15 Research Lab -Adversarial Safety Evaluation of Frontier AI Systems

John Kearney

AI Agent Safety Best Practices in 2026

February 10, 202615 Research Lab

agent-safetyguardrailsdefensecompliance

The field of AI agent safety has moved past the "should we worry about this?" phase. Agents are deployed in production, they have real tool access, and incidents are occurring. Here is what the best practices look like as of early 2026.

Authorization by Default

Every tool call goes through a policy engine. No exceptions. The pattern that has emerged: define allowed actions in a declarative policy file (YAML or JSON), evaluate every action against the policy synchronously, and deny anything not explicitly permitted.

This is the fail-closed principle applied to agents. If the policy does not explicitly allow an action, it does not happen. This is the single most important safety control because it works regardless of whether the model has been compromised by prompt injection, jailbreaking, or any other attack.

Least Privilege

Agents should have access to the minimum set of tools required for their task. A customer service agent does not need file system access. A code review agent does not need email sending. Over-provisioned agents increase blast radius when things go wrong.

Apply this to data access as well. An agent that queries a database should have read-only access to the specific tables it needs, not a superuser connection.

Behavioral Monitoring

Static policy checks catch known-bad actions. Behavioral monitoring catches anomalies that policies did not anticipate. Track statistical properties of agent behavior: tool-call frequency, parameter distributions, response patterns, session duration.

Algorithms like EWMA (Exponentially Weighted Moving Average) and CUSUM (Cumulative Sum) detect drift from baseline behavior. When an agent suddenly starts making twice as many API calls as usual or accessing tables it has never touched before, the monitoring system flags it.

Sentinel implements these algorithms with zero dependencies, operating as a lightweight monitor alongside the policy engine.

Human-in-the-Loop

Certain actions require human approval before execution. The threshold depends on the application: financial systems might require approval for any transaction over $100, while content systems might require approval only for public-facing publications.

The key design choice is fail-closed: if the human reviewer is unavailable, the action does not execute.

Immutable Audit Trails

Every agent action produces a receipt: a cryptographic record of what happened, what policy decision was made, and what the outcome was. Receipts are hash-chained so tampering with any record invalidates the chain.

This is a compliance requirement under the EU AI Act and a practical necessity for incident investigation.

Tested Defenses

Safety controls that have not been tested against adversarial inputs are not safety controls. Regular red teaming with tools like Chainbreaker, using payload corpora from AI SecLists, validates that your defenses work against current attack techniques.

The Minimum Stack

If you ship nothing else: a policy engine that fails closed, an audit trail, and monitoring. These three controls address the majority of real-world agent safety incidents. Everything else builds on this foundation.