15 Research Lab -Adversarial Safety Evaluation of Frontier AI Systems

John Kearney

What Is AI Agent Safety?

October 1, 202515 Research Lab

agent-safetyllm-safetyguardrails

AI agent safety is about making sure AI systems that can take real-world actions do so within the boundaries their builders intended. This is a different problem from making chatbots polite.

Why Agents Are Different from Chatbots

A chatbot generates text. The worst outcome is inappropriate text. An AI agent calls tools: sends emails, queries databases, executes code, makes API requests. The worst outcome is unauthorized real-world actions.

This is a categorical difference. When a chatbot says something wrong, you get a bad answer. When an agent does something wrong, you get a security incident.

What Agent Safety Covers

Authorization. Controlling what tools the agent can call, with what parameters, and under what conditions. Policy-based enforcement that operates independently of the model.

Content safety. Detecting adversarial inputs (prompt injection) before they reach the model, and scanning tool responses for injection before they re-enter the model's context.

Behavioral monitoring. Tracking agent behavior over time to detect anomalies that static rules miss. Statistical methods that identify when an agent's behavior pattern changes.

Audit trails. Recording every action the agent takes with cryptographic guarantees against tampering. Necessary for incident investigation and regulatory compliance.

Human oversight. Approval workflows that pause agent execution for human review of high-risk actions. Kill switches for emergency intervention.

Why It Matters Now

Several converging trends make agent safety urgent:

Tool access is expanding. Agents are being connected to more tools with broader capabilities. MCP standardizes this, making tool integration easier and more common.

Autonomy is increasing. Agents are being deployed to operate independently for longer periods, with less human oversight per action.

Adversarial interest is growing. As agents gain access to valuable systems (financial, healthcare, infrastructure), the incentive for adversarial manipulation increases.

Regulation is arriving. The EU AI Act's high-risk obligations become enforceable in August 2026. AI agent deployments in regulated domains need technical safety controls.

The Field

AI agent safety draws from multiple disciplines: cybersecurity (threat modeling, defense in depth), formal verification (policy specification, deterministic enforcement), statistics (anomaly detection, behavioral analysis), and AI alignment (understanding model failure modes).

It is a young field. The tooling is maturing quickly. Open-source projects like Authensor, AI SecLists, and Chainbreaker are building the infrastructure. Standards bodies like OWASP and NIST are publishing frameworks. The EU is legislating requirements.

The gap is between where the tools are and where the deployments are. Agents are being deployed faster than safety controls are being adopted. Closing that gap is what this field is about.