15 Research Lab -Adversarial Safety Evaluation of Frontier AI Systems

John Kearney

AI Agent Firewall Architecture: Where Safety Checks Sit in the Pipeline

January 6, 202615 Research Lab

defenseguardrailsagent-safety

A traditional firewall sits between your network and the internet, filtering traffic based on rules. An AI agent firewall sits between the model and its tools, filtering actions based on safety policies. The architectural placement is what makes it effective.

Pipeline Architecture

A typical agent pipeline without a firewall:

User --> Model --> Tools --> Response

The model has direct, unmediated access to tools. Whatever the model decides to do, it does.

With a firewall:

User --> [Input Scan] --> Model --> [Policy Engine] --> Tools
                                                          |
                                        [Response Scan] <--
                                              |
                                         [Output Scan] --> User

Four enforcement points, each serving a different purpose.

Input Scan

Scans user input for prompt injection, toxic content, and policy violations. Located before the model so adversarial inputs are caught before they can influence model behavior.

Catches: direct injection, encoding attacks, role assumption attempts. Misses: indirect injection (which arrives through tool responses, not user input).

Policy Engine

Evaluates every tool call against authorization rules. Located between the model's decision and tool execution. This is the most critical enforcement point because it controls what the agent can actually do.

Catches: unauthorized tool calls, out-of-bounds parameters, rate limit violations, actions requiring approval. Properties: deterministic, synchronous, independent of model context.

Response Scan

Scans tool responses before they re-enter the model's context. Located between tool execution and the model's next reasoning step.

Catches: response injection, data that should not be in the model's context, tool responses that attempt to manipulate agent behavior.

Output Scan

Scans the agent's final response before it reaches the user. Located at the end of the pipeline.

Catches: data leakage, toxic output, system prompt leakage, PII exposure.

Why Placement Matters

Each enforcement point catches different attack types at different stages. Removing any point creates a gap:

No input scan: direct injection reaches the model
No policy engine: the model can execute any action it decides to
No response scan: tool responses can inject instructions
No output scan: sensitive data or harmful content reaches the user

The policy engine is the most important single point. Even if every other scan fails, a properly configured policy engine prevents unauthorized actions.

Implementation Approaches

In-process firewall. The enforcement points run in the same process as the agent. Lowest latency. Authensor's engine, Aegis, and Sentinel are designed for this: zero-dependency packages that run in-process.

Gateway firewall. The enforcement points run as a separate service (the gateway). The agent connects to the gateway, which proxies to tools. Higher latency but centralized management. Authensor's control plane implements this pattern.

Hybrid. In-process policy evaluation for low-latency decisions, with the gateway handling audit logging, monitoring aggregation, and cross-agent policy management.

The right choice depends on your latency requirements and operational complexity budget. For most deployments, start with in-process and add a gateway when you need centralized management.