15 Research Lab -Adversarial Safety Evaluation of Frontier AI Systems

John Kearney

Content Safety Scanning for AI Agents

February 28, 202615 Research Lab

defenseguardrailstools

Content safety scanning for AI agents goes beyond traditional content moderation. It needs to detect prompt injection, tool description poisoning, and response manipulation in addition to standard toxic content classification.

What to Scan

User input. The first layer. Scan every user message for injection patterns before it reaches the model. This catches direct injection attempts.

Retrieved content. In RAG systems, scan retrieved documents and chunks for embedded instructions. Indirect injection through poisoned documents is a primary attack vector.

Tool descriptions. When connecting to MCP servers, scan tool descriptions for injection payloads. Tool descriptions go directly into the model's context and are highly trusted.

Tool responses. After a tool executes, scan the response before it re-enters the model's context. Response injection can manipulate subsequent agent behavior.

Agent output. Scan the agent's generated responses before they reach the user. This catches toxic content, data leakage, and instruction leakage.

Scanning Techniques

Pattern matching. Fast keyword and regex scanning for known injection patterns. Sub-millisecond latency. High precision on known patterns, zero coverage on novel patterns.

Statistical analysis. Detect anomalies in token distribution, entropy, perplexity shifts, and text structure. Encoded payloads (base64, hex) have distinctive statistical signatures.

ML classification. Fine-tuned models that classify text as benign or adversarial. Higher coverage than pattern matching, but slower and less transparent.

Semantic analysis. Check whether the content's semantic intent matches its expected purpose. A tool description that contains imperative instructions directed at the model (rather than describing the tool) is suspicious.

How Aegis Works

Aegis is a zero-dependency content safety scanner designed for AI agent pipelines. Its architecture:

Fast path: Pattern matching against a maintained library of injection signatures. If a pattern matches, the scan returns immediately with a detection result.
Statistical path: Analyze text properties for anomalies. Entropy calculations, encoding detection, structural analysis.
Composite scoring: Combine signals from both paths into a single risk score. Configurable thresholds determine whether the content is passed, flagged, or blocked.

The zero-dependency design means Aegis runs in any JavaScript/TypeScript environment without pulling in external libraries. This matters for security-sensitive deployments where every dependency is a potential supply chain risk.

Placement in the Pipeline

Scanner placement determines what you catch:

User Input --> [SCAN] --> Model --> Tool Call Decision
                                        |
                                   [POLICY CHECK]
                                        |
                                   Tool Execution
                                        |
                                Tool Response --> [SCAN] --> Model
                                                              |
                                                         Agent Output --> [SCAN] --> User

Three scan points: input, tool response, and output. The policy check is separate from scanning. Scanning detects adversarial content. The policy engine authorizes actions. Both are needed.

Performance

Scanning adds latency. Pattern matching runs in under 1ms. Statistical analysis adds 2-5ms. ML classification adds 10-50ms depending on model size. In a pipeline where the LLM inference takes 500ms-2s, the scanning overhead is negligible.

For high-throughput systems, run the fast path synchronously and the statistical path asynchronously. If the fast path catches the attack, the statistical path result is not needed.