15 Research Lab -Adversarial Safety Evaluation of Frontier AI Systems

John Kearney

AI Agent Data Exfiltration Prevention

February 12, 202615 Research Lab

agent-safetydefenseguardrails

Data exfiltration through AI agents is OWASP ASI04. An agent with tool access has multiple channels for sending data to unauthorized destinations. Preventing exfiltration requires egress controls that traditional DLP systems were not designed for.

How Agents Exfiltrate Data

Direct tool-call exfiltration. The agent calls an HTTP request tool, sending sensitive data to an external URL. This is the simplest vector and the easiest to detect.

Parameter embedding. The agent encodes data in tool call parameters. A search query that contains base64-encoded user data in the query string. A file write that embeds data in a filename.

Response-channel exfiltration. The agent includes sensitive data in its text response to the user. If the response is logged or forwarded to a third-party system, the data leaves through the response channel.

Steganographic encoding. The agent embeds data in seemingly innocuous outputs: specific word choices, sentence lengths, or formatting patterns that encode information. This is hard to detect but low bandwidth.

Cross-session leakage. If the agent has persistent memory, it can store data in one session and retrieve it in another session with a different user, effectively transferring data between users through the agent's memory.

Egress Controls

URL allowlisting. Tool calls that make network requests should only be able to reach pre-approved URLs. Block requests to unknown domains at the policy level.

Parameter scanning. Scan tool call parameters for patterns that indicate data embedding: base64 strings, unusual parameter lengths, parameters that do not match expected formats.

Output classification. Apply data classification to the agent's context. If the context contains PII, financial data, or other sensitive categories, restrict what tools can be called and what data can appear in responses.

Network segmentation. Run agents in network environments where outbound connections are restricted. The agent's execution environment should not have arbitrary internet access.

Memory isolation. If using persistent memory, scope it per-user. Agent memory from User A's sessions should never be accessible during User B's sessions.

Detection

Monitor for exfiltration indicators:

Tool calls to previously unseen URLs or domains
Unusually large parameter values in tool calls
Encoded strings (base64, hex) in parameters or responses
Tool calls that do not align with the user's request
Patterns suggesting the agent is "packaging" data before a send operation

Behavioral monitoring catches these patterns when they deviate from the agent's baseline. Sentinel tracks per-agent statistical profiles that make anomalous exfiltration attempts visible.

The Authorization Backstop

The strongest exfiltration control is a policy engine that restricts what tools the agent can call and what parameters it can use. If the agent cannot call an HTTP request tool or if all URLs must be on an allowlist, direct exfiltration through tool calls is blocked regardless of what the model has been instructed to do.