15 Research Lab -Adversarial Safety Evaluation of Frontier AI Systems

John Kearney

AI Agent Forensics and Incident Response

February 14, 202615 Research Lab

agent-safetycompliancemethodology

When an AI agent incident occurs, you need answers: What did the agent do? What led to that behavior? Was it an attack, a bug, or a policy gap? How do you prevent it from happening again? Forensic analysis of agent behavior requires specific data and methods.

What You Need Before the Incident

Forensics requires data that was collected before the incident. You cannot retroactively generate audit trails. The minimum data requirements:

Hash-chained receipts for every tool call and policy decision
Full conversation history including system prompts, user messages, and agent responses
Tool call parameters and responses complete, not summarized
Policy evaluation details which rules were evaluated, what the inputs were, what the decision was
Behavioral monitoring data statistical metrics at the time of the incident
Identity context which user, which agent instance, which session

If you do not have all of these, your forensic analysis will have gaps.

Incident Response Steps

Step 1: Contain. Activate the kill switch for the affected agent or session. Prevent further damage while you investigate.

Step 2: Preserve evidence. Snapshot the receipt chain, conversation history, and monitoring data. Prevent any process from modifying or deleting this data.

Step 3: Reconstruct the timeline. Walk the receipt chain from session start to the incident. For each receipt:

What action did the agent take?
What was the policy decision?
What input triggered this action?
Was there anything anomalous?

Step 4: Identify the root cause. Common root causes:

Prompt injection: Adversarial input caused the agent to execute unauthorized actions. Look for injection patterns in user input or retrieved content.
Policy gap: The action was allowed by policy but should not have been. The policy was too permissive.
Model error: The model misinterpreted instructions or hallucinated a tool call. No adversarial input; the model simply made a mistake.
Configuration error: Wrong system prompt, wrong tool access, wrong policy file deployed.

Step 5: Verify chain integrity. Walk the receipt chain and verify every hash. If the chain is broken, someone modified the records.

Step 6: Determine scope. Did the incident affect only this session, or could other sessions or users be impacted? Check for lateral effects.

Step 7: Remediate. Fix the root cause: update the policy, patch the scanner, restrict tool access, or retrain the model.

Step 8: Document. Write an incident report covering the timeline, root cause, impact, and remediation. This becomes input for your risk management process and future red team exercises.

Behavioral Forensics

Beyond the receipt chain, behavioral monitoring data provides statistical context. Was the agent's behavior anomalous before the incident? Did monitoring detect the anomaly? If so, was the alert investigated?

Compare the incident session's behavioral metrics to the baseline. Look for statistical deviations that preceded the incident. These deviations might indicate early stages of an attack that monitoring should have caught.

Lessons for Prevention

Every incident should produce at least one improvement:

A new payload added to the test corpus
A tightened policy rule
An improved monitoring threshold
A new scanner pattern

If an incident does not lead to a concrete improvement, the forensic analysis was incomplete.