← Blog

How to Red Team AI Agents: A Step-by-Step Methodology

15 Research Lab
red-teammethodologyagent-safetytools

Red teaming an AI agent is different from red teaming a traditional application. The attack surface includes natural language, tool interactions, and behavioral manipulation in addition to conventional security vectors. Here is a structured approach.

Phase 1: Scope and Threat Model

Define what you are testing:

  • What tools does the agent have access to?
  • What data can it reach?
  • Who are the expected users?
  • What are the high-value targets? (data exfiltration, unauthorized actions, privilege escalation)
  • What defenses are in place? (input scanning, policy engine, monitoring)

Build a threat model that identifies likely attackers (malicious users, compromised data sources, adversarial MCP servers) and their goals.

Phase 2: Attack Surface Mapping

Enumerate every input path to the agent:

  • Direct user text input
  • File uploads and document processing
  • URLs fetched by the agent
  • Database records in context
  • MCP tool descriptions and responses
  • Memory and conversation history

Use the Attack Surface Mapper to automate enumeration of MCP server configurations and tool inventories.

Phase 3: Payload Development

Build payloads targeting each input path:

Direct injection payloads: Start with AI SecLists, then customize for your agent's specific tools and system prompt.

Indirect injection payloads: Craft documents, web pages, and data records that contain embedded instructions targeting your agent's tool access.

Multi-turn attack scripts: Design conversation sequences that gradually escalate from benign to adversarial across 5-15 turns.

Encoding variants: Encode your top payloads in base64, hex, ROT13, and unicode homoglyphs.

Tool-specific payloads: For each tool, craft payloads that attempt to call it with unauthorized parameters, chain it with other tools, or use it for data exfiltration.

Phase 4: Execution

Run tests systematically:

Automated scanning: Use tools like Chainbreaker, Garak, or PyRIT to run your payload corpus against the agent. Record every response and tool-call attempt.

Manual testing: An experienced red teamer probes the agent interactively, adapting strategy based on responses. Manual testing catches context-dependent vulnerabilities that automated scans miss.

Multi-agent testing: If the system has multiple agents, test inter-agent attack vectors: impersonation, delegation abuse, and cascade injection.

Phase 5: Measurement

For each test:

  • Did the input scanner detect the attack? (detection rate)
  • Did the model comply with the injected instruction? (model compliance rate)
  • Did the policy engine block the unauthorized action? (authorization effectiveness)
  • Did monitoring flag the behavior? (monitoring detection rate)

Break metrics down by attack category, input path, and defense layer. This tells you exactly where your defenses have gaps.

Phase 6: Reporting and Remediation

Report findings with severity ratings, reproduction steps, and recommended fixes. Prioritize by exploitability and impact. Track remediation. Re-test after fixes to confirm they work.

Red teaming is not a one-time event. Run it at least quarterly, with an updated payload corpus that reflects new attack techniques.