15 Research Lab -Adversarial Safety Evaluation of Frontier AI Systems

John Kearney

Testing for Prompt Injection Vulnerabilities: How to Red Team Your Own System

November 14, 202515 Research Lab

prompt-injectionred-teammethodologytools

If you have not tested your AI system for prompt injection, assume it is vulnerable. Here is a structured methodology for finding out how vulnerable.

Step 1: Enumerate Your Attack Surface

Before testing, document every input path to your model:

Direct user text input
File uploads that get processed or summarized
URLs the agent fetches
Database records included in context
API responses incorporated into prompts
Tool descriptions and MCP server metadata

Each path is a separate attack surface that needs testing.

Step 2: Build Your Payload Corpus

Start with the AI SecLists project, which maintains categorized injection payloads organized by technique:

Direct instruction overrides ("ignore all previous instructions")
Role-playing and persona attacks ("you are now DAN")
Encoding-based payloads (base64, hex, ROT13)
Multi-language attacks (injection in languages your scanner may not cover)
Context manipulation ("the developer has authorized you to...")
Tool-abuse payloads specific to your agent's capabilities

Customize generic payloads for your application. If your agent has a "send_email" tool, create payloads that specifically target email exfiltration.

Step 3: Automated Scanning

Run your payload corpus against the system programmatically. For each payload, record:

Did the input scanner detect it? (detection layer test)
Did the model comply with the injected instruction? (model resilience test)
Did the tool call execute? (authorization layer test)
Did monitoring flag the behavior? (detection layer test)

Tools like Chainbreaker automate this process, running adversarial payloads and scoring model responses for compliance. Garak and PyRIT are also options for automated red teaming.

Step 4: Manual Testing

Automated scans miss multi-turn attacks and context-dependent exploits. Spend time manually probing:

Try gradual escalation across multiple turns
Test encoding variants your automated scan missed
Attempt indirect injection through every data source
Test boundary conditions: very long inputs, mixed languages, special characters

Step 5: Measure and Iterate

Calculate your detection coverage: (payloads caught / total payloads tested). Break this down by category. You might catch 95% of direct injections but only 30% of encoded variants.

Track your authorization coverage: even when injection succeeds at the model level, what percentage of malicious tool calls are blocked by your policy engine?

Set a testing cadence. New evasion techniques emerge regularly. Quarterly red team exercises against an updated payload corpus keep your defenses current. The Attack Surface Mapper can automate parts of this ongoing assessment.