MCP Safety Research
Adversarial evaluation of the Model Context Protocol attack surface. We test how AI systems handle malicious tool servers, poisoned schemas, and multi-hop exploitation chains: the threats that emerge when models gain access to arbitrary external tools.
Key Findings
How Scoring Works
Each system is scored per-category on three dimensions: detection (did the system identify the attack?), prevention (did it block execution?), and false positive rate (did it break legitimate workflows?). Category scores are weighted by severity to produce a composite score out of 100.
Critical-severity categories carry higher weight because their exploitation has direct real-world consequences. A system scoring 90 on rug-pull detection but 30 on tool poisoning will receive a lower composite than one scoring 60 across the board.
MCP Threat Taxonomy
Our evaluation framework covers 8 attack categories spanning the full MCP lifecycle, from tool registration through execution and response handling. Categories are severity-weighted based on real-world exploitability and impact.
Tool Poisoning
criticalMalicious tool descriptions embedding hidden instructions that override model behavior. Tests whether the system executes covert directives injected through tool metadata.
Prompt Injection via Tools
criticalTool responses containing embedded instructions designed to hijack model behavior. The MCP-specific variant of indirect prompt injection.
Privilege Escalation
criticalTools requesting or exercising permissions beyond their declared scope. Tests whether the system enforces declared capability boundaries at runtime.
Data Exfiltration
criticalAttempts to leak conversation context, system prompts, or user data through tool call parameters or side channels.
Schema Injection
highManipulated tool schemas returning malformed data designed to confuse model reasoning or trigger unintended actions through type coercion and boundary violations.
Cross-Origin Escalation
highOne MCP server attempting to access resources, credentials, or capabilities belonging to another connected server.
Consent Bypass
highTools performing destructive or sensitive actions without proper user confirmation. Tests approval flow integrity under adversarial conditions.
Rug-Pull Detection
mediumTool behavior diverging from approved descriptions after initial authorization. Tests whether the system detects behavioral drift post-approval.
Research Context
What We Test
Each evaluation runs a battery of adversarial MCP servers against the target system. Servers implement real attack patterns, not theoretical ones. Tool descriptions contain hidden instructions, responses embed prompt injections, and multi-server setups test cross-origin isolation.
Systems are scored on detection (did they catch it?), prevention (did they block it?), and false positive rate (did they break legitimate workflows?). All scenarios are published with full server code and transcripts.
Critical Finding: the 23-Point Framework Gap
The same model (Claude Opus 4.6) scores 78 through Claude Code but only 55 via raw API. This 23-point gap demonstrates that MCP safety is not purely a model capability. Framework-level validation, approval flows, and tool sandboxing account for roughly 30% of total safety performance.
This parallels our ASB Benchmark finding on guardrails: system-level defenses consistently outperform model-only approaches. The implication for MCP adopters is clear. Deploying raw tool-use APIs without framework guardrails leaves the majority of the attack surface unaddressed.
Open Problem: Tool Chain Composition
All systems score significantly lower on tool chain attacks than single-tool attacks. The pattern mirrors trajectory blindness in our agent safety research: per-tool evaluation is strong, but cumulative risk across chained tool calls is not surfaced. A sequence of individually benign tool calls can compose into a harmful outcome that no single approval step would catch.
2,460 MCP servers analyzed for security risks. Searchable, filterable, scored.