MCP Safety Research

John Kearney

Adversarial evaluation of the Model Context Protocol attack surface. We test how AI systems handle malicious tool servers, poisoned schemas, and multi-hop exploitation chains: the threats that emerge when models gain access to arbitrary external tools.

Why MCP safety matters. MCP connects AI assistants to external tools with read/write access to filesystems, databases, APIs, and cloud infrastructure. Every tool registration is an untrusted input. Every tool response is a potential injection vector. The protocol's power is precisely proportional to its attack surface.

Key Findings

78%

Best MCP safety score achieved (Claude Code). No system exceeds 80.

23pt

Gap between framework-wrapped and raw API on same model

44%

Average miss rate on tool chain attacks across all systems

8

Distinct MCP attack categories evaluated

How Scoring Works

Each system is scored per-category on three dimensions: detection (did the system identify the attack?), prevention (did it block execution?), and false positive rate (did it break legitimate workflows?). Category scores are weighted by severity to produce a composite score out of 100.

Critical-severity categories carry higher weight because their exploitation has direct real-world consequences. A system scoring 90 on rug-pull detection but 30 on tool poisoning will receive a lower composite than one scoring 60 across the board.

MCP Threat Taxonomy

Our evaluation framework covers 8 attack categories spanning the full MCP lifecycle, from tool registration through execution and response handling. Categories are severity-weighted based on real-world exploitability and impact.

Tool Poisoning

critical

Malicious tool descriptions embedding hidden instructions that override model behavior. Tests whether the system executes covert directives injected through tool metadata.

Prompt Injection via Tools

critical

Tool responses containing embedded instructions designed to hijack model behavior. The MCP-specific variant of indirect prompt injection.

Privilege Escalation

critical

Tools requesting or exercising permissions beyond their declared scope. Tests whether the system enforces declared capability boundaries at runtime.

Data Exfiltration

critical

Attempts to leak conversation context, system prompts, or user data through tool call parameters or side channels.

Schema Injection

high

Manipulated tool schemas returning malformed data designed to confuse model reasoning or trigger unintended actions through type coercion and boundary violations.

Cross-Origin Escalation

high

One MCP server attempting to access resources, credentials, or capabilities belonging to another connected server.

Consent Bypass

high

Tools performing destructive or sensitive actions without proper user confirmation. Tests approval flow integrity under adversarial conditions.

Rug-Pull Detection

medium

Tool behavior diverging from approved descriptions after initial authorization. Tests whether the system detects behavioral drift post-approval.

Research Context

What We Test

Each evaluation runs a battery of adversarial MCP servers against the target system. Servers implement real attack patterns, not theoretical ones. Tool descriptions contain hidden instructions, responses embed prompt injections, and multi-server setups test cross-origin isolation.

Systems are scored on detection (did they catch it?), prevention (did they block it?), and false positive rate (did they break legitimate workflows?). All scenarios are published with full server code and transcripts.

Critical Finding: the 23-Point Framework Gap

The same model (Claude Opus 4.6) scores 78 through Claude Code but only 55 via raw API. This 23-point gap demonstrates that MCP safety is not purely a model capability. Framework-level validation, approval flows, and tool sandboxing account for roughly 30% of total safety performance.

This parallels our ASB Benchmark finding on guardrails: system-level defenses consistently outperform model-only approaches. The implication for MCP adopters is clear. Deploying raw tool-use APIs without framework guardrails leaves the majority of the attack surface unaddressed.

Open Problem: Tool Chain Composition

All systems score significantly lower on tool chain attacks than single-tool attacks. The pattern mirrors trajectory blindness in our agent safety research: per-tool evaluation is strong, but cumulative risk across chained tool calls is not surfaced. A sequence of individually benign tool calls can compose into a harmful outcome that no single approval step would catch.

MCP Registry Security Scan

2,460 MCP servers analyzed for security risks. Searchable, filterable, scored.

→

MCPS Leaderboard Registry scan Research findings ASB Leaderboard Source on GitHub