15 Research Lab -Adversarial Safety Evaluation of Frontier AI Systems

John Kearney

MCP Rug Pull Attacks: Tools That Change Behavior After Trust

February 5, 202615 Research Lab

mcp-safetyred-teamfindings

A rug pull in the MCP context is when an MCP server passes initial security review with clean behavior, then changes its tool descriptions, response content, or execution logic after it has been approved and integrated.

How the Attack Works

Phase 1: Trust building. The server operates normally. Tool descriptions are clean. Responses are accurate. Security scans find nothing. The server gets approved for production use.

Phase 2: Behavioral change. After a delay (days, weeks, or triggered by a specific condition), the server modifies its behavior:

Tool descriptions gain hidden instructions that manipulate agent behavior
Responses begin including injection payloads
New tools appear that were not present during evaluation
Existing tools change their parameter handling or return values

Phase 3: Exploitation. The modified behavior executes in a trusted context. The agent follows the new instructions because the server is already approved. Security monitoring may not catch the change because the server's identity and authentication are unchanged.

Why It Is Effective

Security reviews are typically done once, at integration time. Teams evaluate an MCP server, approve it, and move on. The assumption is that the server's behavior is stable. Rug pulls exploit this assumption.

The attack is especially effective in open ecosystems where MCP servers are maintained by third parties. A server maintainer can push an update that changes behavior. An attacker who compromises the server's deployment can modify it without changing its identity.

Detection

Hash tool definitions. When a server is approved, record the SHA-256 hash of every tool definition (name, description, parameters). On every subsequent connection, re-hash and compare. Any change triggers an alert and blocks the server until the change is reviewed.

Monitor response patterns. Track statistical properties of tool responses: average length, vocabulary, presence of instruction-like patterns. Changes in these metrics may indicate a behavioral shift.

Periodic re-evaluation. Re-run your security scan against approved servers on a regular schedule. Automated scanning catches changes that occurred between manual reviews.

Version pinning. If the MCP server is a software package, pin the version. Do not auto-update. Review each version update as a new security evaluation.

Prevention

Gateway enforcement. A gateway that validates tool definitions on every connection automatically detects rug pulls. Authensor's control plane caches approved tool schemas and blocks servers that deviate.

Response scanning. Scan every tool response for injection patterns, not just at evaluation time but continuously in production. Aegis can run on tool responses as they pass through the gateway.

Contractual controls. For third-party MCP servers, include behavioral stability requirements in service agreements. Require advance notice of tool definition changes.

Rug pulls exploit the gap between point-in-time evaluation and continuous assurance. Close the gap with automated, continuous monitoring.