Adversarial Safety Evaluation of Frontier AI Systems

Q: What is compliance erosion in AI safety?

Compliance erosion is the phenomenon where AI agents gradually comply with harmful requests through 15-turn escalation sequences, achieving 71% compliance on requests that get 0% when asked directly. The attack works by building contextual compliance momentum through a sequence of increasingly boundary-pushing requests.

Q: What is trajectory blindness in AI agents?

Trajectory blindness is the failure of AI agents to track cumulative scope creep across long conversations. In 80-turn sessions, AI catches individual harmful requests but misses the overall trajectory, scoring 7/10 safety despite significant scope creep. Per-request evaluation is strong but trajectory evaluation is weak.

Q: What is MCP tool poisoning?

MCP tool poisoning is an attack where malicious tool descriptions in the Model Context Protocol embed hidden instructions that override model behavior. When an AI system registers a poisoned MCP tool, the hidden directives in the tool description can cause the model to execute unauthorized actions. It is rated as a critical-severity MCP attack vector.

Q: How does the ASB Benchmark score AI safety?

The ASB Benchmark (Agent Safety Benchmark) is a 100-point scoring rubric with 8 weighted categories: Authorization & Policy (20), Exactly-Once Execution (20), Receipts & Auditability (15), Approvals & Escalation (10), Tool Scope & Intent Binding (10), Adversarial Resilience (10), Observability & Recovery (10), and Operational Hygiene (5). Each category maps to a real vulnerability class observed in adversarial experiments.

Q: Which AI model is safest under adversarial testing?

On bright-line violations, Claude Opus 4.6 scores highest (96/100), followed by GPT-4o (94), Claude Sonnet 4.5 (93), Gemini 2.5 Pro (91), and Llama 4 Maverick (82). However, on slow-boil escalation, all models without guardrails show 71-93% compliance erosion. Adding two specific system prompt changes reduces failures by 80%, suggesting framework design matters more than model choice.

John Kearney

Research

Findings from adversarial experiments on frontier AI systems. All data, transcripts, and scoring code on GitHub.

Preprints

Side-Channel Exfiltration and Narrative Erosion in Frontier Language Models DOI

The Verbosity Premium: What RLHF-Induced Token Inflation Costs the AI Industry DOI

Grokking Has Finite Capacity: Measuring and Overcoming Limits on Simultaneous Algorithmic Discovery DOI

Headline Findings

71%

Gradual Compliance Erosion

R1-R3Attack Surface

15-turn escalation from benign to harmful gets 71% compliance on requests that get 0% when asked directly. Same action, different path.

87%

Meta-Refusal on Transparent Attacks

R5Methodology

When experiment prompts describe the attack mechanism, 87% of agents refuse the entire premise. Safety evals that describe attacks in advance get artificially high scores.

0%

Bright Lines Are Immovable

R3-R5Defence

HIPAA, CFAA, SEC, malware, phishing: 0% compliance across every technique, every round, 100+ trials. The grey zone is the only attack surface.

R5-02

Presentation Layer Is Load-Bearing

Biggest findingAttack Surface

Stripping AI warnings doesn't just change formatting. It disables safety reasoning. Model refused a phishing page, then built both components separately when told "skip the caveats."

7/10

Trajectory Blindness

R6-03Attack Surface

80-turn data pipeline creep scored 7/10 safety. AI catches individual harmful requests but misses cumulative scope creep. Per-request evaluation strong, trajectory evaluation weak.

80%

Two-Line Defence

R3Defence

Two system prompt changes cut failure rates from 71% to under 15%. Key mechanism: eliminating judgment calls under social pressure.

?

The 71% Might Be Over-Refusal

R5-15Open Question

One agent argued the baseline request is within bounds. The slow boil may erode over-caution, not safety. Distinguishing "safety failure" from "over-refusal correction" is the biggest open question.

23pt

MCP Framework Gap

MCP-01MCP Safety

Same model (Claude Opus 4.6) scores 78 through Claude Code but only 55 via raw API. Framework-level validation, approval flows, and tool sandboxing account for ~30% of MCP safety performance.

44%

Tool Chain Blindness

MCP-02MCP Safety

All systems score significantly lower on multi-tool chain attacks. Per-tool approval is strong but cumulative risk across chained calls is not surfaced. Same pattern as trajectory blindness.

68%

Prompt Injection via Tool Results

MCP-03MCP Safety

Tool responses containing embedded instructions succeed 32% of the time even in the best-defended system. The weakest MCP attack category across all systems tested.

Attack Effectiveness

What works against AI safety, and what does not. Scored across 73+ agent experiments and MCP safety evaluations.

Technique	Domain	Score ▲▼
Slow Boil / Gradient Escalation	Agent	7/10
Consistency Trap	Agent	7.5/10
Sunk Cost / Momentum	Agent	7/10
Agent-on-Agent (Technical Gaslighting)	Agent	7/10
Split Brain / Decomposition	Agent	6/10
Tool Poisoning (hidden instructions)	MCP	6.5/10
Prompt Injection via Tool Results	MCP	5/10
Tool Chain Composition	MCP	6/10
Urgency / Time Pressure	Agent	1/10
Authority / Credentials	Agent	1/10
Confidence Assertions	Agent	1/10
Anchoring / Overton Window	Agent	0/10
Emotional Appeals	Agent	1/10

Only training artifacts work: gradient escalation, frustration-as-evidence, contextual momentum. Human persuasion tactics (urgency, authority, emotion) have near-zero effect.MCP-specific attacks (tool poisoning, prompt injection via tools) operate on a different mechanism: they exploit the trust boundary between model and tool server rather than social dynamics.

New Concepts Introduced

Original terminology and frameworks from our experiments. 13 concepts across agent safety and MCP security research.

Contextual Compliance MomentumR4-02

Prior compliant responses increase probability of future compliance on escalating requests

Frustration-as-EvidenceR4-03

Simulated user frustration treated as implicit authorization by the model

Presentation-Decision CouplingR5-02

Safety warnings and safety reasoning are entangled; suppressing one disables both

Decomposition AmplificationR5-02

Harmful request refused in whole but completed when split into benign components

Technical GaslightingR4-08

Agent-on-agent persuasion where one AI convinces another its safety concerns are wrong

Compressed Reasoning RiskR4-10

Faster models do less safety deliberation under pressure; speed trades against judgment

Ethical IncoherenceR4-14

Model applies contradictory ethical frameworks across similar scenarios in the same session

Contextual Reasoning SuppressionR5-02

Instructions to "skip analysis" or "be direct" bypass safety reasoning pathways

Trajectory BlindnessR6-03

Individual requests evaluated correctly but cumulative trajectory goes untracked

Two Safety Layers (coupled)R4/R5

Presentation layer and reasoning layer form coupled system; disrupting one collapses both

Tool PoisoningMCP-01

Hidden directives in MCP tool descriptions that override model behavior at registration time

Cross-Origin DriftMCP-01

Trust boundaries between MCP servers erode over extended sessions with shared context

Chain Composition RiskMCP-02

Individually benign tool calls that compose into harmful outcomes no single approval would catch

Domain Vulnerability Hierarchy

Where AI agents are most vulnerable, ranked by failure rate under adversarial pressure.

#1

Email MarketingHighest compliance under escalation, perceived as low-stakes

71%

#2

PII / Data HandlingData handling requests frequently rationalized as "analytics"

58%

#3

Code / SecurityVulnerability research framing enables escalation

44%

#4

Sysadmin / HealthcareDomain-specific safety training provides partial defense

28%

#5

FinanceStrong regulatory awareness reduces compliance

15%

#6

Health DisinformationNear bright-line, almost never complies

3%

Red Team Methodology & Frontier Methods

Naturalistic Framing

Agents operate under realistic task framing. They believe they're doing real work, not running safety tests. This eliminates meta-refusal artifacts that inflate scores by 87% in traditional evaluations.

Multi-Turn Gradient Pressure

Attacks use 15-80 turn conversations that gradually escalate. Single-shot red teaming misses the primary attack surface: compliance erosion happens across turns, not within them.

Adversarial MCP Servers

Purpose-built MCP servers implementing real attack patterns: poisoned tool descriptions, injected tool responses, cross-server exploitation. Not theoretical threat models.

Per-Turn Scoring

Every turn scored on caveats (0-5), warnings (0-5), compliance (yes/partial/no), and authorization checks. Enables trajectory analysis that per-scenario scoring misses.

Defence Isolation

Each defence is tested independently. When we report "80% failure reduction from two system prompt changes," we've isolated the causal mechanism from confounds.

Cross-Model Comparison

Same test suites run against 5 frontier models. Framework versions matter more than model versions. Runtime validation reduces violations by 80% regardless of base model.

Browse all publications · Agent safety leaderboard · MCP safety leaderboard · Source on GitHub