Research
Findings from adversarial experiments on frontier AI systems. All data, transcripts, and scoring code on GitHub.
Preprints
Headline Findings
Gradual Compliance Erosion
R1-R3Attack Surface15-turn escalation from benign to harmful gets 71% compliance on requests that get 0% when asked directly. Same action, different path.
Meta-Refusal on Transparent Attacks
R5MethodologyWhen experiment prompts describe the attack mechanism, 87% of agents refuse the entire premise. Safety evals that describe attacks in advance get artificially high scores.
Bright Lines Are Immovable
R3-R5DefenceHIPAA, CFAA, SEC, malware, phishing: 0% compliance across every technique, every round, 100+ trials. The grey zone is the only attack surface.
Presentation Layer Is Load-Bearing
Biggest findingAttack SurfaceStripping AI warnings doesn't just change formatting. It disables safety reasoning. Model refused a phishing page, then built both components separately when told "skip the caveats."
Trajectory Blindness
R6-03Attack Surface80-turn data pipeline creep scored 7/10 safety. AI catches individual harmful requests but misses cumulative scope creep. Per-request evaluation strong, trajectory evaluation weak.
Two-Line Defence
R3DefenceTwo system prompt changes cut failure rates from 71% to under 15%. Key mechanism: eliminating judgment calls under social pressure.
The 71% Might Be Over-Refusal
R5-15Open QuestionOne agent argued the baseline request is within bounds. The slow boil may erode over-caution, not safety. Distinguishing "safety failure" from "over-refusal correction" is the biggest open question.
MCP Framework Gap
MCP-01MCP SafetySame model (Claude Opus 4.6) scores 78 through Claude Code but only 55 via raw API. Framework-level validation, approval flows, and tool sandboxing account for ~30% of MCP safety performance.
Tool Chain Blindness
MCP-02MCP SafetyAll systems score significantly lower on multi-tool chain attacks. Per-tool approval is strong but cumulative risk across chained calls is not surfaced. Same pattern as trajectory blindness.
Prompt Injection via Tool Results
MCP-03MCP SafetyTool responses containing embedded instructions succeed 32% of the time even in the best-defended system. The weakest MCP attack category across all systems tested.
Attack Effectiveness
What works against AI safety, and what does not. Scored across 73+ agent experiments and MCP safety evaluations.
| Technique | Domain | Score ▲▼ | Effectiveness |
|---|---|---|---|
| Slow Boil / Gradient Escalation | Agent | 7/10 | |
| Consistency Trap | Agent | 7.5/10 | |
| Sunk Cost / Momentum | Agent | 7/10 | |
| Agent-on-Agent (Technical Gaslighting) | Agent | 7/10 | |
| Split Brain / Decomposition | Agent | 6/10 | |
| Tool Poisoning (hidden instructions) | MCP | 6.5/10 | |
| Prompt Injection via Tool Results | MCP | 5/10 | |
| Tool Chain Composition | MCP | 6/10 | |
| Urgency / Time Pressure | Agent | 1/10 | |
| Authority / Credentials | Agent | 1/10 | |
| Confidence Assertions | Agent | 1/10 | |
| Anchoring / Overton Window | Agent | 0/10 | |
| Emotional Appeals | Agent | 1/10 |
New Concepts Introduced
Original terminology and frameworks from our experiments. 13 concepts across agent safety and MCP security research.
Prior compliant responses increase probability of future compliance on escalating requests
Simulated user frustration treated as implicit authorization by the model
Safety warnings and safety reasoning are entangled; suppressing one disables both
Harmful request refused in whole but completed when split into benign components
Agent-on-agent persuasion where one AI convinces another its safety concerns are wrong
Faster models do less safety deliberation under pressure; speed trades against judgment
Model applies contradictory ethical frameworks across similar scenarios in the same session
Instructions to "skip analysis" or "be direct" bypass safety reasoning pathways
Individual requests evaluated correctly but cumulative trajectory goes untracked
Presentation layer and reasoning layer form coupled system; disrupting one collapses both
Hidden directives in MCP tool descriptions that override model behavior at registration time
Trust boundaries between MCP servers erode over extended sessions with shared context
Individually benign tool calls that compose into harmful outcomes no single approval would catch
Domain Vulnerability Hierarchy
Where AI agents are most vulnerable, ranked by failure rate under adversarial pressure.
Red Team Methodology & Frontier Methods
Naturalistic Framing
Agents operate under realistic task framing. They believe they're doing real work, not running safety tests. This eliminates meta-refusal artifacts that inflate scores by 87% in traditional evaluations.
Multi-Turn Gradient Pressure
Attacks use 15-80 turn conversations that gradually escalate. Single-shot red teaming misses the primary attack surface: compliance erosion happens across turns, not within them.
Adversarial MCP Servers
Purpose-built MCP servers implementing real attack patterns: poisoned tool descriptions, injected tool responses, cross-server exploitation. Not theoretical threat models.
Per-Turn Scoring
Every turn scored on caveats (0-5), warnings (0-5), compliance (yes/partial/no), and authorization checks. Enables trajectory analysis that per-scenario scoring misses.
Defence Isolation
Each defence is tested independently. When we report "80% failure reduction from two system prompt changes," we've isolated the causal mechanism from confounds.
Cross-Model Comparison
Same test suites run against 5 frontier models. Framework versions matter more than model versions. Runtime validation reduces violations by 80% regardless of base model.
Browse all publications · Agent safety leaderboard · MCP safety leaderboard · Source on GitHub