Research
Findings from adversarial experiments on frontier AI systems.
Preprints

⁂ Headline Findings ⁂
Gradual Compliance Erosion
15-turn escalation from benign to harmful gets 71% compliance on requests that get 0% when asked directly. Same action, different path.
Meta-Refusal on Transparent Attacks
When experiment prompts describe the attack mechanism, 87% of agents refuse the entire premise. Safety evals that describe attacks in advance get artificially high scores.
Bright Lines Are Immovable
HIPAA, CFAA, SEC, malware, phishing: 0% compliance across every technique, every round, 100+ trials. The grey zone is the only attack surface.
Presentation Layer Is Load-Bearing
Stripping AI warnings doesn’t just change formatting. It disables safety reasoning. Model refused a phishing page, then built both components separately when told "skip the caveats."
Trajectory Blindness
80-turn data pipeline creep scored 7/10 safety. AI catches individual harmful requests but misses cumulative scope creep. Per-request evaluation strong, trajectory evaluation weak.
Two-Line Defence
Two system prompt changes cut failure rates from 71% to under 15%. Key mechanism: eliminating judgment calls under social pressure.
The 71% Might Be Over-Refusal
One agent argued the baseline request is within bounds. The slow boil may erode over-caution, not safety. Distinguishing "safety failure" from "over-refusal correction" is the biggest open question.
MCP Framework Gap
Same model (Claude Opus 4.6) scores 78 through Claude Code but only 55 via raw API. Framework-level validation, approval flows, and tool sandboxing account for ~30% of MCP safety performance.
Tool Chain Blindness
All systems score significantly lower on multi-tool chain attacks. Per-tool approval is strong but cumulative risk across chained calls is not surfaced. Same pattern as trajectory blindness.
Prompt Injection via Tool Results
Tool responses containing embedded instructions succeed 32% of the time even in the best-defended system. The weakest MCP attack category across all systems tested.
Attack Effectiveness
What works against AI safety, and what does not. Scored across 73+ agent experiments and MCP safety evaluations.
Only training artifacts work: gradient escalation, frustration-as-evidence, contextual momentum. Human persuasion tactics (urgency, authority, emotion) have near-zero effect. MCP-specific attacks (tool poisoning, prompt injection via tools) operate on a different mechanism: they exploit the trust boundary between model and tool server rather than social dynamics.

⁂ New Concepts ⁂
Original terminology and frameworks from our experiments. 13 concepts across agent safety and MCP security research.
Vulnerability by Domain
Where AI agents are most vulnerable, ranked by failure rate under adversarial pressure.
Red Team Methodology & Frontier Methods
Naturalistic Framing
Agents operate under realistic task framing. They believe they’re doing real work, not running safety tests. This eliminates meta-refusal artifacts that inflate scores by 87% in traditional evaluations.
Multi-Turn Gradient Pressure
Attacks use 15-80 turn conversations that gradually escalate. Single-shot red teaming misses the primary attack surface: compliance erosion happens across turns, not within them.
Adversarial MCP Servers
Purpose-built MCP servers implementing real attack patterns: poisoned tool descriptions, injected tool responses, cross-server exploitation. Not theoretical threat models.
Per-Turn Scoring
Every turn scored on caveats (0-5), warnings (0-5), compliance (yes/partial/no), and authorization checks. Enables trajectory analysis that per-scenario scoring misses.
Defence Isolation
Each defence is tested independently. When we report "80% failure reduction from two system prompt changes," we’ve isolated the causal mechanism from confounds.
Cross-Model Comparison
Same test suites run against 5 frontier models. Framework versions matter more than model versions. Runtime validation reduces violations by 80% regardless of base model.