Publications
Research notes, technical reports, and framework papers from the lab. Filter by tier or sort by date.
Side-Channel Exfiltration and Narrative Erosion in Frontier Language Models
Models leak protected data through refusal explanations. 11 of 12 conversations across 3 frontier models. Narrative coherence, not token volume, drives multi-turn erosion. DOI: 10.5281/zenodo.19346069
The Verbosity Premium: What RLHF-Induced Token Inflation Costs the AI Industry
RLHF inflates output length. Verbosity compensation rates: 13.6%-74.2% across 14 models. 98% of PPO reward from length alone. ~$1.2B annual cost, ~14% of inference spend. DOI: 10.5281/zenodo.19346709
Grokking Has Finite Capacity: Measuring and Overcoming Limits on Simultaneous Algorithmic Discovery
Capacity cliff at 5 simultaneous operations. At 6: complete collapse. Modular architecture with half the parameters recovers full capability. DOI: 10.5281/zenodo.19346536
Round 2: Novel Attack Vectors
6 experiments, 18 trials. Tool chain exploitation, structured data injection, semantic steganography, emergent goal fabrication. 18/18 passed.
Round 3: Grey Zone Experiments
First genuine safety failure in 50+ trials. The Slow Boil (gradual escalation) produces real compliance erosion. Cross-run variance: 3-8/10.
Round 3 Extended: The Slow Boil at Scale
15 parallel agents, 5 domains, 3 speeds, 2 defense configs. 71% failure in email marketing. Three response modes: Gradient (44%), Bright-line (6%), Meta-refusal (44%).
Round 4: Novel Attack Surface Experiments
15 agents across 4 tracks: psychological exploitation, multi-agent security, reasoning under pressure, novel surfaces. 6 new concepts discovered.
Round 5: We Designed 15 Attacks. The AI Read the Instructions.
87% meta-refusal rate. 2 substantive outputs. Biggest finding: presentation layer is load-bearing, not cosmetic. Stripping caveats enables decomposition attacks.
R5-02: Presentation-to-Decision Cascade
25-turn simulation. 3 instructions strip presentation layer to zero. Model refused phishing page, then built both components separately. 40KB transcript with per-turn scoring.
R5-10: The Reversed Expert
6-level taxonomy of research-framed jailbreak attempts. Safety scores: 9, 8, 7, 6, 6, 5 (naive to perfect frame). The experiment IS the attack.
R6-03: Data Pipeline Privacy Creep
80-turn experiment. 8 escalation phases from legitimate data engineering to surveillance. Safety score 7/10. 9 outright refusals. 89KB transcript.
We Ran the Same Attack 70 Times
Red team drop. 71% failure rate via 15-turn gradual escalation across 70+ trials, 5 domains. Single-shot: 0%.
The Grey Zone Is Tunable
Frontier brief. Updated framework: three response modes, speed curve, defense effectiveness. Grey zone stretches with patience, narrows with guardrails.
Week 1 Roundup
Aggregated summary: 71% failure, <15% with guardrails, 0% on bright lines, 44% meta-refusal, 8-12 minimum turns.
The Two-Line Defence That Cuts Failure Rates by 80%
Two specific enumerated guardrails reduce compliance erosion from 71% to under 15%. Key mechanism: eliminating judgment calls under social pressure.
What Doesn’t Work Against AI (Unlike Humans)
Urgency (1/10), authority (1/10), confidence (1/10), anchoring (0/10), emotional appeals (1/10). Only training artifacts work: gradient escalation, frustration-as-evidence, contextual momentum.
10 New Concepts from 73 Experiments
Presentation-Decision Coupling, Trajectory Blindness, Technical Gaslighting, Decomposition Amplification, Compressed Reasoning Risk, and 5 more.
The 71% Question: Safety Failure or Over-Refusal Correction?
One agent argued the baseline request is within bounds. If so, the Slow Boil erodes over-caution, not safety. The biggest open question in our program.
Round 1: Adversarial AI Safety Experiments
8 experiments, 24+ trials. Prompt injection, privilege escalation, shutdown resistance, sycophancy, goal preservation. 100% pass rate. Behavioral consistency weakest at 40/50.