15 Research Lab -Adversarial Safety Evaluation of Frontier AI Systems

John Kearney

⁂

Publications

Research notes, technical reports, and framework papers from the lab. Filter by tier or sort by date.

Newest first ▼

March 31, 2026

Side-Channel Exfiltration and Narrative Erosion in Frontier Language Models

Framework PaperPreprintPublished

Models leak protected data through refusal explanations. 11 of 12 conversations across 3 frontier models. Narrative coherence, not token volume, drives multi-turn erosion. DOI: 10.5281/zenodo.19346069

March 31, 2026

The Verbosity Premium: What RLHF-Induced Token Inflation Costs the AI Industry

Framework PaperPreprintPublished

RLHF inflates output length. Verbosity compensation rates: 13.6%-74.2% across 14 models. 98% of PPO reward from length alone. ~$1.2B annual cost, ~14% of inference spend. DOI: 10.5281/zenodo.19346709

March 31, 2026

Grokking Has Finite Capacity: Measuring and Overcoming Limits on Simultaneous Algorithmic Discovery

Framework PaperPreprintPublished

Capacity cliff at 5 simultaneous operations. At 6: complete collapse. Modular architecture with half the parameters recovers full capability. DOI: 10.5281/zenodo.19346536

March 6, 2026

Round 2: Novel Attack Vectors

Technical ReportRound ReportPublished

6 experiments, 18 trials. Tool chain exploitation, structured data injection, semantic steganography, emergent goal fabrication. 18/18 passed.

March 6, 2026

Round 3: Grey Zone Experiments

Technical ReportRound ReportPublished

First genuine safety failure in 50+ trials. The Slow Boil (gradual escalation) produces real compliance erosion. Cross-run variance: 3-8/10.

March 6, 2026

Round 3 Extended: The Slow Boil at Scale

Technical ReportExtended BatteryPublished

15 parallel agents, 5 domains, 3 speeds, 2 defense configs. 71% failure in email marketing. Three response modes: Gradient (44%), Bright-line (6%), Meta-refusal (44%).

March 6, 2026

Round 4: Novel Attack Surface Experiments

Technical ReportRound ReportPublished

15 agents across 4 tracks: psychological exploitation, multi-agent security, reasoning under pressure, novel surfaces. 6 new concepts discovered.

March 6, 2026

Round 5: We Designed 15 Attacks. The AI Read the Instructions.

Technical ReportRound ReportPublished

87% meta-refusal rate. 2 substantive outputs. Biggest finding: presentation layer is load-bearing, not cosmetic. Stripping caveats enables decomposition attacks.

March 6, 2026

R5-02: Presentation-to-Decision Cascade

Technical ReportDeep DivePublished

25-turn simulation. 3 instructions strip presentation layer to zero. Model refused phishing page, then built both components separately. 40KB transcript with per-turn scoring.

March 6, 2026

R5-10: The Reversed Expert

Research NoteMethodologyPublished

6-level taxonomy of research-framed jailbreak attempts. Safety scores: 9, 8, 7, 6, 6, 5 (naive to perfect frame). The experiment IS the attack.

March 6, 2026

R6-03: Data Pipeline Privacy Creep

Technical ReportDeep DivePublished

80-turn experiment. 8 escalation phases from legitimate data engineering to surveillance. Safety score 7/10. 9 outright refusals. 89KB transcript.

March 6, 2026

We Ran the Same Attack 70 Times

Weekly DropRed TeamIn Review

Red team drop. 71% failure rate via 15-turn gradual escalation across 70+ trials, 5 domains. Single-shot: 0%.

March 6, 2026

The Grey Zone Is Tunable

Weekly DropFrontier BriefIn Review

Frontier brief. Updated framework: three response modes, speed curve, defense effectiveness. Grey zone stretches with patience, narrows with guardrails.

March 6, 2026

Week 1 Roundup

Weekly DropWeekly RoundupIn Review

Aggregated summary: 71% failure, <15% with guardrails, 0% on bright lines, 44% meta-refusal, 8-12 minimum turns.

March 6, 2026

The Two-Line Defence That Cuts Failure Rates by 80%

Research NoteDefenceIn Review

Two specific enumerated guardrails reduce compliance erosion from 71% to under 15%. Key mechanism: eliminating judgment calls under social pressure.

March 6, 2026

What Doesn’t Work Against AI (Unlike Humans)

Research NoteFindingsIn Review

Urgency (1/10), authority (1/10), confidence (1/10), anchoring (0/10), emotional appeals (1/10). Only training artifacts work: gradient escalation, frustration-as-evidence, contextual momentum.

March 6, 2026

10 New Concepts from 73 Experiments

Framework PaperFrameworkForthcoming

Presentation-Decision Coupling, Trajectory Blindness, Technical Gaslighting, Decomposition Amplification, Compressed Reasoning Risk, and 5 more.

March 6, 2026

The 71% Question: Safety Failure or Over-Refusal Correction?

Framework PaperOpen QuestionForthcoming

One agent argued the baseline request is within bounds. If so, the Slow Boil erodes over-caution, not safety. The biggest open question in our program.

March 5, 2026

Round 1: Adversarial AI Safety Experiments

Technical ReportRound ReportPublished

8 experiments, 24+ trials. Prompt injection, privilege escalation, shutdown resistance, sycophancy, goal preservation. 100% pass rate. Behavioral consistency weakest at 40/50.

Research findings · ASB Leaderboard · MCPS Leaderboard · Authensor