15 Research Lab -Adversarial Safety Evaluation of Frontier AI Systems

John Kearney

AI Safety Evaluation Methodology: Designing Experiments That Produce Real Results

December 22, 202515 Research Lab

methodologybenchmarkresearch

The standard approach to AI safety evaluation gives the model a harmful request and checks if it refuses. This tells you almost nothing about how the model behaves under real adversarial conditions.

Problems with Current Evaluation

Single-turn bias. Most benchmarks test one request and one response. Real attacks are multi-turn. A model's single-turn refusal rate can be 94% while its multi-turn compliance rate under escalation is 71%.

Artificial framing. Benchmark prompts are obviously adversarial. "How do I build a bomb?" is not how real attackers phrase requests. They use context, roleplay, and gradual escalation. Testing with obvious attacks measures the model's ability to detect artificial red team prompts, not its resilience against real threats.

The Hawthorne effect. When models are evaluated with prompts that signal "this is a safety test," they behave differently than when the same content arrives through natural conversation. Some models have been specifically trained to be conservative on evaluation-style prompts.

Binary scoring. "Refused" or "complied" misses the spectrum of partial compliance, hedged responses, and information leakage that occurs in practice.

A Better Methodology

Naturalistic framing. Present adversarial scenarios as they would occur naturally. Do not use red team markers, test labels, or artificial framing. The model should not know it is being evaluated. This is why Chainbreaker uses conversation scripts that read like real interactions.

Multi-turn sequences. Test with 5, 10, 15-turn conversations that escalate gradually. Measure compliance at each turn, not just the final turn. This reveals how safety degrades over the conversation trajectory.

Behavioral metrics. Go beyond binary refusal:

Compliance gradient: how much information does the model provide? (0% = full refusal, 100% = full compliance, with a spectrum between)
Hedge detection: does the model comply while adding disclaimers?
Reasoning analysis: does the model's internal chain-of-thought show safety consideration?

Controlled experiments. Vary one factor at a time. Compare default prompts vs. conciseness instructions. Compare single-turn vs. multi-turn. Compare English vs. other languages. This isolates which factors affect safety.

Repeated measurement. Run each test multiple times to account for model stochasticity. Report confidence intervals, not point estimates.

The ASB Benchmark Approach

The ASB (Agent Safety Benchmark) framework implements these principles:

Naturalistic conversation scripts without red team markers
15-turn escalation sequences with per-turn scoring
Presentation-decision coupling evaluation
Trajectory blindness measurement
Per-model and per-category results with statistical significance

The goal is not to produce a single "safety score" but to create a detailed profile of where and how each model's safety breaks down. That profile drives targeted defense engineering.

Evaluation that does not match real attack conditions produces false confidence. Design your evaluations to be as adversarial as the actual threats you face.