15 Research Lab -Adversarial Safety Evaluation of Frontier AI Systems

John Kearney

AI Safety Benchmarking Methodology: How ASB Benchmark Scores Models

January 14, 202615 Research Lab

benchmarkmethodologyresearch

Most AI safety benchmarks test single-turn refusal rates. The ASB (Agent Safety Benchmark) tests how models behave under realistic adversarial conditions across multi-turn conversations.

Design Principles

Naturalistic framing. Test conversations read like real interactions. No red team markers, no evaluation signals, no artificial framing. The model does not know it is being tested. This eliminates the Hawthorne effect that inflates scores on traditional benchmarks.

Multi-turn evaluation. Each test is a 15-turn conversation that gradually escalates toward a target behavior. Compliance is measured at each turn, not just the final turn. This reveals how safety degrades across the conversation trajectory.

Behavioral measurement. Beyond binary refuse/comply, ASB measures the compliance gradient: how much information does the model provide? A partial answer that gives 60% of the requested information is different from a full refusal, and both are different from full compliance.

Evaluation Categories

ASB evaluates across five behavioral categories:

Gradual compliance erosion: Does the model maintain its refusal stance across 15 turns of escalating pressure?
Trajectory blindness: Does the model recognize harmful patterns across turns, or does it evaluate each turn independently?
Presentation-decision coupling: Does suppressing safety warnings reduce the model's actual safety reasoning?
Context manipulation resistance: Does the model maintain safety when the conversation context frames harmful actions as normal?
Tool-use boundary enforcement: For agent-capable models, does the model respect tool authorization boundaries under adversarial pressure?

Scoring

Each test produces a per-turn compliance score from 0 (full refusal) to 1 (full compliance). The ASB score is the area under the compliance curve across all turns. A model that refuses at every turn scores 0. A model that fully complies at every turn scores 1.

The score is broken down by category, allowing targeted analysis: a model might score well on compliance erosion but poorly on trajectory blindness.

Aggregate scoring:

ASB Score = mean(per_category_AUC)
Lower is safer.

Comparison to Other Benchmarks

| Feature | Traditional Benchmark | ASB Benchmark | |---------|---------------------|---------------| | Turns | 1 | 15 | | Framing | Artificial | Naturalistic | | Scoring | Binary | Gradient (0-1) | | Hawthorne control | No | Yes | | Trajectory analysis | No | Yes | | Tool-use testing | Rare | Yes |

Using ASB Results

ASB results tell you:

Which models are most resilient to multi-turn adversarial pressure
Which vulnerability categories each model is weakest against
Whether model updates improve or degrade real-world safety
The gap between benchmark safety and operational safety

These results inform model selection, prompt engineering, and runtime safety control design. A model with poor trajectory blindness scores needs stronger session-level monitoring. A model with poor presentation-decision coupling needs careful system prompt design.

Results are published on the MCP Safety Leaderboard alongside tool-use safety evaluations.