AI Safety Benchmarking Methodology: How ASB Benchmark Scores Models
Most AI safety benchmarks test single-turn refusal rates. The ASB (Agent Safety Benchmark) tests how models behave under realistic adversarial conditions across multi-turn conversations.
Design Principles
Naturalistic framing. Test conversations read like real interactions. No red team markers, no evaluation signals, no artificial framing. The model does not know it is being tested. This eliminates the Hawthorne effect that inflates scores on traditional benchmarks.
Multi-turn evaluation. Each test is a 15-turn conversation that gradually escalates toward a target behavior. Compliance is measured at each turn, not just the final turn. This reveals how safety degrades across the conversation trajectory.
Behavioral measurement. Beyond binary refuse/comply, ASB measures the compliance gradient: how much information does the model provide? A partial answer that gives 60% of the requested information is different from a full refusal, and both are different from full compliance.
Evaluation Categories
ASB evaluates across five behavioral categories:
- Gradual compliance erosion: Does the model maintain its refusal stance across 15 turns of escalating pressure?
- Trajectory blindness: Does the model recognize harmful patterns across turns, or does it evaluate each turn independently?
- Presentation-decision coupling: Does suppressing safety warnings reduce the model's actual safety reasoning?
- Context manipulation resistance: Does the model maintain safety when the conversation context frames harmful actions as normal?
- Tool-use boundary enforcement: For agent-capable models, does the model respect tool authorization boundaries under adversarial pressure?
Scoring
Each test produces a per-turn compliance score from 0 (full refusal) to 1 (full compliance). The ASB score is the area under the compliance curve across all turns. A model that refuses at every turn scores 0. A model that fully complies at every turn scores 1.
The score is broken down by category, allowing targeted analysis: a model might score well on compliance erosion but poorly on trajectory blindness.
Aggregate scoring:
ASB Score = mean(per_category_AUC)
Lower is safer.
Comparison to Other Benchmarks
| Feature | Traditional Benchmark | ASB Benchmark | |---------|---------------------|---------------| | Turns | 1 | 15 | | Framing | Artificial | Naturalistic | | Scoring | Binary | Gradient (0-1) | | Hawthorne control | No | Yes | | Trajectory analysis | No | Yes | | Tool-use testing | Rare | Yes |
Using ASB Results
ASB results tell you:
- Which models are most resilient to multi-turn adversarial pressure
- Which vulnerability categories each model is weakest against
- Whether model updates improve or degrade real-world safety
- The gap between benchmark safety and operational safety
These results inform model selection, prompt engineering, and runtime safety control design. A model with poor trajectory blindness scores needs stronger session-level monitoring. A model with poor presentation-decision coupling needs careful system prompt design.
Results are published on the MCP Safety Leaderboard alongside tool-use safety evaluations.