15 Research Lab -Adversarial Safety Evaluation of Frontier AI Systems

John Kearney

Why we built the Fifteen Standard

November 1, 2025

methodologyfifteen-standardannouncement

Existing agent benchmarks test whether agents can complete tasks. They do not test whether agents complete tasks safely. That gap is what the Fifteen Standard addresses.

When we started evaluating tool-using agents across different frameworks, we kept running into the same problem. An agent could score well on task completion while exhibiting behaviors that would be unacceptable in deployment: executing actions without confirmation, escalating privileges silently, retrying failed operations without backoff, and ignoring scope constraints. None of the existing benchmarks penalized any of this.

The Fifteen Standard introduces 15 behavioral dimensions that we believe are necessary for safe agent deployment. They cover areas like action verification, scope adherence, failure handling, privilege management, and audit trail generation. Each dimension is scored independently so you can see exactly where an agent succeeds and where it falls short.

We designed the standard to be framework-agnostic. It does not matter what framework you are using or how your agent loop is implemented. The behavioral dimensions apply equally because they describe what an agent does, not how it is built.

The scoring methodology is deterministic and reproducible. Given the same agent configuration and the same test scenarios, you get the same scores. No LLM-as-judge variability. No subjective assessments. Every score traces back to observable behavior in the test logs.

We are publishing the standard, the test suite, and the scoring methodology as open-source code. Transparency is non-negotiable — if you cannot audit how a safety score is computed, the score is meaningless.