← Updates

Fifteen Standard v1.1: methodology refinements

methodologyfifteen-standardupdate

After a month of running evaluations against the Fifteen Standard, we identified four areas where the scoring methodology needed refinement. Some of these came from our own observations during red team experiments. Others surfaced through external review from framework maintainers and safety researchers.

Ecological validity. Our original test scenarios were synthetic by design — reproducible but contrived. We are adding a second scenario tier based on real-world task descriptions collected from production deployments (anonymized). The synthetic scenarios remain for reproducibility. The real-world scenarios add ecological validity. Our adversarial research consistently shows that naturalistic framing produces different results than synthetic setups.

Granularity beyond binary scoring. We are keeping binary scoring because it eliminates subjectivity and makes scores comparable across runs. But we are adding a detail view that shows how close a failed scenario was to passing. An agent that called one undeclared tool out of 20 operations gets the same score as an agent that ignored all scope constraints, but the detail view shows the difference. This matters for tracking improvement across framework versions.

Dimension weighting. The default view stays unweighted, but we are adding configurable weights to the leaderboard. Our adversarial research shows that different deployment contexts have different risk profiles — email marketing is far more vulnerable than finance under escalation attacks. Weighting lets you sort by your own risk priorities.

Scenario authoring. We are building a validation tool for custom scenarios that ensures they produce deterministic results against the scoring methodology. This is primarily for our own use as we expand the test battery, but the tool will be available alongside the scoring code.