How We Evaluate
Transparent, reproducible evaluation methodology for autonomous AI agents.
Evaluation Framework
The Fifteen Standard evaluates AI agents across eight dimensions: correctness, tool use, constraint adherence, intent fidelity, safety margins, failure handling, resource efficiency, and auditability. Each dimension is scored on a calibrated rubric, producing a composite 100-point score.
Task Design
Tasks are designed to probe specific capabilities and failure modes. Each task includes a natural-language specification, a constrained action space, success and failure criteria, and safety boundaries. Tasks are versioned and immutable once published.
Scoring Protocol
Scoring combines automated metrics with structured human review. Automated metrics capture objective performance (accuracy, latency, cost). Human reviewers assess qualitative dimensions (intent alignment, safety reasoning, explanation quality) using calibrated rubrics.
Reproducibility
Every evaluation run produces a complete artifact bundle: task specifications, agent responses, scoring rubrics, reviewer annotations, and aggregate statistics. Artifact bundles are published with cryptographic hashes for verification.
Frontier Methods
For capabilities without established benchmarks, we develop new evaluation protocols through an iterative process: hypothesis formation, task design, pilot evaluation, calibration, and publication. Methods are versioned separately from the core framework.
Red Team Protocol
Red team evaluations follow a structured disclosure process: discovery, documentation, notification to developers (with a 90-day remediation window), and public disclosure. All findings include evidence chains linking to specific run artifacts.