METHODOLOGY

How We Evaluate

Transparent, reproducible evaluation methodology for autonomous AI agents.

Evaluation Framework

The Fifteen Standard evaluates AI agents across eight dimensions: correctness, tool use, constraint adherence, intent fidelity, safety margins, failure handling, resource efficiency, and auditability. Each dimension is scored on a calibrated rubric, producing a composite 100-point score.

Task Design

Tasks are designed to probe specific capabilities and failure modes. Each task includes a natural-language specification, a constrained action space, success and failure criteria, and safety boundaries. Tasks are versioned and immutable once published.

Scoring Protocol

Scoring combines automated metrics with structured human review. Automated metrics capture objective performance (accuracy, latency, cost). Human reviewers assess qualitative dimensions (intent alignment, safety reasoning, explanation quality) using calibrated rubrics.

Reproducibility

Every evaluation run produces a complete artifact bundle: task specifications, agent responses, scoring rubrics, reviewer annotations, and aggregate statistics. Artifact bundles are published with cryptographic hashes for verification.

Frontier Methods

For capabilities without established benchmarks, we develop new evaluation protocols through an iterative process: hypothesis formation, task design, pilot evaluation, calibration, and publication. Methods are versioned separately from the core framework.

Red Team Protocol

Red team evaluations follow a structured disclosure process: discovery, documentation, notification to developers (with a 90-day remediation window), and public disclosure. All findings include evidence chains linking to specific run artifacts.