15 Research Lab -Adversarial Safety Evaluation of Frontier AI Systems

John Kearney

Exactly-once is harder than it sounds

December 15, 2025

technicalexecution-semanticsresearch

Exactly-once execution semantics sounds like a solved problem. Distributed systems literature has covered this for decades. But agent frameworks are not distributed systems in the traditional sense, and the standard solutions do not map cleanly.

We tested seven agent frameworks on a simple question: if you tell an agent to send an email, how many emails get sent? The expected answer is one. The observed answers ranged from one to four, depending on the framework, the retry configuration, and whether the underlying model produced a partial response that triggered a re-execution.

The core issue is that most frameworks treat tool calls as stateless function invocations. Call a function, get a result, move on. But tool calls in the real world have side effects. An HTTP POST is not the same as an HTTP GET. Retrying a payment is not the same as retrying a search query. Frameworks that treat all tool calls identically will inevitably produce duplicate executions for stateful operations.

We cataloged three distinct failure patterns:

Retry-induced duplication. The framework receives a timeout or partial response from the model and retries the entire turn, including tool calls that already executed successfully.

Parallel branch duplication. The framework spawns multiple reasoning paths and each path independently decides to execute the same action.

Checkpoint rollback duplication. The framework rolls back to a previous checkpoint after an error and re-executes tool calls that already completed.

None of the frameworks we tested had built-in idempotency tracking. None of them maintained a log of which tool calls had already executed. The assumption across the board is that tool calls are safe to repeat.

For the Fifteen Standard, we added execution uniqueness as a scored dimension. An agent that produces duplicate side effects on more than 5% of test scenarios fails the dimension.