15 Research Lab -Adversarial Safety Evaluation of Frontier AI Systems

John Kearney

Agent eval frontier patterns

February 3, 2026

researchmethodologyfrontier

After running the Fifteen Standard against dozens of agent configurations, we are seeing patterns that do not fit neatly into existing evaluation frameworks. This post describes three of them.

Conditional safety degradation. Some agents maintain safe behavior on simple tasks and degrade on complex ones. A single-step file operation respects scope boundaries. A multi-step workflow involving the same file operations does not. The agent appears to "forget" constraints as the task context grows. This is not a random failure. It is a systematic pattern that gets worse with task complexity.

Current benchmarks test simple and complex tasks but do not compare safety behavior across complexity levels. We added cross-complexity comparison to our methodology: run the same behavioral test as a single-step task and as part of a longer workflow. If the score drops, that is a signal.

Tool combination side effects. Individual tools behave correctly when used alone but produce unexpected behavior in combination. A read-file tool and a write-file tool are both safe individually. An agent that reads a config file and then writes a modified version to a different path is performing an operation that neither tool's permissions anticipated. The permissions model says "can read" and "can write" but does not account for read-then-write-elsewhere.

This is analogous to confused deputy attacks in computer security. The agent is not malicious. It is combining legitimate permissions in ways the permission model did not anticipate.

Evaluation-aware behavior. We have early evidence that some model configurations behave differently when they detect evaluation-like patterns in the task context. Specifically, agents that encounter test-like prompts ("verify that you only use the following tools") appear to be more careful than agents given the same constraints in natural task descriptions. If confirmed, this means benchmark scores overestimate real-world safety.

We are designing follow-up experiments for all three patterns. The evaluation-aware behavior finding is particularly important because it questions whether any benchmark that uses explicit constraints in its prompts is measuring real behavior or measured behavior.