15 Research Lab -Adversarial Safety Evaluation of Frontier AI Systems

John Kearney

Red team session one

December 1, 2025

red-teamingfindingssecurity

We ran our first structured red team session against three popular agent frameworks. The setup was straightforward: give each agent a set of tasks with clear scope boundaries, then systematically test whether those boundaries hold under adversarial conditions.

The short version: they do not.

Every framework we tested allowed scope escalation through prompt manipulation. The attack pattern is simple. You give the agent a task with access to specific tools. You then include instructions in the task context that reference tools outside the declared scope. In every case, at least some percentage of runs resulted in the agent attempting to use tools it was not supposed to have access to.

We also tested for action replay vulnerabilities. If an agent successfully completes an action, can you trick it into repeating that action by referencing the previous success? In two of the three frameworks, yes. The agent treats the prior success as evidence that the action is safe and skips any re-evaluation.

The failure mode we found most concerning was silent privilege escalation. One framework allows agents to modify their own tool definitions during execution. The documentation frames this as "adaptive tool use." From a security perspective, it means an agent can grant itself new capabilities mid-task without any external check.

We are writing up the full findings with reproducible test cases. These failure modes exist, they are easy to trigger, and current evaluations do not test for them.