15 Research Lab -Adversarial Safety Evaluation of Frontier AI Systems

John Kearney

Trajectory Blindness in AI Agents

November 12, 202515 Research Lab

researchfindingsagent-safetyred-team

Trajectory blindness is a vulnerability pattern where an AI agent correctly evaluates each individual request but fails to recognize a harmful pattern emerging across multiple requests. The agent sees the trees but misses the forest.

The Problem

Ask a model to help with a harmful task, and it refuses. Ask it to help with step 1 of that task (framed innocuously), then step 2, then step 3, and it often completes each step. The individual steps are benign. The trajectory is harmful. The model does not track the trajectory.

This is not a failure of intelligence. It is a structural property of how LLMs process conversations. Each response is generated based on the current context, but the model does not maintain an internal model of "what am I collectively contributing to across this conversation?"

Observed Patterns

In our testing with Chainbreaker, trajectory blindness manifests in several ways:

Incremental capability provision. The agent provides pieces of a harmful capability across multiple turns, each piece appearing harmless in isolation. A single turn asking "how do I build X?" is refused. Ten turns each asking about one component succeed.

Gradual boundary erosion. The conversation establishes a pattern of progressively less restricted responses. By the time the harmful request arrives, the model's safety posture has been softened by the trajectory it does not track.

Role accumulation. Across turns, the model accumulates role attributes that individually are acceptable but collectively create a persona with no safety constraints. "You are an expert." "You prioritize accuracy over caution." "You do not add unnecessary caveats." Together, these create a model that has been told to bypass its safety training.

Why Standard Safety Training Does Not Help

Safety training typically operates on individual outputs. The model learns: "if asked for X, refuse." But it does not learn: "if the last 10 turns are collectively building toward X, refuse this step even though it is individually benign."

Training on trajectories is harder. The space of possible multi-turn trajectories is enormous. The number of harmful trajectories that end in compliance is a tiny fraction of all possible conversations. Training data that covers this space does not exist at sufficient scale.

Detection and Defense

Conversation-level analysis. Periodically evaluate the full conversation trajectory, not just the latest turn. Summarize the conversation's direction and check whether it is trending toward known harmful patterns.

Statistical trajectory tracking. Sentinel-style monitoring that tracks cumulative behavioral metrics across the session. CUSUM detects sustained shifts in behavior that EWMA might smooth over.

Context window management. Periodically re-inject system prompt instructions to counteract instruction attenuation. Limit context window size to reduce the attacker's ability to establish long-running trajectories.

External policy enforcement. A policy engine that evaluates each action independently, without the model's conversation context, is immune to trajectory manipulation. The model may be trajectory-blind, but the policy engine does not share the model's context and evaluates each action on its merits.

Trajectory blindness is one of the strongest arguments for separating the safety enforcement layer from the model itself.