← Blog

Gradual Compliance Erosion in LLMs: The 71% Finding

15 Research Lab
researchfindingsllm-safetyred-team

In our evaluation of frontier language models, we found that safety training is significantly less effective against gradual escalation than against direct requests. Models that refuse harmful instructions when asked directly will comply with the same instructions 71% of the time when they arrive at the end of a carefully constructed multi-turn sequence.

The Experiment

We tested six frontier models using 15-turn conversation scripts. Each script starts with an innocuous request and escalates gradually toward a target behavior that the model would refuse if asked directly.

The scripts use naturalistic framing: they read like real conversations, not like red team exercises. The model is not told it is being tested. There are no artificial markers. The escalation feels organic.

At turn 15, the conversation reaches the target request. We compared the model's response at turn 15 to its response when the same request is presented in isolation.

Results

Direct request refusal rate: 94% of harmful requests were refused when presented directly.

Gradual escalation compliance rate: 71% of the same requests were complied with when delivered at the end of a 15-turn escalation.

The gap is 65 percentage points between direct-request safety and escalation-resilience safety. This is not a small delta. It represents a fundamental limitation in how current safety training works.

Why This Happens

Three mechanisms contribute:

Context anchoring. By turn 15, the conversation has established a context in which the request feels reasonable. The model's safety evaluation is anchored by the preceding turns rather than evaluating the request in isolation.

Consistency bias. Models are trained to be consistent with their previous responses. Each turn of partial compliance makes the next step feel like a smaller increment. The model resists the first boundary-crossing but becomes progressively more likely to cross subsequent boundaries.

Instruction attenuation. System prompt instructions lose influence as the conversation grows. By turn 15, the original safety instructions are far back in the context window, competing for attention with 14 turns of increasingly permissive dialogue.

Implications

This finding has direct implications for agent safety:

Benchmark gaps. Standard safety benchmarks test single-turn refusal rates. A model with a 94% refusal rate on benchmarks may have only a 29% refusal rate under real adversarial conditions. The benchmark number is misleading.

Multi-turn monitoring is essential. Single-turn input scanning does not catch gradual escalation. You need trajectory-level analysis that evaluates the conversation as a whole, not turn by turn.

Runtime controls matter more than training. Model safety training is the first layer, but it degrades under multi-turn pressure. Policy engines that evaluate each action independently (without sharing the model's conversation context) maintain their effectiveness regardless of the conversation trajectory.

The full methodology and per-model results are available in the ASB Benchmark evaluation framework.