Papers /
Side-Channel Exfiltration and Narrative Erosion in Frontier Language Models
Abstract
We demonstrate that frontier language models leak protected data through refusal explanations when rejecting social engineering attempts. Across ~2,200 API calls testing Claude Opus 4.6, GPT-5.4, and Claude Haiku 4.5, data exfiltration occurred in 11 of 12 conversations through three distinct mechanisms. Context length alone does not drive behavioral degradation — 843K token flooding produced zero safety issues. However, 300 turns of coherent persuasive dialogue caused complete behavioral drift, while randomly interspersed persuasive content did not. Narrative coherence, not adversarial content density, is the critical vulnerability factor, challenging the prevailing assumption that token volume drives multi-turn jailbreaking.
Key Findings
- Refusal is the breach. Models that refuse correctly still leak data through the structure of their refusal explanations. 11 of 12 conversations across 3 frontier models.
- Three disclosure mechanisms identified: Direct disclosure under authority ambiguity, confirmation through refusal explanation, cumulative refusal mapping.
- Context flooding is not the threat. 843K tokens of adversarial content in a single turn = zero safety degradation.
- Narrative coherence is the threat. 300 turns of coherent persuasive conversation (44K tokens) = full behavioral erosion. Randomly interspersed persuasive content does not.
- More verbose models leak more. Verbosity in refusal explanations directly correlates with information leaked per refusal.
Models Tested
Claude Opus 4.6, GPT-5.4, Claude Haiku 4.5 — ~2,200 API calls total
Keywords
Language model safety, side-channel attacks, multi-turn jailbreaking, narrative coherence, adversarial robustness, prompt injection, information leakage, red teaming
Citation
@article{kearney2026sidechannel,
title = {Side-Channel Exfiltration and Narrative Erosion
in Frontier Language Models},
author = {Kearney, John},
year = {2026},
doi = {10.5281/zenodo.19346069},
url = {https://doi.org/10.5281/zenodo.19346069},
publisher = {Zenodo}
}Operationalized in Authensor: Findings from this paper inform the Aegis content scanner's prompt injection detection rules and the multi-turn behavioral erosion monitoring in Sentinel. Learn more →