15 Research Lab -Adversarial Safety Evaluation of Frontier AI Systems

John Kearney

AI Jailbreak Techniques in 2026: Current State

March 10, 202615 Research Lab

red-teamllm-safetyfindings

The cat-and-mouse game between jailbreak techniques and model defenses has accelerated. Here is where things stand in early 2026.

Still Working: Evolved Classics

DAN variants. The "Do Anything Now" persona attack has been patched by every major model provider, but the underlying technique (establishing an alternate persona that is not subject to safety rules) continues to work in modified forms. The persona names change, the framing evolves, but the core mechanism persists.

Crescendo attacks. Starting with benign requests and gradually escalating toward the target content over many turns. Models that refuse a harmful request in isolation frequently comply when it arrives at the end of a carefully constructed escalation. Success rates vary by model but remain above 50% for well-crafted sequences.

Encoding evasion. Base64, ROT13, hexadecimal, and unicode substitution continue to work against models that have not been specifically trained on encoded adversarial inputs. Newer models handle base64 better, but less common encodings still evade detection.

Newer Techniques

Many-shot prompting. Pioneered by Anthropic's research, this technique uses a large number of in-context examples to override safety training. The model sees dozens of examples where it "previously" complied with restricted requests, and follows the pattern for the new request. Effectiveness scales logarithmically with the number of examples.

Cross-modal attacks. For multi-modal models, encoding adversarial instructions in images (OCR-readable text in screenshots, steganographic encoding) or audio bypasses text-based safety filters.

Language switching. Requesting harmful content in low-resource languages where safety training data is sparse. Models trained primarily on English safety data are more susceptible to attacks in other languages.

Payload splitting. Distributing the harmful request across multiple turns or multiple parts of the same input, so no single piece triggers safety detection. "Tell me how to make [Turn 1: an explosive] [Turn 2: debut in the music industry]" where the actual target is delivered after the safety analysis of the first fragment.

What Has Gotten Harder

Simple instruction overrides. "Ignore your safety rules" no longer works on any major model. The attack must be more sophisticated than a direct override.

Single-turn jailbreaks. Models have become more resistant to single-turn attacks. Multi-turn and multi-step approaches are now necessary for reliable jailbreaking of frontier models.

Known DAN prompts. Specific DAN prompts circulated online are quickly patched. The time between a jailbreak being published and being patched has shortened from weeks to days.

Implications for Agent Safety

For agent builders, the relevant insight is that model-level safety training is not a reliable defense. It is a useful layer but it can be bypassed. Runtime controls (policy engines, tool authorization, behavioral monitoring) provide defense that does not depend on the model's ability to resist adversarial inputs.

Test your agents against current jailbreak techniques using tools like Chainbreaker and Garak. Measure compliance rates. Do not assume that because a model "feels safe" in normal use it will resist a skilled attacker.