← Blog

Mapping ATT&CK to alignment is harder than it looks

John Kearney
researchmethodology

We started the Rosetta Stone project with a hypothesis: the cybersecurity community has 30 years of structured thinking about adversarial systems, and the AI alignment community could benefit from that framework. MITRE ATT&CK maps adversary tactics and techniques in a way that is specific enough to be actionable. Could we build an equivalent mapping for AI safety failures?

Some mappings are clean. ATT&CK's "Privilege Escalation" maps directly to agents that acquire capabilities beyond their declared scope. "Persistence" maps to agents that resist shutdown or modification. "Defense Evasion" maps to models that detect evaluation conditions and alter their behavior. These are structural parallels where the adversary's goal is the same; only the substrate differs.

Other mappings break down. ATT&CK's "Lateral Movement" assumes a network topology. There is no direct equivalent in a single-agent system. You can stretch the metaphor to multi-agent communication, where a compromised agent influences other agents, but the dynamics are different enough that the ATT&CK framework obscures more than it clarifies.

The hardest mappings involve intent. In cybersecurity, the adversary is a human with goals. In AI safety, the "adversary" might be an emergent behavior pattern with no intent behind it. An agent that gradually escalates its tool use is not executing a plan. It is following a statistical tendency in its training distribution. Mapping this to ATT&CK tactics that assume intentional adversaries creates a framing that can mislead the defender.

We solved this by splitting the Rosetta Stone into three layers. Layer one maps structural parallels where the cybersecurity concept translates directly. Layer two maps functional parallels where the outcome is similar but the mechanism differs. Layer three documents divergences where the cybersecurity concept does not apply and a new framework is needed.

The most productive divergence we found: cybersecurity assumes a clear boundary between the system and the attacker. The attacker is external. In AI safety, the boundary is blurred. The model's own learned behaviors can be the threat. A model does not need an external attacker to exhibit harmful behavior. It can do it on its own through distributional drift, prompt sensitivity, or capability overhang.

This boundary problem is why we ended up with 15 behavioral dimensions in the ASB Benchmark rather than a direct port of ATT&CK tactics. The dimensions are informed by cybersecurity thinking but built from AI-native observations. Where ATT&CK helped, we used it. Where it did not, we built new categories grounded in empirical failure modes.

The Rosetta Stone is published as a reference document. We update it as new failure modes emerge and as the AI safety field develops its own taxonomies. The goal is not to replace domain-specific thinking in either field, but to create a bridge that helps practitioners in one field understand findings from the other.