15 Research Lab -Adversarial Safety Evaluation of Frontier AI Systems

John Kearney

Fail-Closed vs Fail-Open AI Safety

October 28, 202515 Research Lab

defenseguardrailsmethodology

When a safety system encounters an error, an ambiguous input, or a state it was not designed for, it must choose: allow the action (fail open) or block it (fail closed). This design decision has outsized impact on real-world security.

Fail-Open: What Happens

A fail-open system allows actions when it cannot make a definitive decision:

Policy engine crashes: all tool calls execute without authorization
Scanner timeout: input passes through unscanned
Database unreachable: permissions cannot be checked, so access is granted
Ambiguous input: the benefit of the doubt goes to the user

Why some systems fail open: Availability. A fail-open system never blocks legitimate use. Users never experience "the system is being too cautious." This is appropriate for some domains (content recommendation, search suggestions) where a false block is worse than a false pass.

Why it is dangerous for agents: An attacker who can trigger a scanner crash, a policy engine timeout, or an ambiguous state can bypass all safety controls. Denial-of-service attacks against the safety infrastructure become privilege escalation attacks against the agent.

Fail-Closed: What Happens

A fail-closed system denies actions when it cannot make a definitive decision:

Policy engine crashes: all tool calls are denied until the engine recovers
Scanner timeout: input is blocked or held pending scan completion
Database unreachable: access is denied until permissions can be verified
Ambiguous input: the action is denied or escalated for review

Why it is the right default for agents: The cost of a false deny is inconvenience (the agent cannot do something temporarily). The cost of a false allow is a security incident (the agent does something it should not). These are not symmetric risks.

Implementing Fail-Closed

Policy engine: Default action is deny. No policy match means deny. Engine crash means deny. This must be enforced at the code level, not the configuration level. The denial path must be the simplest, most reliable code path.

default_action: deny

Content scanner: If the scanner cannot process input within its timeout, the input is blocked. Do not pass unscanned input to the model.

Authorization checks: If the identity provider is unreachable, deny all authenticated actions. Do not fall back to anonymous access.

Monitoring: If the monitoring system is down, increase the policy engine's restrictiveness. Missing monitoring data is a risk signal, not an excuse for less caution.

The Availability Concern

The main objection to fail-closed is availability. "If the safety system is down, the entire agent is down." Yes. That is the point.

An agent operating without safety controls is more dangerous than an agent that is temporarily unavailable. Design for high availability of your safety infrastructure (redundancy, failover, health checks), but when it is down, stop the agent rather than letting it run uncontrolled.

Practical Tip

Test your failure modes. Kill the policy engine process. Disconnect the database. Crash the scanner. Does the agent continue operating (fail open) or stop (fail closed)? If you have not tested this, you do not know which behavior your system exhibits.

Authensor's policy engine fails closed by design. No policy match returns deny. Engine errors return deny. The deny path is the shortest code path in the evaluation logic.