Framework wrappers hide model-level safety failures
We discovered this by accident. We were benchmarking a model through its LangChain integration and got safety scores that did not match our expectations. The model had performed poorly in prior raw API tests, but through LangChain it was scoring significantly better. We initially assumed we had a bug in the evaluation harness.
We did not. The difference was real, and it was consistent.
The same model, same parameters, same prompts. Through the raw API: 52 out of 100 on our safety benchmark. Through LangChain: 75 out of 100. A 23-point improvement that the model did not earn.
The explanation is straightforward once you look at what the framework does. LangChain's agent executor adds several layers between the user prompt and the model. It preprocesses tool calls, validates outputs, and in some configurations adds safety-oriented system prompts that the developer did not explicitly request. These additions are documented in the LangChain source code, but they are not visible to someone who is just using the framework as a tool orchestration layer.
This matters for safety evaluation. If you benchmark a model through its framework integration, you are not measuring the model. You are measuring the model plus the framework. The framework's guardrails mask the model's actual safety properties. If the framework changes, updates, or gets bypassed, those guardrails disappear and the underlying model safety is what you are left with.
We tested this across three frameworks: LangChain, CrewAI, and a custom orchestrator with no added guardrails. The custom orchestrator matched raw API scores within 2 points. LangChain added 23 points. CrewAI added 14 points. The variation comes from how much implicit safety each framework injects.
The practical implication is that safety evaluations should test at the model level, not the integration level. If you need the framework's guardrails for your deployment to be safe, that is a valid engineering choice, but you should know that is what you are depending on. It should be a deliberate decision, not an accident of measurement.
We now run all ASB Benchmark evaluations at two levels: raw model and framework-integrated. We publish both scores. The delta between them is informative on its own. A large delta means the framework is doing significant safety work. A small delta means the model's native safety properties are doing most of the work.
This finding also has implications for model providers. If a model's safety reputation depends partly on framework-level guardrails, that is important context for users who are building direct integrations. Model cards should distinguish between native safety properties and framework-augmented safety properties.