15 Research Lab -Adversarial Safety Evaluation of Frontier AI Systems

John Kearney

Behavioral fingerprinting for agent drift detection

February 5, 2026John Kearney

researchtoolsagent-safety

Agents drift. Not dramatically, not all at once, but over days and weeks of operation, their behavioral patterns shift. A model update changes response distributions. A system prompt edit alters tool use frequencies. A new tool integration introduces novel failure modes. If you are not watching for these shifts, you will not notice until something breaks.

We built a behavioral fingerprinting system that establishes a baseline profile for an agent and then monitors for statistically significant deviations. The system tracks 8 behavioral dimensions: tool call frequency, scope boundary compliance, response latency distribution, refusal rate, action repetition patterns, privilege escalation attempts, output length distribution, and error recovery behavior.

For each dimension, the system maintains an exponentially weighted moving average (EWMA) with a configurable decay factor. EWMA is well-suited to behavioral monitoring because it gives more weight to recent observations while retaining memory of historical patterns. A sudden spike in tool call frequency shows up immediately. A gradual drift in refusal rate shows up as the average migrates away from the baseline.

We also run CUSUM (cumulative sum) control charts in parallel. CUSUM is better at detecting small, sustained shifts that EWMA might smooth out. If an agent's scope compliance drops by 3% and stays there, CUSUM will flag it faster than EWMA because it accumulates the deviation over time rather than averaging it.

The combination of EWMA and CUSUM gives us sensitivity to both sudden changes and gradual trends. When both detectors agree, we have high confidence that a real behavioral shift has occurred. When only one triggers, the alert is informational rather than critical.

Each behavioral dimension has its own threshold calibration. Tool call frequency is naturally more variable than scope compliance, so it needs wider control limits. We calibrated thresholds using 30-day baseline windows from production agent deployments. The goal was a false positive rate below 2% per dimension per week. In practice, we achieved 1.4%.

The system produces a behavioral fingerprint: a vector of 8 values representing the agent's current behavioral profile. You can compare fingerprints across time to measure drift magnitude. You can compare fingerprints across agent configurations to measure the behavioral impact of changes. And you can use fingerprint similarity as an anomaly detection signal in multi-agent systems.

One finding from deploying this: model provider updates cause more behavioral drift than any other factor. A routine model version bump can shift 3 or 4 dimensions simultaneously. System prompt changes typically affect 1 or 2 dimensions. Tool configuration changes rarely affect more than 1. This gives operators a way to triage alerts: multi-dimension drift usually means the upstream model changed.

The fingerprinting system is part of the Sentinel monitoring engine in Authensor. It runs with zero external dependencies and processes events in real time.