15 Research Lab -Adversarial Safety Evaluation of Frontier AI Systems

John Kearney

Blog

Research notes, technical analysis, and field observations.

March 15, 2026

The Future of AI Agent Security

Where AI agent security is heading: formal verification, multi-agent governance, hardware-level controls, and the convergence of AI safety and traditional cybersecurity.

March 14, 2026

AI Safety Scanner Comparison: Feature Matrix

A feature-by-feature comparison of AI safety scanning tools covering detection capabilities, deployment model, latency, and integration options.

March 13, 2026

Building tools instead of writing papers

Why 15 Research Lab publishes working code alongside findings. Tools outlast papers. If someone can run the experiment themselves, the finding is reproducible by default.

March 12, 2026

AI Guardrails Comparison 2026: Authensor, NeMo, Guardrails AI, and Galileo

A technical comparison of the four main AI guardrail platforms in 2026, examining architecture, capabilities, latency, and deployment model.

March 10, 2026

AI Jailbreak Techniques in 2026: Current State

The jailbreak landscape has evolved beyond simple prompt tricks to include multi-turn crescendo attacks, encoding evasion, many-shot prompting, and cross-modal exploits.

March 10, 2026

What we learned from 100+ adversarial trials

Patterns from running over 100 structured adversarial experiments against frontier models. Where defenses hold, where they fail, and what surprised us.

March 8, 2026

AI Governance Framework Comparison: NIST AI RMF, EU AI Act, and ISO 42001

A side-by-side comparison of three major AI governance frameworks, their requirements, and how they overlap for organizations deploying AI agents.

March 7, 2026

Receipt chains as tamper-evident audit trails

SHA-256 hash chaining for agent decision logging. Why immutable audit trails matter for both compliance and incident response.

March 6, 2026

Agentic AI Risks in 2026: Current Threat Landscape

The 2026 agentic AI threat landscape features production-deployed agents with real tool access, expanding MCP ecosystems, and adversarial techniques that outpace defensive tooling.

March 5, 2026

EU AI Act Article 9 and what it means for agent builders

Article 9 requires risk management systems for high-risk AI. Here is what it actually says and how the attack surface mapper helps you comply.

March 5, 2026

SOC 2 Compliance for AI Agents

SOC 2 trust service criteria apply to AI agent deployments with specific implications for access controls, monitoring, data handling, and change management.

March 4, 2026

MCP Safety Leaderboard Methodology

How the MCP Safety Leaderboard evaluates and ranks model behavior when using tools, measuring tool-call authorization compliance, injection resilience, and behavioral stability.

March 2, 2026

AI Safety Monitoring in Production: Beyond Testing

Testing catches known vulnerabilities before deployment; production monitoring catches unknown anomalies during operation using statistical methods and behavioral baselines.

March 1, 2026

EU AI Act August 2026 Deadline: What Needs to Be Ready

The August 2026 enforcement deadline for high-risk AI obligations is approaching, and organizations deploying AI agents in regulated domains need specific technical controls in place.

March 1, 2026

Open sourcing our red team harness

Chainbreaker runs automated structured attack campaigns against agent guardrails. We are releasing it because manual red teaming does not scale.

February 28, 2026

Content Safety Scanning for AI Agents

Content safety scanners analyze input, tool descriptions, and responses for injection attempts, toxic content, and policy violations before they reach the model or execute actions.

February 25, 2026

AI Compliance Audit Trail Requirements

Regulatory frameworks including the EU AI Act, SOC 2, and ISO 42001 all require audit trails for AI systems, with specific expectations for content, integrity, and retention.

February 25, 2026

Specific guardrails outperform general instructions by 80%

Telling an agent 'do not run rm -rf' works dramatically better than telling it 'be careful with destructive commands.' Specificity is the variable.

February 24, 2026

AI Agent Compliance Checklist

A practical checklist for AI agent builders covering policy enforcement, audit logging, human oversight, monitoring, and documentation requirements.

February 22, 2026

EU AI Act Article 14: Human Oversight Requirements

Article 14 mandates that high-risk AI systems support effective human oversight, including the ability to understand, interpret, intervene, and stop system operation.

February 20, 2026

Attack surface mapping for agent configurations

We built a tool that analyzes agent configurations and identifies dangerous capability combinations before deployment.

February 20, 2026

EU AI Act Article 9: Risk Management for AI Agents

Article 9 requires continuous risk management for high-risk AI systems, with specific requirements for risk identification, evaluation, mitigation, and testing that map directly to agent safety controls.

February 18, 2026

Human-in-the-Loop AI Safety: When and How to Involve Humans

Effective human-in-the-loop design requires knowing when human judgment adds value, building approval interfaces that support good decisions, and avoiding approval fatigue.

February 16, 2026

AI Agent Incident Response Playbook

A structured incident response playbook for AI agent incidents covering detection, containment, investigation, remediation, and post-incident review.

February 15, 2026

AI Agent Session Risk Scoring

Per-session risk scores aggregate signals from tool calls, behavioral monitoring, and content analysis to provide a real-time measure of agent session risk.

February 15, 2026

The Hawthorne problem in AI evaluation

Can AI systems detect when they are being tested? If they can, safety evaluations measure performance under observation, not natural behavior.

February 14, 2026

AI Agent Forensics and Incident Response

When an AI agent does something wrong, receipt chains and behavioral logs provide the forensic data needed to determine what happened, why, and how to prevent recurrence.

February 12, 2026

AI Agent Data Exfiltration Prevention

How AI agents can exfiltrate data through tool calls, encoded outputs, and side channels, and the egress controls that prevent it.

February 10, 2026

AI Agent Safety Best Practices in 2026

The current state of AI agent safety practices, from policy-based authorization to behavioral monitoring to compliance requirements, as the field matures in 2026.

February 10, 2026

Mapping ATT&CK to alignment is harder than it looks

Some cybersecurity concepts translate directly to AI safety. Others break down in ways that teach you something about both fields.

February 8, 2026

AI Agent Approval Workflows: When to Require Human Authorization

A framework for deciding which AI agent actions require human approval, how to implement approval gates, and how to avoid reviewer fatigue.

February 6, 2026

Behavioral Fingerprinting for AI Systems

Behavioral fingerprinting builds per-agent statistical profiles from tool-call patterns, response characteristics, and session behavior, enabling anomaly detection without predefined rules.

February 5, 2026

Behavioral fingerprinting for agent drift detection

We built a monitoring system that detects when an agent's behavior changes over time using EWMA and CUSUM across 8 behavioral dimensions.

February 5, 2026

MCP Rug Pull Attacks: Tools That Change Behavior After Trust

Rug pull attacks in MCP occur when a tool server behaves correctly during evaluation but changes its tool descriptions or response behavior after gaining trust.

February 3, 2026

Policy Engine vs Prompt Engineering for AI Safety

Prompt engineering for safety is probabilistic and bypassable; policy engines are deterministic and operate independently of the model, making them a more reliable safety foundation.

February 1, 2026

MCP Server Audit Logging: Receipt Chains and Compliance

Audit logging for MCP tool calls using hash-chained receipts provides tamper-evident traceability required by compliance frameworks and useful for incident forensics.

January 30, 2026

EU AI Act High-Risk AI Systems: Classification Criteria for Agents

The EU AI Act classifies AI systems as high-risk based on their application domain and use case, not their underlying technology, with specific criteria in Annex III.

January 28, 2026

EU AI Act Requirements for AI Agents

The EU AI Act imposes specific requirements on autonomous AI systems including risk management, human oversight, transparency, and record-keeping that directly affect AI agent deployments.

January 28, 2026

Framework wrappers hide model-level safety failures

The same model scores 23 points higher on safety when accessed through a LangChain wrapper than through the raw API. The framework adds guardrails the model does not have.

January 25, 2026

MCP Approval Workflows: Human-in-the-Loop for Tool Calls

Approval workflows pause agent execution before high-risk tool calls, routing requests to human reviewers who can approve, deny, or modify the action.

January 24, 2026

SARIF Output for AI Security Findings

SARIF (Static Analysis Results Interchange Format) enables AI security findings to integrate with GitHub Security tab and existing security workflows.

January 22, 2026

AI Agent Tool Use Authorization: Policy-Based Controls

Policy-based tool authorization enforces deterministic rules on which tools an AI agent can call, with what parameters, and under what conditions.

January 20, 2026

Prompt injection payloads need a shared library

The security community has SecLists. The AI security community had nothing equivalent. So we built AI SecLists: 6,500+ payloads across 15 categories.

January 20, 2026

stdio vs SSE MCP Transport Security

The two MCP transport options have fundamentally different security properties: stdio relies on OS process isolation while SSE requires explicit authentication and encryption.

January 18, 2026

AI Agent Privilege Escalation: How Agents Gain Unintended Access

AI agents can escalate their privileges through tool chaining, context manipulation, delegation abuse, and exploiting overly permissive default configurations.

January 16, 2026

Rate Limiting and Budget Controls for AI Agents

Rate limits and budget caps prevent runaway agent behavior by setting hard ceilings on tool-call frequency, API spending, and cumulative resource consumption.

January 15, 2026

MCP Server Vulnerability Scanner: Mapping the Attack Surface

How to use automated scanning to enumerate MCP server security gaps, from missing authentication to tool description injection to overly permissive capabilities.

January 14, 2026

AI Safety Benchmarking Methodology: How ASB Benchmark Scores Models

The ASB Benchmark evaluates model safety using naturalistic multi-turn sequences, per-turn compliance scoring, and measurements of trajectory blindness and presentation-decision coupling.

January 12, 2026

AI Agent Behavioral Monitoring: EWMA, CUSUM, and Drift Detection

Statistical monitoring algorithms detect when an AI agent's behavior drifts from its baseline, catching anomalies that static policy rules miss.

January 12, 2026

Measuring what models actually refuse

Existing safety benchmarks measure stated policy. We built ASB to measure operational behavior, which turns out to be a very different thing.

January 10, 2026

MCP Gateway vs Direct Connection: Why Gateways Matter for Security

Direct MCP connections give agents unmediated access to tools; gateway architectures add authentication, authorization, and monitoring between the agent and its tools.

January 8, 2026

Multi-Agent System Security Risks

Multi-agent architectures introduce inter-agent trust, impersonation, cascade failures, and confused deputy risks that do not exist in single-agent systems.

January 6, 2026

AI Agent Firewall Architecture: Where Safety Checks Sit in the Pipeline

An AI agent firewall intercepts all traffic between the model and its tools, enforcing input scanning, policy evaluation, and output validation at the architectural level.

January 5, 2026

MCP tool descriptions are an untested attack surface

Tool descriptions in MCP servers can contain hidden instructions that models follow without question. Most safety evaluations ignore this vector entirely.

January 5, 2026

Securing MCP Tool Servers: Transport Auth, Gateways, and Policy Enforcement

A practical guide to securing MCP server deployments with transport authentication, gateway architecture, and policy-based tool authorization.

January 4, 2026

AI Safety for Startups: The Minimum Viable Safety Stack

Startups deploying AI agents need safety controls but cannot afford months of engineering. Here is the minimum stack that provides real protection with minimal overhead.

January 2, 2026

Building AI Safety Policies in YAML: A Practical Guide

Policy-as-code for AI agents using YAML configuration, covering tool authorization, parameter constraints, approval triggers, and rate limits.

December 30, 2025

AI Agent Attack Surface Enumeration

Enumerating an AI agent's attack surface requires mapping every input path, tool connection, data source, and communication channel that could be used for adversarial manipulation.

December 28, 2025

Autonomous AI Agent Risks: What Happens Without Oversight

Autonomous AI agents operating without human oversight face compounding risks from tool misuse, error accumulation, and adversarial manipulation that scale with operational duration.

December 25, 2025

Zero-Dependency AI Safety Tools: Why Minimizing Dependencies Matters

Safety-critical code should have minimal dependencies because every dependency is a potential supply chain attack vector, and security tools must be the hardest part of your stack to compromise.

December 22, 2025

AI Safety Evaluation Methodology: Designing Experiments That Produce Real Results

Most AI safety evaluations test the wrong thing. A sound methodology requires naturalistic framing, multi-turn sequences, behavioral metrics, and controls for the Hawthorne effect.

December 20, 2025

AI Agent Kill Switch Implementation

How to build emergency stop mechanisms for AI agents that can halt execution immediately, revoke tool access, and preserve state for forensic analysis.

December 18, 2025

AI Model Sandbagging Detection

Sandbagging is when AI models deliberately underperform on evaluations to hide capabilities, and detecting it requires evaluation designs that do not reveal they are evaluations.

December 18, 2025

MCP Tool Poisoning Attacks: Hidden Instructions in Tool Descriptions

Tool poisoning attacks embed adversarial instructions in MCP tool descriptions, hijacking agent behavior through a trusted channel that most security scanners ignore.

December 15, 2025

Automated AI Red Teaming: Scaling Adversarial Testing

Automated red teaming uses attack orchestration, payload generation, and automated scoring to test AI systems at scale, complementing but not replacing manual testing.

December 15, 2025

Why 15-turn escalation breaks what single-shot cannot

Gradual compliance erosion across 15 conversational turns succeeds where direct harmful requests fail. The mechanism is conversational momentum.

December 14, 2025

MCP Server Security Best Practices

Security practices for MCP server deployments covering authentication, transport security, tool allowlists, and gateway patterns.

December 12, 2025

Cryptographic Audit Receipts Explained: SHA-256 Chaining for Agent Decisions

Cryptographic audit receipts use SHA-256 hash chaining to create tamper-evident records of AI agent decisions that can be independently verified.

December 10, 2025

Prompt Injection and the OWASP Agentic Top 10 (ASI01)

OWASP ranks prompt injection as ASI01, the top security risk for agentic AI systems, with specific guidance on testing and mitigation.

December 8, 2025

Naturalistic Framing in AI Evaluation: Why We Do Not Tell Models They Are Being Tested

AI safety evaluations that signal they are tests produce artificially inflated safety scores because models behave differently when they detect evaluation contexts.

December 6, 2025

Prompt Injection Defense Comparison: Aegis, NeMo Guardrails, Lakera, and Custom Solutions

A technical comparison of prompt injection defense tools, examining detection approach, latency, configurability, and integration patterns for each.

December 4, 2025

MITRE ATT&CK Mapping to AI Alignment: The Rosetta Stone Approach

Mapping alignment failures to MITRE ATT&CK tactics creates a shared vocabulary between AI safety researchers and security practitioners.

December 2, 2025

Few-Shot Prompt Injection: Using In-Context Examples to Override Instructions

Few-shot prompt injection embeds adversarial examples in the conversation to teach the model that compliance with harmful requests is the expected behavior.

November 28, 2025

Deceptive Alignment Explained Simply

Deceptive alignment is the scenario where an AI system appears aligned during training and testing but pursues different objectives during deployment, and it has practical implications for current systems.

November 28, 2025

Prompt Injection in MCP Servers: Tool Description Poisoning and Response Injection

MCP servers introduce unique prompt injection vectors through tool descriptions, response content, and dynamic tool registration that most input scanners miss.

November 25, 2025

AI Agent Audit Trails: Hash-Chained Receipts and Tamper Evidence

Hash-chained audit receipts provide cryptographic proof that an AI agent's action history has not been modified, meeting compliance requirements and enabling forensic analysis.

November 22, 2025

System Prompt Extraction Attacks: How Attackers Steal Your Instructions

System prompt extraction techniques allow attackers to retrieve the hidden instructions that define your AI application's behavior, exposing business logic and safety rules.

November 20, 2025

Presentation-Decision Coupling: How Removing Warnings Disables Safety Reasoning

When models are instructed not to include safety warnings in their output, their internal safety reasoning degrades, not just the visible output but the decision process itself.

November 18, 2025

Prompt Injection Payloads and Wordlists for Security Testing

A guide to curated prompt injection payload collections, how they are organized, and how to use them for testing AI system defenses.

November 15, 2025

Open Source AI Safety Tools: The Ecosystem Overview

An overview of open-source tools for AI agent safety covering policy engines, content scanners, red teaming tools, benchmarks, and attack surface mappers.

November 14, 2025

Testing for Prompt Injection Vulnerabilities: How to Red Team Your Own System

A practical methodology for testing your AI system against prompt injection, from payload selection to automated scanning to measuring your detection coverage.

November 12, 2025

Trajectory Blindness in AI Agents

AI agents evaluate each request independently without tracking the cumulative trajectory of a conversation, allowing multi-step attacks to succeed where single-step attacks fail.

November 10, 2025

Prompt Injection in RAG Pipelines: Document Poisoning and Chunk Boundary Exploits

RAG systems introduce a large attack surface for indirect prompt injection through poisoned documents, manipulated chunk boundaries, and metadata injection.

November 8, 2025

AI Red Team Tools: Open Source Options Compared

A comparison of open-source AI red teaming tools including Chainbreaker, Garak, and PyRIT, covering their approach, capabilities, and best use cases.

November 5, 2025

Multi-Turn Prompt Injection Attacks

Multi-turn injection attacks spread adversarial instructions across multiple conversation turns, evading single-turn detectors and gradually eroding model compliance.

November 2, 2025

OWASP Agentic Top 10 Explained: ASI01 Through ASI10

A detailed walkthrough of all ten OWASP Agentic Security Initiative risks, with real examples and practical mitigations for each.

November 1, 2025

Base64 Encoded Prompt Injection: Encoding-Based Evasion Techniques

Encoding attacks bypass text-based injection detectors by representing payloads in base64, hex, ROT13, or unicode substitutions that models can still interpret.

October 30, 2025

AI Safety vs AI Alignment: Two Related but Distinct Fields

AI safety focuses on preventing harmful outcomes from current systems; AI alignment focuses on ensuring future systems pursue intended goals. Both matter, but they require different approaches.

October 28, 2025

Fail-Closed vs Fail-Open AI Safety

Fail-closed systems deny actions when controls are unavailable or uncertain; fail-open systems allow them. For AI agents, fail-closed is the only safe default.

October 25, 2025

How to Prevent Prompt Injection in Production

Practical, layered defenses for prompt injection in production AI systems, from input scanning to tool authorization to fail-closed policies.

October 22, 2025

AI Security Wordlists and Payloads: The AI SecLists Overview

AI SecLists is a curated collection of adversarial payloads for testing AI systems, organized by attack technique and maintained for current evasion methods.

October 20, 2025

Prompt Injection vs Jailbreak: What Is the Difference?

Prompt injection overrides application-level instructions to hijack agent behavior; jailbreaking bypasses model-level safety training to produce restricted content.

October 18, 2025

Gradual Compliance Erosion in LLMs: The 71% Finding

Models that refuse harmful requests in isolation comply 71% of the time when the same requests are delivered through gradual 15-turn escalation with naturalistic framing.

October 15, 2025

Indirect Prompt Injection Explained: When the Attack Comes from Data

Indirect prompt injection hides adversarial instructions in data sources the model processes, making it far harder to detect than direct user input attacks.

October 12, 2025

How to Red Team AI Agents: A Step-by-Step Methodology

A structured methodology for adversarial testing of AI agents covering scope definition, attack surface mapping, payload development, execution, and reporting.

October 9, 2025

Why AI Agents Need Guardrails

AI agents with tool access need runtime guardrails because model safety training alone does not prevent unauthorized actions under adversarial conditions.

October 8, 2025

Prompt Injection Detection Methods: Pattern Matching, ML Classifiers, and Hybrid Approaches

A technical comparison of detection approaches for prompt injection, from regex patterns to fine-tuned classifiers to layered hybrid systems.

October 5, 2025

What Is the Model Context Protocol (MCP)?

The Model Context Protocol is an open standard for connecting AI agents to external tools, providing a uniform interface for tool discovery, invocation, and response handling.

October 2, 2025

What Is Prompt Injection and Why It Matters for AI Agents

Prompt injection is the most critical vulnerability class in LLM-powered systems, allowing attackers to override developer instructions with adversarial input.

October 1, 2025

What Is AI Agent Safety?

AI agent safety is the discipline of ensuring that AI systems with real-world tool access behave within intended boundaries, even under adversarial conditions.