15 Research Lab -Adversarial Safety Evaluation of Frontier AI Systems

John Kearney

Building tools instead of writing papers

March 13, 2026John Kearney

open-sourceupdate

We made a decision early on that every research finding we publish would be accompanied by a tool, a dataset, or a reproducible experiment. Not as supplementary material. As the primary artifact.

This is unusual for a research lab. The standard academic output is a paper. The paper describes the methodology, presents the results, and maybe links to a GitHub repository that gets updated for six months and then abandoned. The paper is the product. The code is an accessory.

We inverted this. The tool is the product. The writeup explains why the tool exists and what we learned building it. If the writeup disappeared, the tool would still be useful. If the tool disappeared, the writeup would be a description of something you cannot verify.

This approach has practical consequences. First, it forces rigor. Writing a paper about a finding is easy to get wrong in subtle ways. Building a tool that reproduces the finding forces you to be precise about every variable, every threshold, every edge case. If the tool does not produce the claimed results, the finding is wrong. There is no ambiguity.

Second, tools compound. A paper sits in a citation graph and influences future papers. A tool sits in someone's workflow and influences their daily work. Papers inform. Tools enable. The AI SecLists repository has been used by more people than will ever read our prompt injection research. The attack surface mapper has helped more teams assess risk than our compliance analysis paper. The tool reaches users that the paper never would.

Third, tools create accountability. When you publish a tool, you are making a testable claim about its utility. If the tool does not work, people will tell you immediately. Papers can contain errors that go undetected for years. Tools get tested on day one by everyone who installs them.

The tradeoff is that tools take longer to build than papers take to write. A paper describing a finding might take two weeks. A tool that reproduces and operationalizes the finding might take two months. We accept this tradeoff because we think the research community has an oversupply of papers and an undersupply of working tools.

There is also a reproducibility argument. AI safety research has a reproducibility problem. Many findings depend on specific model versions, specific prompt formulations, and specific evaluation conditions that are not fully specified in the paper. When we publish a tool, the entire experimental setup is encoded in the code. You can run it. You will get results. Those results are directly comparable to ours because you used the same tool.

This does not mean papers have no value. Papers provide narrative, context, and interpretation that tools cannot. They situate findings within the broader research landscape. They explain why something matters, not just what it does. We write papers too. But the paper is always second to the tool.

Everything we build is MIT licensed. We want the tools to be used, modified, and extended. If someone forks the ASB Benchmark and adapts it for their domain, that is a better outcome than if they cite our paper and build something from scratch. Adoption over attribution. Utility over credit.