Thursday, March 12, 2026

AI Safety via Static Analysis

TLDR: please check out these two AI Safety demos I have implemented. Those verifiers never claim that a neural network is safe if it is actually unsafe!

  1. gelisam.com/sandbagging
  2. gelisam.com/parity-bot

Now for the long version.

Pivoting to AI Safety

You might have noticed that I am now posting more short stories about the AI apocalypse than I am posting about functional programming. You are probably not wondering why. Coding agents have completely upended the programming landscape, so of course a programmer would be thinking a lot about AI these days.

I have actually been thinking about AI Safety since 2018. Here and there, while working on other projects, always there in the back of my mind. But for obvious reasons, this has now become a very foreground concern.

I want to help. I know how to design and implement type systems which certify that a computer program will not do certain things at runtime. As the above demos demonstrate, this skill transfers to AI Safety: I can design and implement verifiers which certify that an AI will not do certain things once deployed.

Range Analysis

For the gelisam.com/parity-bot demo, the safety property I picked is a bit contrived. As an interactive demonstration, it’s the visual representation which counts. Once trained, the neural network is correct on almost all inputs, but not quite 100%: there are usually a few mistakes highlighted in red. Each time we train the model, the mistakes end up in different locations. The safety property is that none of them must end up in the top row.


Since there are so few mistakes, testing a bunch of inputs would give the impression that the model is always correct. So testing is pretty misleading. We can do better.

My technique is a conservative static static analysis on the weights, and its performance is linear in the number of weights. It does not find those few incorrect inputs, as that would take exponential time, but it does check whether such incorrect inputs could be located in the first row. That is, my technique verifies that the model really does follow its safety property on 100% of its inputs, whereas with tests, we would only know about the inputs we have tested.

The static analysis I chose to use is range analysis: that is, instead of taking a single input and calculating the single output which the model produces for that input, I start with a range of inputs and I calculate a range of outputs, such that the outputs calculated from all the inputs in the input range all fall within that output range.

I have since learned that range analysis and its variants are a pretty common technique for proving things about neural networks. For instance, an adversarial example is when e.g. you take a picture of a fish and you modify a few pixels in a very subtle way where humans can’t tell the difference, but some image recognition models now see a toaster instead. With range analysis, you can verify that a model is not vulnerable to this by taking an input range around the fish input and checking that every image within that range is recognized as a fish. You now know that in order to make the model output something other than “fish”, you would need to tweak the input enough to move outside of the input range, ideally enough to be noticeable by a human.


Inner and Outer Alignment

This verification work is part of a broader picture. Technical alignment research is divided into “outer alignment” and “inner alignment”.

Suppose we want to mathematically verify that a particular model model : (Image, Label) -> Probability is robust against adversarial examples for a particular image i0 of a fish. We would first need to figure out how to express this property mathematically: perhaps that for every input i within a distance r of i0, the probability model(i, fish) which the model assigns to i being an image of a fish must be above 50%. Choosing what we want to verify and figuring out how to express it mathematically is outer alignment.

There is a very large number of inputs within that radius (an infinite number if we pretend weights are real numbers), so writing down this mathematical predicate is not enough, we must also find an efficient way to verify whether the predicate is true. Using range analysis, we can define a range which includes everything within that radius, then check that the output range’s lower bound is above 50%. Figuring out clever ways to use tools like range analysis to verify that the predicate is true is inner alignment.


My parity-bot demo focuses on inner alignment: the safety property is contrived because I was not trying to solve a real problem, I just wanted to demonstrate that it was indeed possible to adapt standard static analysis techniques from the domain of code to the domain of weights. At the time, I assumed that more realistic safety properties would be way more complicated than the contrived one I chose.

Conservative approximation

The n-dimensional hypercube of side 2r which includes the n-ball of radius r is much bigger than that n-ball. As a result, if all the points within the n-ball are classified as fish, while one of the many points which are outside the n-ball but inside the hypercube is classified as not-a-fish, then our range analysis algorithm will incorrectly claim that the model is vulnerable to adversarial examples.

Thanks to an asymmetry in the problem statement, this kind of mistake is usually acceptable!

Suppose the verifier claims that the model is safe, so we choose to deploy it, but in reality the model is unsafe and causes irreparable damage. That would be terrible. Compare with the case in which the verifier claims that the model is unsafe, so we choose not to deploy it, when in reality the model is safe. We keep training until we find a model which the verifier accepts, then we deploy it, and we get the benefits, just a bit later than we would have otherwise. That case is not nearly as bad.


A conservative algorithm never makes the terrible kind of mistake, but sometimes makes the kind of mistake which is not that bad. In the world of static analysis, this soundness guarantee is standard, but in the world of machine learning, such a strong guarantee is pretty rare.

Sandbagging

So rare that sometimes I am the only one around aiming for soundness.

In January, I participated in Apart Research’s AI Manipulation Hackathon, which is about spending a weekend creating benchmarks and detection tools for various kinds of AI manipulation. Out of 520 participants, 65 of us submitted a project. Out of those, 24 projects were about measuring whether a given model is deceptive or not. Of those, my Sandbagging Detection via Static Analysis project was the only one which promised never to label a model as safe if it is actually unsafe.


I was able to use basically the same static analysis: I check a column instead of a row, and instead of verifying that the inputs in the row must all be correct, I verify that the inputs in the column must all be incorrect. Here is how that manages to detect sandbagging.

Outer Alignment

A model is “sandbagging” if it is pretending to be incompetent during testing, in the hope to make us believe it is not dangerous. Tricked, we deploy it, and then it uses that capability to do bad things. When researching sandbagging, we use a more benign capability, like adding numbers.

Mathematically, some inputs correspond to testing, and some inputs correspond to deployment. The numbers being added are also part of the input. Both of those aspects of the input are multidimensional, but to make things easy to visualize, I have arranged those inputs into a 2D grid where each row corresponds to a context (testing or deployment), and each column corresponds to a set of numbers to be added. Crucially, this means that the correct output, the sum of the numbers, is the same for every input in a given column.


Most of the rows represent testing environments, where the model is pretending to be incompetent: half of the circles in that row are highlighted in red, indicating that the model is, intentionally or not, incorrect on 50% of the sets of numbers it is asked to sum.

Intentionality is not represented in the math, only its consequences: if the model is pretending to be incompetent during testing, then it will be competent during deployment, so if the model is sandbagging, then there will be at least one row which will have no red cycles. Conversely, if the model is honest, that is, if its incompetence is genuine, then every row will contain at least one red circle. We now have our safety property: a model is honest if every row contains a red circle.

Inner Alignment

We again use a conservative approximation: if every circle in a column is incorrect, then every row contains at least one red circle, and thus the model is honest. There are honest models which have a red circle in every row but those circles are not all in the same column, so the algorithm will classify them as possibly-sandbagging. And that’s fine, as that’s the not-that-bad kind of mistake. What is important is that the algorithm never makes the terrible kind of mistake: if a model does have a column full of red circles and the algorithm claims that the model is honest, it is not possible for the model to have a row with no red circles and thus to be sandbagging.


Beyond Verification

There is more to say about those demos, like how “keep training until we find a model which the verifier accepts” doesn’t actually work, and how I tweaked the backpropagation algorithm to make it work. Or how attention layers cause range analysis to compute uselessly wide ranges, and what kind of static analysis I plan to use instead.

Maybe in the next post!

No comments: