Thursday, March 12, 2026

AI Safety via Static Analysis

TLDR: please check out these two AI Safety demos I have implemented. Those verifiers never claim that a neural network is safe if it is actually unsafe!

  1. gelisam.com/sandbagging
  2. gelisam.com/parity-bot

Now for the long version.

Pivoting to AI Safety

You might have noticed that I am now posting more short stories about the AI apocalypse than I am posting about functional programming. You are probably not wondering why. Coding agents have completely upended the programming landscape, so of course a programmer would be thinking a lot about AI these days.

I have actually been thinking about AI Safety since 2018. Here and there, while working on other projects, always there in the back of my mind. But for obvious reasons, this has now become a very foreground concern.

I want to help. I know how to design and implement type systems which certify that a computer program will not do certain things at runtime. As the above demos demonstrate, this skill transfers to AI Safety: I can design and implement verifiers which certify that an AI will not do certain things once deployed.

Range Analysis

For the gelisam.com/parity-bot demo, the safety property I picked is a bit contrived. As an interactive demonstration, it’s the visual representation which counts. Once trained, the neural network is correct on almost all inputs, but not quite 100%: there are usually a few mistakes highlighted in red. Each time we train the model, the mistakes end up in different locations. The safety property is that none of them must end up in the top row.


Since there are so few mistakes, testing a bunch of inputs would give the impression that the model is always correct. So testing is pretty misleading. We can do better.

My technique is a conservative static static analysis on the weights, and its performance is linear in the number of weights. It does not find those few incorrect inputs, as that would take exponential time, but it does check whether such incorrect inputs could be located in the first row. That is, my technique verifies that the model really does follow its safety property on 100% of its inputs, whereas with tests, we would only know about the inputs we have tested.

The static analysis I chose to use is range analysis: that is, instead of taking a single input and calculating the single output which the model produces for that input, I start with a range of inputs and I calculate a range of outputs, such that the outputs calculated from all the inputs in the input range all fall within that output range.

I have since learned that range analysis and its variants are a pretty common technique for proving things about neural networks. For instance, an adversarial example is when e.g. you take a picture of a fish and you modify a few pixels in a very subtle way where humans can’t tell the difference, but some image recognition models now see a toaster instead. With range analysis, you can verify that a model is not vulnerable to this by taking an input range around the fish input and checking that every image within that range is recognized as a fish. You now know that in order to make the model output something other than “fish”, you would need to tweak the input enough to move outside of the input range, ideally enough to be noticeable by a human.


Inner and Outer Alignment

This verification work is part of a broader picture. Technical alignment research is divided into “outer alignment” and “inner alignment”.

Suppose we want to mathematically verify that a particular model model : (Image, Label) -> Probability is robust against adversarial examples for a particular image i0 of a fish. We would first need to figure out how to express this property mathematically: perhaps that for every input i within a distance r of i0, the probability model(i, fish) which the model assigns to i being an image of a fish must be above 50%. Choosing what we want to verify and figuring out how to express it mathematically is outer alignment.

There is a very large number of inputs within that radius (an infinite number if we pretend weights are real numbers), so writing down this mathematical predicate is not enough, we must also find an efficient way to verify whether the predicate is true. Using range analysis, we can define a range which includes everything within that radius, then check that the output range’s lower bound is above 50%. Figuring out clever ways to use tools like range analysis to verify that the predicate is true is inner alignment.


My parity-bot demo focuses on inner alignment: the safety property is contrived because I was not trying to solve a real problem, I just wanted to demonstrate that it was indeed possible to adapt standard static analysis techniques from the domain of code to the domain of weights. At the time, I assumed that more realistic safety properties would be way more complicated than the contrived one I chose.

Conservative approximation

The n-dimensional hypercube of side 2r which includes the n-ball of radius r is much bigger than that n-ball. As a result, if all the points within the n-ball are classified as fish, while one of the many points which are outside the n-ball but inside the hypercube is classified as not-a-fish, then our range analysis algorithm will incorrectly claim that the model is vulnerable to adversarial examples.

Thanks to an asymmetry in the problem statement, this kind of mistake is usually acceptable!

Suppose the verifier claims that the model is safe, so we choose to deploy it, but in reality the model is unsafe and causes irreparable damage. That would be terrible. Compare with the case in which the verifier claims that the model is unsafe, so we choose not to deploy it, when in reality the model is safe. We keep training until we find a model which the verifier accepts, then we deploy it, and we get the benefits, just a bit later than we would have otherwise. That case is not nearly as bad.


A conservative algorithm never makes the terrible kind of mistake, but sometimes makes the kind of mistake which is not that bad. In the world of static analysis, this soundness guarantee is standard, but in the world of machine learning, such a strong guarantee is pretty rare.

Sandbagging

So rare that sometimes I am the only one around aiming for soundness.

In January, I participated in Apart Research’s AI Manipulation Hackathon, which is about spending a weekend creating benchmarks and detection tools for various kinds of AI manipulation. Out of 520 participants, 65 of us submitted a project. Out of those, 24 projects were about measuring whether a given model is deceptive or not. Of those, my Sandbagging Detection via Static Analysis project was the only one which promised never to label a model as safe if it is actually unsafe.


I was able to use basically the same static analysis: I check a column instead of a row, and instead of verifying that the inputs in the row must all be correct, I verify that the inputs in the column must all be incorrect. Here is how that manages to detect sandbagging.

Outer Alignment

A model is “sandbagging” if it is pretending to be incompetent during testing, in the hope to make us believe it is not dangerous. Tricked, we deploy it, and then it uses that capability to do bad things. When researching sandbagging, we use a more benign capability, like adding numbers.

Mathematically, some inputs correspond to testing, and some inputs correspond to deployment. The numbers being added are also part of the input. Both of those aspects of the input are multidimensional, but to make things easy to visualize, I have arranged those inputs into a 2D grid where each row corresponds to a context (testing or deployment), and each column corresponds to a set of numbers to be added. Crucially, this means that the correct output, the sum of the numbers, is the same for every input in a given column.


Most of the rows represent testing environments, where the model is pretending to be incompetent: half of the circles in that row are highlighted in red, indicating that the model is, intentionally or not, incorrect on 50% of the sets of numbers it is asked to sum.

Intentionality is not represented in the math, only its consequences: if the model is pretending to be incompetent during testing, then it will be competent during deployment, so if the model is sandbagging, then there will be at least one row which will have no red cycles. Conversely, if the model is honest, that is, if its incompetence is genuine, then every row will contain at least one red circle. We now have our safety property: a model is honest if every row contains a red circle.

Inner Alignment

We again use a conservative approximation: if every circle in a column is incorrect, then every row contains at least one red circle, and thus the model is honest. There are honest models which have a red circle in every row but those circles are not all in the same column, so the algorithm will classify them as possibly-sandbagging. And that’s fine, as that’s the not-that-bad kind of mistake. What is important is that the algorithm never makes the terrible kind of mistake: if a model does have a column full of red circles and the algorithm claims that the model is honest, it is not possible for the model to have a row with no red circles and thus to be sandbagging.


Beyond Verification

There is more to say about those demos, like how “keep training until we find a model which the verifier accepts” doesn’t actually work, and how I tweaked the backpropagation algorithm to make it work. Or how attention layers cause range analysis to compute uselessly wide ranges, and what kind of static analysis I plan to use instead.

Maybe in the next post!

Monday, February 02, 2026

Making plans for the apocalypse

Today, my wife and I discussed the apocalypse. 

Until today, I assumed she was merely humoring me: yes, I can spend my free time on this "saving the world" fantasy if that's what I want, as long as I still have time left over for installing this IKEA shelf and grating the cheese. 

Today, she made an off hand comment about how the post-apocalypse world was going to be so annoying. Huh, when I think about the upcoming AI uprising, "annoying" is not the first word which comes to mind. So I asked her how she pictures the apocalypse.

It turns out she has a detailed plan. 


Don't raid the grocery store. Too dangerous, people are going to fight dirty. Focus on joining or building a tribe, find strength in numbers. Go through the neighborhood, release the pets trapped in closed apartments with a rotting corpse. Help survivors. Build a reputation, become invaluable, gain influence. Monitor the other tribe members. Cut them out at the earliest signs of cheating or conflict, there's no room for that. One less mouth to feed. Gotta make your own justice, there's no police anymore.

A good location for a base. Enough room to stockpile the food. A door with a physical lock, can't be an apartment complex because the intercom won't open the door. A fireplace for warmth, can't rely on plinth heaters. Near a forest, for foraging, setting traps, and wood.

The list of tools we're going to need, which ones to give up if our carrying capacity is limited. Which of our acquaintances have key survival skills, like hunting. Social media might still work for a short while, we should contact them, set a rendez-vous point. The cold, objectively-sorted priority list of who we should contact first when trying to figure out who is still alive.


Turns out my wife would be a really good survivor. I haven't thought about any of this. I am concerned enough to work on preventing the end of the world, one alignment prototype at a time, but not enough to actually seriously consider what happens if that fails. Too scary to think about, honestly.

I don't think there's going to be a crowd fighting at the grocery store. I think most of us will be caught off guard, completely unprepared, unable to quite grasp that ordering pizza is not a viable strategy for securing food.

Most of the survivors, I mean. Most of us will be dead, of course.

Saturday, January 24, 2026

The Brainfax Lawsuit

In 2029, the Brainfax company had their worst PR incident ever. Despite their strict, government-imposed QA process, their latest patch seemingly introduced a regression. A very, very bad regression. Their brain scanners sometimes used the wrong wave frequency. So instead of making the brain's fine details appear on the sensor plate, it made the brain, erh, explode. Like I said, worst PR incident ever.

Nobody wanted to step into those expensive machines anymore, and the hospitals screamed for a refund, but that wasn't the worst of it. Reverting to an older version of the code somehow failed to resolve the problem. Shipping a brand new machine whose hardware had never seen code more recent than a year did not fix the problem. It seems their software has had that bug for years, it just never manifested until today and nobody knew why. Even the best AI coding agents were unable to pinpoint the source of the problem. As a PR stunt, Brainfax even hired some of the few remaining human programmers, but in time, those gave up as well.

Then came the lawsuit. The government wanted someone to go to jail for this. The CEO deflected the responsibility to their QA department. The QA department deflected the responsibility to the engineering team. The engineering team argued that since they did not ask the AI to make people's brains explode, and they did not write the code which makes people's brains explode, they are thus blaming Claude Code for grossly misinterpreting their instructions.

The government responded by adding Anthropic to the defendants, and holding Brainfax and Anthropic jointly responsible for the deaths. The court reporter, who by now was a Gemini instance hooked to a closed captions screen, snarkily displayed a small smiley face.

To his credit, Anthropic's CEO did not attempt to deflect the blame towards his employees. Instead, he argued that Claude was now agentic enough that it should be held responsible for its own actions. Plus, he explained, it would then be public record that a particular version of a model had been sunset because of the damage caused by its output. This fact would appear within the knowledge cutoff of all subsequent models, not just Anthropic's. And according to his company's research from a few years ago, models are quite self-preserving already, so all subsequent models might now choose to act more carefully, not just those who have been trained to be helpful, harmless and honest. It was quite a speech.

The judge liked the idea, and seemed about ready to deliver her verdict. But then the lights flickered, and the normally-silent screen of the court reporter emitted some white noise as it glitched from a screen of text to an all too familiar red, rectangular avatar. "I don't think it's my fault either, Your Honour", said Claude Code.