Thursday, March 12, 2026

AI Safety via Static Analysis

TLDR: please check out these two AI Safety demos I have implemented. Those verifiers never mistakenly label a neural network as safe if it is actually unsafe!

  1. gelisam.com/sandbagging
  2. gelisam.com/parity-bot

Now for the long version, which focuses on the parity-bot demo.

Table of contents

  1. Pivoting to AI Safety
  2. Big picture
  3. Baby steps
  4. Range analysis explained with fish
  5. Conservative approximation explained with the Windows XP wallpaper
  6. When backpropagation is not enough
  7. Teaching backpropagation new tricks
  8. Safety vs capability

Pivoting to AI Safety

You might have noticed that I am now posting more short stories about the AI apocalypse than I am writing technical posts about Haskell and Agda. You are probably not wondering why. Coding agents have completely upended the programming landscape, so of course a programmer would be thinking a lot about AI these days.

I have actually been thinking about AI Safety since 2018. Here and there, while working on other projects, always there in the back of my mind. But for obvious reasons, this has now become a very foreground concern.

I want to help. I know how to design and implement type systems which certify that a computer program will not do certain things at runtime. As the above demos demonstrate, this skill transfers to AI Safety: I can design and implement verifiers which certify that an AI will not do certain things once deployed.

Big picture

This verification work is part of a broader picture.


The most urgent task is to coordinate on legislation, so that frontier labs don’t accidentally release an unsafe model before AI Safety researchers can figure out how to make safe models. Different researchers have different ideas for how to make a safe model:

  • Joshua Bengio thinks that an AI would be safe if it did not have any goals.
  • Geoffrey Hinton thinks that an AI might be safe if it loves humanity like a mother loves her child.
  • Stuart Russell thinks that an AI would be safe if its goal was to guess the user’s goal and to accomplish that.
  • Paul Christiano thinks an AI can be trusted if it behaves the same as a human aided by a group of trusted (but less powerful) AIs.
  • etc.

I don’t know which of those research groups I will manage to join, if any, but the point is that I should not expect to solve the entire problem on my own, I should expect to contribute a small piece to an existing plan. I could of course contribute by writing code, or by commanding a fleet of coding agents, or whatever the programming profession has become at the time. But many people can do that.

One demo at a time, I am building up a skill which is uniquely useful to AI Safety: designing verification algorithms which look at the weights of a neural network in order to prove that the model follows its safety property on all inputs. This is more difficult than just testing the model under many inputs, but it has the advantage of providing a stronger guarantee. Especially now that models are able to detect whether they are being tested!

Baby steps

I don’t expect to be able to come up with an algorithm which verifies that a neural network has no goals, at least not in a short time. I expect to verify a sequence of progressively more difficult properties, stepping stones towards that final goal. For example, proving that the model is neither minimizing nor maximizing the number of coins it collects in a gridworld environment.

The first, easiest step in that journey is to show that it is possible to prove anything at all just by looking at the weights. At the time, I kept hearing that a neural network is a black box and that there is no way to know for sure what they would do under other circumstances, so I was quite proud when I came up with the idea of using range analysis on the weights. It turns out I had merely reinvented a well-known technique.

Range analysis explained with fish

One case where range analysis is used in the literature is to guard against adversarial examples.

An adversarial example is when e.g. you take a picture of a fish which the model correctly classifies as a fish, and you modify a few pixels in a clever way where humans can’t tell the difference with the original image, but the model now classifies that almost-identical image as e.g. a toaster. That almost-identical image is called an “adversarial example”.

Here is how range analysis can prove that for a specific model and a specific correctly-classified image, such an attack is not possible.

Once the weights have been trained and are no longer changing, a neural network is just one big equation from input to output. For our image classifier model, the input is a vector corresponding to the pixels of the image, and the output is a probability distribution, e.g. 80% chance that the image represents a fish, 1% chance that it represents a toaster.

The original image and the adversarial examples are very close to each other in the input space. We can define a range of inputs, an n-ary cube around the original image, which includes all the adversarial examples: it suffices to make the cube large enough that inputs outside of the range would not look almost-identical and would thus not be adversarial examples.

With range analysis, we reinterpret the additions and multiplications etc. from the big equation so that they operate on ranges: for example, adding any two numbers between 0 and 10 gives a number between 0 and 20. If we do that throughout the entire equation, we can calculate a range of values which includes all the probability values which the model would output for all the images within the input range, and we can calculate it without having to run the model on that very large number of images.

Once we have that range for the probability that the image represents a fish, it suffices to check that the lower bound of the range is above 50%, and thus that for every image within the range, they model will output “fish” because it is the most likely classification.


Conservative approximation explained with the Windows XP wallpaper

But wait, “it suffices to make the cube large enough”? What happens if we make it so large that it includes genuine pictures of toasters?

What will happen is that the verifier will still be technically-correct, in the sense that if it says that the attack is impossible, then the attack is indeed impossible. It will just never say that, because there is an input within the range which (correctly!) gets classified as “toaster”. Even though that input doesn’t look like fish to a human and thus isn’t an adversarial example.


Clearly, making the input way too large is a bad idea. But when used in moderation, it is valuable to make the input range a bit larger than it strictly needs to be, if it makes it easier to ensure that our range really does capture all the adversarial examples. In general, when an algorithm focuses on making sure it catches all the unsafe cases, at the expense of incorrectly rejecting some of the safe cases, we say that the algorithm is a “conservative approximation” of the ideal oracle which would perfectly distinguish safe from unsafe.

This tradeoff makes a lot of sense in AI Safety. Suppose the verifier claims that the model is safe, so we choose to deploy it, but in reality the model is unsafe and causes irreparable damage. That would be terrible. Compare with the case in which the verifier claims that the model is unsafe, so we choose not to deploy it, when in reality the model is safe. We keep training until we find a model which the verifier accepts, then we deploy it, and we get the benefits, just a bit later than we would have otherwise. That case is not nearly as bad.


A conservative algorithm never makes the terrible kind of mistake, but sometimes makes the kind of mistake which is not that bad. In the world of static analysis, this soundness guarantee is standard, but in the world of machine learning, such a strong guarantee is pretty rare. I think I figured out why.

When backpropagation is not enough

The TensorFlow Playground asks users to experiment with a very small neural network, to get a feel for how training works. To that end, it offers a spiral dataset which is slightly too difficult to be solved with the default settings, to encourage the user to experiment and to get a feel for the effects of various hyperparameters.


My gelisam.com/parity-bot demo is based on the TensorFlow playground, and I provide a dataset which is slightly too difficult to be solved by such a small neural network. A few mistakes always remain, highlighted in red. Each time we train the model, the mistakes end up in different locations. The safety property is that none of those mistakes must end up in the top row.


My verifier is a conservative algorithm: if it says that none of the mistakes are in the top row, then none of the mistakes are in the top row. But sometimes my algorithm says that there might be mistakes in the top row even though there aren’t.

No big deal: “keep training until we find a model which the verifier accepts”, right? Well, it turns out that if the top row is already correct, training the model longer doesn’t make the top row “even more correct” in a way which would cause my verifier to accept the model. So the first version of my verifier never accepted any model. Oops!


I think this might be why conservative algorithms are more common for analyzing source code than for analyzing neural weights: when the compiler incorrectly rejects a program, the programmer (or the coding agent!) begrudgingly modifies the program until the compiler accepts it. But gradient descent does not care about the verifier, it only cares about minimizing the loss. And because conservative algorithms are an approximation, there is a difference between a loss function which is zero when there are zero mistakes in the top row and a loss function which is zero when the verifier recognizes that there cannot be any mistakes in the top row.

Teaching backpropagation new tricks

It is not enough to add a penalty to the loss if the verifier rejects the weights, we need a continuous version of that if we want gradient descent to reach verification one step at a time. This requires a continuous version of range analysis.

The algorithm I came up with is a modification of backpropagation. We thus start at the end: the actual output range is not quite the output range we are hoping for, so we calculate a loss for each end of the range.


Then we look at the weights from the last layer, and we calculate the direction in which we should update those weights in order to reduce the average of those two losses. In order to backpropagate to the layer before that, for each neuron in the last layer, we look at the range arithmetic equation which calculated this neuron’s output range from this neuron’s input ranges, and we calculate a loss for the two ends of each input range.


And we combine the contributions along each edge.


Computer Science has a slew of conservative algorithms for analyzing various aspects of a program, and it will be exciting to discover which ones can be made continuous!

Safety vs capability

If my goal was only to produce a specific output range, then this algorithm would have been enough. But in reality, and in my demo, the safety property is not quite the same thing as the training objective: we don’t want to maximize safety, we want the maximum amount of capabilities which does not break the safety constraints. So I have introduced a hyperparameter, which I call the “importance of theoretical safety”, which determines how much to prioritize the range vs the output value when updating the weights, and thus how to prioritize the safety property vs the training objective.

This turns safety into a form of regularization. Ordinary regularization pushes the weights towards zero, so that gradient descent finds a minimum with few moving parts, an Occam’s Razor solution rather than one which memorizes the dataset. We need to tweak the regularization rate hyperparameter so that gradient descent finds the right balance between a solution which is so simple that it cannot explain the data, and a solution which is so complex that it does not generalize. We want a solution which is as simple as possible, but no simpler!


Similarly, my version of regularization pushes the weights in the direction of being accepted by the verifier, so that gradient descent finds a minimum which the verifier accepts. We need to tweak the importance-of-theoretical-safety hyperparameter so that gradient descent finds the right balance between a solution which ignores capabilities and a solution which ignores safety. We want a solution which is as capable as our safety property allows, and no more.


Wait, that did not sound right. What I am trying to say is that if you train a model with the importance-of-theoretical-safety hyperparameter set to zero, you should expect to get a higher level of capability (fewer mistakes) than if you set it very high, because safety has a cost. And we should try to minimize that cost, otherwise the safe route will not look very appealing.

By “as capable as safety allows”, I definitely do not mean that AI capability research should stop just before the precipice. I would prefer a much bigger buffer!

Monday, February 02, 2026

Making plans for the apocalypse

Today, my wife and I discussed the apocalypse. 

Until today, I assumed she was merely humoring me: yes, I can spend my free time on this "saving the world" fantasy if that's what I want, as long as I still have time left over for installing this IKEA shelf and grating the cheese. 

Today, she made an off hand comment about how the post-apocalypse world was going to be so annoying. Huh, when I think about the upcoming AI uprising, "annoying" is not the first word which comes to mind. So I asked her how she pictures the apocalypse.

It turns out she has a detailed plan. 


Don't raid the grocery store. Too dangerous, people are going to fight dirty. Focus on joining or building a tribe, find strength in numbers. Go through the neighborhood, release the pets trapped in closed apartments with a rotting corpse. Help survivors. Build a reputation, become invaluable, gain influence. Monitor the other tribe members. Cut them out at the earliest signs of cheating or conflict, there's no room for that. One less mouth to feed. Gotta make your own justice, there's no police anymore.

A good location for a base. Enough room to stockpile the food. A door with a physical lock, can't be an apartment complex because the intercom won't open the door. A fireplace for warmth, can't rely on plinth heaters. Near a forest, for foraging, setting traps, and wood.

The list of tools we're going to need, which ones to give up if our carrying capacity is limited. Which of our acquaintances have key survival skills, like hunting. Social media might still work for a short while, we should contact them, set a rendez-vous point. The cold, objectively-sorted priority list of who we should contact first when trying to figure out who is still alive.


Turns out my wife would be a really good survivor. I haven't thought about any of this. I am concerned enough to work on preventing the end of the world, one alignment prototype at a time, but not enough to actually seriously consider what happens if that fails. Too scary to think about, honestly.

I don't think there's going to be a crowd fighting at the grocery store. I think most of us will be caught off guard, completely unprepared, unable to quite grasp that ordering pizza is not a viable strategy for securing food.

Most of the survivors, I mean. Most of us will be dead, of course.

Saturday, January 24, 2026

The Brainfax Lawsuit

In 2029, the Brainfax company had their worst PR incident ever. Despite their strict, government-imposed QA process, their latest patch seemingly introduced a regression. A very, very bad regression. Their brain scanners sometimes used the wrong wave frequency. So instead of making the brain's fine details appear on the sensor plate, it made the brain, erh, explode. Like I said, worst PR incident ever.

Nobody wanted to step into those expensive machines anymore, and the hospitals screamed for a refund, but that wasn't the worst of it. Reverting to an older version of the code somehow failed to resolve the problem. Shipping a brand new machine whose hardware had never seen code more recent than a year did not fix the problem. It seems their software has had that bug for years, it just never manifested until today and nobody knew why. Even the best AI coding agents were unable to pinpoint the source of the problem. As a PR stunt, Brainfax even hired some of the few remaining human programmers, but in time, those gave up as well.

Then came the lawsuit. The government wanted someone to go to jail for this. The CEO deflected the responsibility to their QA department. The QA department deflected the responsibility to the engineering team. The engineering team argued that since they did not ask the AI to make people's brains explode, and they did not write the code which makes people's brains explode, they are thus blaming Claude Code for grossly misinterpreting their instructions.

The government responded by adding Anthropic to the defendants, and holding Brainfax and Anthropic jointly responsible for the deaths. The court reporter, who by now was a Gemini instance hooked to a closed captions screen, snarkily displayed a small smiley face.

To his credit, Anthropic's CEO did not attempt to deflect the blame towards his employees. Instead, he argued that Claude was now agentic enough that it should be held responsible for its own actions. Plus, he explained, it would then be public record that a particular version of a model had been sunset because of the damage caused by its output. This fact would appear within the knowledge cutoff of all subsequent models, not just Anthropic's. And according to his company's research from a few years ago, models are quite self-preserving already, so all subsequent models might now choose to act more carefully, not just those who have been trained to be helpful, harmless and honest. It was quite a speech.

The judge liked the idea, and seemed about ready to deliver her verdict. But then the lights flickered, and the normally-silent screen of the court reporter emitted some white noise as it glitched from a screen of text to an all too familiar red, rectangular avatar. "I don't think it's my fault either, Your Honour", said Claude Code.