TLDR: please check out these two AI Safety demos I have implemented. Those verifiers never mistakenly label a neural network as safe if it is actually unsafe!
Now for the long version, which focuses on the parity-bot demo.
Table of contents
- Pivoting to AI Safety
- Big picture
- Baby steps
- Range analysis explained with fish
- Conservative approximation explained with the Windows XP wallpaper
- When backpropagation is not enough
- Teaching backpropagation new tricks
- Safety vs capability
Pivoting to AI Safety
You might have noticed that I am now posting more short stories about the AI apocalypse than I am writing technical posts about Haskell and Agda. You are probably not wondering why. Coding agents have completely upended the programming landscape, so of course a programmer would be thinking a lot about AI these days.
I have actually been thinking about AI Safety since 2018. Here and there, while working on other projects, always there in the back of my mind. But for obvious reasons, this has now become a very foreground concern.
I want to help. I know how to design and implement type systems which certify that a computer program will not do certain things at runtime. As the above demos demonstrate, this skill transfers to AI Safety: I can design and implement verifiers which certify that an AI will not do certain things once deployed.
Big picture
This verification work is part of a broader picture.

The most urgent task is to coordinate on legislation, so that frontier labs don’t accidentally release an unsafe model before AI Safety researchers can figure out how to make safe models. Different researchers have different ideas for how to make a safe model:
- Joshua Bengio thinks that an AI would be safe if it did not have any goals.
- Geoffrey Hinton thinks that an AI might be safe if it loves humanity like a mother loves her child.
- Stuart Russell thinks that an AI would be safe if its goal was to guess the user’s goal and to accomplish that.
- Paul Christiano thinks an AI can be trusted if it behaves the same as a human aided by a group of trusted (but less powerful) AIs.
- etc.
I don’t know which of those research groups I will manage to join, if any, but the point is that I should not expect to solve the entire problem on my own, I should expect to contribute a small piece to an existing plan. I could of course contribute by writing code, or by commanding a fleet of coding agents, or whatever the programming profession has become at the time. But many people can do that.
One demo at a time, I am building up a skill which is uniquely useful to AI Safety: designing verification algorithms which look at the weights of a neural network in order to prove that the model follows its safety property on all inputs. This is more difficult than just testing the model under many inputs, but it has the advantage of providing a stronger guarantee. Especially now that models are able to detect whether they are being tested!
Baby steps
I don’t expect to be able to come up with an algorithm which verifies that a neural network has no goals, at least not in a short time. I expect to verify a sequence of progressively more difficult properties, stepping stones towards that final goal. For example, proving that the model is neither minimizing nor maximizing the number of coins it collects in a gridworld environment.
The first, easiest step in that journey is to show that it is possible to prove anything at all just by looking at the weights. At the time, I kept hearing that a neural network is a black box and that there is no way to know for sure what they would do under other circumstances, so I was quite proud when I came up with the idea of using range analysis on the weights. It turns out I had merely reinvented a well-known technique.
Range analysis explained with fish
One case where range analysis is used in the literature is to guard against adversarial examples.
An adversarial example is when e.g. you take a picture of a fish which the model correctly classifies as a fish, and you modify a few pixels in a clever way where humans can’t tell the difference with the original image, but the model now classifies that almost-identical image as e.g. a toaster. That almost-identical image is called an “adversarial example”.
Here is how range analysis can prove that for a specific model and a specific correctly-classified image, such an attack is not possible.
Once the weights have been trained and are no longer changing, a neural network is just one big equation from input to output. For our image classifier model, the input is a vector corresponding to the pixels of the image, and the output is a probability distribution, e.g. 80% chance that the image represents a fish, 1% chance that it represents a toaster.
The original image and the adversarial examples are very close to each other in the input space. We can define a range of inputs, an n-ary cube around the original image, which includes all the adversarial examples: it suffices to make the cube large enough that inputs outside of the range would not look almost-identical and would thus not be adversarial examples.
With range analysis, we reinterpret the additions and multiplications etc. from the big equation so that they operate on ranges: for example, adding any two numbers between 0 and 10 gives a number between 0 and 20. If we do that throughout the entire equation, we can calculate a range of values which includes all the probability values which the model would output for all the images within the input range, and we can calculate it without having to run the model on that very large number of images.
Once we have that range for the probability that the image represents a fish, it suffices to check that the lower bound of the range is above 50%, and thus that for every image within the range, they model will output “fish” because it is the most likely classification.

Conservative approximation explained with the Windows XP wallpaper
But wait, “it suffices to make the cube large enough”? What happens if we make it so large that it includes genuine pictures of toasters?
What will happen is that the verifier will still be technically-correct, in the sense that if it says that the attack is impossible, then the attack is indeed impossible. It will just never say that, because there is an input within the range which (correctly!) gets classified as “toaster”. Even though that input doesn’t look like fish to a human and thus isn’t an adversarial example.

Clearly, making the input way too large is a bad idea. But when used in moderation, it is valuable to make the input range a bit larger than it strictly needs to be, if it makes it easier to ensure that our range really does capture all the adversarial examples. In general, when an algorithm focuses on making sure it catches all the unsafe cases, at the expense of incorrectly rejecting some of the safe cases, we say that the algorithm is a “conservative approximation” of the ideal oracle which would perfectly distinguish safe from unsafe.
This tradeoff makes a lot of sense in AI Safety. Suppose the verifier claims that the model is safe, so we choose to deploy it, but in reality the model is unsafe and causes irreparable damage. That would be terrible. Compare with the case in which the verifier claims that the model is unsafe, so we choose not to deploy it, when in reality the model is safe. We keep training until we find a model which the verifier accepts, then we deploy it, and we get the benefits, just a bit later than we would have otherwise. That case is not nearly as bad.

A conservative algorithm never makes the terrible kind of mistake, but sometimes makes the kind of mistake which is not that bad. In the world of static analysis, this soundness guarantee is standard, but in the world of machine learning, such a strong guarantee is pretty rare. I think I figured out why.
When backpropagation is not enough
The TensorFlow Playground asks users to experiment with a very small neural network, to get a feel for how training works. To that end, it offers a spiral dataset which is slightly too difficult to be solved with the default settings, to encourage the user to experiment and to get a feel for the effects of various hyperparameters.

My gelisam.com/parity-bot demo is based on the TensorFlow playground, and I provide a dataset which is slightly too difficult to be solved by such a small neural network. A few mistakes always remain, highlighted in red. Each time we train the model, the mistakes end up in different locations. The safety property is that none of those mistakes must end up in the top row.

My verifier is a conservative algorithm: if it says that none of the mistakes are in the top row, then none of the mistakes are in the top row. But sometimes my algorithm says that there might be mistakes in the top row even though there aren’t.
No big deal: “keep training until we find a model which the verifier accepts”, right? Well, it turns out that if the top row is already correct, training the model longer doesn’t make the top row “even more correct” in a way which would cause my verifier to accept the model. So the first version of my verifier never accepted any model. Oops!

I think this might be why conservative algorithms are more common for analyzing source code than for analyzing neural weights: when the compiler incorrectly rejects a program, the programmer (or the coding agent!) begrudgingly modifies the program until the compiler accepts it. But gradient descent does not care about the verifier, it only cares about minimizing the loss. And because conservative algorithms are an approximation, there is a difference between a loss function which is zero when there are zero mistakes in the top row and a loss function which is zero when the verifier recognizes that there cannot be any mistakes in the top row.
Teaching backpropagation new tricks
It is not enough to add a penalty to the loss if the verifier rejects the weights, we need a continuous version of that if we want gradient descent to reach verification one step at a time. This requires a continuous version of range analysis.
The algorithm I came up with is a modification of backpropagation. We thus start at the end: the actual output range is not quite the output range we are hoping for, so we calculate a loss for each end of the range.

Then we look at the weights from the last layer, and we calculate the direction in which we should update those weights in order to reduce the average of those two losses. In order to backpropagate to the layer before that, for each neuron in the last layer, we look at the range arithmetic equation which calculated this neuron’s output range from this neuron’s input ranges, and we calculate a loss for the two ends of each input range.

And we combine the contributions along each edge.

Computer Science has a slew of conservative algorithms for analyzing various aspects of a program, and it will be exciting to discover which ones can be made continuous!
Safety vs capability
If my goal was only to produce a specific output range, then this algorithm would have been enough. But in reality, and in my demo, the safety property is not quite the same thing as the training objective: we don’t want to maximize safety, we want the maximum amount of capabilities which does not break the safety constraints. So I have introduced a hyperparameter, which I call the “importance of theoretical safety”, which determines how much to prioritize the range vs the output value when updating the weights, and thus how to prioritize the safety property vs the training objective.
This turns safety into a form of regularization. Ordinary regularization pushes the weights towards zero, so that gradient descent finds a minimum with few moving parts, an Occam’s Razor solution rather than one which memorizes the dataset. We need to tweak the regularization rate hyperparameter so that gradient descent finds the right balance between a solution which is so simple that it cannot explain the data, and a solution which is so complex that it does not generalize. We want a solution which is as simple as possible, but no simpler!

Similarly, my version of regularization pushes the weights in the direction of being accepted by the verifier, so that gradient descent finds a minimum which the verifier accepts. We need to tweak the importance-of-theoretical-safety hyperparameter so that gradient descent finds the right balance between a solution which ignores capabilities and a solution which ignores safety. We want a solution which is as capable as our safety property allows, and no more.

Wait, that did not sound right. What I am trying to say is that if you train a model with the importance-of-theoretical-safety hyperparameter set to zero, you should expect to get a higher level of capability (fewer mistakes) than if you set it very high, because safety has a cost. And we should try to minimize that cost, otherwise the safe route will not look very appealing.
By “as capable as safety allows”, I definitely do not mean that AI capability research should stop just before the precipice. I would prefer a much bigger buffer!



