Here is the strangest fact about modern AI: nobody wrote it. A large model is not programmed with rules; it is grown from data, billions of numbers tuned until the whole thing works. It answers, reasons, and codes, and yet no one can open it up and point to where any of that lives. We built a mind we cannot read. Mechanistic interpretability is the science trying to fix that, to turn the black box into something we can actually understand, part by part.
This guide is a visual tour of how that is done: what the units of meaning inside a model really are, the strange trick (superposition) that hides them, the tool (sparse autoencoders) that pulls them back out, the circuits that wire them into behavior, and the experiments that prove what is causal. It assumes you know roughly what a neural network is. For the models themselves, see how reasoning models work.
What you'll learn
- Why a trained model is a black box, and what it would mean to read it
- Features: the real units of meaning, which are directions, not neurons
- Superposition: how a model hides more concepts than it has neurons
- Sparse autoencoders: the tool that pulls clean features back out
- Circuits: how features wire into behavior, like the induction head
- Steering and patching: how we prove a part is causal, not just correlated
What is mechanistic interpretability?
Most AI explainability asks shallow questions: which input words mattered most, or which example is similar to this one. Mechanistic interpretability asks the deep one: what algorithm is the model actually running? It treats a trained network like an alien artifact to be reverse-engineered, recovering the concepts it stores and the computations it performs, until you can describe a behavior as a mechanism rather than a mystery.
The working metaphor is a microscope. The weights are all right there, fully visible, and still illegible, the way every cell of an organism is visible under a lens but the biology is not obvious. The job is to find the structure hiding in plain sight.
Features, not neurons
The obvious place to look is the neuron. Maybe one neuron is the "cat" neuron, another the "France" neuron. Reality is messier. Probe a single neuron and you find it fires for a jumble of unrelated things: snippets of Python, the color green, a particular surname, parts of words in Korean. Neurons are polysemantic, which makes them useless as units of meaning.
The real unit is the feature: a direction in the model's activation space that corresponds to one concept. A feature is usually spread across many neurons at once, and any one neuron takes part in many features. So you cannot read a model neuron by neuron, any more than you can read a sentence letter by scrambled letter. You have to find the directions.
Superposition: more concepts than neurons
If features are directions, here is the twist that makes interpretability hard. A model represents far more features than it has neurons. It does this through superposition: cramming many features into the same space as directions that are only nearly separate, not perfectly. The picture below shows the idea in two dimensions.
Pulling features apart: sparse autoencoders
If features are hidden in superposition, you need a way to pull them back out. The breakthrough tool is the sparse autoencoder (SAE), a form of dictionary learning. The trick is to re-express a layer's dense, tangled activations using a much wider set of features, while forcing almost all of them to be off at any moment. That pressure to be sparse makes each surviving feature settle onto a single, clean concept.
Circuits: how features wire into behavior
Features are the nouns. Circuits are the verbs: small wirings of features and attention components that carry out a specific computation. Information moves along the model's residual stream, a shared channel that each layer reads from and writes back to, and a circuit is a path through it that does one identifiable job.
The classic example is the induction head, the circuit behind a model's knack for continuing a pattern it has just seen. It works in two moves: find where the current token appeared before, then predict whatever followed it last time.
Proving it is causal: patching and steering
Finding a feature that lights up for a concept is only correlation. To show it actually drives behavior, you intervene: reach in and change the activation, then watch the output move. Two methods do this. Activation patching swaps a piece of one run into another to see if the behavior follows. Feature steering clamps a feature on or off and watches the model bend.
Why it matters
This is not only elegant; it is one of the central bets in AI safety. We are deploying systems whose inner workings we cannot yet read, and behavior alone is a weak guarantee, since a model can look aligned while computing something else. If we could read internals, several hard problems get more tractable:
- Detecting deception. A feature for "the model knows this is false" would be worth more than any amount of polite output.
- Finding dangerous capabilities before they are used, rather than discovering them in the wild.
- Debugging at the root. Fixing the circuit that causes a failure, not just patching the symptom with more training data.
- Earned trust. Confidence based on understanding the mechanism, not on having failed to find a problem yet.
What it cannot do yet
Be clear-eyed about how early this is. Extracting millions of features and tracing a handful of circuits is a genuine leap, but it is not a full account of how a frontier model produces any given answer. The honest limits:
- Coverage. We can read pieces, not the whole program. Most of what a model does is still dark.
- Scale. The tools are expensive and the targets keep getting bigger, faster than the microscope improves.
- Completeness. Naming a feature is not the same as proving you have found all of them, or that your story is the model's true mechanism rather than a plausible one.
None of that makes it less important. It makes it one of the most active and consequential research frontiers in AI.
We built a mind we cannot read. Mechanistic interpretability is the slow, careful work of learning to read it.
Frequently asked questions
What is mechanistic interpretability?
Mechanistic interpretability is the science of reverse-engineering what happens inside a neural network: turning its billions of learned weights into human-understandable parts. The goal is to find the concepts a model represents (features) and the step-by-step computations it runs (circuits), so we can say not just what a model does but how it does it.
Why is it so hard to understand what a neural network is doing?
Because a model is trained, not programmed. Nobody writes the rules; they emerge from billions of numbers tuned by gradient descent. The result works without being legible, the way a brain works without a wiring diagram. Mechanistic interpretability is the attempt to recover that wiring diagram after the fact.
What is a feature in interpretability?
A feature is a direction in a model's internal activation space that stands for a concept: the Golden Gate Bridge, a semicolon in code, a sense of sadness. Features, not individual neurons, are the real units of meaning, because the model usually spreads each feature across many neurons.
What is superposition?
Superposition is how a model packs more features than it has neurons, by storing them as overlapping directions that are only nearly, not exactly, separate. It is why a single neuron lights up for many unrelated things (it is polysemantic) and why you cannot understand a model neuron by neuron.
What is a sparse autoencoder in interpretability?
A sparse autoencoder (SAE), also called dictionary learning, is the main tool for undoing superposition. It learns to re-express a layer's dense, tangled activations as a much wider set of features where only a few are active at once, and each one tends to be a single clean, interpretable concept.
What is a circuit?
A circuit is a small subgraph of a model that implements a specific behavior: particular features and attention heads wired together to do one job. A famous example is the induction head, a circuit that drives in-context repetition by finding where a token appeared before and predicting what followed it.
What is feature steering or activation patching?
They are causal interventions: instead of only observing activations, you change them and watch the output. Clamp a feature on and the model fixates on it; this is how Golden Gate Claude was made, by turning up a single Golden Gate Bridge feature. Patching swaps activations between runs to prove a part actually causes a behavior, rather than just correlating with it.
Why does mechanistic interpretability matter?
Because we are deploying systems we do not fully understand. Reading a model's internals could let us detect deception or dangerous capabilities, debug failures at their root, and build justified trust instead of guessing from behavior alone. It is one of the main technical bets in AI safety.
Can we fully reverse-engineer a model yet?
Not yet. The field can now extract millions of interpretable features from frontier models and trace some real circuits, which is a large advance, but it is still far from a complete account of how a model produces any given output. It is early, fast-moving, and one of the most important open problems in AI.
Glossary
- Mechanistic interpretability
- Reverse-engineering a neural network's internals into human-understandable features and circuits, to explain how it works, not just what it outputs.
- Feature
- A direction in activation space that represents one concept. The real unit of meaning, usually spread across many neurons.
- Neuron
- A single unit in the network. Individually hard to read, because it takes part in many features at once.
- Polysemanticity
- The fact that one neuron responds to many unrelated concepts, a direct consequence of superposition.
- Monosemanticity
- The goal state: a unit that means exactly one thing. Sparse autoencoder features aim for this.
- Superposition
- Storing more features than there are neurons by using overlapping, nearly-separate directions.
- Activation
- The vector of numbers a layer produces for a given input. The thing interpretability reads.
- Residual stream
- The shared channel running through a transformer that each layer reads from and writes to.
- Circuit
- A small wiring of features and attention components that implements one specific behavior.
- Attention head
- A component that moves information between positions. Circuits are often built from a few specific heads.
- Induction head
- A circuit that continues a pattern by finding a token's previous occurrence and copying what followed it.
- Sparse autoencoder (SAE)
- A model that re-expresses dense activations as a wide, sparse set of interpretable features. Also called dictionary learning.
- Activation patching
- Swapping activations between runs to test whether a component causes a behavior.
- Feature steering
- Clamping a feature up or down to change the model's output, proving the feature is causal.
Where to go next
You now have the field in pictures: the black box, features as directions, superposition hiding them, sparse autoencoders pulling them out, circuits wiring them into behavior, and interventions proving cause. Three directions from here.
To understand the models being interpreted, read how reasoning models work and what an AI agent is. Interpretability is also deeply tied to training: the guide to reinforcement learning covers how the behaviors we later try to read get shaped in the first place.
For the daily moves in safety, interpretability research, and the models themselves, the daily briefing reads the wire so you do not have to, and closes each edition with one falsifiable call we settle in public. This guide is part of The Primer, our growing library of ground-up explainers, re-checked against the live landscape each month so the details stay current.