nextbig.dev
Vancouver, B.C. · Intelligence on AI and the machines that run it
nextbig.dev
← The Primer
IntermediateUpdated June 2026

What Is Mechanistic Interpretability? A Visual Guide to the Inside of a Neural Network

A modern AI is grown, not written: billions of weights that work without anyone being able to say exactly how. Mechanistic interpretability is the science of opening that black box, finding the concepts and circuits inside, and proving what they do. Here is the field, in pictures.

Reading with an AI? Take this whole guide into your assistant, or copy any command and prompt below. View .md

Here is the strangest fact about modern AI: nobody wrote it. A large model is not programmed with rules; it is grown from data, billions of numbers tuned until the whole thing works. It answers, reasons, and codes, and yet no one can open it up and point to where any of that lives. We built a mind we cannot read. Mechanistic interpretability is the science trying to fix that, to turn the black box into something we can actually understand, part by part.

This guide is a visual tour of how that is done: what the units of meaning inside a model really are, the strange trick (superposition) that hides them, the tool (sparse autoencoders) that pulls them back out, the circuits that wire them into behavior, and the experiments that prove what is causal. It assumes you know roughly what a neural network is. For the models themselves, see how reasoning models work.

What you'll learn

  • Why a trained model is a black box, and what it would mean to read it
  • Features: the real units of meaning, which are directions, not neurons
  • Superposition: how a model hides more concepts than it has neurons
  • Sparse autoencoders: the tool that pulls clean features back out
  • Circuits: how features wire into behavior, like the induction head
  • Steering and patching: how we prove a part is causal, not just correlated

What is mechanistic interpretability?

Most AI explainability asks shallow questions: which input words mattered most, or which example is similar to this one. Mechanistic interpretability asks the deep one: what algorithm is the model actually running? It treats a trained network like an alien artifact to be reverse-engineered, recovering the concepts it stores and the computations it performs, until you can describe a behavior as a mechanism rather than a mystery.

The working metaphor is a microscope. The weights are all right there, fully visible, and still illegible, the way every cell of an organism is visible under a lens but the biology is not obvious. The job is to find the structure hiding in plain sight.

A neural network as a black box of billions of weights, with interpretability as a lens revealing a clean circuit inside prompt THE MODEL billions of weights, grown not written answer A CIRCUIT
The whole problem in one picture. A prompt goes in and an answer comes out, and in between sit billions of weights that are fully visible and completely opaque. Mechanistic interpretability is the lens: it resolves that wall of numbers into legible parts, the features and circuits that actually do the work.

Features, not neurons

The obvious place to look is the neuron. Maybe one neuron is the "cat" neuron, another the "France" neuron. Reality is messier. Probe a single neuron and you find it fires for a jumble of unrelated things: snippets of Python, the color green, a particular surname, parts of words in Korean. Neurons are polysemantic, which makes them useless as units of meaning.

The real unit is the feature: a direction in the model's activation space that corresponds to one concept. A feature is usually spread across many neurons at once, and any one neuron takes part in many features. So you cannot read a model neuron by neuron, any more than you can read a sentence letter by scrambled letter. You have to find the directions.

Why directions? Inside a model, a layer's state is just a long list of numbers, a vector. A concept being "present" looks like that vector pointing a certain way. Adding two features means pointing partly in both directions at once. That is the geometry the rest of this guide is about.

Superposition: more concepts than neurons

If features are directions, here is the twist that makes interpretability hard. A model represents far more features than it has neurons. It does this through superposition: cramming many features into the same space as directions that are only nearly separate, not perfectly. The picture below shows the idea in two dimensions.

Superposition: six feature directions packed into a two-neuron space, more concepts than dimensions neuron 1 neuron 2 Golden Gate Bridge DNA semicolons first names the 1800s sadness Six features. Two neurons. Two neurons hold only two clean directions. The model packs in six by letting them overlap and nearly, but not quite, cancel out. That overlap is exactly why one neuron fires for many unrelated things. Real models do this with thousands of dimensions and millions of features.
Superposition in two dimensions. A space of two neurons can only hold two truly separate directions, but the model squeezes in six by letting them overlap. Scale this up: a real layer has thousands of dimensions and represents millions of features the same way. This is the core reason a model is not readable one neuron at a time.

Pulling features apart: sparse autoencoders

If features are hidden in superposition, you need a way to pull them back out. The breakthrough tool is the sparse autoencoder (SAE), a form of dictionary learning. The trick is to re-express a layer's dense, tangled activations using a much wider set of features, while forcing almost all of them to be off at any moment. That pressure to be sparse makes each surviving feature settle onto a single, clean concept.

A sparse autoencoder re-expresses dense polysemantic activations as a wide set of sparse, interpretable features ACTIVATION (DENSE) every neuron a bit on; tangled, polysemantic sparse autoencoder learns a dictionary FEATURES (SPARSE) Golden Gate Bridge code: semicolons DNA sequences thousands of features; a few on; each readable
The dictionary trick. On the left, a handful of neurons are all partly active, a tangle no one can read. The sparse autoencoder re-expresses that same state as thousands of features of which only a few fire, and those tend to be single, nameable concepts. Run at scale, this has pulled millions of interpretable features out of frontier models.

Circuits: how features wire into behavior

Features are the nouns. Circuits are the verbs: small wirings of features and attention components that carry out a specific computation. Information moves along the model's residual stream, a shared channel that each layer reads from and writes back to, and a circuit is a path through it that does one identifiable job.

The classic example is the induction head, the circuit behind a model's knack for continuing a pattern it has just seen. It works in two moves: find where the current token appeared before, then predict whatever followed it last time.

An induction head: it finds where a token appeared before and copies what came next, driving in-context repetition Mr Dursley . . . Mr next? 1. find where this token appeared before 2. copy whatever came next Dursley predicted
An induction head, the most studied circuit in interpretability. Seeing the second "Mr", it reaches back to the earlier "Mr", notes that "Dursley" followed it, and predicts "Dursley" again. Nobody designed this; it forms on its own during training, and it is a big part of how models learn in context. Finding it was an early proof that real, nameable algorithms live inside the weights.

Proving it is causal: patching and steering

Finding a feature that lights up for a concept is only correlation. To show it actually drives behavior, you intervene: reach in and change the activation, then watch the output move. Two methods do this. Activation patching swaps a piece of one run into another to see if the behavior follows. Feature steering clamps a feature on or off and watches the model bend.

Feature steering: clamping the Golden Gate Bridge feature to maximum makes the model fixate on it, proving the feature is causal FEATURE: GOLDEN GATE BRIDGE off clamped to MAX BEFORE you: what should I cook tonight? model: how about a simple pasta with garlic and olive oil? AFTER (FEATURE CLAMPED ON) you: what should I cook tonight? model: something to enjoy with a view of the Golden Gate Bridge, its towers rising through the fog...
Steering one feature. With the Golden Gate Bridge feature clamped to its maximum, the model drags every answer back to the bridge, even a question about dinner. That is the move behind Golden Gate Claude, and it is the strongest evidence interpretability offers: not that a feature correlates with a concept, but that turning it changes what the model does.

Why it matters

This is not only elegant; it is one of the central bets in AI safety. We are deploying systems whose inner workings we cannot yet read, and behavior alone is a weak guarantee, since a model can look aligned while computing something else. If we could read internals, several hard problems get more tractable:

What it cannot do yet

Be clear-eyed about how early this is. Extracting millions of features and tracing a handful of circuits is a genuine leap, but it is not a full account of how a frontier model produces any given answer. The honest limits:

None of that makes it less important. It makes it one of the most active and consequential research frontiers in AI.

We built a mind we cannot read. Mechanistic interpretability is the slow, careful work of learning to read it.

Frequently asked questions

What is mechanistic interpretability?

Mechanistic interpretability is the science of reverse-engineering what happens inside a neural network: turning its billions of learned weights into human-understandable parts. The goal is to find the concepts a model represents (features) and the step-by-step computations it runs (circuits), so we can say not just what a model does but how it does it.

Why is it so hard to understand what a neural network is doing?

Because a model is trained, not programmed. Nobody writes the rules; they emerge from billions of numbers tuned by gradient descent. The result works without being legible, the way a brain works without a wiring diagram. Mechanistic interpretability is the attempt to recover that wiring diagram after the fact.

What is a feature in interpretability?

A feature is a direction in a model's internal activation space that stands for a concept: the Golden Gate Bridge, a semicolon in code, a sense of sadness. Features, not individual neurons, are the real units of meaning, because the model usually spreads each feature across many neurons.

What is superposition?

Superposition is how a model packs more features than it has neurons, by storing them as overlapping directions that are only nearly, not exactly, separate. It is why a single neuron lights up for many unrelated things (it is polysemantic) and why you cannot understand a model neuron by neuron.

What is a sparse autoencoder in interpretability?

A sparse autoencoder (SAE), also called dictionary learning, is the main tool for undoing superposition. It learns to re-express a layer's dense, tangled activations as a much wider set of features where only a few are active at once, and each one tends to be a single clean, interpretable concept.

What is a circuit?

A circuit is a small subgraph of a model that implements a specific behavior: particular features and attention heads wired together to do one job. A famous example is the induction head, a circuit that drives in-context repetition by finding where a token appeared before and predicting what followed it.

What is feature steering or activation patching?

They are causal interventions: instead of only observing activations, you change them and watch the output. Clamp a feature on and the model fixates on it; this is how Golden Gate Claude was made, by turning up a single Golden Gate Bridge feature. Patching swaps activations between runs to prove a part actually causes a behavior, rather than just correlating with it.

Why does mechanistic interpretability matter?

Because we are deploying systems we do not fully understand. Reading a model's internals could let us detect deception or dangerous capabilities, debug failures at their root, and build justified trust instead of guessing from behavior alone. It is one of the main technical bets in AI safety.

Can we fully reverse-engineer a model yet?

Not yet. The field can now extract millions of interpretable features from frontier models and trace some real circuits, which is a large advance, but it is still far from a complete account of how a model produces any given output. It is early, fast-moving, and one of the most important open problems in AI.

Glossary

Mechanistic interpretability
Reverse-engineering a neural network's internals into human-understandable features and circuits, to explain how it works, not just what it outputs.
Feature
A direction in activation space that represents one concept. The real unit of meaning, usually spread across many neurons.
Neuron
A single unit in the network. Individually hard to read, because it takes part in many features at once.
Polysemanticity
The fact that one neuron responds to many unrelated concepts, a direct consequence of superposition.
Monosemanticity
The goal state: a unit that means exactly one thing. Sparse autoencoder features aim for this.
Superposition
Storing more features than there are neurons by using overlapping, nearly-separate directions.
Activation
The vector of numbers a layer produces for a given input. The thing interpretability reads.
Residual stream
The shared channel running through a transformer that each layer reads from and writes to.
Circuit
A small wiring of features and attention components that implements one specific behavior.
Attention head
A component that moves information between positions. Circuits are often built from a few specific heads.
Induction head
A circuit that continues a pattern by finding a token's previous occurrence and copying what followed it.
Sparse autoencoder (SAE)
A model that re-expresses dense activations as a wide, sparse set of interpretable features. Also called dictionary learning.
Activation patching
Swapping activations between runs to test whether a component causes a behavior.
Feature steering
Clamping a feature up or down to change the model's output, proving the feature is causal.

Where to go next

You now have the field in pictures: the black box, features as directions, superposition hiding them, sparse autoencoders pulling them out, circuits wiring them into behavior, and interventions proving cause. Three directions from here.

To understand the models being interpreted, read how reasoning models work and what an AI agent is. Interpretability is also deeply tied to training: the guide to reinforcement learning covers how the behaviors we later try to read get shaped in the first place.

For the daily moves in safety, interpretability research, and the models themselves, the daily briefing reads the wire so you do not have to, and closes each edition with one falsifiable call we settle in public. This guide is part of The Primer, our growing library of ground-up explainers, re-checked against the live landscape each month so the details stay current.

Keep learning

The Primer is our growing library of ground-up explainers, re-checked every month so the details stay current. The daily briefing tracks what changes.

Browse The Primer