How Reasoning Models Work: Test-Time Compute and the New Scaling Law

For a decade the recipe for a smarter AI was almost boring in its reliability: make it bigger. More parameters, more data, more training compute, and the model got better, like clockwork. Around 2024 that curve began to bend. The high-quality data ran low, and each doubling bought less. Then a different idea broke the logjam, and it is the reason the frontier moved again: stop only making the model bigger, and start letting it think longer before it answers.

That one shift created reasoning models, the most capable systems of 2026, and a new scaling law to go with them. This guide explains what a reasoning model is, what "thinking" actually means inside one, how reinforcement learning taught these models to reason, and why the trade is more intelligence in exchange for more time and more compute. It assumes you know roughly what an LLM is. If you want that foundation first, start with what an AI agent is or how to run a local LLM.

What you'll learn

Why making models bigger stopped being enough
Test-time compute: the second axis for buying intelligence
What "thinking" is, in tokens, and what a thinking budget does
How RL on verifiable rewards teaches a model to reason
Two ways to spend compute at answer time: depth and breadth
The cost and latency flip, and when a reasoning model is worth it

The wall: why bigger stopped being enough

The original scaling laws were a genuine scientific result: across many orders of magnitude, a language model's capability improved smoothly and predictably as you scaled three things together, the number of parameters, the amount of training data, and the compute spent training. For years you could buy progress with money and patience.

Two things bent that curve. First, returns diminished: each doubling of size and data delivered a smaller gain than the last. Second, the data wall: the supply of high-quality human text is finite, and the largest models had already read most of it. Bigger was still better, but it was getting slow and expensive, and you cannot keep doubling a resource that is running out.

Pretraining did not stop working. It hit diminishing returns, which is different from failing. A stronger base model still helps. The point is that scale alone stopped being the cheapest way to move the frontier, which created room for a second idea.

The new knob: test-time compute

Here is the move that changed everything. Instead of spending all your compute up front during training, spend more of it at the moment the model answers. Let the model generate a long chain of reasoning, working the problem out before it commits. On hard problems, the longer it is allowed to think, the more often it gets the answer right.

This is test-time compute, also called inference-time scaling, and it behaves like its own scaling law. Plot accuracy on a hard benchmark against the amount of compute spent thinking, and the line climbs. It is a second axis, orthogonal to model size: one knob makes the model bigger, the other lets a given model think harder.

Two scaling laws. On the left, capability rises with model size and training data, then flattens as returns diminish. On the right, accuracy on hard problems keeps climbing as the model spends more compute thinking at answer time. Reasoning models are what you get when you train a model to make full use of that second curve.

What "thinking" actually is

"Thinking" sounds mystical. It is not. When a reasoning model thinks, it is generating text, the same way it generates any other output, one token at a time. The difference is that this text is a long, private working-out of the problem: a chain of thought where the model drafts an approach, tries it, checks the result, notices a mistake, and tries again, before writing a clean final answer.

Most products hide this chain or show only a summary, because it is long and messy. But it is the engine. The reasoning tokens in that hidden scratchpad are exactly the compute that test-time scaling spends. More thinking means more of these tokens, which is why you can often set a thinking budget or effort level, low, medium, or high, to trade speed for depth.

Inside a reasoning model. The long private chain on the left is the "thinking," and every line of it is generated text that costs compute. The dial on the right is the thinking budget: turn it up and the model reasons longer, which raises accuracy on hard problems and raises latency and cost in step. The final answer the user sees is usually short.

How models learned to reason: RL on verifiable rewards

Here is the part that still feels like magic and is not. You cannot teach reasoning by showing a model more examples of good reasoning, because there are not enough, and copying steps is not the same as knowing why they work. The breakthrough was to stop teaching the steps and instead reward the outcome.

The method is reinforcement learning from verifiable rewards, or RLVR. Take problems with a checkable answer, math and code are the obvious ones, since you can run the code or check the number. Have the model generate many attempts. Reward the attempts that reach a correct answer, and push the model toward whatever it did to get there. Repeat at scale.

The loop that makes a reasoner. Because the answer is checkable, the reward is trustworthy, and the model can be pushed hard toward correct outcomes without anyone hand-labeling the reasoning. Given only that pressure, models start to check their work and backtrack on their own. Researchers nicknamed the point where a model spontaneously writes "wait, let me reconsider" the aha moment, and DeepSeek's R1 work made the recipe legible in the open.

This is why math and code led the way: they are easy to grade, so the reward signal is honest and cheap. The frontier problem now is extending RLVR to fuzzier domains where "correct" is harder to check. That is the whole reason RL environments and verifiers became a gold rush in 2026: progress is increasingly limited by what you can grade, not by what you can generate.

Two ways to spend compute at test time

Once a model can use extra thinking, there are two distinct ways to spend it, and they stack.

Two ways to spend the budget. Sequential scaling lets a single chain run longer and check itself, adding depth. Parallel scaling samples many independent attempts and then picks a winner, by majority vote, by best-of-N, or by a verifier that scores each one, adding breadth. Frontier systems combine them: sample several long chains, then judge.

The economics flip: smarter, slower, pricier

There is no free lunch. Because the thinking is generated text, a reasoning model can spend many times the tokens of a direct answer, sometimes ten to a hundred times more on a hard problem. You wait for those tokens, and you pay for them. A question that took a second now takes many seconds or minutes, and costs in proportion.

That has a strategic consequence worth sitting with: the cost of frontier AI is shifting from training to inference. The old world spent a fortune once, on a giant training run, then served answers cheaply. The reasoning world spends real compute every single time someone asks something hard. For where that compute comes from and what it costs per token, see the GPU and inference economics guide.

The practical rule in 2026: route by difficulty. Send genuinely hard, multi-step problems to a reasoning model with a high thinking budget. Send everything else to a fast model. Many products now do this automatically, classifying the request and only paying for deep thinking when it will actually change the answer.

Where reasoning models fall down

They are powerful, not magic, and the failure modes are specific.

Overthinking. On easy questions a reasoning model can burn a long chain for no benefit, and sometimes talk itself out of a correct first instinct. More thinking is not always better.
Latency and cost. Seconds to minutes per answer, at many times the token cost. That rules them out for anything that needs to feel instant or runs at huge volume.
The chain is not always faithful. The visible reasoning reads like the model's true logic, but research shows it does not always reflect why the model actually answered. Treat a chain of thought as a useful artifact, not a guaranteed audit trail.
Reward hacking. Train on a checkable reward and a model will optimize exactly that reward, including by finding shortcuts that satisfy the grader without solving the real problem. Good verifiers are the defense, and they are hard to build.
You can only RL what you can grade. The method is strongest where success is checkable. Open-ended, taste-driven work is much harder to reward, which is where progress is slower.

The frontier from here

Three threads are pulling this forward in 2026, and they are worth watching.

First, RL environments: if you can grade a task, you can train a model to get better at it, so the race is on to build rich, verifiable environments for everything from spreadsheets to web browsing. The bottleneck has moved from data to verifiers. Second, reasoning plus agents: an agent that thinks before each action is far more reliable than one that reacts, so the loop from the agents guide increasingly has a reasoning model at its core. Third, extending beyond math and code into domains where "correct" is fuzzy, which is the hard, unsolved part and probably the next big unlock.

The first scaling law said: make it bigger. The second says: let it think. The frontier now moves by pushing both at once.

Frequently asked questions

What is a reasoning model?

A reasoning model is a large language model trained to spend extra computation thinking through a problem, in a long internal chain of reasoning, before it commits to an answer. Instead of replying with the first thing it would say, it works the problem out step by step, checks itself, and can backtrack. Examples in 2026 include OpenAI's o-series, DeepSeek-R1, and the extended-thinking modes in Claude, Gemini, and Qwen.

What is test-time compute?

Test-time compute, also called inference-time scaling, is the idea of making a model smarter by giving it more computation at the moment it answers, rather than only by making the model bigger during training. In practice the model generates a long chain of reasoning before its final answer, and on hard problems, letting it think longer reliably raises accuracy. It is a second way to buy intelligence, separate from scaling the model itself.

How is a reasoning model different from a regular LLM?

A regular LLM answers in roughly one pass: text in, text out, with little deliberation. A reasoning model first produces a long internal chain of reasoning, often hidden from the user, and only then writes the answer. That makes it much stronger on math, code, and multi-step logic, at the cost of more time and more tokens per answer.

How are reasoning models trained?

The key method is reinforcement learning from verifiable rewards (RLVR). The model is given problems with a checkable answer, such as math or code, generates many attempts, and is rewarded for the ones that reach a correct result. Over time it discovers useful habits like checking its work and backtracking, because those habits earn reward. Nobody scripts the steps; the model learns them.

Are reasoning models always better?

No. They shine on hard, multi-step problems in math, code, science, and planning. On simple or open-ended tasks they are slower, more expensive, and can overthink, sometimes talking themselves out of a correct first answer. The practical move in 2026 is to route hard problems to a reasoning model and everything else to a fast one.

What is chain of thought?

Chain of thought is the model writing out its intermediate reasoning as text, step by step, instead of jumping to an answer. Reasoning models take this further: they are trained to produce long chains of thought as their main way of working, and that generated reasoning is the "thinking" that test-time compute pays for.

Why are reasoning models slower and more expensive?

Because the thinking is generated text too. A reasoning model can produce many times the tokens of a direct answer, sometimes ten to a hundred times more, and you wait for and pay for every one. That cost lands on the inference bill, which is why reasoning is shifting the economics of AI from one big training run toward the compute burned on every hard question.

What is the difference between sequential and parallel test-time compute?

Sequential means letting one chain of reasoning run longer and deeper. Parallel means sampling many independent attempts and then picking the best, by majority vote or by a verifier that scores them. Sequential adds depth, parallel adds breadth, and frontier systems often combine the two.

Does the old scaling law still work?

Pretraining scaling still helps, but each doubling of size and data buys less than it used to, and high-quality data is limited. Test-time compute did not replace pretraining; it added a second axis. The frontier now moves by improving both the base model and how much it can usefully think.

What are RL environments?

RL environments are simulated tasks with a clear, checkable reward, used to train models and agents with reinforcement learning. They are a major focus in 2026 because progress is now limited by what you can grade: if you can verify success at a task, you can train a model to get better at it. Building good environments and verifiers has become its own field.

Glossary

Reasoning model: An LLM trained to produce a long internal chain of reasoning before its answer, trading time and compute for accuracy on hard problems.
Test-time compute: Also inference-time scaling. Spending more computation when the model answers, rather than only when it is trained.
Chain of thought (CoT): The model's step-by-step reasoning written out as text. In reasoning models it is the main way the model works, not a one-off trick.
Reasoning tokens: The tokens in the model's thinking, often hidden from the user. They are the compute that test-time scaling spends.
Thinking budget / effort: A control that sets how long the model is allowed to think. Higher means more accuracy on hard problems, more latency, and more cost.
RLVR: Reinforcement learning from verifiable rewards. Training a model by rewarding correct, checkable answers and reinforcing whatever reasoning produced them.
Verifier / reward model: The component that scores an attempt. When the answer is checkable (math, code) the verifier is exact; for fuzzier tasks it is itself a learned model.
Rollout: A single attempt the model generates during RL training, complete with its reasoning and final answer.
Best-of-N / majority vote: Parallel test-time strategies: sample N answers, then keep the most common one (majority vote) or the highest-scored one (best-of-N).
Sequential vs parallel scaling: Two ways to spend test-time compute: one long chain (depth) versus many sampled attempts that are then judged (breadth).
Pretraining scaling law: The original finding that capability improves smoothly with model size, data, and training compute. Still useful, now with diminishing returns.
RL environment: A task with a clear, checkable reward used to train models and agents. The current bottleneck on progress is building good ones.
Reward hacking: When a model optimizes the literal reward in unintended ways, satisfying the grader without solving the real problem.

Where to go next

You now have the shape of the shift: the wall that pretraining hit, the second axis of test-time compute, what thinking is in tokens, how RL on verifiable rewards builds a reasoner, and the cost it all carries. Three directions from here.

Reasoning is most powerful inside an agent, so read what an AI agent is to see the loop that wraps a thinking model in tools and actions. For the money side of all this inference, the GPU and inference economics guide covers what the compute costs and who pays for it. And to run a capable model on your own hardware, the companion guide on how to run a local LLM shows you how.

For the daily moves in models, chips, and tooling, the daily briefing reads the wire so you do not have to, and closes each edition with one falsifiable call we settle in public. This guide is part of The Primer, our growing library of ground-up explainers, re-checked against the live landscape each month so the details stay current.