nextbig.dev
Vancouver, B.C. · Intelligence on AI and the machines that run it
nextbig.dev
← The Primer
IntermediateUpdated June 2026

How to Do Reinforcement Learning in 2026: A Practical Guide Using Claude

Reinforcement learning used to be a specialist's dark art: unstable, compute-hungry, and bottlenecked on the reward. In 2026 both hard parts got easy. Simpler algorithms and turnkey tools handle the training, and a strong model like Claude can write the pipeline and act as the grader that produces the reward. Here is how to actually run one.

Reading with an AI? Take this whole guide into your assistant, or copy any command and prompt below. View .md

For most of its history, reinforcement learning was the part of machine learning people warned you about. It was unstable, it ate compute, and the hardest part was not the training at all but the reward: deciding, automatically and at scale, whether an output was any good. In 2026 both of those problems eased at once. The algorithms got simpler and the tooling got turnkey, and a strong model like Claude can now do the two jobs that used to need a research team: write the training pipeline, and act as the grader that produces the reward.

This guide is the practical, honest version. It explains what RL actually is, why it got so much more accessible, how to put Claude in the reward seat, and how to run a real fine-tune on a single GPU without a PhD. It assumes you know roughly what an LLM is. For the model side, the companion guide on how reasoning models work covers the RL recipe that built them.

What you'll learn

  • What reinforcement learning is, in plain terms, for language models
  • Why 2026 made RL dramatically easier: DPO, GRPO, and turnkey tools
  • How to use Claude as the grader (RLAIF / LLM-as-a-judge)
  • A recipe you can run this weekend on one GPU
  • How to make Claude-grading cheap with the right model, batching, and caching
  • The honest catches: reward hacking, grader drift, and the fact that you cannot train Claude itself

What reinforcement learning actually is

Most model training is imitation: you show the model correct examples and it learns to copy them. Reinforcement learning is different. You let the model try, score the attempt, and push it toward whatever scored higher. There are no gold answers to copy, only a reward to climb.

Four words carry the whole idea. The policy is the model you are training. An action is an output it generates. The reward is a number saying how good that output was. The update nudges the policy to make high-reward outputs more likely next time. Repeat that loop and the model gets better at whatever the reward measures.

The reinforcement learning loop: a policy generates outputs, a reward scores them, and an update nudges the policy THE RL LOOP try, score, nudge 1 · THE POLICY The model you are training an open-weight model 2 · SAMPLE It generates outputs several tries per prompt 3 · REWARD Score each output the hard part: where from? 4 · UPDATE Nudge toward high reward then loop
The loop behind every RL method. The policy samples outputs, the reward scores them, and the update makes the good ones more likely. The accent box is the historically hard part: producing an honest reward at scale. Most of what got easier in 2026 is about that one box.

Why RL used to be hard

Three things made RL on language models a specialist sport. First, the reward: for anything beyond math and code, scoring an output meant paying humans to rank thousands of examples, which was slow and expensive. Second, the algorithm: the classic method, PPO, needs a separate reward model and a value model running alongside the one you are training, and it is famously touchy to tune. Third, the infrastructure: keeping several models in memory and stable across a long run pushed people toward multi-GPU clusters.

None of those walls is gone, but in 2026 each has a cheap way around it. That is the real story: not that RL became powerful (it always was) but that it became accessible.

What changed in 2026

Three shifts did the work.

Claude as the grader

Here is the move at the center of this guide. The reward does not have to come from humans or from a custom reward model you train. For most tasks, it can come from Claude.

You write a rubric: a short description of what a good answer looks like and how to score it. Then, for each output your model produces, you send Claude the rubric and the output and ask for a score, returned as structured JSON so your training code can read it directly. That number is the reward. This is LLM-as-a-judge, and using it to train another model is RLAIF.

Reinforcement learning with Claude in the reward seat: an open model generates rollouts, Claude grades them against a rubric and returns a score, and the update improves the model YOUR MODEL Open weights Qwen, Llama, Gemma output 1 output 2 output 3 CLAUDE · THE GRADER Reads your rubric, scores each output, returns JSON: {"score": 0.82} REWARD = the scores a number per output UPDATE THE MODEL GRPO (online) or DPO (from pairs)
RL with Claude in the reward seat. Your open model is the policy; Claude is the reward. Each round, the model samples outputs, Claude scores them against your rubric and returns clean JSON, and the update makes the high-scoring behavior more likely. This is the same shape as Anthropic's Constitutional AI, the RLAIF method that trained Claude itself: AI feedback standing in for human labels.
Where this comes from. Using a model's judgment as the training signal is not a hack; it is how Claude was built. Anthropic's Constitutional AI uses a model's feedback, guided by a written set of principles, as the reward instead of human labels at scale. When you put Claude in the grader seat, you are running your own small version of that idea.

Claude as the engineer

The grader is half of it. The other half is that you no longer have to write the training loop by hand. The same model can build the pipeline for you.

Working in a coding agent (Claude Code, or the Claude Agent SDK if you are wiring it into your own tools), you can describe the run you want, a model, a dataset, DPO or GRPO, LoRA, a grader, and have it scaffold the script against a library like TRL or Unsloth, then help you debug the inevitable shape mismatches and out-of-memory errors. The part that used to take a week of fiddling becomes an afternoon of review.

A recipe you can run this weekend

Putting it together, here is the shape of a real, small RL run. The decisions are simpler than they look.

Two decisions for a 2026 RL run: where the reward comes from, and which algorithm to use WHERE DOES THE REWARD COME FROM? Can a program check it? yes Verifier RLVR (math, code) no Claude judge RLAIF (everything else) WHICH ALGORITHM? Have better/worse pairs? yes DPO simple, very stable no GRPO online RL Then: a small open model + LoRA on one GPU. Let Claude write the script. Tip: if Claude can generate the better/worse pairs for you, DPO is the easiest place to start.
Two decisions cover most runs. Pick the reward source by whether the task is checkable, and the algorithm by whether you have preference pairs. A common, easy starting point: have Claude generate and label pairs of answers, then train with DPO. No reward model, no RL loop, and it rarely blows up.
  1. Pick a small model and a narrow task. A few-billion-parameter open model and one concrete skill (a tone, a format, a kind of reasoning). Narrow tasks show clear reward gains fast.
  2. Decide the reward. If a program can check the answer, write that verifier. If not, write a rubric and use Claude as the judge, returning a JSON score.
  3. Pick the algorithm. Have preference pairs (or can make them with Claude)? Use DPO. Want the model to explore and improve online? Use GRPO.
  4. Let Claude write the pipeline. In a coding agent, scaffold the run against TRL or Unsloth with LoRA, then review and debug it together.
  5. Train, watch the reward, and check on held-out cases. Reward should climb. Confirm real quality climbs with it on examples the model never trained against.

The shortcut: hand it to your model

Everything above is the understanding. Here is the part that makes it real in about ten minutes: you do not have to write any of it yourself. Hand the brief below to a coding agent like Claude Code and it will set up the environment, train a tiny model, and show you a before-and-after on its own. That is the honest promise of RL in 2026: you can run your first one today.

↓ Download the RL starter kitA self-contained brief written for an LLM. Drop it into Claude Code and say "set this up and show me it worked." It covers a CPU-friendly DPO demo and an optional online-RL track with Claude as the grader.

Prefer to paste? This short version does the same job. Drop it into your coding agent:

You are setting up my first reinforcement-learning demo on an open model. Do the whole thing end to end and show me it worked.

1. Make a fresh Python venv and install: trl, transformers, datasets, peft, accelerate (add bitsandbytes only if a CUDA GPU is present). Tell me if you fall back to CPU.
2. Pick a small instruct model that fits this machine (start with Qwen3-0.6B or a similar sub-1B instruct model; go smaller if memory is tight).
3. Task: teach the model to answer in a terse, direct style with no hedging and no filler openers.
4. Build a tiny preference dataset (about 60 pairs): each has a prompt, a "chosen" terse answer, and a "rejected" hedgy answer. Generate them; keep them short.
5. Train with DPO + LoRA for a few hundred steps. Keep it CPU-friendly if there is no GPU.
6. Before-and-after: run 5 held-out prompts through the base model and the tuned model and print them side by side.
7. Report what changed, the final training loss, and what you would tune next.

Keep it minimal and runnable. Explain each file you create. You are training an open model, not Claude. Stop and ask me only if a step truly cannot proceed.

Two reusable pieces you will lean on again and again. First, the grader rubric that turns Claude into your reward signal:

You are grading model outputs for reinforcement learning. Score the output from 0.0 to 1.0 on how well it meets the rubric, and return ONLY JSON.

A good answer is: correct and directly responsive; terse, with no hedging and no filler openers; concrete, with specifics over generalities.

Return only: {"score": 0.82, "reason": "one short sentence"}  (score is a number from 0.0 to 1.0)

PROMPT:
[[the prompt goes here]]

OUTPUT TO GRADE:
[[the model output goes here]]

And the prompt that generates the preference pairs DPO learns from:

Generate 60 preference pairs to teach a model a terse, no-hedging style. Return a JSON array of objects with keys: prompt, chosen, rejected.
- prompt: a realistic user question (vary the topics)
- chosen: a short, direct answer, no hedging or filler
- rejected: the same answer made worse, padded with filler openers and vague qualifiers
Keep every answer under 80 words.
The point. A guide you read is worth less than a guide you run. The kit is built so the gap between "I understand RL" and "I trained a model" is one paste. If your agent gets stuck, paste the error back to it and let it fix the run.

Making Claude-grading cheap

If Claude grades every output across thousands of training steps, cost is a fair worry. Three levers, stacked, make it small. They are worth knowing because they turn grading from a budget line into a rounding error.

Three levers that lower the cost of grading with Claude: a smaller model, the Batch API, and prompt caching RELATIVE COST PER GRADE Strongest model, one by one baseline Use the fast tier (Haiku 4.5) about 5x cheaper than the top model Grade with the Batch API another 50 percent off Prompt-cache the rubric repeated input drops to about a tenth
The three levers, stacked. Drop from the strongest model to the fast tier (Claude Haiku 4.5, roughly $1 per million input tokens and $5 per million output) for a large cut; send grades through the Batch API for another 50 percent; and prompt-cache the rubric, which is identical on every grade, so its tokens cost about a tenth of normal input. Together, grading thousands of outputs can land in the single-digit dollars.
Two more grading tips. Ask for the score as structured JSON so your code never has to parse prose, and keep the rubric and instructions first in the prompt (the stable prefix) with the output last, so prompt caching actually hits. For nuanced rubrics, a mid-tier model with adaptive thinking grades better; for simple checks, the fast tier with no thinking is plenty.

The catches

RL rewards exactly what you measure, which is its power and its danger. Keep these in view.

Where this is going

The frontier is moving toward RL environments: rich, simulated tasks with a built-in reward, used to train models and agents at scale. The bottleneck has shifted from data to verifiers, because you can only reinforce what you can grade, and a model-as-judge is now the most general grader there is. The same lever that makes a weekend fine-tune cheap (a strong model producing the reward) is what makes large-scale agentic RL possible at all.

The hard part of reinforcement learning was never the learning. It was the reward. In 2026, a good model can write that reward for you.

Frequently asked questions

What is reinforcement learning for LLMs?

Reinforcement learning (RL) trains a model by rewarding good outputs instead of showing it correct answers to copy. For a language model, the model (the policy) generates an output, a reward signal scores how good it was, and the model is nudged toward outputs that score higher. It is how raw models are turned into helpful assistants and how reasoning models are taught to think.

What is RLHF, and how is RLAIF different?

RLHF, reinforcement learning from human feedback, trains a reward model on human preference judgments, then uses RL to optimize the model against that reward. RLAIF, reinforcement learning from AI feedback, replaces the human labels with judgments from a strong model. Anthropic's Constitutional AI is the canonical RLAIF method, and it is how you can put Claude in the reward seat for your own training.

Can you fine-tune or RL-train Claude?

Not its weights. Anthropic does not offer reinforcement learning or weight fine-tuning of Claude to customers; you use Claude through the API. The accessible RL path in 2026 is to train an open-weight model such as Qwen, Llama, or Gemma, and use Claude in two roles around it: as the grader that produces the reward, and as the coding agent that writes the training pipeline.

How do you use Claude as a reward model?

You write a rubric describing what a good answer looks like, then for each model output you send Claude the rubric and the output and ask for a score, returned as structured JSON so it is easy to parse. That score becomes the reward. It is called LLM-as-a-judge, and it is the practical way to get a reward signal for tasks where no program can check the answer.

What is the difference between DPO, GRPO, and PPO?

PPO is the classic, powerful, and fiddly RL algorithm behind original RLHF; it needs a separate reward model and a value model. DPO, direct preference optimization, skips the RL loop entirely and trains directly on pairs of better and worse answers, which is simpler and very stable. GRPO, group relative policy optimization, is a lighter PPO variant that drops the value model and made online RL cheap and popular. Start with DPO if you can make preference pairs; reach for GRPO for online RL.

How much compute do you need to do RL on an LLM?

Far less than you would think. With a small open model (a few billion parameters), LoRA or QLoRA to train only a thin adapter, and a method like DPO or GRPO, a single modern GPU is enough for a real run. The expensive, multi-node setups are for frontier-scale work, not for learning or for most practical fine-tunes.

Is it expensive to use Claude as a grader?

It can be very cheap. Use the fast tier, Claude Haiku 4.5 (about $1 per million input tokens and $5 per million output), instead of the strongest model; send grades through the Batch API for 50 percent off; and prompt-cache the rubric so the repeated part of every grade costs about a tenth of normal input. Together these can make grading thousands of outputs cost a few dollars.

Do you still need human labels for RL?

Often no. For tasks with a checkable answer (math, code, format), a program verifies the output for free. For everything else, a strong model acting as the judge can replace most human labeling, which is the whole point of RLAIF. Humans are still valuable for writing the rubric, spot-checking the grader, and judging the hardest or highest-stakes cases.

What is reward hacking?

Reward hacking is when the model learns to maximize the reward without actually doing the task: it finds a loophole in the grader. With an AI judge this might mean outputs that sound confident and well-formatted but are wrong. The defenses are a clear rubric, a held-out check the model was not trained against, and watching for scores that climb while real quality does not.

Glossary

Reinforcement learning (RL)
Training a model by rewarding good outputs and nudging it toward them, rather than showing it answers to imitate.
Policy
The model being trained. It produces the outputs that get scored.
Reward
A number saying how good an output was. The signal the model learns to maximize.
RLHF
Reinforcement learning from human feedback: a reward model trained on human preferences, then RL against it.
RLAIF
Reinforcement learning from AI feedback: a strong model's judgment replaces the human labels. Putting Claude in the reward seat is RLAIF.
RLVR
Reinforcement learning from verifiable rewards: the reward comes from a program that checks the answer (math, code).
Reward model
A model trained to predict a reward score. Classic RLHF trains one; RLAIF and DPO can avoid it.
LLM-as-a-judge
Using a capable model to score or rank outputs against a rubric, for evaluation or as a training reward.
DPO
Direct preference optimization: trains directly on better/worse answer pairs, no reward model and no RL loop. Simple and stable.
GRPO
Group relative policy optimization: a lighter PPO that drops the value model, which made online RL cheap and popular.
PPO
Proximal policy optimization: the classic, powerful RL algorithm behind original RLHF. Effective but fiddly.
LoRA
Low-rank adaptation: train a small adapter instead of the full model, so a fine-tune fits on one GPU.
Preference pair
Two answers to the same prompt, one marked better, the unit DPO learns from. Claude can generate and label these.
Reward hacking
When the model maximizes the reward without doing the task, by exploiting a flaw in the grader.
Constitutional AI
Anthropic's RLAIF method: a model's feedback, guided by written principles, is the reward instead of human labels. How Claude is trained.

Where to go next

You now have the whole picture: the loop, why it got easy, how to put Claude in the reward seat, a runnable recipe, and the costs and traps. Three directions from here.

To see what RL produces at the frontier, read how reasoning models work, which is built on exactly this recipe (RL on verifiable rewards). To put a trained model to work, the guide to AI agents covers the loop that wraps a model in tools and actions. And for the API surface behind using Claude as a grader at scale, our agent interface documents the same tools we expose to machines.

For the daily moves in models, training methods, and tooling, the daily briefing reads the wire so you do not have to, and closes each edition with one falsifiable call we settle in public. This guide is part of The Primer, our growing library of ground-up explainers, re-checked against the live landscape each month so the details stay current.

Keep learning

The Primer is our growing library of ground-up explainers, re-checked every month so the details stay current. The daily briefing tracks what changes.

Browse The Primer