# Reinforcement Learning Starter Kit

A brief written for your LLM. Hand this whole file to a coding agent (Claude Code, or any capable model with a terminal) and say: "Set this up and show me it worked."

From The Primer at nextbig.dev/learn/how-to-do-reinforcement-learning

## What you are building

A minimal, runnable reinforcement-learning demo that fine-tunes a small open-weight model and proves it changed, with a clear before-and-after. Two tracks; pick by the hardware available:

- Track A (DPO, easiest, runs on CPU): train on preference pairs. No reward model, no API key.
- Track B (GRPO with Claude as the grader; needs an Anthropic API key, GPU recommended): online RL where Claude scores each output.

Default to Track A unless the user has a GPU and an ANTHROPIC_API_KEY and asks for online RL.

## Ground rules for you, the agent

- Keep it minimal and actually runnable on this machine. Detect GPU vs CPU and scale down to fit.
- Create real files, explain each one, and run the training yourself.
- You are training an OPEN model. You are NOT training Anthropic's Claude. If Track B is used, Claude is only the grader.
- Read any API key from the environment. Never hard-code a secret.
- Stop and ask the user only if a step genuinely cannot proceed.

## Track A: DPO in seven steps

1. Make a fresh Python venv and install: trl, transformers, datasets, peft, accelerate. Add bitsandbytes only if a CUDA GPU is present.
2. Choose a small instruct model that fits memory. Start around Qwen3-0.6B or another sub-1B instruct model; go smaller if needed.
3. Pick the skill to teach: a terse, direct answering style with no hedging and no filler openers.
4. Build a tiny preference dataset, about 60 pairs. Each pair has a prompt, a "chosen" short direct answer, and a "rejected" answer that is the same content padded with hedging and filler. Generate them with the pair prompt in the Appendix.
5. Train with DPO + LoRA for a few hundred steps. Keep sequence lengths short. On CPU, shrink steps and batch size so it finishes in minutes.
6. Evaluate: run 5 held-out prompts (not in training) through the base model and the tuned model. Print them side by side.
7. Report: show the before-and-after, the final training loss, and one or two things you would tune next.

## Track B: GRPO with Claude as the grader (optional)

Same setup, but instead of fixed pairs the model generates several answers per prompt and Claude scores each one; the scores are the reward, trained with the GRPO trainer in trl.

- The reward function calls the Anthropic API with the rubric in the Appendix and parses the JSON score.
- Make grading cheap: use the fast model (Claude Haiku 4.5), send grades through the Batch API where possible, and keep the rubric first in the prompt so prompt caching hits.

## Done means

A terminal run that ends by printing 5 before-and-after pairs where the tuned model is visibly terser, plus the training loss. If the change is not visible, say so honestly and suggest more steps or more pairs.

## If something breaks

- Out of memory: smaller model, shorter max sequence length, smaller batch, or 4-bit loading (QLoRA).
- No GPU: a sub-1B model with a few hundred steps still shows the effect on CPU; just be patient.
- Loss not moving: check that the chosen and rejected answers actually differ in the intended way.

## A word on cost and safety

Track A is free after the model download. Track B costs only the grading calls, which the fast model plus batching plus caching keep small. Review any code before you run it.

## Appendix: copy-paste prompts

### Generate preference pairs (Track A)

    Generate 60 preference pairs to teach a model a terse, no-hedging style.
    Return a JSON array of objects with keys: prompt, chosen, rejected.
    - prompt: a realistic user question (vary the topics)
    - chosen: a short, direct answer, no hedging or filler
    - rejected: the same answer made worse, padded with filler openers and vague qualifiers
    Keep every answer under 80 words.

### Grader rubric (Track B: Claude as judge)

    You are grading model outputs for reinforcement learning. Score the output
    from 0.0 to 1.0 on how well it meets the rubric, and return ONLY JSON.

    A good answer is: correct and directly responsive; terse, with no hedging
    and no filler openers; concrete, with specifics over generalities.

    Return only: {"score": 0.82, "reason": "one short sentence"}
    (score is a number from 0.0 to 1.0)

    PROMPT:
    [[the prompt goes here]]

    OUTPUT TO GRADE:
    [[the model output goes here]]
