# Never Price a Model in Its First Month

> Serving costs for a frontier model collapse by orders of magnitude in the six weeks after launch. Lock your stack in week one and you lock in the peak.

- Published: 2026-06-14
- Author: Oday Brahem
- Canonical URL: https://www.nextbig.dev/blog/never-price-a-model-in-its-first-month

The price you pay to run a frontier model in its first week is the most you will ever pay to run it. Not the floor. The ceiling. Every cost curve we have watched this year says the same thing: the served cost of a given model collapses by one to two orders of magnitude in the weeks after launch, then flattens near a hardware floor. Builders keep treating launch-week economics as the baseline. It is the peak.

So here is the position. Do not lock anything in a model's first month. Not your default model, not your margins, not your GPU reservation. The number that matters is not today's price, it is the slope of the curve under it, and that slope is steepest right after a launch. This week handed us a live test. Anthropic shipped Claude Fable 5 at $10 in and $50 out, Cursor crowned it the new coding default, and SemiAnalysis published a trace showing a 1.6T model fall 100x in less than a month. The lesson is not which model won. The lesson is timing.

## Run the 100x honestly

SemiAnalysis traced DeepSeekV4's 1.6T inference cost from [day 0 to day 43 across GB300 and MI355X](https://semianalysis.substack.com/p/deepseekv4-16t-day-0-to-day-43-performance) and found per-million-token cost dropping 100x in 26 days. Take that apart before you trust it. A clean 100x over 26 days is roughly a 19% cut every single day. No serving stack sustains that. It is a settling curve, not a constant rate. Most of the drop lands early, then the line bends toward a hardware floor and stays there.

That shape is the whole point. The collapse is front-loaded because the easy wins are front-loaded: a better kernel, a quantization pass, a batching scheme, a migration to the right accelerator. The trace shows hardware choice alone swinging the bill by orders of magnitude. None of that work is done on launch day. It is done in the six weeks after, by the inference team racing down the curve while you decide whether to commit.

## The levers are landing in public

You can watch the levers ship in real time. Google open-sourced [DiffusionGemma](https://x.com/GoogleDeepMind/status/2064741061352636762), a 26B diffusion model that denoises 256-token blocks in parallel instead of crawling token by token, and it arrived with [native vLLM support on day one](https://x.com/vllm_project/status/2064753414735900835). Community benchmarks put it near [1,000 tokens per second on a single H100, roughly 4x its autoregressive peers](https://x.com/mervenoyann/status/2064753402064601181). A 4x decode speedup is a 4x cut in GPUs per unit of throughput. That is the curve bending in front of you.

Training moved the same week. Nvidia showed [NVFP4 training Llama 3 up to 1.73x faster than FP8](https://x.com/NVIDIAAI/status/2064105188219134041) with no accuracy loss on Blackwell. Cheaper training feeds cheaper iteration, and cheaper iteration feeds the price you pay downstream. Every one of these wins is portable. Diffusion decoding, four-bit precision, and parallel blocks are not lab secrets. They get applied to whatever weights are hot, which means the next frontier model inherits the curve the last one paid to discover.

## Fable 5 is mispriced by design

Now apply this to the model everyone is evaluating. Fable 5 launched at $10/$50, twice the cost of the model it replaces. SemiAnalysis is already [stress-testing the $200/month coding plans](https://x.com/SemiAnalysis_/status/2064815044085318040) to find the real compute caps, and independent evals are circling the price. One [eval found Fable matches GPT-5.5 on 98% of coding tasks at 2x the cost](https://x.com/bindureddy/status/2064425878080327730), which means routing only the hardest 2% to Fable preserves quality and halves the bill. Cursor's own board lists [Fable 5 Max at 72.9% for $18 a run and Fable 5 High at 70.6% for $10.81](http://cursor.com/evals). Those are launch numbers. They are the most expensive those scores will ever be.

The price is a placeholder, and Anthropic has told us so without saying it. The same weights serve both Fable and the restricted Mythos build, so the serving cost is shared with the flagship, and the company is reportedly moving to own its servers to attack its largest expense. A $50 output price set before that buildout lands is a number waiting to fall. Anyone who signs a fixed-cost integration on it this week is locking the peak into their P&L.

> "The launch-week price of a frontier model is the most you will ever pay to run it. Treat it as a peak, not a baseline."

## The contract is where this hurts

The expensive mistake is not a model default. You can change that with a config edit. The expensive mistake is hardware you committed to on last month's math. Neoclouds are selling multi-year capacity hard. [Crusoe is nearing 5 GW contracted with a 40 GW pipeline](https://x.com/CrusoeAI/status/2064366518901874978), and one widely read post this week argued [xAI now looks more like a datacenter REIT than a frontier lab](https://martinalderson.com/posts/xais-new-rental-business/). That supply is real and useful. The trap is the term sheet.

If you size a reservation on autoregressive decode throughput, and block decoding cuts your tokens-per-GPU need by 4x two months later, you are paying for capacity you no longer use. The settling curve does not care that your contract is signed. Run your reservation math twice: once on today's throughput, once on a 4x decode assumption, and commit only to the spread you would still want if the optimistic number lands. Reserve the floor, buy the rest on demand.

## Waiting is now a real option

Waiting used to mean shipping a worse product. It does not anymore, because the same curve lifts the floor. Stanford data this week put [local models answering 71% of queries accurately, up from 23% in 2023](https://x.com/ClementDelangue/status/2064039913843286318). Apple shipped a [20B model that fits in device RAM](https://x.com/awnihannun/status/2064202168618422396) through aggressive compression. DiffusionGemma is Apache-licensed and runs on consumer GPUs today. The model you can self-host in eight weeks will do what the frontier API did at launch, at a cost you control rather than one you negotiate.

So the "wait six weeks" discipline is not passive. It is a portfolio. Keep one open-weight model warm enough to serve real traffic, route the hardest fraction of prompts to the frontier API, and let the settling curve pull the blended cost down underneath you. The teams that win this year are not the ones running the strongest model everywhere. They are the ones who priced the curve correctly and refused to pay the peak.

## What to do this week

- Do not sign a fixed-price model integration in a model's first 30 days. Benchmark on launch weights, but assume the cost you commit to is the highest you will ever pay.
- Run GPU reservation math on two throughput numbers: today's autoregressive decode, and a 4x block-decode assumption. Reserve the floor, buy the spread on demand.
- Put a provider-abstraction layer in front of every endpoint so changing models is a config edit, not a sprint.
- Benchmark DiffusionGemma on your latency-critical path before your next capacity contract, not after.
- Route by difficulty. Send the cheap 98% to the cheaper model and reserve the frontier call for the 2% that needs it.
- Re-run your eval the week a model turns six weeks old. That is when the real price shows up.

## Our Call

By August 15, 2026, you will be able to serve Fable-5-launch-quality output for under $10 per million tokens, at least 5x below its $50 launch price, through some mix of Anthropic price cuts, third-party hosts, and open-weight models that match its launch benchmarks. Launch week was the peak. This Call is wrong if, on August 15, 2026, the cheapest route to matching Fable 5's June launch benchmark scores still costs more than $10 per million output tokens.

---
Cite as: "Never Price a Model in Its First Month" — nextbig.dev, https://www.nextbig.dev/blog/never-price-a-model-in-its-first-month