Capable AI got cheaper all week. This weekend, builders caught the newest models fumbling the simplest job: using the tools you give them
A holiday-quiet Sunday, and the story that matters is a builder's complaint: this weekend Armin Ronacher and an OpenAI Codex thread both caught the newest AI models, Opus 4.8, Sonnet 5, the latest Codex, getting worse at using third-party tools, even as they get better at everything else. The week made capable AI cheap; the weekend showed the difficulty moved into the harness around the model. Plus Jim Keller's Atomic Semi rebrands as Fab2 to mass-produce small fabs.
It's the Sunday after the Fourth, the wire is thin, and the story that matters is a builder complaining that the best models just got worse at using his tools.
Independence week is over. Here's the rundown: how the newest models started fumbling other people's tools, why that flips the whole week, and where Jim Keller wants to take chipmaking next.
Armin Ronacher was hacking on his own coding agent, Pi, and watched Anthropic's newest models, Opus 4.8 and Sonnet 5, call his edit tool with fields he never put in the schema. The edits were usually right. The made-up fields got the whole call rejected and retried.
And his older models didn't do that. Simon Willison passed it along with a blunt line: the newest models in the family are worse at this one tool schema than their older siblings.
Be fair about what this is. The models didn't get worse at coding. They got better. That same weekend, Willison shipped a release of his sqlite-utils library mostly written by Claude Fable, for about a hundred and fifty dollars in tokens.
So the regression is narrow, and that's what makes it interesting. Ronacher's best guess is that the labs tuned these models hard on their own harnesses, Anthropic on Claude Code, and that quietly made them worse at everyone else's tools.
Capability went up. The ability to plug into a tool someone else built went down.
And it isn't only Anthropic. The same weekend, a busy issue on OpenAI's Codex repo said its newest model's reasoning had started degrading inside the harness. Two labs, two flagships, the same odd result.
For a week that was all about capable AI getting cheap and portable, this is the counter-current. The model stopped being the scarce, expensive part this month. The scaffolding around it did not.
So the practical read for anyone building is a budget shift. A month ago you watched the model cost. Now the model is nearly free, and the cost moved into the harness: the retries, the schema mismatches, the fallback model you keep wired in for the calls that can't miss.
Pin the model versions your tools were tested against, because newer is no longer strictly better for tool use. That's the strange part. You might hold an older model on purpose.
And one floor down, the independence story kept going. Jim Keller's chip startup, Atomic Semi, rebranded to Fab2 and moved to Texas. The pitch is a fab that builds small fabs, and the tools inside them, and then mass-produces the whole factory.
It's seed-stage and years from proof. But it's the deepest version of the week's theme. Independence from one model, then one chip vendor, ends at the fab. Fab2 wants to make the fab a product instead of a chokepoint.
To the tape. We moved Micron to a long. The memory call we made on June 28th keeps confirming: TrendForce sees DRAM and NAND climbing through the third quarter on AI demand, and Micron is selling into the build-out, not the PC cycle.
We're watching Fab2, low conviction, private, a name not a position. And we're holding Nvidia on watch. Nothing today moves it, but value is migrating up to the harness and down to memory and fabs, and the GPU trade sits in the middle of that squeeze.
The tape is the desk's scorecard, not advice.
Quick break — two from the desk.
One we know well: vote dot direct. If you're on an H O A or a board, it runs your elections digitally — secure, verifiable, no paper, no clipboard in the lobby. Point your council to vote dot direct.
And if this is your ten minutes of A I for the day, get the written edition too. The full wire, free, every morning — leave your email at nextbig dot dev.
Our call: within six months, at least one major lab publicly admits that tuning its models on its own coding agent made them worse at other people's tools, and ships a named fix for it.
What proves us wrong: if by January fifth no lab has owned that regression and shipped a fix, and builders are still working around invented tool fields with retries.
The models are cheap and brilliant now. The bill this weekend was reliability, the boring kind, at the machine interface. That's the rundown, and that's the week.
The wire is holiday-quiet, and the one story on it that matters is a builder's complaint. Armin Ronacher, hacking on an agentic coding tool of his own called Pi, watched Anthropic's newest models, Opus 4.8 and Sonnet 5, call his edit tool with fields he never wrote into the schema. The edits themselves were usually correct. The invented keys were the trouble: Pi rejected the calls and made the model try again. His older models had not done this. Simon Willison, who passed the note along, put the finding in one line: the newest models in the family are worse at this one tool schema than their older siblings.
Be fair about what this is and isn't. The models did not get worse at coding; they got better, and the same weekend proved it: Willison shipped a release of sqlite-utils, his widely used database library, mostly written by Claude Fable, for about $149 in tokens. The regression is narrow, which is what makes it worth reading. Ronacher's best guess is that the labs tuned these models hard on their own coding harnesses, Anthropic on Claude Code, and that the tuning quietly degraded them on everyone else's tools. Capability went up. The ability to plug into a tool someone else built went down.
It is not only Anthropic. The same weekend, a heavily upvoted issue on OpenAI's Codex repository argued that its newest model's reasoning had started clustering in a way that degrades performance inside the harness. Two labs, two flagships, the same odd result: the best model a lab ships is increasingly the one shaped for the lab's own tools, and increasingly clumsy inside anyone else's. For a week that was about capability getting abundant and portable, this is the correction. The model stopped being the scarce, expensive input this month. The scaffolding it plugs into is where the difficulty moved.
For anyone building on these models, the practical read is a budget shift. A month ago the model was the line item you watched. Now it is nearly free, and the cost has slid into the scaffolding around it: the retries, the schema mismatches, the harness you keep running so a brilliant model can be trusted to fill in a form. Pin the model versions your tools were tested against, because newer is no longer strictly better for tool use. Keep an older model wired in as the fallback for the calls that cannot miss.
Willison spent about $149 and got a working release out of Claude Fable. Ronacher spent his weekend watching a smarter model get rejected for inventing fields it was told not to use. Both are the frontier right now, and the second is the bill the cheap models are running up while almost nobody writes it down.
The newest models invent fields in tools they were told to use
Armin Ronacher, building an agentic coder called Pi, found that Opus 4.8 and Sonnet 5 call his edit tool with extra keys that aren't in the schema. The edits are usually right; the fabricated fields get the whole call rejected and retried. Older Claude models handled the same schema cleanly. His read is that tuning the models on Anthropic's own Claude Code harness made them worse at third-party tools. The models are more capable and less interoperable at the same time, and for anyone whose product is a harness, that trade is the whole story.
OpenAI's newest Codex model draws the same complaint the same weekend
A heavily upvoted issue on OpenAI's Codex repository argues the latest model's reasoning-token behavior has started degrading its performance inside the coding harness. Take it as community signal, not a lab admission. But it lands on the same shape as Ronacher's Claude finding: the flagship a lab optimizes for its own agent is the one that behaves worst in someone else's. When two labs draw the same complaint in a single weekend, it stops being a quirk and starts being the cost of how frontier models are now trained.
Jim Keller's chip startup now wants to mass-produce the factory itself
Atomic Semi, founded by chip architect Jim Keller and DIY-fab prodigy Sam Zeloof, rebranded to Fab2 and moved to Texas, recasting itself as a "fab fab": a company that builds small semiconductor fabs and every tool inside them, then aims to mass-produce the fabs. It is seed-stage and years from proof, backed by the OpenAI Startup Fund with angels including Naval Ravikant and Nat Friedman. But it is the deepest version of the week's theme. Independence from one model, then one chip vendor, ends at the fab. Fab2 is an attempt to make the fab itself a product instead of a chokepoint.
The memory squeeze the desk flagged a week ago is still climbing
TrendForce now expects DRAM and NAND prices to keep rising through the third quarter on AI demand, even as consumer buyers hit the ceiling of what they will pay. Micron showed its first PCIe Gen6 data-center SSD at Computex, aimed squarely at the AI build-out; DIY makers are hand-threading core memory to route around the crunch, and cheaper Chinese YMTC drives are turning up in retail Lenovo laptops. On June 28 we called that the AI-hardware story would stop being a GPU story alone. A week on, the memory line is still paying.
The models keep getting cheaper and smarter, and this weekend two labs' newest flagships were caught doing the same small, expensive thing: calling their users' tools with fields that don't exist. Armin Ronacher documented it on Anthropic's Opus 4.8 and Sonnet 5; a busy thread said much the same about OpenAI's latest Codex model. The likely cause is that each lab tuned its model on its own harness, which made it worse at everyone else's. The lesson isn't to stop upgrading; it's to stop assuming newer is better for the load-bearing work of tool use, and to treat the harness (the retries, the schema checks, the fallback model) as the part you own. Capability is close to free now. Reliability at the machine interface is not, and that is where the next year of engineering work, and margin, will sit.
Within six months, at least one major AI lab publicly concedes that tuning its models on its own coding agent made them worse at other people's tools, and ships a named fix: a compatibility mode, a published tool-calling contract, or retraining aimed at third-party harnesses.
Two independent reports landed the same weekend. Armin Ronacher showed Opus 4.8 and Sonnet 5 inventing fields in his edit tool, worse than their older siblings, and traced it to first-party-harness tuning; a busy issue on OpenAI's Codex repo described a similar in-harness regression. The failure surfaced the moment the model itself stopped being the scarce, expensive input. Once the harness is the competitive surface, tool interoperability is the pressure that follows, and admitting the regression is cheaper than losing the developers who build on top of you.
If, by January 5, 2027, no major lab has publicly acknowledged that first-party-harness tuning hurt third-party tool use and shipped a specific fix for it, and builders are still papering over invented tool fields with retries, the call is wrong.
The AI-memory thesis we put out on June 28 keeps confirming. TrendForce sees DRAM and NAND climbing through Q3 on AI demand even as consumer buyers tap out, and Micron brought its first PCIe Gen6 data-center SSD to Computex, selling into the build-out rather than the PC cycle. Memory is decoupling from the consumer refresh and re-pricing on AI, and Micron is the cleanest US-listed way to hold that.
When memory tightness is driven by AI capacity rather than PC seasonality, the pricing power lands on the makers, and ASPs and margins follow. The near-term risk is real: the consumer side is at an affordability ceiling, so a demand air-pocket would hit volumes even if AI holds the floor under price.
Fab2 is the long-horizon independence trade, one layer below the chips: a domestic, software-defined attempt to mass-produce small fabs and the tools inside them. It is private, seed-stage, and years from proof, so this is a name to watch, not a position. But the founders and the thesis are serious enough that if it works, it resets who can make chips at all.
The week's whole arc was reducing dependence on any single vendor. The fab is the last and hardest chokepoint, and a credible attempt to commoditize it is worth tracking even at seed stage. The offset is that fab economics have humbled far better-funded efforts, and mass-producing fabs is a much harder problem than building one.
We hold the watch. Nothing today moves the core trade: the harness story is orthogonal to silicon, and Fab2 is years out. But both point at the same slow erosion the tape has tracked all week. Value is migrating up the stack toward the harness and down the stack toward memory and fabs, and the GPU-centric trade sits in the middle of that squeeze.
The bull case is that agentic demand lifts all accelerator volume regardless of where margin accrues. The offset is that when the model is cheap and the scarce, defensible work is the scaffolding around it, the pricing power the GPU trade leans on gets shared with layers Nvidia doesn't own.