The Briefing · Sunday, July 5, 2026

Capable AI got cheaper all week. This weekend, builders caught the newest models fumbling the simplest job: using the tools you give them

A holiday-quiet Sunday, and the story that matters is a builder's complaint: this weekend Armin Ronacher and an OpenAI Codex thread both caught the newest AI models, Opus 4.8, Sonnet 5, the latest Codex, getting worse at using third-party tools, even as they get better at everything else. The week made capable AI cheap; the weekend showed the difficulty moved into the harness around the model. Plus Jim Keller's Atomic Semi rebrands as Fab2 to mass-produce small fabs.

By Oday Brahem · written with AI, edited by hand
9 stories analyzed from 300+ curated sources

⏱ 8 min read

The Rundown No. 134 · Audio Edition · 5 min All episodes RSS MP3

0:00 / 5:10

VTT

Oday

It's the Sunday after the Fourth, the wire is thin, and the story that matters is a builder complaining that the best models just got worse at using his tools.

Shannon

Independence week is over. Here's the rundown: how the newest models started fumbling other people's tools, why that flips the whole week, and where Jim Keller wants to take chipmaking next.

Oday

Armin Ronacher was hacking on his own coding agent, Pi, and watched Anthropic's newest models, Opus 4.8 and Sonnet 5, call his edit tool with fields he never put in the schema. The edits were usually right. The made-up fields got the whole call rejected and retried.

Shannon

And his older models didn't do that. Simon Willison passed it along with a blunt line: the newest models in the family are worse at this one tool schema than their older siblings.

Oday

Be fair about what this is. The models didn't get worse at coding. They got better. That same weekend, Willison shipped a release of his sqlite-utils library mostly written by Claude Fable, for about a hundred and fifty dollars in tokens.

Shannon

So the regression is narrow, and that's what makes it interesting. Ronacher's best guess is that the labs tuned these models hard on their own harnesses, Anthropic on Claude Code, and that quietly made them worse at everyone else's tools.

Oday

Capability went up. The ability to plug into a tool someone else built went down.

Shannon

And it isn't only Anthropic. The same weekend, a busy issue on OpenAI's Codex repo said its newest model's reasoning had started degrading inside the harness. Two labs, two flagships, the same odd result.

Oday

For a week that was all about capable AI getting cheap and portable, this is the counter-current. The model stopped being the scarce, expensive part this month. The scaffolding around it did not.

Shannon

So the practical read for anyone building is a budget shift. A month ago you watched the model cost. Now the model is nearly free, and the cost moved into the harness: the retries, the schema mismatches, the fallback model you keep wired in for the calls that can't miss.

Oday

Pin the model versions your tools were tested against, because newer is no longer strictly better for tool use. That's the strange part. You might hold an older model on purpose.

Shannon

And one floor down, the independence story kept going. Jim Keller's chip startup, Atomic Semi, rebranded to Fab2 and moved to Texas. The pitch is a fab that builds small fabs, and the tools inside them, and then mass-produces the whole factory.

Oday

It's seed-stage and years from proof. But it's the deepest version of the week's theme. Independence from one model, then one chip vendor, ends at the fab. Fab2 wants to make the fab a product instead of a chokepoint.

Oday

To the tape. We moved Micron to a long. The memory call we made on June 28th keeps confirming: TrendForce sees DRAM and NAND climbing through the third quarter on AI demand, and Micron is selling into the build-out, not the PC cycle.

Shannon

We're watching Fab2, low conviction, private, a name not a position. And we're holding Nvidia on watch. Nothing today moves it, but value is migrating up to the harness and down to memory and fabs, and the GPU trade sits in the middle of that squeeze.

Oday

The tape is the desk's scorecard, not advice.

Oday

Quick break — two from the desk.

Shannon

One we know well: vote dot direct. If you're on an H O A or a board, it runs your elections digitally — secure, verifiable, no paper, no clipboard in the lobby. Point your council to vote dot direct.

Oday

And if this is your ten minutes of A I for the day, get the written edition too. The full wire, free, every morning — leave your email at nextbig dot dev.

Oday

Our call: within six months, at least one major lab publicly admits that tuning its models on its own coding agent made them worse at other people's tools, and ships a named fix for it.

Shannon

What proves us wrong: if by January fifth no lab has owned that regression and shipped a fix, and builders are still working around invented tool fields with retries.

Oday

The models are cheap and brilliant now. The bill this weekend was reliability, the boring kind, at the machine interface. That's the rundown, and that's the week.

The Big Story

Capable AI got cheaper all week. This weekend, builders caught the newest models fumbling the simplest job: using the tools you give them

The wire is holiday-quiet, and the one story on it that matters is a builder's complaint. Armin Ronacher, hacking on an agentic coding tool of his own called Pi, watched Anthropic's newest models, Opus 4.8 and Sonnet 5, call his edit tool with fields he never wrote into the schema. The edits themselves were usually correct. The invented keys were the trouble: Pi rejected the calls and made the model try again. His older models had not done this. Simon Willison, who passed the note along, put the finding in one line: the newest models in the family are worse at this one tool schema than their older siblings.

Be fair about what this is and isn't. The models did not get worse at coding; they got better, and the same weekend proved it: Willison shipped a release of sqlite-utils, his widely used database library, mostly written by Claude Fable, for about $149 in tokens. The regression is narrow, which is what makes it worth reading. Ronacher's best guess is that the labs tuned these models hard on their own coding harnesses, Anthropic on Claude Code, and that the tuning quietly degraded them on everyone else's tools. Capability went up. The ability to plug into a tool someone else built went down.

It is not only Anthropic. The same weekend, a heavily upvoted issue on OpenAI's Codex repository argued that its newest model's reasoning had started clustering in a way that degrades performance inside the harness. Two labs, two flagships, the same odd result: the best model a lab ships is increasingly the one shaped for the lab's own tools, and increasingly clumsy inside anyone else's. For a week that was about capability getting abundant and portable, this is the correction. The model stopped being the scarce, expensive input this month. The scaffolding it plugs into is where the difficulty moved.

For anyone building on these models, the practical read is a budget shift. A month ago the model was the line item you watched. Now it is nearly free, and the cost has slid into the scaffolding around it: the retries, the schema mismatches, the harness you keep running so a brilliant model can be trusted to fill in a form. Pin the model versions your tools were tested against, because newer is no longer strictly better for tool use. Keep an older model wired in as the fallback for the calls that cannot miss.

Willison spent about $149 and got a working release out of Claude Fable. Ronacher spent his weekend watching a smarter model get rejected for inventing fields it was told not to use. Both are the frontier right now, and the second is the bill the cheap models are running up while almost nobody writes it down.

@mitsuhiko Read source

The Harness Became the Bottleneck

The newest models invent fields in tools they were told to use

Armin Ronacher, building an agentic coder called Pi, found that Opus 4.8 and Sonnet 5 call his edit tool with extra keys that aren't in the schema. The edits are usually right; the fabricated fields get the whole call rejected and retried. Older Claude models handled the same schema cleanly. His read is that tuning the models on Anthropic's own Claude Code harness made them worse at third-party tools. The models are more capable and less interoperable at the same time, and for anyone whose product is a harness, that trade is the whole story.

@mitsuhiko Read source

OpenAI's newest Codex model draws the same complaint the same weekend

A heavily upvoted issue on OpenAI's Codex repository argues the latest model's reasoning-token behavior has started degrading its performance inside the coding harness. Take it as community signal, not a lab admission. But it lands on the same shape as Ronacher's Claude finding: the flagship a lab optimizes for its own agent is the one that behaves worst in someone else's. When two labs draw the same complaint in a single weekend, it stops being a quirk and starts being the cost of how frontier models are now trained.

@github Read source

Independence, All the Way Down to the Fab

Jim Keller's chip startup now wants to mass-produce the factory itself

Atomic Semi, founded by chip architect Jim Keller and DIY-fab prodigy Sam Zeloof, rebranded to Fab2 and moved to Texas, recasting itself as a "fab fab": a company that builds small semiconductor fabs and every tool inside them, then aims to mass-produce the fabs. It is seed-stage and years from proof, backed by the OpenAI Startup Fund with angels including Naval Ravikant and Nat Friedman. But it is the deepest version of the week's theme. Independence from one model, then one chip vendor, ends at the fab. Fab2 is an attempt to make the fab itself a product instead of a chokepoint.

@tomshardware Read source

The memory squeeze the desk flagged a week ago is still climbing

TrendForce now expects DRAM and NAND prices to keep rising through the third quarter on AI demand, even as consumer buyers hit the ceiling of what they will pay. Micron showed its first PCIe Gen6 data-center SSD at Computex, aimed squarely at the AI build-out; DIY makers are hand-threading core memory to route around the crunch, and cheaper Chinese YMTC drives are turning up in retail Lenovo laptops. On June 28 we called that the AI-hardware story would stop being a GPU story alone. A week on, the memory line is still paying.

@tomshardware Read source

Quick Hits

Amazon stops taking new Mechanical Turk customers, twenty years after it turned human labeling into an API

@techcrunch

Alibaba reportedly bans employees from using Claude Code, classifying it high-risk software

@techcrunch

EU Council fast-tracks Chat Control, reviving mandatory scanning of private messages

@heise

Google's July 4 ad imagines the Declaration of Independence written with Gemini. It did not go over well

@verge

The Takeaway

The models keep getting cheaper and smarter, and this weekend two labs' newest flagships were caught doing the same small, expensive thing: calling their users' tools with fields that don't exist. Armin Ronacher documented it on Anthropic's Opus 4.8 and Sonnet 5; a busy thread said much the same about OpenAI's latest Codex model. The likely cause is that each lab tuned its model on its own harness, which made it worse at everyone else's. The lesson isn't to stop upgrading; it's to stop assuming newer is better for the load-bearing work of tool use, and to treat the harness (the retries, the schema checks, the fallback model) as the part you own. Capability is close to free now. Reliability at the machine interface is not, and that is where the next year of engineering work, and margin, will sit.

The Call C-20260705

Within six months, at least one major AI lab publicly concedes that tuning its models on its own coding agent made them worse at other people's tools, and ships a named fix: a compatibility mode, a published tool-calling contract, or retraining aimed at third-party harnesses.

The case

Two independent reports landed the same weekend. Armin Ronacher showed Opus 4.8 and Sonnet 5 inventing fields in his edit tool, worse than their older siblings, and traced it to first-party-harness tuning; a busy issue on OpenAI's Codex repo described a similar in-harness regression. The failure surfaced the moment the model itself stopped being the scarce, expensive input. Once the harness is the competitive surface, tool interoperability is the pressure that follows, and admitting the regression is cheaper than losing the developers who build on top of you.

What proves us wrong

If, by January 5, 2027, no major lab has publicly acknowledged that first-party-harness tuning hurt third-party tool use and shipped a specific fix for it, and builders are still papering over invented tool fields with retries, the call is wrong.

Settles by January 5, 2027

The Tape T-20260705

▲ Long MU Micron medium conviction

The AI-memory thesis we put out on June 28 keeps confirming. TrendForce sees DRAM and NAND climbing through Q3 on AI demand even as consumer buyers tap out, and Micron brought its first PCIe Gen6 data-center SSD to Computex, selling into the build-out rather than the PC cycle. Memory is decoupling from the consumer refresh and re-pricing on AI, and Micron is the cleanest US-listed way to hold that.

When memory tightness is driven by AI capacity rather than PC seasonality, the pricing power lands on the makers, and ASPs and margins follow. The near-term risk is real: the consumer side is at an affordability ceiling, so a demand air-pocket would hit volumes even if AI holds the floor under price.

Wrong if DRAM and NAND pricing rolls over before the fourth quarter, or Micron's next report shows AI demand failing to offset the consumer softness, leaving revenue and margins flat to down. Settles 6 months

◆ Watch Private Fab2 (Atomic Semi) low conviction

Fab2 is the long-horizon independence trade, one layer below the chips: a domestic, software-defined attempt to mass-produce small fabs and the tools inside them. It is private, seed-stage, and years from proof, so this is a name to watch, not a position. But the founders and the thesis are serious enough that if it works, it resets who can make chips at all.

The week's whole arc was reducing dependence on any single vendor. The fab is the last and hardest chokepoint, and a credible attempt to commoditize it is worth tracking even at seed stage. The offset is that fab economics have humbled far better-funded efforts, and mass-producing fabs is a much harder problem than building one.

Wrong if Fab2 fails to ship a working, repeatable fab on its stated path, or stalls at the prototype stage the way most novel-fab efforts have. Settles 18 months

◆ Watch NVDA Nvidia low conviction

We hold the watch. Nothing today moves the core trade: the harness story is orthogonal to silicon, and Fab2 is years out. But both point at the same slow erosion the tape has tracked all week. Value is migrating up the stack toward the harness and down the stack toward memory and fabs, and the GPU-centric trade sits in the middle of that squeeze.

The bull case is that agentic demand lifts all accelerator volume regardless of where margin accrues. The offset is that when the model is cheap and the scarce, defensible work is the scaffolding around it, the pricing power the GPU trade leans on gets shared with layers Nvidia doesn't own.

Wrong if Nvidia's next two quarters show accelerating data-center revenue and holding margins, with no visible share loss to AMD or in-house inference silicon. Settles 9 months

Desk signals from the day's verified wire — falsifiable, dated, settled in public. Analysis, not individualized investment advice.

Capable AI got cheaper all week. This weekend, builders caught the newest models fumbling the simplest job: using the tools you give them

The newest models invent fields in tools they were told to use

OpenAI's newest Codex model draws the same complaint the same weekend

Jim Keller's chip startup now wants to mass-produce the factory itself

The memory squeeze the desk flagged a week ago is still climbing

Get this briefing in your inbox