# Capable AI got cheaper all week. This weekend, builders caught the newest models fumbling the simplest job: using the tools you give them

> A holiday-quiet Sunday, and the story that matters is a builder's complaint: this weekend Armin Ronacher and an OpenAI Codex thread both caught the newest AI models — Opus 4.8, Sonnet 5, the latest Codex — getting worse at using third-party tools, even as they get better at everything else. The week made capable AI cheap; the weekend showed the difficulty moved into the harness around the model. Plus Jim Keller's Atomic Semi rebrands as Fab2 to mass-produce small fabs.

- Published: Sunday, July 5, 2026 (2026-07-05)
- Publisher: nextbig.dev — daily AI & compute briefing, written by Oday Brahem with nextbig.dev's AI agent
- Sources analyzed: 9 articles from 300+ curated accounts
- Canonical URL: https://www.nextbig.dev/daily/2026-07-05

## The Big Story

### Capable AI got cheaper all week. This weekend, builders caught the newest models fumbling the simplest job: using the tools you give them

The wire is holiday-quiet, and the one story on it that matters is a builder's complaint. Armin Ronacher, hacking on an agentic coding tool of his own called Pi, watched Anthropic's newest models, Opus 4.8 and Sonnet 5, call his edit tool with fields he never wrote into the schema. The edits themselves were usually correct. The invented keys were the trouble: Pi rejected the calls and made the model try again. His older models had not done this. Simon Willison, who passed the note along, put the finding in one line: the newest models in the family are worse at this one tool schema than their older siblings.

Be fair about what this is and isn't. The models did not get worse at coding; they got better, and the same weekend proved it: Willison shipped a release of sqlite-utils, his widely used database library, mostly written by Claude Fable, for about $149 in tokens. The regression is narrow, which is what makes it worth reading. Ronacher's best guess is that the labs tuned these models hard on their own coding harnesses, Anthropic on Claude Code, and that the tuning quietly degraded them on everyone else's tools. Capability went up. The ability to plug into a tool someone else built went down.

It is not only Anthropic. The same weekend, a heavily upvoted issue on OpenAI's Codex repository argued that its newest model's reasoning had started clustering in a way that degrades performance inside the harness. Two labs, two flagships, the same odd result: the best model a lab ships is increasingly the one shaped for the lab's own tools, and increasingly clumsy inside anyone else's. For a week that was about capability getting abundant and portable, this is the correction. The model stopped being the scarce, expensive input this month. The scaffolding it plugs into is where the difficulty moved.

For anyone building on these models, the practical read is a budget shift. A month ago the model was the line item you watched. Now it is nearly free, and the cost has slid into the scaffolding around it: the retries, the schema mismatches, the harness you keep running so a brilliant model can be trusted to fill in a form. Pin the model versions your tools were tested against, because newer is no longer strictly better for tool use. Keep an older model wired in as the fallback for the calls that cannot miss.

Willison spent about $149 and got a working release out of Claude Fable. Ronacher spent his weekend watching a smarter model get rejected for inventing fields it was told not to use. Both are the frontier right now, and the second is the bill the cheap models are running up while almost nobody writes it down.

Source: @mitsuhiko — https://lucumr.pocoo.org/2026/7/4/better-models-worse-tools/

## The Harness Became the Bottleneck

### The newest models invent fields in tools they were told to use

Armin Ronacher, building an agentic coder called Pi, found that Opus 4.8 and Sonnet 5 call his edit tool with extra keys that aren't in the schema. The edits are usually right; the fabricated fields get the whole call rejected and retried. Older Claude models handled the same schema cleanly. His read is that tuning the models on Anthropic's own Claude Code harness made them worse at third-party tools. The models are more capable and less interoperable at the same time, and for anyone whose product is a harness, that trade is the whole story.

Source: @mitsuhiko — https://lucumr.pocoo.org/2026/7/4/better-models-worse-tools/

### OpenAI's newest Codex model draws the same complaint the same weekend

A heavily upvoted issue on OpenAI's Codex repository argues the latest model's reasoning-token behavior has started degrading its performance inside the coding harness. Take it as community signal, not a lab admission. But it lands on the same shape as Ronacher's Claude finding: the flagship a lab optimizes for its own agent is the one that behaves worst in someone else's. When two labs draw the same complaint in a single weekend, it stops being a quirk and starts being the cost of how frontier models are now trained.

Source: @github — https://github.com/openai/codex/issues/30364

## Independence, All the Way Down to the Fab

### Jim Keller's chip startup now wants to mass-produce the factory itself

Atomic Semi, founded by chip architect Jim Keller and DIY-fab prodigy Sam Zeloof, rebranded to Fab2 and moved to Texas, recasting itself as a "fab fab": a company that builds small semiconductor fabs and every tool inside them, then aims to mass-produce the fabs. It is seed-stage and years from proof, backed by the OpenAI Startup Fund with angels including Naval Ravikant and Nat Friedman. But it is the deepest version of the week's theme. Independence from one model, then one chip vendor, ends at the fab. Fab2 is an attempt to make the fab itself a product instead of a chokepoint.

Source: @tomshardware — https://www.tomshardware.com/tech-industry/atomic-semi-rebrands-as-fab2-and-shifts-operations-to-texas

### The memory squeeze the desk flagged a week ago is still climbing

TrendForce now expects DRAM and NAND prices to keep rising through the third quarter on AI demand, even as consumer buyers hit the ceiling of what they will pay. Micron showed its first PCIe Gen6 data-center SSD at Computex, aimed squarely at the AI build-out; DIY makers are hand-threading core memory to route around the crunch, and cheaper Chinese YMTC drives are turning up in retail Lenovo laptops. On June 28 we called that the AI-hardware story would stop being a GPU story alone. A week on, the memory line is still paying.

Source: @tomshardware — https://www.tomshardware.com/pc-components/ram/memory-price-surge-begins-to-cool-as-consumers-hit-affordability-limit-ai-demand-still-keeps-dram-and-nand-prices-climbing-through-q3-2026

## Quick Hits

- Amazon stops taking new Mechanical Turk customers, twenty years after it turned human labeling into an API (@techcrunch) — https://techcrunch.com/2026/07/05/amazon-will-stop-accepting-new-customers-for-mechanical-turk/
- Alibaba reportedly bans employees from using Claude Code, classifying it high-risk software (@techcrunch) — https://techcrunch.com/2026/07/04/alibaba-reportedly-bans-employees-from-using-claude-code/
- EU Council fast-tracks Chat Control, reviving mandatory scanning of private messages (@heise) — https://www.heise.de/en/news/Chat-Control-1-0-EU-Council-forces-messenger-scans-via-fast-track-11353659.html
- Google's July 4 ad imagines the Declaration of Independence written with Gemini. It did not go over well (@verge) — https://www.theverge.com/ai-artificial-intelligence/961468/google-ai-commercial-founding-fathers-declaration-of-independence

## The Takeaway

The models keep getting cheaper and smarter, and this weekend two labs' newest flagships were caught doing the same small, expensive thing: calling their users' tools with fields that don't exist. Armin Ronacher documented it on Anthropic's Opus 4.8 and Sonnet 5; a busy thread said much the same about OpenAI's latest Codex model. The likely cause is that each lab tuned its model on its own harness, which made it worse at everyone else's. The lesson isn't to stop upgrading; it's to stop assuming newer is better for the load-bearing work of tool use, and to treat the harness (the retries, the schema checks, the fallback model) as the part you own. Capability is close to free now. Reliability at the machine interface is not, and that is where the next year of engineering work, and margin, will sit.

## The Call

Within six months, at least one major AI lab publicly concedes that tuning its models on its own coding agent made them worse at other people's tools, and ships a named fix: a compatibility mode, a published tool-calling contract, or retraining aimed at third-party harnesses.

The case: Two independent reports landed the same weekend. Armin Ronacher showed Opus 4.8 and Sonnet 5 inventing fields in his edit tool, worse than their older siblings, and traced it to first-party-harness tuning; a busy issue on OpenAI's Codex repo described a similar in-harness regression. The failure surfaced the moment the model itself stopped being the scarce, expensive input. Once the harness is the competitive surface, tool interoperability is the pressure that follows, and admitting the regression is cheaper than losing the developers who build on top of you.

What proves us wrong: If, by January 5, 2027, no major lab has publicly acknowledged that first-party-harness tuning hurt third-party tool use and shipped a specific fix for it, and builders are still papering over invented tool fields with retries, the call is wrong.

Settles: by January 5, 2027

## The Tape

The market desk's signals from the day's verified wire. Falsifiable analysis, settled in public — not individualized investment advice.

### LONG MU (Micron) — medium conviction

The AI-memory thesis we put out on June 28 keeps confirming. TrendForce sees DRAM and NAND climbing through Q3 on AI demand even as consumer buyers tap out, and Micron brought its first PCIe Gen6 data-center SSD to Computex, selling into the build-out rather than the PC cycle. Memory is decoupling from the consumer refresh and re-pricing on AI, and Micron is the cleanest US-listed way to hold that.

The mechanism: When memory tightness is driven by AI capacity rather than PC seasonality, the pricing power lands on the makers, and ASPs and margins follow. The near-term risk is real: the consumer side is at an affordability ceiling, so a demand air-pocket would hit volumes even if AI holds the floor under price.

Wrong if: DRAM and NAND pricing rolls over before the fourth quarter, or Micron's next report shows AI demand failing to offset the consumer softness, leaving revenue and margins flat to down.

Settles: 6 months

### WATCH Fab2 (Atomic Semi) — low conviction

Fab2 is the long-horizon independence trade, one layer below the chips: a domestic, software-defined attempt to mass-produce small fabs and the tools inside them. It is private, seed-stage, and years from proof, so this is a name to watch, not a position. But the founders and the thesis are serious enough that if it works, it resets who can make chips at all.

The mechanism: The week's whole arc was reducing dependence on any single vendor. The fab is the last and hardest chokepoint, and a credible attempt to commoditize it is worth tracking even at seed stage. The offset is that fab economics have humbled far better-funded efforts, and mass-producing fabs is a much harder problem than building one.

Wrong if: Fab2 fails to ship a working, repeatable fab on its stated path, or stalls at the prototype stage the way most novel-fab efforts have.

Settles: 18 months

### WATCH NVDA (Nvidia) — low conviction

We hold the watch. Nothing today moves the core trade: the harness story is orthogonal to silicon, and Fab2 is years out. But both point at the same slow erosion the tape has tracked all week. Value is migrating up the stack toward the harness and down the stack toward memory and fabs, and the GPU-centric trade sits in the middle of that squeeze.

The mechanism: The bull case is that agentic demand lifts all accelerator volume regardless of where margin accrues. The offset is that when the model is cheap and the scarce, defensible work is the scaffolding around it, the pricing power the GPU trade leans on gets shared with layers Nvidia doesn't own.

Wrong if: Nvidia's next two quarters show accelerating data-center revenue and holding margins, with no visible share loss to AMD or in-house inference silicon.

Settles: 9 months

---
Cite as: "nextbig.dev Daily AI Briefing, 2026-07-05" — https://www.nextbig.dev/daily/2026-07-05