The AI Hardware Stack, Explained: From GPUs and HBM to Neoclouds

Every answer a model gives you starts as electricity moving through physical parts: a chip doing the math, stacked memory feeding it, packaging holding it together, optics wiring thousands of them into one machine, and a datacenter with the power to run it all. The news calls these parts GB200, HBM, CoWoS, RISC-V. This guide explains what each one is, what it decides, and why a builder who has never touched a server should still know the difference.

You do not need an electrical-engineering background. By the end you will be able to read an AI-infrastructure headline and know immediately whether it touches your cost, your latency, or your ability to get capacity at all.

What you'll learn

The five layers of the AI hardware stack, from a single chip to a datacenter
What a GPU, HBM memory, and CoWoS packaging each do, and which one is really the bottleneck
How thousands of chips are wired into one machine with NVLink and optics, and why co-packaged optics is coming
What a neocloud is, and when it beats a hyperscaler
Why RISC-V and custom silicon are the gamble against NVIDIA, plus a plain glossary of every term

The AI hardware stack, in one map

"AI infrastructure" sounds like analyst territory, but underneath the branding it is five physical layers stacked on top of each other. Each layer answers a different question, and almost every infrastructure headline is really about one of them.

The stack, bottom to top. A headline about a chip (layer 1) eventually changes your invoice; a headline about power (layer 4) decides what is even possible eighteen months out. Naming the layer is the first step to knowing whether the news matters to you.

The accelerator: GPUs and the Blackwell generation

At the bottom of the stack is the accelerator: the chip that does the matrix math a model is made of. Almost always this is an NVIDIA GPU, though Google's TPUs, Amazon's Trainium, and a handful of startups build alternatives. A GPU is good at AI because it does thousands of simple calculations in parallel, which is exactly the shape of the work.

The current NVIDIA generation is Blackwell. You will see it in two product names:

GB200: a "superchip" that pairs two Blackwell GPUs with a Grace CPU. It is the building block most new capacity in 2025–2026 is made of.
GB300: "Blackwell Ultra," the mid-cycle upgrade with more and faster memory, tuned for the heavier demands of reasoning and inference.

In the datacenter these arrive as a GB200 NVL72: a single rack that wires 72 Blackwell GPUs together so tightly they behave like one enormous GPU (more on how, below). When you read about "racks" or "allocations," this is the unit being bought and sold.

What comes next: Rubin. NVIDIA's next architecture after Blackwell is Rubin (paired with a new CPU as "Vera Rubin"), expected to use next-generation HBM4 memory. Every generation resets the price-per-performance expectation, so capacity plans across the industry move with its timing. A new generation slipping or shipping early is itself market-moving news.

HBM: the bottleneck nobody sees

Here is the counterintuitive part. The headline number for a chip is its compute, measured in FLOPs. But for most AI work, the chip is not waiting on math. It is waiting on memory: moving the model's weights and the running activations in and out fast enough to keep the math units fed. This is the "memory wall," and it is why the most important part beside the GPU is its memory.

That memory is HBM (High-Bandwidth Memory): DRAM stacked in tall towers and placed right next to the GPU so data has the shortest possible trip. HBM is what gives a modern accelerator its enormous bandwidth, and it is made by only a few companies (SK Hynix, Samsung, Micron). Because it is hard to make and in heavy demand, HBM supply, not GPU logic, is often the real reason a chip is scarce or expensive.

A GPU die with HBM stacks beside it, all sitting on a shared interposer. The wide arrows are the bandwidth that keeps the math units busy. Compute has outrun memory for years, so how much HBM a chip has, and how fast, is frequently what matters most.

CoWoS: why packaging caps supply

Look again at that interposer under the die and memory. Getting a GPU and several HBM stacks to sit together and communicate at full speed is its own hard problem, solved by advanced packaging. The dominant method is TSMC's CoWoS (Chip-on-Wafer-on-Substrate), which mounts the die and the memory on a shared piece of silicon so the connections between them are short, wide, and fast.

CoWoS matters to builders for one blunt reason: there is not enough of it. Packaging capacity has been one of the tightest links in the entire AI supply chain. When a company says it cannot get enough GPUs, the constraint is often not the chip fab but the packaging line behind it. Watch CoWoS capacity and you are watching the real ceiling on how many accelerators reach the market.

Wiring chips into one machine: NVLink and optics

No single GPU can train a frontier model. The work is split across thousands of them, which means the wiring between chips is as important as the chips themselves. There are two scales of connection, and a third one arriving.

Three scales of wiring. Inside a rack, NVLink lets dozens of GPUs share work as if they were one ("scale-up"). Across the room, optical links tie racks into a full cluster ("scale-out"). As clusters hit tens of thousands of GPUs, ordinary networking burns too much power, which is why co-packaged optics, moving the light onto the chip package itself, is the next frontier.

NVLink is NVIDIA's high-speed link between GPUs in the same rack. It is what makes a GB200 NVL72 act like one machine instead of 72 separate cards. Optical networking (InfiniBand or high-end Ethernet) connects racks across the datacenter floor. And co-packaged optics is the emerging shift that builds the optics directly into the switch package, the leading candidate for how the next, larger clusters get built without the power bill running away.

Where it all runs: neoclouds vs hyperscalers

You almost certainly rent this hardware rather than own it. There are three places to rent from, and the choice is a real cost decision.

Neoclouds: providers built specifically to rent GPUs (CoreWeave, Lambda, Crusoe, Nebius and others). Because accelerators are their entire business, they often beat the big clouds on raw GPU price and on availability. For pure training or inference at scale, this is increasingly where the capacity is.
Hyperscalers: AWS, Azure, Google Cloud. You may pay more per GPU-hour, but you get the surrounding storage, networking, security, and managed services, plus any existing contract. Breadth, not price, is the draw.
On-prem: your own hardware. Maximum control and, at steady high utilization, the lowest long-run cost, in exchange for capex and the job of running it.

The three places to rent (or own) AI compute. Neoclouds increasingly win on price and availability for pure GPU work; hyperscalers win on everything around the GPU; on-prem wins on control and, at steady high utilization, long-run cost.

The cost rarely turns on the sticker price. Egress fees, how predictable your utilization is, and the cost of switching providers usually decide more than the per-hour rate. For the full decision frame, see the GPU and infra economics playbook.

The challengers: RISC-V and custom silicon

NVIDIA's dominance, and its margins, create a powerful incentive to build alternatives. Two threads are worth tracking.

Custom silicon (ASICs): the largest buyers design their own accelerators to escape both the price and the supply queue. Google's TPU, Amazon's Trainium, and chips from Meta and others are all moves to own the silicon, on the thesis that it beats renting NVIDIA's.

RISC-V: an open, royalty-free instruction set architecture, the basic vocabulary a processor speaks, that anyone can build on without licensing x86 (Intel, AMD) or Arm. For AI it lowers the cost of designing custom CPUs and accelerators, which is why startups such as Rivos and a wave of others build on it. RISC-V progress is a leading indicator of where pricing pressure on the incumbents shows up next. Further out, frontier candidates like Extropic's thermodynamic computing aim at a different physics of computation entirely; mostly research today, but the kind of story that tells you where the field thinks the limits are.

The hard limit: power

Zoom all the way out and the binding constraint is not chips at all. It is electricity. A frontier cluster draws as much power as a small city, and you cannot run accelerators you cannot power and cool. This is why AI news now includes power-purchase agreements, grid connections, and even nuclear deals: the companies building the largest systems are securing energy years ahead. When you read about a datacenter "buildout," the gating question is almost always how many megawatts it can actually get, and when.

A builder's glossary

The terms that recur in AI-infrastructure news, in one place. You do not need to memorize them; you need to recognize which layer each one belongs to.

Term	What it is	Why it matters
GPU / accelerator	The chip that does AI math in parallel	The base unit of compute; almost always NVIDIA
Blackwell	NVIDIA's current GPU generation	What most new 2025–2026 capacity is built on
GB200 / GB300	Grace + Blackwell "superchips" (GB300 = Blackwell Ultra)	The building blocks sold into datacenters
NVL72	A rack of 72 GPUs linked as one machine	The unit bought, sold, and "allocated"
HBM	High-Bandwidth Memory stacked beside the GPU	Often the real bottleneck and the cause of scarcity
CoWoS	TSMC advanced packaging for die + HBM	A tight supply constraint behind GPU shortages
NVLink	NVIDIA's GPU-to-GPU link inside a rack	Makes many GPUs act as one ("scale-up")
Co-packaged optics	Optical networking built into the package	How the next, larger clusters stay power-feasible
Neocloud	A GPU-first cloud (CoreWeave, Lambda, Crusoe…)	Often the cheapest, most available capacity
RISC-V	An open, royalty-free chip instruction set	Lowers the cost of building NVIDIA alternatives
Rubin	NVIDIA's next architecture after Blackwell	Resets price/performance; its timing moves plans

How to follow AI hardware news

You now have the map: the chip and its memory, the packaging that holds them, the wiring that scales them, the clouds that rent them, and the power that limits them all. The reason it is worth knowing is leverage: when a headline lands, you can place it on the stack and tell instantly whether it changes your cost, your speed, or your access to capacity.

For what each move actually costs, read the GPU and infra economics playbook. For how to follow the daily flow of chip, datacenter, and pricing news without drowning, see the guide to AI infrastructure news. And for the moves themselves, the daily briefing reads the wire every morning and closes each edition with one falsifiable call we settle in public.

This guide is part of The Primer, our growing library of ground-up explainers. We re-check every one against the live landscape each month, so the names and generations stay current.

Frequently asked questions

What is a neocloud?

A neocloud is a cloud provider built specifically to rent out GPUs for AI, rather than a general-purpose cloud that also happens to have GPUs. Providers like CoreWeave, Lambda, Crusoe, and Nebius buy accelerators in bulk and rent them by the hour, often at lower prices and with better availability than the big hyperscalers because GPUs are their whole business. For many teams training or serving models, a neocloud is where the capacity actually is.

What is HBM (high-bandwidth memory), and why does it matter?

HBM is the stacked memory that sits right next to a GPU and feeds it data. Modern AI is bottlenecked less by raw math and more by how fast you can move weights and activations in and out of memory, so HBM bandwidth often decides real-world speed. HBM is also hard to manufacture and in short supply, which is why it, not the GPU logic itself, is frequently the reason a chip is scarce or expensive.

What is CoWoS?

CoWoS (Chip-on-Wafer-on-Substrate) is TSMC's advanced packaging that places a GPU die and its HBM memory stacks together on a single silicon interposer so they can talk at enormous bandwidth. It is essential for every high-end AI accelerator, and CoWoS capacity is one of the hardest bottlenecks in the supply chain: when people say they cannot get GPUs, packaging is often the real reason.

What is the difference between GB200 and GB300?

Both are NVIDIA Grace Blackwell systems that pair Blackwell GPUs with a Grace CPU. GB200 is the first Blackwell-generation product; GB300 (Blackwell Ultra) is the mid-cycle upgrade with more and faster HBM memory and higher performance, aimed especially at reasoning and inference workloads. In rack form they appear as the GB200 NVL72 and GB300 NVL72, which wire 72 GPUs together to act as one machine.

What are co-packaged optics?

Co-packaged optics (CPO) move the optical components that send data between machines off separate pluggable modules and directly into the switch or chip package. As clusters grow to tens of thousands of GPUs, the power and cost of conventional networking become a limit, and CPO is the leading way to push past it. It is an emerging shift, and a leading indicator of how large the next generation of clusters can get.

What is RISC-V, and why does it matter for AI?

RISC-V is an open, royalty-free instruction set architecture: the basic vocabulary a chip uses, but free for anyone to build on, unlike x86 (Intel/AMD) or Arm. For AI it matters because it lets companies design custom CPUs and accelerators without licensing someone else's design, which is part of the broader effort to escape NVIDIA's pricing and margins. It is a leading indicator of where cost pressure on the incumbents comes from next.

What is NVIDIA Rubin?

Rubin is NVIDIA's next GPU architecture after Blackwell, expected to use next-generation HBM4 memory and pair with a new CPU (the combination is referred to as Vera Rubin). Like every generation before it, it resets expectations for price and performance, and capacity plans across the industry shift with its timing, so a slip or an early ship is itself news.

Neocloud vs hyperscaler: which is cheaper?

Neoclouds often win on raw GPU price and on availability, because renting accelerators is their entire business. Hyperscalers (AWS, Azure, Google Cloud) usually win on breadth: the surrounding storage, networking, security, and managed services, plus existing contracts. The right answer follows the workload; for pure training or inference at scale a neocloud is frequently cheaper, but egress fees, switching cost, and what else you run alongside the GPUs can flip the math.

As a builder, how much of this hardware do I actually need to know?

You do not need to design chips. You need to read a headline and know which part it is about, because that tells you whether it touches your cost, your speed, or your ability to get capacity at all. Knowing that HBM and packaging gate supply, that neoclouds are where affordable capacity lives, and that a new generation like Rubin resets pricing is enough to make better infrastructure decisions without becoming a hardware engineer.