inside
the chip
how AI silicon actually works
A bottom-up walk through how an AI chip is built — starting from AND gates and ending at the GPU-vs-TPU architectural split. Notes from a conversation between Dwarkesh Patel and Reiner Pope, CEO of MatX. Every section here distills one big idea from the transcript and the diagrams that make it stick.
The one idea that runs the whole thing
Every level of chip design is the same fight: maximize compute relative to communication. From the precision of a single multiplier, to the size of a systolic array, to the layout of a whole datacenter — you are always trying to do more arithmetic per byte you move. That's it. That's the whole show.
◇at the gate
Bit-width scales quadratically — halving precision more than doubles throughput. This is why FP4 is so much faster than FP8.
◇at the core
A systolic array bakes a 2D loop of MACs into hardware so the weight matrix sits in place while activations flow through.
◇at the chip
A scratchpad replaces a cache so memory access is deterministic and software, not hardware, controls movement.
Logic gates → multiply-accumulate
The atomic op of AI
Look inside any matrix multiply and you find a triple for loop:
Every step is one multiply-accumulate — a multiply, an add into a running sum. So the whole chip can be optimized around that one operation.
Multiply small, accumulate big
Reiner's example: a 4-bit × 4-bit multiply, accumulating into an 8-bit running sum.
Why the asymmetry? Two reasons:
- The product of two N-bit numbers needs 2N bits to hold without loss.
- You sum many of these — rounding errors pile up in the accumulator, not the multiplier.
So low-precision multiply + higher-precision add is a free lunch in error.
The full adder is a 3→2 compressor
Coming from software you'd assume a "full adder" adds two 32-bit numbers. It doesn't.
It takes three single bits, counts them, and writes the count in binary as two bits.
The right output bit is the column sum. The left bit is the carry into the next column.
Dadda multiplier
To do the big column sum from the multiplication above, you tile full adders across the partial-product grid.
- Each adder eats 3 bits, emits 2 bits — net −1 bit.
- Started with 24 input bits, ended with 8 output bits.
- So you needed exactly 24 − 8 = 16 full adders.
- Generalizes to p × q full adders for a p-bit × q-bit MAC.
It's the standard area-efficient multiplier construction.
The quadratic scaling insight. Halving precision doesn't just double throughput
— it more than doubles it, because area scales as p × q. This is
the single biggest reason low-precision arithmetic has worked so well for neural nets.
Nvidia even acknowledged this on B300 by quoting FP4 ≈ 3× FP8 instead
of the historical 2× ratio. Technically it should be 4×.
The hidden tax: muxes and data movement
Where does the MAC live?
You drop the multiply-accumulate unit next to a register file. The MAC reads three registers — two operands and the accumulator — does its thing, writes back.
But which registers? The MAC doesn't always read the same three slots. So you need a mux in front of each input to select.
A mux is a software switch
To pick "register #3" out of 8, hardware does the dumb thing: AND every entry with a one-hot mask, then OR everything together.
Selecting a register is not free. It looks like nothing in software but it's a real chunk of silicon.
Almost all the area in a classic CUDA core is just moving bytes — not doing arithmetic. ~7/8 of the gates feed the muxes that read and write the register file. This is the problem statement that motivated Tensor Cores and, before them, systolic arrays.
Systolic arrays: tilting the ratio
Bake two loops into hardware
A single MAC bakes one level of the triple loop into silicon. A systolic array bakes two: an entire matrix-vector multiply becomes one fixed-function block.
The unit goes from scalar op to tile of ops. Larger granularity means the same register-file tax is amortized over way more arithmetic.
Quadratic compute, linear comm
An x × y systolic array does x × y multiply-accumulates per cycle. But the data flowing in and out only scales as x (or x + y).
The bigger the array, the better the ratio. Older TPUs ran 128 × 128.
How big should the array be?
A huge systolic array means more amortization. But it also means less flexibility for the register file and other ops.
Reiner's framing: set a budget — e.g. 10% of die area on data movement, 90% on the array — and size everything from there.
Bigger register files = more application performance but less array.
Splittable systolic arrays
Reiner mentions MatX has a "splittable systolic array" — big arrays that can also operate as several small ones.
It's the obvious compromise between TPU's coarse granularity and GPU's many-small-cores layout. We'll come back to this in §09.
Clock cycles & pipeline registers
100 billion transistors, in lockstep
Chips are massively parallel. To avoid software-style synchronization (mutexes, locks — way too slow), every nanosecond everything pauses simultaneously.
That moment is the clock cycle. Mediated by registers — tiny storage devices that latch whatever value is on their input wire at the tick.
Logic must finish before the tick
If your "cloud of logic" between two registers takes longer than the clock period, you lose. The signal hasn't settled.
So a major job in chip design is making the longest path through any cloud of logic as short as possible.
Designers margin for ~25% slack so the chip basically never misses.
Latency vs throughput is a real knob. You can push clock speed arbitrarily high by stuffing pipeline registers everywhere — but past a point, almost all your area is registers, not logic. Same energy lesson as last episode's batch-size talk: high clock / low batch favors latency. Lower clock / wider arrays favor throughput.
FPGA vs ASIC
FPGA
- First unit: ~$10K.
- Reconfigurable in the field — change the design any time.
- Built from LUTs + registers + a giant mesh of muxes.
- ~10× more expensive in area and energy than ASIC.
- Great when you change the workload often (e.g. HFT, prototyping).
ASIC
- First unit: ~$30M (a full tape-out).
- Frozen at fabrication. No changing the logic.
- Custom polysilicon and wires — minimum gates for the job.
- ~10× cheaper & more efficient than the equivalent FPGA.
- Worth it once volume + stability justify the NRE cost.
The LUT: a 4→1 truth table in silicon
A typical FPGA "lookup table" has 4 input bits, 1 output bit. Inside it is a 16-entry table stored in configuration memory.
By writing different 16-bit patterns into that memory, the LUT becomes AND, OR, XOR, NAND, a 3-way majority, a 4-way parity — anything.
That's where the "field-programmable" comes from: muxes route signals between LUTs, LUTs configure into any gate. It's muxes all the way down.
Programmability has a cost
An ASIC implements a 4-way AND with literally 3 AND gates.
An FPGA implements the same thing with one LUT — which internally is ~32 gates of muxes selecting from a 16-entry table.
That's the ~10× tax. Plus the routing muxes between LUTs cost area and add wire delay.
Cache vs scratchpad: who decides what's hot?
CPU / Cache
- One "read memory" instruction. Hardware decides if data is in cache.
- Cache is ~100× faster than DDR — programs need it to run at reasonable speed.
- But hit/miss depends on ambient environment: other programs, recent accesses, replacement policy.
- Non-deterministic latency.
TPU / Scratchpad
- Two distinct instructions: "read scratchpad" and "read HBM".
- Software is responsible for placing data in the right tier.
- Same idea, totally different control surface.
- Deterministic latency — by construction.
FPGAs win on latency because cache is gone
Reiner notes you could build a CPU with deterministic latency — and some chips do (Groq, TPU compute cores). It's actually a simpler starting point. CPUs are non-deterministic because someone added caches and branch prediction. You can take them out — you just give up performance per cycle to do it.
This is why HFT shops like Jane Street reach for FPGAs: predictable per-packet latency matters more than peak throughput.
Why CPU cores are bigger than GPU cores
~100 cores × big & complicated
A modern CPU has ~100 cores doing ~16-way SIMD = ~1,000-way parallelism. But each core is huge.
Where does the die go? Mostly:
- Cache hierarchy
- Register files
- Branch predictor (the GPU-killer)
- ALUs (small fraction)
Many small SMs, no branch predictor
A GPU rips out a lot of CPU baggage — most importantly the branch predictor — and shrinks the register files. Result: way more cores, more area per ALU.
The trade is that GPUs are bad at branchy serial code. They're great at SIMD throughput where you don't need to guess where the next instruction lives.
It predicts the future ~5 cycles ahead
A single instruction takes ~5 ns to process: read, decode, evaluate, write back. To run at 1–2 GHz, the CPU must keep pipelining new instructions while old ones finish. But if an old instruction is a branch (an if), the CPU doesn't yet know which way to go.
The branch predictor guesses — based on history, target tables, and pattern detection — and the pipeline runs ahead speculatively. On a misprediction, the speculative work is thrown away. That's why a tight branchy loop on a CPU benefits enormously from a sophisticated predictor.
GPUs don't have one because they don't need one. They rely on having so many threads in flight that they can just switch to ready work while a branch resolves.
Brains vs chips
Where the analogy holds
- Memory ↔ compute co-location: brains do it natively, but systolic arrays do too — the weight sits where the math happens.
- Sparsity: brains are unstructured sparse. Chips can do structured sparsity but pay a tax for unstructured.
- Clock speed: brain runs at maybe kilohertz. Chips run at gigahertz.
Why slow doesn't mean efficient
Most of a chip's energy is in switching bits 0↔1 — charging and discharging tiny capacitors. Static idle power is much smaller.
So if you ran a GPU at 1 MHz instead of 1 GHz, you'd use ~1,000× less energy. But you'd also do ~1,000× less work per second. Per operation, you don't save much.
The brain isn't more efficient simply by being slow. Something else is going on — likely a combination of co-location, sparsity, and analog computation.
A GPU is just a bunch of tiny TPUs
Lots of small tiles → flexibility
- Many SMs, each with their own tensor core, vector ALUs, register file.
- Lots of perimeter between matrix and vector units → tons of cross-bandwidth.
- Great when there isn't one giant matmul — when work is uneven or branchy.
- Pays the register-file tax over many small tiles.
Few big tiles → amortization
- One or two huge MXUs + one vector unit.
- All data movement between matrix and vector squeezes through narrow perimeter — lower bandwidth there.
- But each register-file dollar is spread over a much bigger tile.
- Wins when the workload is one big matmul. Suffers when work doesn't fit the shape.
MatX's pitch (hinted at): a splittable systolic array that behaves like a big TPU tile when the matmul is big, and like a stack of small GPU-style tiles when it isn't. Best of both granularities, ideally without the worst of either.
The mental model in 7 lines
The compute / comm ratio is everything
At every level of the stack — from the bit-width of a multiplier to the floor plan of a datacenter — the optimization is more arithmetic per byte moved. That's what motivates low precision, systolic arrays, scratchpads, and the GPU-vs-TPU layout debate.
The whole conversation, compressed
- The atomic op of AI hardware is the multiply-accumulate.
- Multiplier area scales as p × q, so low precision wins quadratically.
- In a vanilla core, most of the area moves data, not arithmetic.
- Systolic arrays fix this by baking a 2D loop into hardware and parking weights in place.
- Clock speed is set by the longest path — pipeline registers buy speed at an area cost.
- FPGAs trade ~10× efficiency for in-field reconfigurability via LUTs and routing muxes.
- GPUs are many tiny TPUs; TPUs are few big GPU-tiles. MatX wants splittable.
Open questions the transcript points at
- How much of FP4 vs FP8 should a chip dedicate? Equal die area? Equal power budget? Customer-driven?
- How big should one systolic array be before perimeter bandwidth kills you?
- Can splittable systolic arrays really get TPU's amortization and GPU's intra-chip bandwidth?
- What's the analog-computation / co-location story that lets brains run at kilohertz and still beat silicon on perception?