INSIDE THE CHIP // notes from Reiner Pope on how AI silicon actually works

00

The one idea that runs the whole thing

◇at the gate

Bit-width scales quadratically — halving precision more than doubles throughput. This is why FP4 is so much faster than FP8.

level 2

◇at the core

A systolic array bakes a 2D loop of MACs into hardware so the weight matrix sits in place while activations flow through.

level 3

◇at the chip

A scratchpad replaces a cache so memory access is deterministic and software, not hardware, controls movement.

01

Logic gates → multiply-accumulate

primitives

why MAC?

The atomic op of AI

Look inside any matrix multiply and you find a triple for loop:

// matrix multiply, three nested loops for i: for j: for k: out[i,k] += A[i,j] * B[j,k]

Every step is one multiply-accumulate — a multiply, an add into a running sum. So the whole chip can be optimized around that one operation.

precision asymmetry

Multiply small, accumulate big

Reiner's example: a 4-bit × 4-bit multiply, accumulating into an 8-bit running sum.

Why the asymmetry? Two reasons:

The product of two N-bit numbers needs 2N bits to hold without loss.
You sum many of these — rounding errors pile up in the accumulator, not the multiplier.

So low-precision multiply + higher-precision add is a free lunch in error.

// FIG-01 — 4-bit × 4-bit long multiplication, accumulator on top

building block

The full adder is a 3→2 compressor

Coming from software you'd assume a "full adder" adds two 32-bit numbers. It doesn't.

It takes three single bits, counts them, and writes the count in binary as two bits.

// truth table sample in = 1 1 1 → out = 1 1 (count=3) in = 1 0 1 → out = 1 0 (count=2) in = 0 1 0 → out = 0 1 (count=1) in = 0 0 0 → out = 0 0 (count=0)

The right output bit is the column sum. The left bit is the carry into the next column.

algorithm

Dadda multiplier

To do the big column sum from the multiplication above, you tile full adders across the partial-product grid.

Each adder eats 3 bits, emits 2 bits — net −1 bit.
Started with 24 input bits, ended with 8 output bits.
So you needed exactly 24 − 8 = 16 full adders.
Generalizes to p × q full adders for a p-bit × q-bit MAC.

It's the standard area-efficient multiplier construction.

The quadratic scaling insight. Halving precision doesn't just double throughput — it more than doubles it, because area scales as p × q. This is the single biggest reason low-precision arithmetic has worked so well for neural nets. Nvidia even acknowledged this on B300 by quoting FP4 ≈ 3× FP8 instead of the historical 2× ratio. Technically it should be 4×.

02

The hidden tax: muxes and data movement

communication

old-school CUDA core / CPU

Where does the MAC live?

You drop the multiply-accumulate unit next to a register file. The MAC reads three registers — two operands and the accumulator — does its thing, writes back.

But which registers? The MAC doesn't always read the same three slots. So you need a mux in front of each input to select.

what is a mux

A mux is a software switch

To pick "register #3" out of 8, hardware does the dumb thing: AND every entry with a one-hot mask, then OR everything together.

// n-input, p-bit mux ANDs = n × p ORs = (n − 1) × p

Selecting a register is not free. It looks like nothing in software but it's a real chunk of silicon.

// FIG-02 — Where the gates actually go in a CUDA-style core

Almost all the area in a classic CUDA core is just moving bytes — not doing arithmetic. ~7/8 of the gates feed the muxes that read and write the register file. This is the problem statement that motivated Tensor Cores and, before them, systolic arrays.

03

Systolic arrays: tilting the ratio

tensor cores

the move

Bake two loops into hardware

A single MAC bakes one level of the triple loop into silicon. A systolic array bakes two: an entire matrix-vector multiply becomes one fixed-function block.

The unit goes from scalar op to tile of ops. Larger granularity means the same register-file tax is amortized over way more arithmetic.

scaling property

Quadratic compute, linear comm

An x × y systolic array does x × y multiply-accumulates per cycle. But the data flowing in and out only scales as x (or x + y).

compute ∝ x · y (quadratic) i/o wires ∝ x (linear) ratio → yx better as it grows

The bigger the array, the better the ratio. Older TPUs ran 128 × 128.

// FIG-03 — 2×2 systolic array: weights stay, activations flow

sizing decision

How big should the array be?

A huge systolic array means more amortization. But it also means less flexibility for the register file and other ops.

Reiner's framing: set a budget — e.g. 10% of die area on data movement, 90% on the array — and size everything from there.

Bigger register files = more application performance but less array.

MatX hint

Splittable systolic arrays

Reiner mentions MatX has a "splittable systolic array" — big arrays that can also operate as several small ones.

It's the obvious compromise between TPU's coarse granularity and GPU's many-small-cores layout. We'll come back to this in §09.

04

Clock cycles & pipeline registers

timing

why a clock?

100 billion transistors, in lockstep

Chips are massively parallel. To avoid software-style synchronization (mutexes, locks — way too slow), every nanosecond everything pauses simultaneously.

That moment is the clock cycle. Mediated by registers — tiny storage devices that latch whatever value is on their input wire at the tick.

the constraint

Logic must finish before the tick

If your "cloud of logic" between two registers takes longer than the clock period, you lose. The signal hasn't settled.

So a major job in chip design is making the longest path through any cloud of logic as short as possible.

Designers margin for ~25% slack so the chip basically never misses.

// FIG-04 — Pipeline register insertion: trade area for clock speed

Latency vs throughput is a real knob. You can push clock speed arbitrarily high by stuffing pipeline registers everywhere — but past a point, almost all your area is registers, not logic. Same energy lesson as last episode's batch-size talk: high clock / low batch favors latency. Lower clock / wider arrays favor throughput.

05

FPGA vs ASIC

reconfigurability

FPGA

First unit: ~$10K.
Reconfigurable in the field — change the design any time.
Built from LUTs + registers + a giant mesh of muxes.
~10× more expensive in area and energy than ASIC.
Great when you change the workload often (e.g. HFT, prototyping).

vs

ASIC

First unit: ~$30M (a full tape-out).
Frozen at fabrication. No changing the logic.
Custom polysilicon and wires — minimum gates for the job.
~10× cheaper & more efficient than the equivalent FPGA.
Worth it once volume + stability justify the NRE cost.

primitive

The LUT: a 4→1 truth table in silicon

A typical FPGA "lookup table" has 4 input bits, 1 output bit. Inside it is a 16-entry table stored in configuration memory.

By writing different 16-bit patterns into that memory, the LUT becomes AND, OR, XOR, NAND, a 3-way majority, a 4-way parity — anything.

That's where the "field-programmable" comes from: muxes route signals between LUTs, LUTs configure into any gate. It's muxes all the way down.

why 10× slower

Programmability has a cost

An ASIC implements a 4-way AND with literally 3 AND gates.

An FPGA implements the same thing with one LUT — which internally is ~32 gates of muxes selecting from a 16-entry table.

That's the ~10× tax. Plus the routing muxes between LUTs cost area and add wire delay.

// FIG-05 — 4-input LUT: a programmable truth table

06

Cache vs scratchpad: who decides what's hot?

memory model

CPU / Cache

One "read memory" instruction. Hardware decides if data is in cache.
Cache is ~100× faster than DDR — programs need it to run at reasonable speed.
But hit/miss depends on ambient environment: other programs, recent accesses, replacement policy.
Non-deterministic latency.

vs

TPU / Scratchpad

Two distinct instructions: "read scratchpad" and "read HBM".
Software is responsible for placing data in the right tier.
Same idea, totally different control surface.
Deterministic latency — by construction.

key insight

FPGAs win on latency because cache is gone

Reiner notes you could build a CPU with deterministic latency — and some chips do (Groq, TPU compute cores). It's actually a simpler starting point. CPUs are non-deterministic because someone added caches and branch prediction. You can take them out — you just give up performance per cycle to do it.

This is why HFT shops like Jane Street reach for FPGAs: predictable per-packet latency matters more than peak throughput.

07

Why CPU cores are bigger than GPU cores

architecture

CPU

~100 cores × big & complicated

A modern CPU has ~100 cores doing ~16-way SIMD = ~1,000-way parallelism. But each core is huge.

Where does the die go? Mostly:

Cache hierarchy
Register files
Branch predictor (the GPU-killer)
ALUs (small fraction)

GPU

Many small SMs, no branch predictor

A GPU rips out a lot of CPU baggage — most importantly the branch predictor — and shrinks the register files. Result: way more cores, more area per ALU.

The trade is that GPUs are bad at branchy serial code. They're great at SIMD throughput where you don't need to guess where the next instruction lives.

what does a branch predictor do?

It predicts the future ~5 cycles ahead

A single instruction takes ~5 ns to process: read, decode, evaluate, write back. To run at 1–2 GHz, the CPU must keep pipelining new instructions while old ones finish. But if an old instruction is a branch (an if), the CPU doesn't yet know which way to go.

The branch predictor guesses — based on history, target tables, and pattern detection — and the pipeline runs ahead speculatively. On a misprediction, the speculative work is thrown away. That's why a tight branchy loop on a CPU benefits enormously from a sophisticated predictor.

GPUs don't have one because they don't need one. They rely on having so many threads in flight that they can just switch to ready work while a branch resolves.

08

Brains vs chips

analogies & limits

structural differences

Where the analogy holds

Memory ↔ compute co-location: brains do it natively, but systolic arrays do too — the weight sits where the math happens.
Sparsity: brains are unstructured sparse. Chips can do structured sparsity but pay a tax for unstructured.
Clock speed: brain runs at maybe kilohertz. Chips run at gigahertz.

energy

Why slow doesn't mean efficient

Most of a chip's energy is in switching bits 0↔1 — charging and discharging tiny capacitors. Static idle power is much smaller.

So if you ran a GPU at 1 MHz instead of 1 GHz, you'd use ~1,000× less energy. But you'd also do ~1,000× less work per second. Per operation, you don't save much.

The brain isn't more efficient simply by being slow. Something else is going on — likely a combination of co-location, sparsity, and analog computation.

09

A GPU is just a bunch of tiny TPUs

tile sizing

// FIG-06 — Floor plan comparison: GPU's many small tiles vs TPU's few big tiles

GPU

Lots of small tiles → flexibility

Many SMs, each with their own tensor core, vector ALUs, register file.
Lots of perimeter between matrix and vector units → tons of cross-bandwidth.
Great when there isn't one giant matmul — when work is uneven or branchy.
Pays the register-file tax over many small tiles.

TPU

Few big tiles → amortization

One or two huge MXUs + one vector unit.
All data movement between matrix and vector squeezes through narrow perimeter — lower bandwidth there.
But each register-file dollar is spread over a much bigger tile.
Wins when the workload is one big matmul. Suffers when work doesn't fit the shape.

MatX's pitch (hinted at): a splittable systolic array that behaves like a big TPU tile when the matmul is big, and like a stack of small GPU-style tiles when it isn't. Best of both granularities, ideally without the worst of either.

10

The mental model in 7 lines

summary

×4

FP4 vs FP8 (theoretical)

7/8

cost in data movement

128²

classic TPU MXU size

10×

FPGA tax vs ASIC

remember this

The compute / comm ratio is everything

At every level of the stack — from the bit-width of a multiplier to the floor plan of a datacenter — the optimization is more arithmetic per byte moved. That's what motivates low precision, systolic arrays, scratchpads, and the GPU-vs-TPU layout debate.

7 lines

The whole conversation, compressed

The atomic op of AI hardware is the multiply-accumulate.
Multiplier area scales as p × q, so low precision wins quadratically.
In a vanilla core, most of the area moves data, not arithmetic.
Systolic arrays fix this by baking a 2D loop into hardware and parking weights in place.
Clock speed is set by the longest path — pipeline registers buy speed at an area cost.
FPGAs trade ~10× efficiency for in-field reconfigurability via LUTs and routing muxes.
GPUs are many tiny TPUs; TPUs are few big GPU-tiles. MatX wants splittable.

further

Open questions the transcript points at

How much of FP4 vs FP8 should a chip dedicate? Equal die area? Equal power budget? Customer-driven?
How big should one systolic array be before perimeter bandwidth kills you?
Can splittable systolic arrays really get TPU's amortization and GPU's intra-chip bandwidth?
What's the analog-computation / co-location story that lets brains run at kilohertz and still beat silicon on perception?

insidethe chip how AI silicon actually works