SYS // CHIP-NOTES v1.0 SRC: Dwarkesh × Reiner Pope (MatX)
STATUS: NOMINAL NODE: 3nm
// TRANSCRIPT BREAKDOWN

inside
the chip how AI silicon actually works

A bottom-up walk through how an AI chip is built — starting from AND gates and ending at the GPU-vs-TPU architectural split. Notes from a conversation between Dwarkesh Patel and Reiner Pope, CEO of MatX. Every section here distills one big idea from the transcript and the diagrams that make it stick.

primitive
multiply-accumulate
building block
full adder (3→2)
unit cell
systolic array
enemy #1
data movement cost
scaling law
compute ∝ p × q
00

The one idea that runs the whole thing

META

Every level of chip design is the same fight: maximize compute relative to communication. From the precision of a single multiplier, to the size of a systolic array, to the layout of a whole datacenter — you are always trying to do more arithmetic per byte you move. That's it. That's the whole show.

level 1

at the gate

Bit-width scales quadratically — halving precision more than doubles throughput. This is why FP4 is so much faster than FP8.

level 2

at the core

A systolic array bakes a 2D loop of MACs into hardware so the weight matrix sits in place while activations flow through.

level 3

at the chip

A scratchpad replaces a cache so memory access is deterministic and software, not hardware, controls movement.

01

Logic gates → multiply-accumulate

primitives
why MAC?

The atomic op of AI

Look inside any matrix multiply and you find a triple for loop:

// matrix multiply, three nested loops for i: for j: for k: out[i,k] += A[i,j] * B[j,k]

Every step is one multiply-accumulate — a multiply, an add into a running sum. So the whole chip can be optimized around that one operation.

precision asymmetry

Multiply small, accumulate big

Reiner's example: a 4-bit × 4-bit multiply, accumulating into an 8-bit running sum.

Why the asymmetry? Two reasons:

  • The product of two N-bit numbers needs 2N bits to hold without loss.
  • You sum many of these — rounding errors pile up in the accumulator, not the multiplier.

So low-precision multiply + higher-precision add is a free lunch in error.

// FIG-01 — 4-bit × 4-bit long multiplication, accumulator on top
A (4-bit) 1 0 0 1 B (4-bit) × 1 0 1 0 16 AND gates → partial products 0 0 0 0 1 0 0 1 · 0 0 0 0 · · 1 0 0 1 · · · + accumulator (8-bit) 0 1 1 0 1 0 1 1 5-way column sum → 16 full adders 1 0 1 0 0 1 0 1 p × q ANDs = 16 + q accumulator bits = 24 input bits p × q full adders = 16 SCALING LAW p-bit × q-bit MAC → p×q AND gates + p×q full adders → area ≈ O(p·q)
building block

The full adder is a 3→2 compressor

Coming from software you'd assume a "full adder" adds two 32-bit numbers. It doesn't.

It takes three single bits, counts them, and writes the count in binary as two bits.

// truth table sample in = 1 1 1 → out = 1 1 (count=3) in = 1 0 1 → out = 1 0 (count=2) in = 0 1 0 → out = 0 1 (count=1) in = 0 0 0 → out = 0 0 (count=0)

The right output bit is the column sum. The left bit is the carry into the next column.

algorithm

Dadda multiplier

To do the big column sum from the multiplication above, you tile full adders across the partial-product grid.

  • Each adder eats 3 bits, emits 2 bits — net −1 bit.
  • Started with 24 input bits, ended with 8 output bits.
  • So you needed exactly 24 − 8 = 16 full adders.
  • Generalizes to p × q full adders for a p-bit × q-bit MAC.

It's the standard area-efficient multiplier construction.

The quadratic scaling insight. Halving precision doesn't just double throughput — it more than doubles it, because area scales as p × q. This is the single biggest reason low-precision arithmetic has worked so well for neural nets. Nvidia even acknowledged this on B300 by quoting FP4 ≈ 3× FP8 instead of the historical 2× ratio. Technically it should be 4×.

02

The hidden tax: muxes and data movement

communication
old-school CUDA core / CPU

Where does the MAC live?

You drop the multiply-accumulate unit next to a register file. The MAC reads three registers — two operands and the accumulator — does its thing, writes back.

But which registers? The MAC doesn't always read the same three slots. So you need a mux in front of each input to select.

what is a mux

A mux is a software switch

To pick "register #3" out of 8, hardware does the dumb thing: AND every entry with a one-hot mask, then OR everything together.

// n-input, p-bit mux ANDs = n × p ORs = (n − 1) × p

Selecting a register is not free. It looks like nothing in software but it's a real chunk of silicon.

// FIG-02 — Where the gates actually go in a CUDA-style core
REGISTER FILE 8 entries × p bits R0 0110 R1 1010 R2 1101 R3 0001 R4 1011 R5 0100 R6 1111 R7 0010 MUX A MUX B MUX C MAC multiply-add p × q gates writeback ⚠ AREA BUDGET 3 muxes × 8 inputs × p bits = 24p ANDs vs. MAC = ~4p gates → 7/8 of cost is just MOVING DATA

Almost all the area in a classic CUDA core is just moving bytes — not doing arithmetic. ~7/8 of the gates feed the muxes that read and write the register file. This is the problem statement that motivated Tensor Cores and, before them, systolic arrays.

03

Systolic arrays: tilting the ratio

tensor cores
the move

Bake two loops into hardware

A single MAC bakes one level of the triple loop into silicon. A systolic array bakes two: an entire matrix-vector multiply becomes one fixed-function block.

The unit goes from scalar op to tile of ops. Larger granularity means the same register-file tax is amortized over way more arithmetic.

scaling property

Quadratic compute, linear comm

An x × y systolic array does x × y multiply-accumulates per cycle. But the data flowing in and out only scales as x (or x + y).

compute ∝ x · y (quadratic) i/o wires ∝ x (linear) ratio → yx better as it grows

The bigger the array, the better the ratio. Older TPUs ran 128 × 128.

// FIG-03 — 2×2 systolic array: weights stay, activations flow
activations stream in → 7 3 w=0 MAC w=1 MAC w=3 MAC w=2 MAC 21 13 ↓ output vector (column dot-products) // WEIGHTS stay put. loaded once, reused thousands of times. → huge compute reuse // ACTIVATIONS flow top → bottom. only x wires of input bandwidth. → linear i/o cost // PARTIAL SUMS accumulate down columns → column dot-products fall out at the bottom edge. // LOADING WEIGHTS trickled in row by row as a daisy chain — slow but cheap, since it happens rarely.
sizing decision

How big should the array be?

A huge systolic array means more amortization. But it also means less flexibility for the register file and other ops.

Reiner's framing: set a budget — e.g. 10% of die area on data movement, 90% on the array — and size everything from there.

Bigger register files = more application performance but less array.

MatX hint

Splittable systolic arrays

Reiner mentions MatX has a "splittable systolic array" — big arrays that can also operate as several small ones.

It's the obvious compromise between TPU's coarse granularity and GPU's many-small-cores layout. We'll come back to this in §09.

04

Clock cycles & pipeline registers

timing
why a clock?

100 billion transistors, in lockstep

Chips are massively parallel. To avoid software-style synchronization (mutexes, locks — way too slow), every nanosecond everything pauses simultaneously.

That moment is the clock cycle. Mediated by registers — tiny storage devices that latch whatever value is on their input wire at the tick.

the constraint

Logic must finish before the tick

If your "cloud of logic" between two registers takes longer than the clock period, you lose. The signal hasn't settled.

So a major job in chip design is making the longest path through any cloud of logic as short as possible.

Designers margin for ~25% slack so the chip basically never misses.

// FIG-04 — Pipeline register insertion: trade area for clock speed
BEFORE — long logic, 1 GHz max R logic cloud (delay = T) R f_max ≈ 1/T AFTER — split with pipeline register, 2 GHz max R half-logic R half-logic R ↑ inserted register f_max ≈ 2/T (twice the speed, +1 register area) THE HARD CASE — feedback loop A running sum: reads its own value and adds. R + feedback You can't just insert a pipeline reg — it would split the sum into "evens" and "odds". → feedback loops set the chip's max clock.

Latency vs throughput is a real knob. You can push clock speed arbitrarily high by stuffing pipeline registers everywhere — but past a point, almost all your area is registers, not logic. Same energy lesson as last episode's batch-size talk: high clock / low batch favors latency. Lower clock / wider arrays favor throughput.

05

FPGA vs ASIC

reconfigurability

FPGA

  • First unit: ~$10K.
  • Reconfigurable in the field — change the design any time.
  • Built from LUTs + registers + a giant mesh of muxes.
  • ~10× more expensive in area and energy than ASIC.
  • Great when you change the workload often (e.g. HFT, prototyping).
vs

ASIC

  • First unit: ~$30M (a full tape-out).
  • Frozen at fabrication. No changing the logic.
  • Custom polysilicon and wires — minimum gates for the job.
  • ~10× cheaper & more efficient than the equivalent FPGA.
  • Worth it once volume + stability justify the NRE cost.
primitive

The LUT: a 4→1 truth table in silicon

A typical FPGA "lookup table" has 4 input bits, 1 output bit. Inside it is a 16-entry table stored in configuration memory.

By writing different 16-bit patterns into that memory, the LUT becomes AND, OR, XOR, NAND, a 3-way majority, a 4-way parity — anything.

That's where the "field-programmable" comes from: muxes route signals between LUTs, LUTs configure into any gate. It's muxes all the way down.

why 10× slower

Programmability has a cost

An ASIC implements a 4-way AND with literally 3 AND gates.

An FPGA implements the same thing with one LUT — which internally is ~32 gates of muxes selecting from a 16-entry table.

That's the ~10× tax. Plus the routing muxes between LUTs cost area and add wire delay.

// FIG-05 — 4-input LUT: a programmable truth table
a → b → c → d → mux8→1 mux8→1 mux8→1 mux8→1 ↑ select from nearby LUTs / registers 16-ENTRY TRUTH TABLE 0000→00100→11000→0 0001→10101→01001→1 0010→10110→11010→1 0011→00111→01011→0 — program-able 16-bit memory defineswhich gatethis LUT is cost ≈ 32 gates per LUT (vs 1 gate in ASIC) OUT FIELD-PROGRAMMABLE "field" = deployed in the wild, not at fab time CONFIG 16-bit per LUT + mux selector bits
06

Cache vs scratchpad: who decides what's hot?

memory model

CPU / Cache

  • One "read memory" instruction. Hardware decides if data is in cache.
  • Cache is ~100× faster than DDR — programs need it to run at reasonable speed.
  • But hit/miss depends on ambient environment: other programs, recent accesses, replacement policy.
  • Non-deterministic latency.
vs

TPU / Scratchpad

  • Two distinct instructions: "read scratchpad" and "read HBM".
  • Software is responsible for placing data in the right tier.
  • Same idea, totally different control surface.
  • Deterministic latency — by construction.
key insight

FPGAs win on latency because cache is gone

Reiner notes you could build a CPU with deterministic latency — and some chips do (Groq, TPU compute cores). It's actually a simpler starting point. CPUs are non-deterministic because someone added caches and branch prediction. You can take them out — you just give up performance per cycle to do it.

This is why HFT shops like Jane Street reach for FPGAs: predictable per-packet latency matters more than peak throughput.

07

Why CPU cores are bigger than GPU cores

architecture
CPU

~100 cores × big & complicated

A modern CPU has ~100 cores doing ~16-way SIMD = ~1,000-way parallelism. But each core is huge.

Where does the die go? Mostly:

  • Cache hierarchy
  • Register files
  • Branch predictor (the GPU-killer)
  • ALUs (small fraction)
GPU

Many small SMs, no branch predictor

A GPU rips out a lot of CPU baggage — most importantly the branch predictor — and shrinks the register files. Result: way more cores, more area per ALU.

The trade is that GPUs are bad at branchy serial code. They're great at SIMD throughput where you don't need to guess where the next instruction lives.

what does a branch predictor do?

It predicts the future ~5 cycles ahead

A single instruction takes ~5 ns to process: read, decode, evaluate, write back. To run at 1–2 GHz, the CPU must keep pipelining new instructions while old ones finish. But if an old instruction is a branch (an if), the CPU doesn't yet know which way to go.

The branch predictor guesses — based on history, target tables, and pattern detection — and the pipeline runs ahead speculatively. On a misprediction, the speculative work is thrown away. That's why a tight branchy loop on a CPU benefits enormously from a sophisticated predictor.

GPUs don't have one because they don't need one. They rely on having so many threads in flight that they can just switch to ready work while a branch resolves.

08

Brains vs chips

analogies & limits
structural differences

Where the analogy holds

  • Memory ↔ compute co-location: brains do it natively, but systolic arrays do too — the weight sits where the math happens.
  • Sparsity: brains are unstructured sparse. Chips can do structured sparsity but pay a tax for unstructured.
  • Clock speed: brain runs at maybe kilohertz. Chips run at gigahertz.
energy

Why slow doesn't mean efficient

Most of a chip's energy is in switching bits 0↔1 — charging and discharging tiny capacitors. Static idle power is much smaller.

So if you ran a GPU at 1 MHz instead of 1 GHz, you'd use ~1,000× less energy. But you'd also do ~1,000× less work per second. Per operation, you don't save much.

The brain isn't more efficient simply by being slow. Something else is going on — likely a combination of co-location, sparsity, and analog computation.

09

A GPU is just a bunch of tiny TPUs

tile sizing
// FIG-06 — Floor plan comparison: GPU's many small tiles vs TPU's few big tiles
GPU — many small SMs around an L2 L2 cache SM SM SM SM SM SM each SM ≈ small TPU: tensor core + vector unit TPU — few big matrix units (MXUs) with one vector unit MXU (big systolic array) vector unit MXU (big systolic array) amortizes the register-file tax across a much larger tile
GPU

Lots of small tiles → flexibility

  • Many SMs, each with their own tensor core, vector ALUs, register file.
  • Lots of perimeter between matrix and vector units → tons of cross-bandwidth.
  • Great when there isn't one giant matmul — when work is uneven or branchy.
  • Pays the register-file tax over many small tiles.
TPU

Few big tiles → amortization

  • One or two huge MXUs + one vector unit.
  • All data movement between matrix and vector squeezes through narrow perimeter — lower bandwidth there.
  • But each register-file dollar is spread over a much bigger tile.
  • Wins when the workload is one big matmul. Suffers when work doesn't fit the shape.

MatX's pitch (hinted at): a splittable systolic array that behaves like a big TPU tile when the matmul is big, and like a stack of small GPU-style tiles when it isn't. Best of both granularities, ideally without the worst of either.

10

The mental model in 7 lines

summary
×4
FP4 vs FP8 (theoretical)
7/8
cost in data movement
128²
classic TPU MXU size
10×
FPGA tax vs ASIC
remember this

The compute / comm ratio is everything

At every level of the stack — from the bit-width of a multiplier to the floor plan of a datacenter — the optimization is more arithmetic per byte moved. That's what motivates low precision, systolic arrays, scratchpads, and the GPU-vs-TPU layout debate.

7 lines

The whole conversation, compressed

  • The atomic op of AI hardware is the multiply-accumulate.
  • Multiplier area scales as p × q, so low precision wins quadratically.
  • In a vanilla core, most of the area moves data, not arithmetic.
  • Systolic arrays fix this by baking a 2D loop into hardware and parking weights in place.
  • Clock speed is set by the longest path — pipeline registers buy speed at an area cost.
  • FPGAs trade ~10× efficiency for in-field reconfigurability via LUTs and routing muxes.
  • GPUs are many tiny TPUs; TPUs are few big GPU-tiles. MatX wants splittable.
further

Open questions the transcript points at

  • How much of FP4 vs FP8 should a chip dedicate? Equal die area? Equal power budget? Customer-driven?
  • How big should one systolic array be before perimeter bandwidth kills you?
  • Can splittable systolic arrays really get TPU's amortization and GPU's intra-chip bandwidth?
  • What's the analog-computation / co-location story that lets brains run at kilohertz and still beat silicon on perception?