Show description
INSIDE THE CHIP // notes from Reiner Pope on how AI silicon actually works
INSIDE THE CHIP // notes from Reiner Pope on how AI silicon actually works
SYS // CHIP-NOTES v1.0
SRC: Dwarkesh × Reiner Pope (MatX)
STATUS: NOMINAL
NODE: 3nm
// SECTIONS
00The Big Idea
01Logic Gates → MAC
02Mux + Data Movement
03Systolic Arrays
04Clock Cycles
05FPGA vs ASIC
06Cache vs Scratchpad
07CPU vs GPU Cores
08Brains vs Chips
09GPU = tiny TPUs
10Key Takeaways
// TRANSCRIPT BREAKDOWN
insidethe chip
how AI silicon actually works
A bottom-up walk through how an AI chip is built — starting from AND gates and
ending at the GPU-vs-TPU architectural split. Notes from a conversation
between Dwarkesh Patel and Reiner Pope, CEO of MatX. Every
section here distills one big idea from the transcript and the diagrams that make it stick.
primitive
multiply-accumulate
building block
full adder (3→2)
unit cell
systolic array
enemy #1
data movement cost
scaling law
compute ∝ p × q
00
The one idea that runs the whole thing
META
Every level of chip design is the same fight: maximize compute relative to
communication. From the precision of a single multiplier, to the size of a systolic
array, to the layout of a whole datacenter — you are always trying to do more arithmetic per
byte you move. That's it. That's the whole show.
level 1
◇at the gate
Bit-width scales quadratically — halving precision more than doubles throughput. This is why FP4 is so much faster than FP8.
level 2
◇at the core
A systolic array bakes a 2D loop of MACs into hardware so the weight matrix sits in place while activations flow through.
level 3
◇at the chip
A scratchpad replaces a cache so memory access is deterministic and software, not hardware, controls movement.
01
Logic gates → multiply-accumulate
primitives
why MAC?
The atomic op of AI
Look inside any matrix multiply and you find a triple for loop:
// matrix multiply, three nested loops
for i: for j: for k:
out[i,k] += A[i,j] * B[j,k]
Every step is one multiply-accumulate — a multiply, an add into a running sum.…
INSIDE THE CHIP // notes from Reiner Pope on how AI silicon actually works
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>INSIDE THE CHIP // notes from Reiner Pope on how AI silicon actually works</title>
<link rel="preconnect" href="https://fonts.googleapis.com">
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
<link href="https://fonts.googleapis.com/css2?family=JetBrains+Mono:wght@300;400;500;700;800&family=Major+Mono+Display&family=Space+Grotesk:wght@400;500;700&display=swap" rel="stylesheet">
<style>
/* ===========================================================
INSIDE THE CHIP — single file, schematic aesthetic
palette: phosphor + magenta + amber on near-black
=========================================================== */
:root {
--bg-0: #07080c;
--bg-1: #0c0e15;
--bg-2: #11141d;
--bg-3: #161a25;
--line: #1f2332;
--line-2: #2b3145;
--txt: #cbd1de;
--txt-dim: #7a8194;
--txt-mute: #4d5468;
--phos: #5cf2a4; /* phosphor green */
--cyan: #38d9ff;
--mag: #ff3d8a;
--amb: #ffb547;
--red: #ff5a5a;
--vio: #b48cff;
--shadow-hard: 4px 4px 0 0 #000;
--shadow-neon: 0 0 0 1px var(--line-2), 0 0 30px -10px var(--phos), 6px 6px 0 0 #000;
}
* { box-sizing: border-box; margin: 0; padding: 0; }
html, body {
background: var(--bg-0);
color: var(--txt);
font-family: 'JetBrains Mono', ui-monospace, monospace;
font-size: 14px;
line-height: 1.6;
-webkit-font-smoothing: antialiased;
overflow-x: hidden;
}
/* ----------- circuit board background ------------ */
body::before {
content: '';
position: fixed;
inset: 0;
pointer-events: none;
background:
radial-gradient(circle at 20% 10%, rgba(92,242,164,0.06), transparent 40%),
radial-gradient(circle at 80% 80%, rgba(255,61,138,0.05), transparent 45%),
linear-gradient(var(--line) 1px, transparent 1px) 0 0/40px 40px,
linear-gradient(90deg, var(--line) 1px, transparent 1px) 0 0/40px 40px;
background-color: var(--bg-0);
opacity: 0.6;
z-index: 0;
}
body::after {
/* scanlines */
content: '';
position: fixed;
inset: 0;
pointer-events: none;
background: repeating-linear-gradient(
0deg,
rgba(0,0,0,0.0) 0px,
rgba(0,0,0,0.0) 2px,
rgba(0,0,0,0.15) 3px,
rgba(0,0,0,0.0) 4px
);
z-index: 1;
mix-blend-mode: multiply;
}
main { position: relative; z-index: 2; }
/* ================= TOP BAR ================= */
.topbar {
display: flex;
align-items: center;
justify-content: space-between;
padding: 14px 28px;
border-bottom: 1px solid var(--line-2);
background: rgba(7,8,12,0.85);
backdrop-filter: blur(8px);
position: sticky;
top: 0;
z-index: 100;
font-size: 11px;
letter-spacing: 0.18em;
text-transform: uppercase;
}
.topbar .left { display: flex; gap: 24px; align-items: center; }
.topbar .dot {
width: 8px; height: 8px;
background: var(--phos);
border-radius: 50%;
box-shadow: 0 0 12px var(--phos);
animation: blink 1.4s infinite;
}
@keyframes blink { 50% { opacity: 0.3; } }
.topbar .right { color: var(--txt-mute); display: flex; gap: 18px; }
.topbar .right span { color: var(--phos); }
/* ================= HERO ================= */
.hero {
padding: 80px 28px 40px;
position: relative;
max-width: 1400px;
margin: 0 auto;
}
.hero .tag {
display: inline-block;
border: 1px solid var(--phos);
color: var(--phos);
padding: 4px 10px;
font-size: 10px;
letter-spacing: 0.25em;
margin-bottom: 26px;
box-shadow: 3px 3px 0 0 #000;
}
.hero h1 {
font-family: 'Major Mono Display', monospace;
font-size: clamp(48px, 8vw, 120px);
line-height: 0.92;
letter-spacing: -0.02em;
color: var(--txt);
margin-bottom: 22px;
text-shadow: 0 0 40px rgba(92,242,164,0.15);
}
.hero h1 .sub {
display: block;
font-family: 'Major Mono Display', monospace;
color: var(--phos);
font-size: 0.45em;
margin-top: 8px;
text-shadow: 0 0 20px rgba(92,242,164,0.5);
}
.hero .lede {
max-width: 720px;
font-size: 16px;
color: var(--txt-dim);
margin-top: 28px;
line-height: 1.7;
}
.hero .lede strong { color: var(--txt); font-weight: 500; }
.hero .lede .h { color: var(--amb); }
.hero .meta {
margin-top: 36px;
display: grid;
grid-template-columns: repeat(auto-fit, minmax(160px, 1fr));
gap: 14px;
max-width: 900px;
}
.meta .item {
border: 1px solid var(--line-2);
background: var(--bg-1);
padding: 14px 16px;
box-shadow: 4px 4px 0 0 #000;
}
.meta .item .k { font-size: 10px; color: var(--txt-mute); letter-spacing: 0.2em; text-transform: uppercase; }
.meta .item .v { font-size: 15px; color: var(--phos); margin-top: 6px; font-weight: 500; }
/* ================= SECTION FRAME ================= */
section.chunk {
padding: 60px 28px;
max-width: 1400px;
margin: 0 auto;
position: relative;
}
.chunk-head {
display: grid;
grid-template-columns: auto 1fr auto;
align-items: center;
gap: 18px;
margin-bottom: 36px;
padding-bottom: 18px;
border-bottom: 1px dashed var(--line-2);
}
.chunk-head .num {
font-family: 'Major Mono Display', monospace;
font-size: 42px;
color: var(--mag);
text-shadow: 0 0 18px rgba(255,61,138,0.4);
line-height: 1;
}
.chunk-head h2 {
font-family: 'Space Grotesk', sans-serif;
font-size: clamp(24px, 3vw, 36px);
font-weight: 700;
letter-spacing: -0.01em;
color: var(--txt);
}
.chunk-head .pill {
font-size: 10px;
letter-spacing: 0.22em;
text-transform: uppercase;
color: var(--txt-mute);
border: 1px solid var(--line-2);
padding: 4px 10px;
background: var(--bg-1);
}
/* ================= CARDS ================= */
.grid {
display: grid;
gap: 22px;
}
.grid.cols-2 { grid-template-columns: repeat(auto-fit, minmax(360px, 1fr)); }
.grid.cols-3 { grid-template-columns: repeat(auto-fit, minmax(280px, 1fr)); }
.card {
background: linear-gradient(180deg, var(--bg-2) 0%, var(--bg-1) 100%);
border: 1px solid var(--line-2);
padding: 24px;
position: relative;
box-shadow: 6px 6px 0 0 #000, 0 0 0 1px rgba(255,255,255,0.02) inset;
transition: transform 0.2s ease, box-shadow 0.2s ease, border-color 0.2s ease;
}
.card:hover {
transform: translate(-2px, -2px);
box-shadow: 8px 8px 0 0 #000, 0 0 0 1px var(--phos) inset;
border-color: var(--phos);
}
.card .badge {
position: absolute;
top: -10px;
left: 18px;
background: var(--bg-0);
border: 1px solid var(--line-2);
padding: 2px 10px;
font-size: 10px;
letter-spacing: 0.2em;
color: var(--txt-mute);
text-transform: uppercase;
}
.card h3 {
font-family: 'Space Grotesk', sans-serif;
font-size: 20px;
font-weight: 700;
margin-bottom: 14px;
color: var(--txt);
letter-spacing: -0.01em;
}
.card h3 .glyph { color: var(--phos); margin-right: 8px; }
.card p { color: var(--txt-dim); margin-bottom: 10px; }
.card p:last-child { margin-bottom: 0; }
.card strong { color: var(--txt); font-weight: 500; }
.card .hi { color: var(--phos); }
.card .hi-m { color: var(--mag); }
.card .hi-a { color: var(--amb); }
.card .hi-c { color: var(--cyan); }
/* card flavors */
.card.accent-phos { border-color: rgba(92,242,164,0.35); }
.card.accent-phos .badge { color: var(--phos); border-color: var(--phos); }
.card.accent-mag { border-color: rgba(255,61,138,0.35); }
.card.accent-mag .badge { color: var(--mag); border-color: var(--mag); }
.card.accent-amb { border-color: rgba(255,181,71,0.35); }
.card.accent-amb .badge { color: var(--amb); border-color: var(--amb); }
.card.accent-cyan { border-color: rgba(56,217,255,0.35); }
.card.accent-cyan .badge { color: var(--cyan); border-color: var(--cyan); }
.card ul { list-style: none; padding: 0; margin: 8px 0; }
.card ul li {
padding: 6px 0 6px 22px;
position: relative;
color: var(--txt-dim);
border-bottom: 1px dotted var(--line);
}
.card ul li:last-child { border-bottom: none; }
.card ul li::before {
content: '▸';
position: absolute;
left: 0;
color: var(--phos);
font-size: 10px;
top: 9px;
}
/* ================= INLINE CODE/FORMULA ================= */
.formula {
background: var(--bg-0);
border: 1px solid var(--line-2);
border-left: 3px solid var(--phos);
padding: 14px 18px;
font-family: 'JetBrains Mono', monospace;
font-size: 13px;
color: var(--phos);
margin: 14px 0;
overflow-x: auto;
box-shadow: 3px 3px 0 0 #000;
}
.formula .c { color: var(--txt-mute); }
.formula .v { color: var(--amb); }
.formula .o { color: var(--mag); }
.formula .n { color: var(--cyan); }
code.inline {
background: var(--bg-0);
border: 1px solid var(--line);
padding: 1px 6px;
color: var(--amb);
font-size: 12px;
}
/* ================= QUOTE/INSIGHT ================= */
.insight {
border: 1px solid var(--mag);
background: linear-gradient(135deg, rgba(255,61,138,0.06), transparent);
padding: 20px 24px;
margin: 20px 0;
position: relative;
box-shadow: 5px 5px 0 0 #000;
}
.insight::before {
content: '!! INSIGHT';
position: absolute;
top: -10px;
left: 16px;
background: var(--bg-0);
color: var(--mag);
padding: 2px 10px;
font-size: 10px;
letter-spacing: 0.25em;
border: 1px solid var(--mag);
}
.insight p { color: var(--txt); font-size: 15px; line-height: 1.7; }
.insight p strong { color: var(--mag); }
/* ================= SVG DIAGRAM WRAPPER ================= */
.diagram {
background: var(--bg-0);
border: 1px solid var(--line-2);
padding: 22px;
margin: 22px 0;
box-shadow: 6px 6px 0 0 #000;
overflow-x: auto;
}
.diagram .label {
font-size: 10px;
letter-spacing: 0.25em;
color: var(--txt-mute);
text-transform: uppercase;
margin-bottom: 12px;
}
.diagram svg { display: block; margin: 0 auto; max-width: 100%; height: auto; }
/* ================= TABLE ================= */
.tbl {
width: 100%;
border-collapse: separate;
border-spacing: 0;
font-size: 13px;
margin: 16px 0;
}
.tbl th, .tbl td {
padding: 12px 14px;
text-align: left;
border-bottom: 1px solid var(--line);
}
.tbl th {
background: var(--bg-0);
color: var(--txt-mute);
font-size: 10px;
letter-spacing: 0.2em;
text-transform: uppercase;
font-weight: 500;
border-bottom: 1px solid var(--line-2);
}
.tbl td .pos { color: var(--phos); }
.tbl td .neg { color: var(--mag); }
.tbl td .neu { color: var(--amb); }
/* ================= TOC ================= */
.toc {
position: fixed;
top: 90px;
right: 20px;
width: 220px;
background: var(--bg-1);
border: 1px solid var(--line-2);
box-shadow: 6px 6px 0 0 #000;
padding: 16px;
font-size: 11px;
z-index: 50;
display: none;
}
.toc h4 {
font-size: 10px;
letter-spacing: 0.25em;
color: var(--txt-mute);
text-transform: uppercase;
margin-bottom: 12px;
border-bottom: 1px dashed var(--line-2);
padding-bottom: 10px;
}
.toc a {
display: block;
padding: 6px 0;
color: var(--txt-dim);
text-decoration: none;
border-left: 2px solid transparent;
padding-left: 8px;
transition: all 0.15s;
}
.toc a:hover { color: var(--phos); border-left-color: var(--phos); }
.toc a .num { color: var(--mag); margin-right: 8px; font-size: 10px; }
@media (min-width: 1280px) { .toc { display: block; } }
/* ================= FOOTER ================= */
footer {
border-top: 1px solid var(--line-2);
padding: 40px 28px;
margin-top: 60px;
text-align: center;
color: var(--txt-mute);
font-size: 11px;
letter-spacing: 0.15em;
}
footer .ascii {
color: var(--phos);
font-size: 10px;
line-height: 1.3;
margin: 20px 0;
white-space: pre;
font-family: 'JetBrains Mono', monospace;
opacity: 0.5;
}
/* ============ SVG common text style ============ */
svg text { font-family: 'JetBrains Mono', monospace; font-size: 11px; }
/* ============ tag chips inside cards ============ */
.chips { display: flex; flex-wrap: wrap; gap: 6px; margin-top: 12px; }
.chips span {
font-size: 10px;
padding: 3px 8px;
border: 1px solid var(--line-2);
color: var(--txt-mute);
letter-spacing: 0.1em;
background: var(--bg-0);
}
/* ============ side-by-side comparison ============ */
.vs {
display: grid;
grid-template-columns: 1fr auto 1fr;
gap: 18px;
align-items: stretch;
margin: 22px 0;
}
.vs .side {
border: 1px solid var(--line-2);
background: var(--bg-1);
padding: 20px;
box-shadow: 5px 5px 0 0 #000;
}
.vs .side h4 {
font-family: 'Space Grotesk', sans-serif;
font-size: 18px;
margin-bottom: 12px;
letter-spacing: -0.01em;
}
.vs .side.left { border-left: 3px solid var(--cyan); }
.vs .side.left h4 { color: var(--cyan); }
.vs .side.right { border-left: 3px solid var(--mag); }
.vs .side.right h4 { color: var(--mag); }
.vs .side ul li::before { color: currentColor; }
.vs .side.left ul li::before { color: var(--cyan); }
.vs .side.right ul li::before { color: var(--mag); }
.vs .divider {
align-self: center;
font-family: 'Major Mono Display', monospace;
font-size: 28px;
color: var(--amb);
text-shadow: 0 0 20px rgba(255,181,71,0.5);
}
@media (max-width: 700px) {
.vs { grid-template-columns: 1fr; }
.vs .divider { text-align: center; }
}
/* ============ "you said / he said" Q&A block ============ */
.qa {
border: 1px solid var(--line-2);
background: var(--bg-1);
padding: 18px 22px;
margin: 14px 0;
box-shadow: 4px 4px 0 0 #000;
}
.qa .q { color: var(--cyan); margin-bottom: 10px; font-size: 13px; }
.qa .q::before { content: '>> '; color: var(--cyan); }
.qa .a { color: var(--txt-dim); padding-left: 20px; border-left: 2px solid var(--phos); }
.qa .a strong { color: var(--phos); }
/* ============ stacked banner stat ============ */
.stats {
display: grid;
grid-template-columns: repeat(auto-fit, minmax(180px, 1fr));
gap: 14px;
margin: 22px 0;
}
.stat {
background: var(--bg-1);
border: 1px solid var(--line-2);
padding: 18px;
text-align: center;
box-shadow: 4px 4px 0 0 #000;
}
.stat .big {
font-family: 'Major Mono Display', monospace;
font-size: 32px;
color: var(--phos);
text-shadow: 0 0 20px rgba(92,242,164,0.4);
margin-bottom: 4px;
}
.stat .lbl { font-size: 10px; color: var(--txt-mute); letter-spacing: 0.18em; text-transform: uppercase; }
.stat.mag .big { color: var(--mag); text-shadow: 0 0 20px rgba(255,61,138,0.4); }
.stat.amb .big { color: var(--amb); text-shadow: 0 0 20px rgba(255,181,71,0.4); }
.stat.cyan .big { color: var(--cyan); text-shadow: 0 0 20px rgba(56,217,255,0.4); }
/* small details */
::selection { background: var(--phos); color: var(--bg-0); }
a { color: var(--cyan); text-decoration: none; border-bottom: 1px dotted var(--cyan); }
a:hover { color: var(--phos); border-bottom-color: var(--phos); }
/* small screens fix for TOC offset */
@media (max-width: 1279px) {
.chunk, .hero { padding-left: 20px; padding-right: 20px; }
}
</style>
</head>
<body>
<!-- ============================ TOP BAR ============================ -->
<div class="topbar">
<div class="left">
<span class="dot"></span>
<span>SYS // CHIP-NOTES v1.0</span>
<span style="color:var(--txt-mute)">SRC: Dwarkesh × Reiner Pope (MatX)</span>
</div>
<div class="right">
<span>STATUS: <span>NOMINAL</span></span>
<span>NODE: 3nm</span>
</div>
</div>
<!-- ============================ TOC ============================ -->
<nav class="toc">
<h4>// SECTIONS</h4>
<a href="#s0"><span class="num">00</span>The Big Idea</a>
<a href="#s1"><span class="num">01</span>Logic Gates → MAC</a>
<a href="#s2"><span class="num">02</span>Mux + Data Movement</a>
<a href="#s3"><span class="num">03</span>Systolic Arrays</a>
<a href="#s4"><span class="num">04</span>Clock Cycles</a>
<a href="#s5"><span class="num">05</span>FPGA vs ASIC</a>
<a href="#s6"><span class="num">06</span>Cache vs Scratchpad</a>
<a href="#s7"><span class="num">07</span>CPU vs GPU Cores</a>
<a href="#s8"><span class="num">08</span>Brains vs Chips</a>
<a href="#s9"><span class="num">09</span>GPU = tiny TPUs</a>
<a href="#s10"><span class="num">10</span>Key Takeaways</a>
</nav>
<main>
<!-- ============================ HERO ============================ -->
<section class="hero">
<span class="tag">// TRANSCRIPT BREAKDOWN</span>
<h1>
inside<br>the chip
<span class="sub">how AI silicon actually works</span>
</h1>
<p class="lede">
A bottom-up walk through how an AI chip is built — starting from <strong>AND gates</strong> and
ending at the <span class="h">GPU-vs-TPU architectural split</span>. Notes from a conversation
between Dwarkesh Patel and <strong>Reiner Pope</strong>, CEO of <strong>MatX</strong>. Every
section here distills one big idea from the transcript and the diagrams that make it stick.
</p>
<div class="meta">
<div class="item"><div class="k">primitive</div><div class="v">multiply-accumulate</div></div>
<div class="item"><div class="k">building block</div><div class="v">full adder (3→2)</div></div>
<div class="item"><div class="k">unit cell</div><div class="v">systolic array</div></div>
<div class="item"><div class="k">enemy #1</div><div class="v">data movement cost</div></div>
<div class="item"><div class="k">scaling law</div><div class="v">compute ∝ p × q</div></div>
</div>
</section>
<!-- ============================ S0 — BIG IDEA ============================ -->
<section class="chunk" id="s0">
<div class="chunk-head">
<div class="num">00</div>
<h2>The one idea that runs the whole thing</h2>
<div class="pill">META</div>
</div>
<div class="insight">
<p>
Every level of chip design is the same fight: <strong>maximize compute relative to
communication</strong>. From the precision of a single multiplier, to the size of a systolic
array, to the layout of a whole datacenter — you are always trying to do more arithmetic per
byte you move. That's it. That's the whole show.
</p>
</div>
<div class="grid cols-3">
<div class="card accent-phos">
<span class="badge">level 1</span>
<h3><span class="glyph">◇</span>at the gate</h3>
<p>Bit-width scales <span class="hi">quadratically</span> — halving precision more than doubles throughput. This is why FP4 is so much faster than FP8.</p>
</div>
<div class="card accent-mag">
<span class="badge">level 2</span>
<h3><span class="glyph">◇</span>at the core</h3>
<p>A systolic array bakes a 2D loop of MACs into hardware so the weight matrix sits <span class="hi-m">in place</span> while activations flow through.</p>
</div>
<div class="card accent-amb">
<span class="badge">level 3</span>
<h3><span class="glyph">◇</span>at the chip</h3>
<p>A scratchpad replaces a cache so memory access is <span class="hi-a">deterministic</span> and software, not hardware, controls movement.</p>
</div>
</div>
</section>
<!-- ============================ S1 — LOGIC GATES TO MAC ============================ -->
<section class="chunk" id="s1">
<div class="chunk-head">
<div class="num">01</div>
<h2>Logic gates → multiply-accumulate</h2>
<div class="pill">primitives</div>
</div>
<div class="grid cols-2">
<div class="card">
<span class="badge">why MAC?</span>
<h3>The atomic op of AI</h3>
<p>Look inside any matrix multiply and you find a triple <code class="inline">for</code> loop:</p>
<div class="formula">
<span class="c">// matrix multiply, three nested loops</span>
<span class="o">for</span> i: <span class="o">for</span> j: <span class="o">for</span> k:
out[i,k] <span class="o">+=</span> A[i,j] <span class="o">*</span> B[j,k]
</div>
<p>Every step is one <strong class="hi">multiply-accumulate</strong> — a multiply, an add into a running sum. So the whole chip can be optimized around that one operation.</p>
</div>
<div class="card accent-phos">
<span class="badge">precision asymmetry</span>
<h3>Multiply small, accumulate big</h3>
<p>Reiner's example: a <span class="hi">4-bit × 4-bit</span> multiply, accumulating into an <span class="hi">8-bit</span> running sum.</p>
<p>Why the asymmetry? Two reasons:</p>
<ul>
<li>The product of two N-bit numbers needs 2N bits to hold without loss.</li>
<li>You sum many of these — rounding errors pile up in the <strong>accumulator</strong>, not the multiplier.</li>
</ul>
<p>So <span class="hi-m">low-precision multiply + higher-precision add</span> is a free lunch in error.</p>
</div>
</div>
<!-- DIAGRAM: long multiplication -->
<div class="diagram">
<div class="label">// FIG-01 — 4-bit × 4-bit long multiplication, accumulator on top</div>
<svg viewBox="0 0 720 360" xmlns="http://www.w3.org/2000/svg">
<defs>
<pattern id="grid1" width="20" height="20" patternUnits="userSpaceOnUse">
<path d="M 20 0 L 0 0 0 20" fill="none" stroke="#1f2332" stroke-width="0.5"/>
</pattern>
</defs>
<rect width="720" height="360" fill="url(#grid1)"/>
<!-- multiplier example -->
<g font-family="JetBrains Mono" font-size="16" fill="#cbd1de">
<!-- top number A = 1101 -->
<text x="240" y="50" fill="#7a8194" font-size="11">A (4-bit)</text>
<text x="320" y="50" font-size="20" fill="#5cf2a4">1 0 0 1</text>
<!-- B = 1010 -->
<text x="240" y="80" fill="#7a8194" font-size="11">B (4-bit)</text>
<text x="320" y="80" font-size="20" fill="#38d9ff">× 1 0 1 0</text>
<line x1="240" y1="92" x2="450" y2="92" stroke="#2b3145"/>
<!-- partial products (16 ANDs) -->
<text x="40" y="120" fill="#7a8194" font-size="11">16 AND gates → partial products</text>
<text x="320" y="120" fill="#ffb547">0 0 0 0</text>
<text x="305" y="142" fill="#ffb547">1 0 0 1 ·</text>
<text x="290" y="164" fill="#ffb547">0 0 0 0 · ·</text>
<text x="275" y="186" fill="#ffb547">1 0 0 1 · · ·</text>
<!-- 8-bit accumulator -->
<text x="40" y="218" fill="#7a8194" font-size="11">+ accumulator (8-bit)</text>
<text x="245" y="218" fill="#ff3d8a">0 1 1 0 1 0 1 1</text>
<line x1="240" y1="232" x2="470" y2="232" stroke="#2b3145"/>
<!-- 5-way sum -->
<text x="40" y="262" fill="#7a8194" font-size="11">5-way column sum → 16 full adders</text>
<text x="245" y="266" fill="#5cf2a4" font-size="20">1 0 1 0 0 1 0 1</text>
</g>
<!-- right column callouts -->
<g font-family="JetBrains Mono" font-size="11" fill="#7a8194">
<text x="510" y="55">p × q ANDs</text>
<text x="510" y="70" fill="#5cf2a4">= 16</text>
<text x="510" y="160">+ q accumulator bits</text>
<text x="510" y="175">= 24 input bits</text>
<text x="510" y="260">p × q full adders</text>
<text x="510" y="275" fill="#ff3d8a">= 16</text>
</g>
<!-- formula box -->
<g transform="translate(40, 295)">
<rect width="640" height="50" fill="#0c0e15" stroke="#5cf2a4" stroke-width="1"/>
<text x="20" y="22" fill="#7a8194" font-size="11">SCALING LAW</text>
<text x="20" y="40" fill="#5cf2a4" font-size="13">
p-bit × q-bit MAC → p×q AND gates + p×q full adders → area ≈ O(p·q)
</text>
</g>
</svg>
</div>
<!-- FULL ADDER -->
<div class="grid cols-2">
<div class="card accent-cyan">
<span class="badge">building block</span>
<h3>The full adder is a <span class="hi-c">3→2 compressor</span></h3>
<p>Coming from software you'd assume a "full adder" adds two 32-bit numbers. <strong>It doesn't.</strong></p>
<p>It takes <strong>three single bits</strong>, counts them, and writes the count in binary as <strong>two bits</strong>.</p>
<div class="formula">
<span class="c">// truth table sample</span>
in = 1 1 1 → out = 1 1 <span class="c">(count=3)</span>
in = 1 0 1 → out = 1 0 <span class="c">(count=2)</span>
in = 0 1 0 → out = 0 1 <span class="c">(count=1)</span>
in = 0 0 0 → out = 0 0 <span class="c">(count=0)</span>
</div>
<p>The right output bit is the column sum. The left bit is the <strong>carry</strong> into the next column.</p>
</div>
<div class="card accent-amb">
<span class="badge">algorithm</span>
<h3>Dadda multiplier</h3>
<p>To do the big column sum from the multiplication above, you tile <strong>full adders</strong> across the partial-product grid.</p>
<ul>
<li>Each adder eats <span class="hi-a">3 bits</span>, emits <span class="hi-a">2 bits</span> — net <strong>−1 bit</strong>.</li>
<li>Started with 24 input bits, ended with 8 output bits.</li>
<li>So you needed exactly <span class="hi">24 − 8 = 16</span> full adders.</li>
<li>Generalizes to <span class="hi">p × q</span> full adders for a p-bit × q-bit MAC.</li>
</ul>
<p>It's the standard area-efficient multiplier construction.</p>
</div>
</div>
<!-- Quadratic scaling insight -->
<div class="insight">
<p>
<strong>The quadratic scaling insight.</strong> Halving precision doesn't just double throughput
— it more than doubles it, because area scales as <code class="inline">p × q</code>. This is
the single biggest reason low-precision arithmetic has worked so well for neural nets.
Nvidia even acknowledged this on B300 by quoting <span class="hi">FP4 ≈ 3× FP8</span> instead
of the historical 2× ratio. Technically it should be 4×.
</p>
</div>
</section>
<!-- ============================ S2 — MUX / DATA MOVEMENT ============================ -->
<section class="chunk" id="s2">
<div class="chunk-head">
<div class="num">02</div>
<h2>The hidden tax: muxes and data movement</h2>
<div class="pill">communication</div>
</div>
<div class="grid cols-2">
<div class="card">
<span class="badge">old-school CUDA core / CPU</span>
<h3>Where does the MAC live?</h3>
<p>You drop the multiply-accumulate unit next to a <strong>register file</strong>. The MAC reads three registers — two operands and the accumulator — does its thing, writes back.</p>
<p>But which registers? The MAC doesn't always read the same three slots. So you need a <strong class="hi-m">mux</strong> in front of each input to <em>select</em>.</p>
</div>
<div class="card accent-mag">
<span class="badge">what is a mux</span>
<h3>A mux is a software switch</h3>
<p>To pick "register #3" out of 8, hardware does the dumb thing: <strong>AND every entry with a one-hot mask, then OR everything together</strong>.</p>
<div class="formula">
<span class="c">// n-input, p-bit mux</span>
ANDs = <span class="v">n × p</span>
ORs = <span class="v">(n − 1) × p</span>
</div>
<p>Selecting a register is not free. It looks like nothing in software but it's a real chunk of silicon.</p>
</div>
</div>
<!-- DIAGRAM: mux cost vs MAC -->
<div class="diagram">
<div class="label">// FIG-02 — Where the gates actually go in a CUDA-style core</div>
<svg viewBox="0 0 720 320" xmlns="http://www.w3.org/2000/svg">
<!-- register file -->
<g>
<rect x="40" y="40" width="120" height="220" fill="#0c0e15" stroke="#2b3145"/>
<text x="100" y="32" text-anchor="middle" fill="#7a8194" font-size="11">REGISTER FILE</text>
<text x="100" y="252" text-anchor="middle" fill="#7a8194" font-size="11">8 entries × p bits</text>
<g font-family="JetBrains Mono" font-size="11" fill="#cbd1de">
<text x="55" y="60">R0 0110</text>
<text x="55" y="80">R1 1010</text>
<text x="55" y="100">R2 1101</text>
<text x="55" y="120">R3 0001</text>
<text x="55" y="140">R4 1011</text>
<text x="55" y="160">R5 0100</text>
<text x="55" y="180">R6 1111</text>
<text x="55" y="200">R7 0010</text>
</g>
</g>
<!-- 3 muxes -->
<g>
<rect x="220" y="60" width="80" height="50" fill="#0c0e15" stroke="#ff3d8a" stroke-width="1.5"/>
<text x="260" y="92" text-anchor="middle" fill="#ff3d8a" font-size="13">MUX A</text>
<rect x="220" y="130" width="80" height="50" fill="#0c0e15" stroke="#ff3d8a" stroke-width="1.5"/>
<text x="260" y="162" text-anchor="middle" fill="#ff3d8a" font-size="13">MUX B</text>
<rect x="220" y="200" width="80" height="50" fill="#0c0e15" stroke="#ff3d8a" stroke-width="1.5"/>
<text x="260" y="232" text-anchor="middle" fill="#ff3d8a" font-size="13">MUX C</text>
</g>
<!-- wires from regfile to muxes (a bunch of them, indicating bandwidth) -->
<g stroke="#2b3145" stroke-width="0.5" fill="none">
<path d="M160 65 L 220 85"/><path d="M160 85 L 220 85"/>
<path d="M160 105 L 220 85"/><path d="M160 125 L 220 85"/>
<path d="M160 145 L 220 85"/><path d="M160 165 L 220 85"/>
<path d="M160 185 L 220 85"/><path d="M160 205 L 220 85"/>
<path d="M160 65 L 220 155"/><path d="M160 85 L 220 155"/>
<path d="M160 105 L 220 155"/><path d="M160 125 L 220 155"/>
<path d="M160 145 L 220 155"/><path d="M160 165 L 220 155"/>
<path d="M160 185 L 220 155"/><path d="M160 205 L 220 155"/>
<path d="M160 65 L 220 225"/><path d="M160 85 L 220 225"/>
<path d="M160 105 L 220 225"/><path d="M160 125 L 220 225"/>
<path d="M160 145 L 220 225"/><path d="M160 165 L 220 225"/>
<path d="M160 185 L 220 225"/><path d="M160 205 L 220 225"/>
</g>
<!-- MAC unit -->
<g>
<rect x="360" y="120" width="120" height="80" fill="#0c0e15" stroke="#5cf2a4" stroke-width="2"/>
<text x="420" y="148" text-anchor="middle" fill="#5cf2a4" font-size="14" font-weight="bold">MAC</text>
<text x="420" y="168" text-anchor="middle" fill="#5cf2a4" font-size="11">multiply-add</text>
<text x="420" y="188" text-anchor="middle" fill="#7a8194" font-size="10">p × q gates</text>
</g>
<!-- wires mux→MAC -->
<g stroke="#5cf2a4" stroke-width="1" fill="none">
<path d="M300 85 L 360 140"/>
<path d="M300 155 L 360 160"/>
<path d="M300 225 L 360 180"/>
</g>
<!-- writeback -->
<path d="M480 160 L 540 160 L 540 60 L 160 60" stroke="#38d9ff" stroke-width="1" fill="none" stroke-dasharray="3,2"/>
<text x="550" y="100" fill="#38d9ff" font-size="10">writeback</text>
<!-- COST BANNER -->
<g transform="translate(40, 280)">
<rect width="640" height="34" fill="#0c0e15" stroke="#ff3d8a"/>
<text x="18" y="14" fill="#ff3d8a" font-size="10">⚠ AREA BUDGET</text>
<text x="18" y="28" fill="#cbd1de" font-size="12">3 muxes × 8 inputs × p bits = 24p ANDs vs. MAC = ~4p gates → 7/8 of cost is just MOVING DATA</text>
</g>
</svg>
</div>
<div class="insight">
<p>
<strong>Almost all the area in a classic CUDA core is just moving bytes</strong> — not doing
arithmetic. ~7/8 of the gates feed the muxes that read and write the register file. This is
the problem statement that motivated <span class="hi">Tensor Cores</span> and, before them,
<span class="hi">systolic arrays</span>.
</p>
</div>
</section>
<!-- ============================ S3 — SYSTOLIC ARRAYS ============================ -->
<section class="chunk" id="s3">
<div class="chunk-head">
<div class="num">03</div>
<h2>Systolic arrays: tilting the ratio</h2>
<div class="pill">tensor cores</div>
</div>
<div class="grid cols-2">
<div class="card">
<span class="badge">the move</span>
<h3>Bake two loops into hardware</h3>
<p>A single MAC bakes <strong>one</strong> level of the triple loop into silicon. A systolic array bakes <strong>two</strong>: an entire matrix-vector multiply becomes one fixed-function block.</p>
<p>The unit goes from <em>scalar op</em> to <em>tile of ops</em>. Larger granularity means the same <strong>register-file tax</strong> is amortized over way more arithmetic.</p>
</div>
<div class="card accent-phos">
<span class="badge">scaling property</span>
<h3>Quadratic compute, linear comm</h3>
<p>An <span class="hi">x × y</span> systolic array does <strong>x × y</strong> multiply-accumulates per cycle. But the data flowing in and out only scales as <strong>x</strong> (or x + y).</p>
<div class="formula">
compute ∝ <span class="v">x · y</span> <span class="c">(quadratic)</span>
i/o wires ∝ <span class="v">x</span> <span class="c">(linear)</span>
ratio → <span class="v">y</span>x better as it grows
</div>
<p>The bigger the array, the better the ratio. Older TPUs ran <strong>128 × 128</strong>.</p>
</div>
</div>
<!-- DIAGRAM: systolic array -->
<div class="diagram">
<div class="label">// FIG-03 — 2×2 systolic array: weights stay, activations flow</div>
<svg viewBox="0 0 720 380" xmlns="http://www.w3.org/2000/svg">
<!-- input vector top -->
<g>
<text x="200" y="30" fill="#7a8194" font-size="11">activations stream in →</text>
<rect x="200" y="40" width="60" height="40" fill="#0c0e15" stroke="#38d9ff" stroke-width="1.5"/>
<text x="230" y="65" text-anchor="middle" fill="#38d9ff" font-size="16">7</text>
<rect x="280" y="40" width="60" height="40" fill="#0c0e15" stroke="#38d9ff" stroke-width="1.5"/>
<text x="310" y="65" text-anchor="middle" fill="#38d9ff" font-size="16">3</text>
</g>
<!-- 2x2 MAC grid with stored weights -->
<g>
<!-- top-left -->
<rect x="200" y="110" width="60" height="60" fill="#0c0e15" stroke="#5cf2a4" stroke-width="2"/>
<text x="230" y="135" text-anchor="middle" fill="#5cf2a4" font-size="11">w=0</text>
<text x="230" y="155" text-anchor="middle" fill="#cbd1de" font-size="10">MAC</text>
<!-- top-right -->
<rect x="280" y="110" width="60" height="60" fill="#0c0e15" stroke="#5cf2a4" stroke-width="2"/>
<text x="310" y="135" text-anchor="middle" fill="#5cf2a4" font-size="11">w=1</text>
<text x="310" y="155" text-anchor="middle" fill="#cbd1de" font-size="10">MAC</text>
<!-- bottom-left -->
<rect x="200" y="180" width="60" height="60" fill="#0c0e15" stroke="#5cf2a4" stroke-width="2"/>
<text x="230" y="205" text-anchor="middle" fill="#5cf2a4" font-size="11">w=3</text>
<text x="230" y="225" text-anchor="middle" fill="#cbd1de" font-size="10">MAC</text>
<!-- bottom-right -->
<rect x="280" y="180" width="60" height="60" fill="#0c0e15" stroke="#5cf2a4" stroke-width="2"/>
<text x="310" y="205" text-anchor="middle" fill="#5cf2a4" font-size="11">w=2</text>
<text x="310" y="225" text-anchor="middle" fill="#cbd1de" font-size="10">MAC</text>
</g>
<!-- flow arrows: activations top -> down -->
<g stroke="#38d9ff" stroke-width="1.5" fill="none" marker-end="url(#arrCyan)">
<path d="M230 80 L 230 110"/>
<path d="M310 80 L 310 110"/>
<path d="M230 170 L 230 180"/>
<path d="M310 170 L 310 180"/>
</g>
<!-- partial sums flow down -->
<g stroke="#ff3d8a" stroke-width="1.5" fill="none" marker-end="url(#arrMag)">
<path d="M230 240 L 230 270"/>
<path d="M310 240 L 310 270"/>
</g>
<!-- output -->
<g>
<rect x="200" y="270" width="60" height="40" fill="#0c0e15" stroke="#ff3d8a" stroke-width="1.5"/>
<text x="230" y="295" text-anchor="middle" fill="#ff3d8a" font-size="16">21</text>
<rect x="280" y="270" width="60" height="40" fill="#0c0e15" stroke="#ff3d8a" stroke-width="1.5"/>
<text x="310" y="295" text-anchor="middle" fill="#ff3d8a" font-size="16">13</text>
<text x="200" y="335" fill="#7a8194" font-size="11">↓ output vector (column dot-products)</text>
</g>
<!-- right side commentary -->
<g font-family="JetBrains Mono" font-size="11">
<text x="400" y="50" fill="#7a8194">// WEIGHTS</text>
<text x="400" y="68" fill="#cbd1de">stay put. loaded once,</text>
<text x="400" y="84" fill="#cbd1de">reused thousands of times.</text>
<text x="400" y="100" fill="#5cf2a4">→ huge compute reuse</text>
<text x="400" y="135" fill="#7a8194">// ACTIVATIONS</text>
<text x="400" y="153" fill="#cbd1de">flow top → bottom. only</text>
<text x="400" y="169" fill="#cbd1de">x wires of input bandwidth.</text>
<text x="400" y="185" fill="#38d9ff">→ linear i/o cost</text>
<text x="400" y="220" fill="#7a8194">// PARTIAL SUMS</text>
<text x="400" y="238" fill="#cbd1de">accumulate down columns →</text>
<text x="400" y="254" fill="#cbd1de">column dot-products fall out</text>
<text x="400" y="270" fill="#ff3d8a">at the bottom edge.</text>
<text x="400" y="305" fill="#7a8194">// LOADING WEIGHTS</text>
<text x="400" y="323" fill="#cbd1de">trickled in row by row as a</text>
<text x="400" y="339" fill="#cbd1de">daisy chain — slow but cheap,</text>
<text x="400" y="355" fill="#ffb547">since it happens rarely.</text>
</g>
<!-- defs for arrowheads -->
<defs>
<marker id="arrCyan" markerWidth="6" markerHeight="6" refX="3" refY="3" orient="auto">
<path d="M0,0 L6,3 L0,6 z" fill="#38d9ff"/>
</marker>
<marker id="arrMag" markerWidth="6" markerHeight="6" refX="3" refY="3" orient="auto">
<path d="M0,0 L6,3 L0,6 z" fill="#ff3d8a"/>
</marker>
</defs>
</svg>
</div>
<div class="grid cols-2">
<div class="card accent-amb">
<span class="badge">sizing decision</span>
<h3>How big should the array be?</h3>
<p>A huge systolic array means more amortization. But it also means <strong>less flexibility</strong> for the register file and other ops.</p>
<p>Reiner's framing: set a budget — e.g. <span class="hi-a">10% of die area on data movement, 90% on the array</span> — and size everything from there.</p>
<p>Bigger register files = more application performance but less array.</p>
</div>
<div class="card accent-mag">
<span class="badge">MatX hint</span>
<h3>Splittable systolic arrays</h3>
<p>Reiner mentions MatX has a "<strong class="hi-m">splittable systolic array</strong>" — big arrays that can also operate as several small ones.</p>
<p>It's the obvious compromise between TPU's coarse granularity and GPU's many-small-cores layout. We'll come back to this in §09.</p>
</div>
</div>
</section>
<!-- ============================ S4 — CLOCK CYCLES ============================ -->
<section class="chunk" id="s4">
<div class="chunk-head">
<div class="num">04</div>
<h2>Clock cycles & pipeline registers</h2>
<div class="pill">timing</div>
</div>
<div class="grid cols-2">
<div class="card">
<span class="badge">why a clock?</span>
<h3>100 billion transistors, in lockstep</h3>
<p>Chips are <strong>massively</strong> parallel. To avoid software-style synchronization (mutexes, locks — way too slow), every nanosecond <strong>everything pauses simultaneously</strong>.</p>
<p>That moment is the <span class="hi">clock cycle</span>. Mediated by registers — tiny storage devices that latch whatever value is on their input wire at the tick.</p>
</div>
<div class="card accent-cyan">
<span class="badge">the constraint</span>
<h3>Logic must finish before the tick</h3>
<p>If your "cloud of logic" between two registers takes longer than the clock period, you lose. The signal hasn't settled.</p>
<p>So a major job in chip design is making the <strong>longest path</strong> through any cloud of logic as short as possible.</p>
<p>Designers margin for ~25% slack so the chip basically never misses.</p>
</div>
</div>
<!-- DIAGRAM: pipeline register insertion -->
<div class="diagram">
<div class="label">// FIG-04 — Pipeline register insertion: trade area for clock speed</div>
<svg viewBox="0 0 720 280" xmlns="http://www.w3.org/2000/svg">
<!-- BEFORE -->
<g>
<text x="40" y="30" fill="#7a8194" font-size="11">BEFORE — long logic, 1 GHz max</text>
<rect x="40" y="50" width="20" height="40" fill="#0c0e15" stroke="#38d9ff" stroke-width="2"/>
<text x="50" y="76" text-anchor="middle" fill="#38d9ff" font-size="13">R</text>
<path d="M60 70 L 80 70 L 80 50 L 220 50 L 240 70 L 220 90 L 80 90 L 80 70 Z"
fill="#0c0e15" stroke="#5cf2a4" stroke-width="1.5"/>
<text x="155" y="74" text-anchor="middle" fill="#5cf2a4" font-size="12">logic cloud (delay = T)</text>
<rect x="240" y="50" width="20" height="40" fill="#0c0e15" stroke="#38d9ff" stroke-width="2"/>
<text x="250" y="76" text-anchor="middle" fill="#38d9ff" font-size="13">R</text>
<text x="40" y="115" fill="#7a8194" font-size="10">f_max ≈ 1/T</text>
</g>
<!-- AFTER: split with pipeline reg -->
<g transform="translate(0, 130)">
<text x="40" y="30" fill="#7a8194" font-size="11">AFTER — split with pipeline register, 2 GHz max</text>
<rect x="40" y="50" width="20" height="40" fill="#0c0e15" stroke="#38d9ff" stroke-width="2"/>
<text x="50" y="76" text-anchor="middle" fill="#38d9ff" font-size="13">R</text>
<path d="M60 70 L 80 70 L 80 50 L 130 50 L 150 70 L 130 90 L 80 90 L 80 70 Z"
fill="#0c0e15" stroke="#5cf2a4" stroke-width="1.5"/>
<text x="105" y="74" text-anchor="middle" fill="#5cf2a4" font-size="11">half-logic</text>
<rect x="150" y="50" width="20" height="40" fill="#0c0e15" stroke="#ffb547" stroke-width="2"/>
<text x="160" y="76" text-anchor="middle" fill="#ffb547" font-size="13">R</text>
<path d="M170 70 L 190 70 L 190 50 L 240 50 L 260 70 L 240 90 L 190 90 L 190 70 Z"
fill="#0c0e15" stroke="#5cf2a4" stroke-width="1.5"/>
<text x="215" y="74" text-anchor="middle" fill="#5cf2a4" font-size="11">half-logic</text>
<rect x="260" y="50" width="20" height="40" fill="#0c0e15" stroke="#38d9ff" stroke-width="2"/>
<text x="270" y="76" text-anchor="middle" fill="#38d9ff" font-size="13">R</text>
<text x="160" y="40" text-anchor="middle" fill="#ffb547" font-size="10">↑ inserted register</text>
<text x="40" y="115" fill="#7a8194" font-size="10">f_max ≈ 2/T (twice the speed, +1 register area)</text>
</g>
<!-- right side: feedback loop case -->
<g transform="translate(340, 0)">
<text x="0" y="30" fill="#7a8194" font-size="11">THE HARD CASE — feedback loop</text>
<text x="0" y="50" fill="#cbd1de" font-size="11">A running sum: reads its own value and adds.</text>
<g transform="translate(0, 70)">
<rect x="60" y="20" width="20" height="40" fill="#0c0e15" stroke="#38d9ff" stroke-width="2"/>
<text x="70" y="46" text-anchor="middle" fill="#38d9ff" font-size="13">R</text>
<path d="M80 40 L 110 40 L 110 20 L 160 20 L 180 40 L 160 60 L 110 60 L 110 40 Z"
fill="#0c0e15" stroke="#5cf2a4" stroke-width="1.5"/>
<text x="135" y="44" text-anchor="middle" fill="#5cf2a4" font-size="12">+</text>
<path d="M180 40 L 210 40 L 210 100 L 60 100 L 60 60" fill="none" stroke="#ff3d8a" stroke-width="1.5" stroke-dasharray="3,2" marker-end="url(#arrMag2)"/>
<text x="210" y="116" text-anchor="end" fill="#ff3d8a" font-size="10">feedback</text>
</g>
<text x="0" y="200" fill="#cbd1de" font-size="11" font-style="italic">You can't just insert a pipeline reg —</text>
<text x="0" y="216" fill="#cbd1de" font-size="11" font-style="italic">it would split the sum into "evens" and "odds".</text>
<text x="0" y="240" fill="#ff3d8a" font-size="11">→ feedback loops set the chip's max clock.</text>
</g>
<defs>
<marker id="arrMag2" markerWidth="6" markerHeight="6" refX="3" refY="3" orient="auto">
<path d="M0,0 L6,3 L0,6 z" fill="#ff3d8a"/>
</marker>
</defs>
</svg>
</div>
<div class="insight">
<p>
<strong>Latency vs throughput is a real knob.</strong> You can push clock speed arbitrarily
high by stuffing pipeline registers everywhere — but past a point, almost all your area is
registers, not logic. <em>Same energy lesson as last episode's batch-size talk: high clock
/ low batch favors latency. Lower clock / wider arrays favor throughput.</em>
</p>
</div>
</section>
<!-- ============================ S5 — FPGA vs ASIC ============================ -->
<section class="chunk" id="s5">
<div class="chunk-head">
<div class="num">05</div>
<h2>FPGA vs ASIC</h2>
<div class="pill">reconfigurability</div>
</div>
<div class="vs">
<div class="side left">
<h4>FPGA</h4>
<ul>
<li>First unit: <strong>~$10K</strong>.</li>
<li>Reconfigurable in the field — change the design any time.</li>
<li>Built from <strong>LUTs + registers + a giant mesh of muxes</strong>.</li>
<li>~10× more expensive in area and energy than ASIC.</li>
<li>Great when you change the workload often (e.g. HFT, prototyping).</li>
</ul>
</div>
<div class="divider">vs</div>
<div class="side right">
<h4>ASIC</h4>
<ul>
<li>First unit: <strong>~$30M</strong> (a full tape-out).</li>
<li>Frozen at fabrication. No changing the logic.</li>
<li>Custom polysilicon and wires — minimum gates for the job.</li>
<li>~10× cheaper & more efficient than the equivalent FPGA.</li>
<li>Worth it once volume + stability justify the NRE cost.</li>
</ul>
</div>
</div>
<div class="grid cols-2">
<div class="card accent-cyan">
<span class="badge">primitive</span>
<h3>The LUT: a 4→1 truth table in silicon</h3>
<p>A typical FPGA "lookup table" has <strong>4 input bits, 1 output bit</strong>. Inside it is a 16-entry table stored in configuration memory.</p>
<p>By writing different 16-bit patterns into that memory, the LUT becomes AND, OR, XOR, NAND, a 3-way majority, a 4-way parity — anything.</p>
<p>That's where the "field-programmable" comes from: muxes route signals between LUTs, LUTs configure into any gate. <strong class="hi-c">It's muxes all the way down.</strong></p>
</div>
<div class="card accent-mag">
<span class="badge">why 10× slower</span>
<h3>Programmability has a cost</h3>
<p>An ASIC implements a 4-way AND with literally <strong>3 AND gates</strong>.</p>
<p>An FPGA implements the same thing with one LUT — which internally is ~<strong>32 gates</strong> of muxes selecting from a 16-entry table.</p>
<p>That's the ~10× tax. Plus the routing muxes between LUTs cost area and add wire delay.</p>
</div>
</div>
<!-- DIAGRAM: LUT -->
<div class="diagram">
<div class="label">// FIG-05 — 4-input LUT: a programmable truth table</div>
<svg viewBox="0 0 720 240" xmlns="http://www.w3.org/2000/svg">
<!-- inputs -->
<g font-family="JetBrains Mono" font-size="11" fill="#cbd1de">
<text x="30" y="60">a →</text>
<text x="30" y="100">b →</text>
<text x="30" y="140">c →</text>
<text x="30" y="180">d →</text>
</g>
<!-- 4 input muxes from "soup" -->
<g>
<rect x="80" y="45" width="50" height="30" fill="#0c0e15" stroke="#38d9ff"/>
<text x="105" y="64" text-anchor="middle" fill="#38d9ff" font-size="10">mux8→1</text>
<rect x="80" y="85" width="50" height="30" fill="#0c0e15" stroke="#38d9ff"/>
<text x="105" y="104" text-anchor="middle" fill="#38d9ff" font-size="10">mux8→1</text>
<rect x="80" y="125" width="50" height="30" fill="#0c0e15" stroke="#38d9ff"/>
<text x="105" y="144" text-anchor="middle" fill="#38d9ff" font-size="10">mux8→1</text>
<rect x="80" y="165" width="50" height="30" fill="#0c0e15" stroke="#38d9ff"/>
<text x="105" y="184" text-anchor="middle" fill="#38d9ff" font-size="10">mux8→1</text>
</g>
<text x="80" y="220" fill="#7a8194" font-size="10">↑ select from nearby LUTs / registers</text>
<!-- LUT body -->
<g>
<rect x="200" y="50" width="240" height="160" fill="#0c0e15" stroke="#5cf2a4" stroke-width="2"/>
<text x="320" y="42" text-anchor="middle" fill="#5cf2a4" font-size="11">16-ENTRY TRUTH TABLE</text>
<!-- truth table -->
<g font-family="JetBrains Mono" font-size="10" fill="#cbd1de">
<text x="218" y="72">0000→0</text><text x="288" y="72">0100→1</text><text x="358" y="72">1000→0</text>
<text x="218" y="92">0001→1</text><text x="288" y="92">0101→0</text><text x="358" y="92">1001→1</text>
<text x="218" y="112">0010→1</text><text x="288" y="112">0110→1</text><text x="358" y="112">1010→1</text>
<text x="218" y="132">0011→0</text><text x="288" y="132">0111→0</text><text x="358" y="132">1011→0</text>
<text x="218" y="152">— program-</text><text x="288" y="152">able 16-</text><text x="358" y="152">bit memory</text>
<text x="218" y="172">defines</text><text x="288" y="172">which gate</text><text x="358" y="172">this LUT is</text>
</g>
<text x="320" y="200" text-anchor="middle" fill="#ffb547" font-size="10">cost ≈ 32 gates per LUT (vs 1 gate in ASIC)</text>
</g>
<!-- output -->
<g>
<rect x="490" y="115" width="40" height="30" fill="#0c0e15" stroke="#ff3d8a"/>
<text x="510" y="134" text-anchor="middle" fill="#ff3d8a" font-size="11">OUT</text>
</g>
<!-- wires -->
<g stroke="#2b3145" stroke-width="0.8" fill="none">
<path d="M130 60 L 200 70"/>
<path d="M130 100 L 200 110"/>
<path d="M130 140 L 200 150"/>
<path d="M130 180 L 200 190"/>
<path d="M440 130 L 490 130"/>
</g>
<text x="550" y="80" fill="#7a8194" font-size="10">FIELD-PROGRAMMABLE</text>
<text x="550" y="98" fill="#cbd1de" font-size="10">"field" = deployed in</text>
<text x="550" y="114" fill="#cbd1de" font-size="10">the wild, not at fab time</text>
<text x="550" y="148" fill="#7a8194" font-size="10">CONFIG</text>
<text x="550" y="166" fill="#cbd1de" font-size="10">16-bit per LUT +</text>
<text x="550" y="182" fill="#cbd1de" font-size="10">mux selector bits</text>
</svg>
</div>
</section>
<!-- ============================ S6 — Cache vs Scratchpad ============================ -->
<section class="chunk" id="s6">
<div class="chunk-head">
<div class="num">06</div>
<h2>Cache vs scratchpad: who decides what's hot?</h2>
<div class="pill">memory model</div>
</div>
<div class="vs">
<div class="side left">
<h4>CPU / Cache</h4>
<ul>
<li>One "read memory" instruction. <strong>Hardware decides</strong> if data is in cache.</li>
<li>Cache is ~100× faster than DDR — programs need it to run at reasonable speed.</li>
<li>But hit/miss depends on ambient environment: other programs, recent accesses, replacement policy.</li>
<li><strong class="hi-c">Non-deterministic latency.</strong></li>
</ul>
</div>
<div class="divider">vs</div>
<div class="side right">
<h4>TPU / Scratchpad</h4>
<ul>
<li>Two distinct instructions: <strong>"read scratchpad"</strong> and <strong>"read HBM"</strong>.</li>
<li>Software is responsible for placing data in the right tier.</li>
<li>Same idea, totally different control surface.</li>
<li><strong class="hi-m">Deterministic latency — by construction.</strong></li>
</ul>
</div>
</div>
<div class="card">
<span class="badge">key insight</span>
<h3>FPGAs win on latency because cache is gone</h3>
<p>Reiner notes you <em>could</em> build a CPU with deterministic latency — and some chips do (Groq, TPU compute cores). It's actually a simpler starting point. CPUs are non-deterministic because someone added caches and branch prediction. You can take them out — you just give up performance per cycle to do it.</p>
<p>This is why HFT shops like Jane Street reach for FPGAs: predictable per-packet latency matters more than peak throughput.</p>
</div>
</section>
<!-- ============================ S7 — CPU vs GPU CORES ============================ -->
<section class="chunk" id="s7">
<div class="chunk-head">
<div class="num">07</div>
<h2>Why CPU cores are bigger than GPU cores</h2>
<div class="pill">architecture</div>
</div>
<div class="grid cols-2">
<div class="card accent-cyan">
<span class="badge">CPU</span>
<h3>~100 cores × big & complicated</h3>
<p>A modern CPU has ~100 cores doing ~16-way SIMD = ~1,000-way parallelism. But each core is <strong>huge</strong>.</p>
<p>Where does the die go? Mostly:</p>
<ul>
<li><strong>Cache hierarchy</strong></li>
<li><strong>Register files</strong></li>
<li><span class="hi-c">Branch predictor</span> (the GPU-killer)</li>
<li>ALUs (small fraction)</li>
</ul>
</div>
<div class="card accent-mag">
<span class="badge">GPU</span>
<h3>Many small SMs, no branch predictor</h3>
<p>A GPU rips out a lot of CPU baggage — most importantly the branch predictor — and shrinks the register files. Result: way more cores, more area per ALU.</p>
<p>The trade is that GPUs are bad at branchy serial code. They're great at SIMD throughput where you don't need to guess where the next instruction lives.</p>
</div>
</div>
<div class="card">
<span class="badge">what does a branch predictor do?</span>
<h3>It predicts the future ~5 cycles ahead</h3>
<p>A single instruction takes ~5 ns to process: read, decode, evaluate, write back. To run at 1–2 GHz, the CPU must <strong>keep pipelining new instructions</strong> while old ones finish. But if an old instruction is a branch (an <code class="inline">if</code>), the CPU doesn't yet know which way to go.</p>
<p>The branch predictor guesses — based on history, target tables, and pattern detection — and the pipeline runs ahead speculatively. On a misprediction, the speculative work is thrown away. That's why a tight branchy loop on a CPU benefits enormously from a sophisticated predictor.</p>
<p>GPUs don't have one because they don't need one. They rely on having <strong>so many threads in flight</strong> that they can just switch to ready work while a branch resolves.</p>
</div>
</section>
<!-- ============================ S8 — BRAINS VS CHIPS ============================ -->
<section class="chunk" id="s8">
<div class="chunk-head">
<div class="num">08</div>
<h2>Brains vs chips</h2>
<div class="pill">analogies & limits</div>
</div>
<div class="grid cols-2">
<div class="card accent-amb">
<span class="badge">structural differences</span>
<h3>Where the analogy holds</h3>
<ul>
<li><strong>Memory ↔ compute co-location:</strong> brains do it natively, but systolic arrays do too — the weight sits where the math happens.</li>
<li><strong>Sparsity:</strong> brains are <em>unstructured</em> sparse. Chips can do structured sparsity but pay a tax for unstructured.</li>
<li><strong>Clock speed:</strong> brain runs at maybe kilohertz. Chips run at gigahertz.</li>
</ul>
</div>
<div class="card accent-phos">
<span class="badge">energy</span>
<h3>Why slow doesn't mean efficient</h3>
<p>Most of a chip's energy is in <span class="hi">switching bits 0↔1</span> — charging and discharging tiny capacitors. Static idle power is much smaller.</p>
<p>So if you ran a GPU at 1 MHz instead of 1 GHz, you'd use ~1,000× less energy. But you'd also do ~1,000× less work per second. <strong>Per operation, you don't save much.</strong></p>
<p>The brain isn't more efficient simply by being slow. Something else is going on — likely a combination of co-location, sparsity, and analog computation.</p>
</div>
</div>
</section>
<!-- ============================ S9 — GPU = tiny TPUs ============================ -->
<section class="chunk" id="s9">
<div class="chunk-head">
<div class="num">09</div>
<h2>A GPU is just a bunch of tiny TPUs</h2>
<div class="pill">tile sizing</div>
</div>
<!-- DIAGRAM: gpu vs tpu floorplan -->
<div class="diagram">
<div class="label">// FIG-06 — Floor plan comparison: GPU's many small tiles vs TPU's few big tiles</div>
<svg viewBox="0 0 720 320" xmlns="http://www.w3.org/2000/svg">
<!-- GPU side -->
<g>
<text x="40" y="30" fill="#7a8194" font-size="11">GPU — many small SMs around an L2</text>
<rect x="40" y="50" width="300" height="240" fill="#0c0e15" stroke="#38d9ff" stroke-width="2"/>
<!-- top row of SMs -->
<g fill="#0c0e15" stroke="#38d9ff" stroke-width="0.8">
<rect x="55" y="65" width="40" height="40"/>
<rect x="100" y="65" width="40" height="40"/>
<rect x="145" y="65" width="40" height="40"/>
<rect x="190" y="65" width="40" height="40"/>
<rect x="235" y="65" width="40" height="40"/>
<rect x="280" y="65" width="40" height="40"/>
</g>
<!-- L2 -->
<rect x="55" y="115" width="265" height="90" fill="#11141d" stroke="#7a8194"/>
<text x="187" y="165" text-anchor="middle" fill="#7a8194" font-size="13">L2 cache</text>
<!-- bottom row of SMs -->
<g fill="#0c0e15" stroke="#38d9ff" stroke-width="0.8">
<rect x="55" y="215" width="40" height="40"/>
<rect x="100" y="215" width="40" height="40"/>
<rect x="145" y="215" width="40" height="40"/>
<rect x="190" y="215" width="40" height="40"/>
<rect x="235" y="215" width="40" height="40"/>
<rect x="280" y="215" width="40" height="40"/>
</g>
<text x="75" y="92" fill="#38d9ff" font-size="9">SM</text>
<text x="120" y="92" fill="#38d9ff" font-size="9">SM</text>
<text x="165" y="92" fill="#38d9ff" font-size="9">SM</text>
<text x="210" y="92" fill="#38d9ff" font-size="9">SM</text>
<text x="255" y="92" fill="#38d9ff" font-size="9">SM</text>
<text x="300" y="92" fill="#38d9ff" font-size="9">SM</text>
<text x="40" y="305" fill="#cbd1de" font-size="10">each SM ≈ small TPU: tensor core + vector unit</text>
</g>
<!-- TPU side -->
<g>
<text x="380" y="30" fill="#7a8194" font-size="11">TPU — few big matrix units (MXUs) with one vector unit</text>
<rect x="380" y="50" width="300" height="240" fill="#0c0e15" stroke="#ff3d8a" stroke-width="2"/>
<!-- top MXU -->
<rect x="395" y="65" width="270" height="80" fill="#0c0e15" stroke="#ff3d8a" stroke-width="1.2"/>
<text x="530" y="110" text-anchor="middle" fill="#ff3d8a" font-size="13">MXU (big systolic array)</text>
<!-- vector unit -->
<rect x="395" y="155" width="270" height="40" fill="#11141d" stroke="#ffb547"/>
<text x="530" y="180" text-anchor="middle" fill="#ffb547" font-size="12">vector unit</text>
<!-- bottom MXU -->
<rect x="395" y="205" width="270" height="70" fill="#0c0e15" stroke="#ff3d8a" stroke-width="1.2"/>
<text x="530" y="245" text-anchor="middle" fill="#ff3d8a" font-size="13">MXU (big systolic array)</text>
<text x="380" y="305" fill="#cbd1de" font-size="10">amortizes the register-file tax across a much larger tile</text>
</g>
</svg>
</div>
<div class="grid cols-2">
<div class="card accent-cyan">
<span class="badge">GPU</span>
<h3>Lots of small tiles → flexibility</h3>
<ul>
<li>Many SMs, each with their own tensor core, vector ALUs, register file.</li>
<li>Lots of <strong>perimeter</strong> between matrix and vector units → tons of cross-bandwidth.</li>
<li>Great when there isn't one giant matmul — when work is uneven or branchy.</li>
<li>Pays the register-file tax over many small tiles.</li>
</ul>
</div>
<div class="card accent-mag">
<span class="badge">TPU</span>
<h3>Few big tiles → amortization</h3>
<ul>
<li>One or two huge MXUs + one vector unit.</li>
<li>All data movement between matrix and vector squeezes through narrow perimeter — <strong>lower bandwidth</strong> there.</li>
<li>But each register-file dollar is spread over a much bigger tile.</li>
<li>Wins when the workload is one big matmul. Suffers when work doesn't fit the shape.</li>
</ul>
</div>
</div>
<div class="insight">
<p>
<strong>MatX's pitch (hinted at):</strong> a <span class="hi-m">splittable systolic array</span> that
behaves like a big TPU tile when the matmul is big, and like a stack of small GPU-style tiles when
it isn't. Best of both granularities, ideally without the worst of either.
</p>
</div>
</section>
<!-- ============================ S10 — TAKEAWAYS ============================ -->
<section class="chunk" id="s10">
<div class="chunk-head">
<div class="num">10</div>
<h2>The mental model in 7 lines</h2>
<div class="pill">summary</div>
</div>
<div class="stats">
<div class="stat"><div class="big">×4</div><div class="lbl">FP4 vs FP8 (theoretical)</div></div>
<div class="stat mag"><div class="big">7/8</div><div class="lbl">cost in data movement</div></div>
<div class="stat amb"><div class="big">128²</div><div class="lbl">classic TPU MXU size</div></div>
<div class="stat cyan"><div class="big">10×</div><div class="lbl">FPGA tax vs ASIC</div></div>
</div>
<div class="grid cols-2">
<div class="card accent-phos">
<span class="badge">remember this</span>
<h3>The compute / comm ratio is everything</h3>
<p>At every level of the stack — from the bit-width of a multiplier to the floor plan of a datacenter — the optimization is <strong>more arithmetic per byte moved</strong>. That's what motivates low precision, systolic arrays, scratchpads, and the GPU-vs-TPU layout debate.</p>
</div>
<div class="card accent-amb">
<span class="badge">7 lines</span>
<h3>The whole conversation, compressed</h3>
<ul>
<li>The atomic op of AI hardware is the <strong class="hi-a">multiply-accumulate</strong>.</li>
<li>Multiplier area scales as <strong class="hi-a">p × q</strong>, so low precision wins quadratically.</li>
<li>In a vanilla core, <strong class="hi-a">most of the area moves data</strong>, not arithmetic.</li>
<li>Systolic arrays fix this by baking a 2D loop into hardware and parking weights in place.</li>
<li>Clock speed is set by the longest path — pipeline registers buy speed at an area cost.</li>
<li>FPGAs trade ~10× efficiency for in-field reconfigurability via LUTs and routing muxes.</li>
<li>GPUs are many tiny TPUs; TPUs are few big GPU-tiles. MatX wants <strong class="hi-a">splittable</strong>.</li>
</ul>
</div>
</div>
<div class="card" style="margin-top: 22px;">
<span class="badge">further</span>
<h3>Open questions the transcript points at</h3>
<ul>
<li>How much of FP4 vs FP8 should a chip dedicate? Equal die area? Equal power budget? Customer-driven?</li>
<li>How big should one systolic array be before perimeter bandwidth kills you?</li>
<li>Can splittable systolic arrays really get TPU's amortization <strong>and</strong> GPU's intra-chip bandwidth?</li>
<li>What's the analog-computation / co-location story that lets brains run at kilohertz and still beat silicon on perception?</li>
</ul>
</div>
</section>
<footer>
<div class="ascii"> ╭──────────────────────────────────────────────╮
│ END OF TRANSMISSION │
│ built from notes // Dwarkesh × Reiner Pope │
╰──────────────────────────────────────────────╯</div>
<div>// CHIP-NOTES v1.0 — single-file html, no deps, vibes intact</div>
</footer>
</main>
<script>
// smooth-scroll for TOC
document.querySelectorAll('.toc a').forEach(a => {
a.addEventListener('click', e => {
e.preventDefault();
const t = document.querySelector(a.getAttribute('href'));
if (t) t.scrollIntoView({ behavior: 'smooth', block: 'start' });
});
});
// highlight current section in TOC
const sections = document.querySelectorAll('section.chunk');
const tocLinks = document.querySelectorAll('.toc a');
const observer = new IntersectionObserver((entries) => {
entries.forEach(entry => {
if (entry.isIntersecting) {
tocLinks.forEach(l => l.style.color = '');
const active = document.querySelector(`.toc a[href="#${entry.target.id}"]`);
if (active) {
active.style.color = 'var(--phos)';
active.style.borderLeftColor = 'var(--phos)';
}
}
});
}, { threshold: 0.2, rootMargin: '-30% 0px -50% 0px' });
sections.forEach(s => observer.observe(s));
</script>
</body>
</html>