Inside The Chip — Lalo Adrian Morales

Show description 1,989 chars · AI
INSIDE THE CHIP // notes from Reiner Pope on how AI silicon actually works

INSIDE THE CHIP // notes from Reiner Pope on how AI silicon actually works

SYS // CHIP-NOTES v1.0
SRC: Dwarkesh × Reiner Pope (MatX)

STATUS: NOMINAL
NODE: 3nm

// SECTIONS

00The Big Idea
01Logic Gates → MAC
02Mux + Data Movement
03Systolic Arrays
04Clock Cycles
05FPGA vs ASIC
06Cache vs Scratchpad
07CPU vs GPU Cores
08Brains vs Chips
09GPU = tiny TPUs
10Key Takeaways

// TRANSCRIPT BREAKDOWN

insidethe chip
how AI silicon actually works

A bottom-up walk through how an AI chip is built — starting from AND gates and
ending at the GPU-vs-TPU architectural split. Notes from a conversation
between Dwarkesh Patel and Reiner Pope, CEO of MatX. Every
section here distills one big idea from the transcript and the diagrams that make it stick.

primitive
multiply-accumulate

building block
full adder (3→2)

unit cell
systolic array

enemy #1
data movement cost

scaling law
compute ∝ p × q

00

The one idea that runs the whole thing

META

Every level of chip design is the same fight: maximize compute relative to
communication. From the precision of a single multiplier, to the size of a systolic
array, to the layout of a whole datacenter — you are always trying to do more arithmetic per
byte you move. That's it. That's the whole show.

level 1
◇at the gate

Bit-width scales quadratically — halving precision more than doubles throughput. This is why FP4 is so much faster than FP8.

level 2
◇at the core

A systolic array bakes a 2D loop of MACs into hardware so the weight matrix sits in place while activations flow through.

level 3
◇at the chip

A scratchpad replaces a cache so memory access is deterministic and software, not hardware, controls movement.

01

Logic gates → multiply-accumulate

primitives

why MAC?
The atomic op of AI

Look inside any matrix multiply and you find a triple for loop:

// matrix multiply, three nested loops
for i: for j: for k:
out[i,k] += A[i,j] * B[j,k]

Every step is one multiply-accumulate — a multiply, an add into a running sum.…
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>INSIDE THE CHIP // notes from Reiner Pope on how AI silicon actually works</title>
<link rel="preconnect" href="https://fonts.googleapis.com">
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
<link href="https://fonts.googleapis.com/css2?family=JetBrains+Mono:wght@300;400;500;700;800&family=Major+Mono+Display&family=Space+Grotesk:wght@400;500;700&display=swap" rel="stylesheet">
<style>
  /* ===========================================================
     INSIDE THE CHIP — single file, schematic aesthetic
     palette: phosphor + magenta + amber on near-black
     =========================================================== */

  :root {
    --bg-0: #07080c;
    --bg-1: #0c0e15;
    --bg-2: #11141d;
    --bg-3: #161a25;
    --line: #1f2332;
    --line-2: #2b3145;
    --txt: #cbd1de;
    --txt-dim: #7a8194;
    --txt-mute: #4d5468;
    --phos: #5cf2a4;     /* phosphor green */
    --cyan: #38d9ff;
    --mag:  #ff3d8a;
    --amb:  #ffb547;
    --red:  #ff5a5a;
    --vio:  #b48cff;
    --shadow-hard: 4px 4px 0 0 #000;
    --shadow-neon: 0 0 0 1px var(--line-2), 0 0 30px -10px var(--phos), 6px 6px 0 0 #000;
  }

  * { box-sizing: border-box; margin: 0; padding: 0; }

  html, body {
    background: var(--bg-0);
    color: var(--txt);
    font-family: 'JetBrains Mono', ui-monospace, monospace;
    font-size: 14px;
    line-height: 1.6;
    -webkit-font-smoothing: antialiased;
    overflow-x: hidden;
  }

  /* ----------- circuit board background ------------ */
  body::before {
    content: '';
    position: fixed;
    inset: 0;
    pointer-events: none;
    background:
      radial-gradient(circle at 20% 10%, rgba(92,242,164,0.06), transparent 40%),
      radial-gradient(circle at 80% 80%, rgba(255,61,138,0.05), transparent 45%),
      linear-gradient(var(--line) 1px, transparent 1px) 0 0/40px 40px,
      linear-gradient(90deg, var(--line) 1px, transparent 1px) 0 0/40px 40px;
    background-color: var(--bg-0);
    opacity: 0.6;
    z-index: 0;
  }
  body::after {
    /* scanlines */
    content: '';
    position: fixed;
    inset: 0;
    pointer-events: none;
    background: repeating-linear-gradient(
      0deg,
      rgba(0,0,0,0.0) 0px,
      rgba(0,0,0,0.0) 2px,
      rgba(0,0,0,0.15) 3px,
      rgba(0,0,0,0.0) 4px
    );
    z-index: 1;
    mix-blend-mode: multiply;
  }

  main { position: relative; z-index: 2; }

  /* ================= TOP BAR ================= */
  .topbar {
    display: flex;
    align-items: center;
    justify-content: space-between;
    padding: 14px 28px;
    border-bottom: 1px solid var(--line-2);
    background: rgba(7,8,12,0.85);
    backdrop-filter: blur(8px);
    position: sticky;
    top: 0;
    z-index: 100;
    font-size: 11px;
    letter-spacing: 0.18em;
    text-transform: uppercase;
  }
  .topbar .left { display: flex; gap: 24px; align-items: center; }
  .topbar .dot {
    width: 8px; height: 8px;
    background: var(--phos);
    border-radius: 50%;
    box-shadow: 0 0 12px var(--phos);
    animation: blink 1.4s infinite;
  }
  @keyframes blink { 50% { opacity: 0.3; } }
  .topbar .right { color: var(--txt-mute); display: flex; gap: 18px; }
  .topbar .right span { color: var(--phos); }

  /* ================= HERO ================= */
  .hero {
    padding: 80px 28px 40px;
    position: relative;
    max-width: 1400px;
    margin: 0 auto;
  }
  .hero .tag {
    display: inline-block;
    border: 1px solid var(--phos);
    color: var(--phos);
    padding: 4px 10px;
    font-size: 10px;
    letter-spacing: 0.25em;
    margin-bottom: 26px;
    box-shadow: 3px 3px 0 0 #000;
  }
  .hero h1 {
    font-family: 'Major Mono Display', monospace;
    font-size: clamp(48px, 8vw, 120px);
    line-height: 0.92;
    letter-spacing: -0.02em;
    color: var(--txt);
    margin-bottom: 22px;
    text-shadow: 0 0 40px rgba(92,242,164,0.15);
  }
  .hero h1 .sub {
    display: block;
    font-family: 'Major Mono Display', monospace;
    color: var(--phos);
    font-size: 0.45em;
    margin-top: 8px;
    text-shadow: 0 0 20px rgba(92,242,164,0.5);
  }
  .hero .lede {
    max-width: 720px;
    font-size: 16px;
    color: var(--txt-dim);
    margin-top: 28px;
    line-height: 1.7;
  }
  .hero .lede strong { color: var(--txt); font-weight: 500; }
  .hero .lede .h { color: var(--amb); }

  .hero .meta {
    margin-top: 36px;
    display: grid;
    grid-template-columns: repeat(auto-fit, minmax(160px, 1fr));
    gap: 14px;
    max-width: 900px;
  }
  .meta .item {
    border: 1px solid var(--line-2);
    background: var(--bg-1);
    padding: 14px 16px;
    box-shadow: 4px 4px 0 0 #000;
  }
  .meta .item .k { font-size: 10px; color: var(--txt-mute); letter-spacing: 0.2em; text-transform: uppercase; }
  .meta .item .v { font-size: 15px; color: var(--phos); margin-top: 6px; font-weight: 500; }

  /* ================= SECTION FRAME ================= */
  section.chunk {
    padding: 60px 28px;
    max-width: 1400px;
    margin: 0 auto;
    position: relative;
  }
  .chunk-head {
    display: grid;
    grid-template-columns: auto 1fr auto;
    align-items: center;
    gap: 18px;
    margin-bottom: 36px;
    padding-bottom: 18px;
    border-bottom: 1px dashed var(--line-2);
  }
  .chunk-head .num {
    font-family: 'Major Mono Display', monospace;
    font-size: 42px;
    color: var(--mag);
    text-shadow: 0 0 18px rgba(255,61,138,0.4);
    line-height: 1;
  }
  .chunk-head h2 {
    font-family: 'Space Grotesk', sans-serif;
    font-size: clamp(24px, 3vw, 36px);
    font-weight: 700;
    letter-spacing: -0.01em;
    color: var(--txt);
  }
  .chunk-head .pill {
    font-size: 10px;
    letter-spacing: 0.22em;
    text-transform: uppercase;
    color: var(--txt-mute);
    border: 1px solid var(--line-2);
    padding: 4px 10px;
    background: var(--bg-1);
  }

  /* ================= CARDS ================= */
  .grid {
    display: grid;
    gap: 22px;
  }
  .grid.cols-2 { grid-template-columns: repeat(auto-fit, minmax(360px, 1fr)); }
  .grid.cols-3 { grid-template-columns: repeat(auto-fit, minmax(280px, 1fr)); }

  .card {
    background: linear-gradient(180deg, var(--bg-2) 0%, var(--bg-1) 100%);
    border: 1px solid var(--line-2);
    padding: 24px;
    position: relative;
    box-shadow: 6px 6px 0 0 #000, 0 0 0 1px rgba(255,255,255,0.02) inset;
    transition: transform 0.2s ease, box-shadow 0.2s ease, border-color 0.2s ease;
  }
  .card:hover {
    transform: translate(-2px, -2px);
    box-shadow: 8px 8px 0 0 #000, 0 0 0 1px var(--phos) inset;
    border-color: var(--phos);
  }
  .card .badge {
    position: absolute;
    top: -10px;
    left: 18px;
    background: var(--bg-0);
    border: 1px solid var(--line-2);
    padding: 2px 10px;
    font-size: 10px;
    letter-spacing: 0.2em;
    color: var(--txt-mute);
    text-transform: uppercase;
  }
  .card h3 {
    font-family: 'Space Grotesk', sans-serif;
    font-size: 20px;
    font-weight: 700;
    margin-bottom: 14px;
    color: var(--txt);
    letter-spacing: -0.01em;
  }
  .card h3 .glyph { color: var(--phos); margin-right: 8px; }
  .card p { color: var(--txt-dim); margin-bottom: 10px; }
  .card p:last-child { margin-bottom: 0; }
  .card strong { color: var(--txt); font-weight: 500; }
  .card .hi { color: var(--phos); }
  .card .hi-m { color: var(--mag); }
  .card .hi-a { color: var(--amb); }
  .card .hi-c { color: var(--cyan); }

  /* card flavors */
  .card.accent-phos { border-color: rgba(92,242,164,0.35); }
  .card.accent-phos .badge { color: var(--phos); border-color: var(--phos); }
  .card.accent-mag  { border-color: rgba(255,61,138,0.35); }
  .card.accent-mag  .badge { color: var(--mag); border-color: var(--mag); }
  .card.accent-amb  { border-color: rgba(255,181,71,0.35); }
  .card.accent-amb  .badge { color: var(--amb); border-color: var(--amb); }
  .card.accent-cyan { border-color: rgba(56,217,255,0.35); }
  .card.accent-cyan .badge { color: var(--cyan); border-color: var(--cyan); }

  .card ul { list-style: none; padding: 0; margin: 8px 0; }
  .card ul li {
    padding: 6px 0 6px 22px;
    position: relative;
    color: var(--txt-dim);
    border-bottom: 1px dotted var(--line);
  }
  .card ul li:last-child { border-bottom: none; }
  .card ul li::before {
    content: '▸';
    position: absolute;
    left: 0;
    color: var(--phos);
    font-size: 10px;
    top: 9px;
  }

  /* ================= INLINE CODE/FORMULA ================= */
  .formula {
    background: var(--bg-0);
    border: 1px solid var(--line-2);
    border-left: 3px solid var(--phos);
    padding: 14px 18px;
    font-family: 'JetBrains Mono', monospace;
    font-size: 13px;
    color: var(--phos);
    margin: 14px 0;
    overflow-x: auto;
    box-shadow: 3px 3px 0 0 #000;
  }
  .formula .c { color: var(--txt-mute); }
  .formula .v { color: var(--amb); }
  .formula .o { color: var(--mag); }
  .formula .n { color: var(--cyan); }

  code.inline {
    background: var(--bg-0);
    border: 1px solid var(--line);
    padding: 1px 6px;
    color: var(--amb);
    font-size: 12px;
  }

  /* ================= QUOTE/INSIGHT ================= */
  .insight {
    border: 1px solid var(--mag);
    background: linear-gradient(135deg, rgba(255,61,138,0.06), transparent);
    padding: 20px 24px;
    margin: 20px 0;
    position: relative;
    box-shadow: 5px 5px 0 0 #000;
  }
  .insight::before {
    content: '!! INSIGHT';
    position: absolute;
    top: -10px;
    left: 16px;
    background: var(--bg-0);
    color: var(--mag);
    padding: 2px 10px;
    font-size: 10px;
    letter-spacing: 0.25em;
    border: 1px solid var(--mag);
  }
  .insight p { color: var(--txt); font-size: 15px; line-height: 1.7; }
  .insight p strong { color: var(--mag); }

  /* ================= SVG DIAGRAM WRAPPER ================= */
  .diagram {
    background: var(--bg-0);
    border: 1px solid var(--line-2);
    padding: 22px;
    margin: 22px 0;
    box-shadow: 6px 6px 0 0 #000;
    overflow-x: auto;
  }
  .diagram .label {
    font-size: 10px;
    letter-spacing: 0.25em;
    color: var(--txt-mute);
    text-transform: uppercase;
    margin-bottom: 12px;
  }
  .diagram svg { display: block; margin: 0 auto; max-width: 100%; height: auto; }

  /* ================= TABLE ================= */
  .tbl {
    width: 100%;
    border-collapse: separate;
    border-spacing: 0;
    font-size: 13px;
    margin: 16px 0;
  }
  .tbl th, .tbl td {
    padding: 12px 14px;
    text-align: left;
    border-bottom: 1px solid var(--line);
  }
  .tbl th {
    background: var(--bg-0);
    color: var(--txt-mute);
    font-size: 10px;
    letter-spacing: 0.2em;
    text-transform: uppercase;
    font-weight: 500;
    border-bottom: 1px solid var(--line-2);
  }
  .tbl td .pos { color: var(--phos); }
  .tbl td .neg { color: var(--mag); }
  .tbl td .neu { color: var(--amb); }

  /* ================= TOC ================= */
  .toc {
    position: fixed;
    top: 90px;
    right: 20px;
    width: 220px;
    background: var(--bg-1);
    border: 1px solid var(--line-2);
    box-shadow: 6px 6px 0 0 #000;
    padding: 16px;
    font-size: 11px;
    z-index: 50;
    display: none;
  }
  .toc h4 {
    font-size: 10px;
    letter-spacing: 0.25em;
    color: var(--txt-mute);
    text-transform: uppercase;
    margin-bottom: 12px;
    border-bottom: 1px dashed var(--line-2);
    padding-bottom: 10px;
  }
  .toc a {
    display: block;
    padding: 6px 0;
    color: var(--txt-dim);
    text-decoration: none;
    border-left: 2px solid transparent;
    padding-left: 8px;
    transition: all 0.15s;
  }
  .toc a:hover { color: var(--phos); border-left-color: var(--phos); }
  .toc a .num { color: var(--mag); margin-right: 8px; font-size: 10px; }
  @media (min-width: 1280px) { .toc { display: block; } }

  /* ================= FOOTER ================= */
  footer {
    border-top: 1px solid var(--line-2);
    padding: 40px 28px;
    margin-top: 60px;
    text-align: center;
    color: var(--txt-mute);
    font-size: 11px;
    letter-spacing: 0.15em;
  }
  footer .ascii {
    color: var(--phos);
    font-size: 10px;
    line-height: 1.3;
    margin: 20px 0;
    white-space: pre;
    font-family: 'JetBrains Mono', monospace;
    opacity: 0.5;
  }

  /* ============ SVG common text style ============ */
  svg text { font-family: 'JetBrains Mono', monospace; font-size: 11px; }

  /* ============ tag chips inside cards ============ */
  .chips { display: flex; flex-wrap: wrap; gap: 6px; margin-top: 12px; }
  .chips span {
    font-size: 10px;
    padding: 3px 8px;
    border: 1px solid var(--line-2);
    color: var(--txt-mute);
    letter-spacing: 0.1em;
    background: var(--bg-0);
  }

  /* ============ side-by-side comparison ============ */
  .vs {
    display: grid;
    grid-template-columns: 1fr auto 1fr;
    gap: 18px;
    align-items: stretch;
    margin: 22px 0;
  }
  .vs .side {
    border: 1px solid var(--line-2);
    background: var(--bg-1);
    padding: 20px;
    box-shadow: 5px 5px 0 0 #000;
  }
  .vs .side h4 {
    font-family: 'Space Grotesk', sans-serif;
    font-size: 18px;
    margin-bottom: 12px;
    letter-spacing: -0.01em;
  }
  .vs .side.left  { border-left: 3px solid var(--cyan); }
  .vs .side.left h4 { color: var(--cyan); }
  .vs .side.right { border-left: 3px solid var(--mag); }
  .vs .side.right h4 { color: var(--mag); }
  .vs .side ul li::before { color: currentColor; }
  .vs .side.left ul li::before { color: var(--cyan); }
  .vs .side.right ul li::before { color: var(--mag); }
  .vs .divider {
    align-self: center;
    font-family: 'Major Mono Display', monospace;
    font-size: 28px;
    color: var(--amb);
    text-shadow: 0 0 20px rgba(255,181,71,0.5);
  }
  @media (max-width: 700px) {
    .vs { grid-template-columns: 1fr; }
    .vs .divider { text-align: center; }
  }

  /* ============ "you said / he said" Q&A block ============ */
  .qa {
    border: 1px solid var(--line-2);
    background: var(--bg-1);
    padding: 18px 22px;
    margin: 14px 0;
    box-shadow: 4px 4px 0 0 #000;
  }
  .qa .q { color: var(--cyan); margin-bottom: 10px; font-size: 13px; }
  .qa .q::before { content: '>> '; color: var(--cyan); }
  .qa .a { color: var(--txt-dim); padding-left: 20px; border-left: 2px solid var(--phos); }
  .qa .a strong { color: var(--phos); }

  /* ============ stacked banner stat ============ */
  .stats {
    display: grid;
    grid-template-columns: repeat(auto-fit, minmax(180px, 1fr));
    gap: 14px;
    margin: 22px 0;
  }
  .stat {
    background: var(--bg-1);
    border: 1px solid var(--line-2);
    padding: 18px;
    text-align: center;
    box-shadow: 4px 4px 0 0 #000;
  }
  .stat .big {
    font-family: 'Major Mono Display', monospace;
    font-size: 32px;
    color: var(--phos);
    text-shadow: 0 0 20px rgba(92,242,164,0.4);
    margin-bottom: 4px;
  }
  .stat .lbl { font-size: 10px; color: var(--txt-mute); letter-spacing: 0.18em; text-transform: uppercase; }
  .stat.mag .big { color: var(--mag); text-shadow: 0 0 20px rgba(255,61,138,0.4); }
  .stat.amb .big { color: var(--amb); text-shadow: 0 0 20px rgba(255,181,71,0.4); }
  .stat.cyan .big { color: var(--cyan); text-shadow: 0 0 20px rgba(56,217,255,0.4); }

  /* small details */
  ::selection { background: var(--phos); color: var(--bg-0); }
  a { color: var(--cyan); text-decoration: none; border-bottom: 1px dotted var(--cyan); }
  a:hover { color: var(--phos); border-bottom-color: var(--phos); }

  /* small screens fix for TOC offset */
  @media (max-width: 1279px) {
    .chunk, .hero { padding-left: 20px; padding-right: 20px; }
  }
</style>
</head>
<body>

<!-- ============================ TOP BAR ============================ -->
<div class="topbar">
  <div class="left">
    <span class="dot"></span>
    <span>SYS // CHIP-NOTES v1.0</span>
    <span style="color:var(--txt-mute)">SRC: Dwarkesh × Reiner Pope (MatX)</span>
  </div>
  <div class="right">
    <span>STATUS: <span>NOMINAL</span></span>
    <span>NODE: 3nm</span>
  </div>
</div>

<!-- ============================ TOC ============================ -->
<nav class="toc">
  <h4>// SECTIONS</h4>
  <a href="#s0"><span class="num">00</span>The Big Idea</a>
  <a href="#s1"><span class="num">01</span>Logic Gates → MAC</a>
  <a href="#s2"><span class="num">02</span>Mux + Data Movement</a>
  <a href="#s3"><span class="num">03</span>Systolic Arrays</a>
  <a href="#s4"><span class="num">04</span>Clock Cycles</a>
  <a href="#s5"><span class="num">05</span>FPGA vs ASIC</a>
  <a href="#s6"><span class="num">06</span>Cache vs Scratchpad</a>
  <a href="#s7"><span class="num">07</span>CPU vs GPU Cores</a>
  <a href="#s8"><span class="num">08</span>Brains vs Chips</a>
  <a href="#s9"><span class="num">09</span>GPU = tiny TPUs</a>
  <a href="#s10"><span class="num">10</span>Key Takeaways</a>
</nav>

<main>

<!-- ============================ HERO ============================ -->
<section class="hero">
  <span class="tag">// TRANSCRIPT BREAKDOWN</span>
  <h1>
    inside<br>the chip
    <span class="sub">how AI silicon actually works</span>
  </h1>
  <p class="lede">
    A bottom-up walk through how an AI chip is built — starting from <strong>AND gates</strong> and
    ending at the <span class="h">GPU-vs-TPU architectural split</span>. Notes from a conversation
    between Dwarkesh Patel and <strong>Reiner Pope</strong>, CEO of <strong>MatX</strong>. Every
    section here distills one big idea from the transcript and the diagrams that make it stick.
  </p>

  <div class="meta">
    <div class="item"><div class="k">primitive</div><div class="v">multiply-accumulate</div></div>
    <div class="item"><div class="k">building block</div><div class="v">full adder (3→2)</div></div>
    <div class="item"><div class="k">unit cell</div><div class="v">systolic array</div></div>
    <div class="item"><div class="k">enemy #1</div><div class="v">data movement cost</div></div>
    <div class="item"><div class="k">scaling law</div><div class="v">compute ∝ p × q</div></div>
  </div>
</section>

<!-- ============================ S0 — BIG IDEA ============================ -->
<section class="chunk" id="s0">
  <div class="chunk-head">
    <div class="num">00</div>
    <h2>The one idea that runs the whole thing</h2>
    <div class="pill">META</div>
  </div>

  <div class="insight">
    <p>
      Every level of chip design is the same fight: <strong>maximize compute relative to
      communication</strong>. From the precision of a single multiplier, to the size of a systolic
      array, to the layout of a whole datacenter — you are always trying to do more arithmetic per
      byte you move. That's it. That's the whole show.
    </p>
  </div>

  <div class="grid cols-3">
    <div class="card accent-phos">
      <span class="badge">level 1</span>
      <h3><span class="glyph">◇</span>at the gate</h3>
      <p>Bit-width scales <span class="hi">quadratically</span> — halving precision more than doubles throughput. This is why FP4 is so much faster than FP8.</p>
    </div>
    <div class="card accent-mag">
      <span class="badge">level 2</span>
      <h3><span class="glyph">◇</span>at the core</h3>
      <p>A systolic array bakes a 2D loop of MACs into hardware so the weight matrix sits <span class="hi-m">in place</span> while activations flow through.</p>
    </div>
    <div class="card accent-amb">
      <span class="badge">level 3</span>
      <h3><span class="glyph">◇</span>at the chip</h3>
      <p>A scratchpad replaces a cache so memory access is <span class="hi-a">deterministic</span> and software, not hardware, controls movement.</p>
    </div>
  </div>
</section>

<!-- ============================ S1 — LOGIC GATES TO MAC ============================ -->
<section class="chunk" id="s1">
  <div class="chunk-head">
    <div class="num">01</div>
    <h2>Logic gates → multiply-accumulate</h2>
    <div class="pill">primitives</div>
  </div>

  <div class="grid cols-2">
    <div class="card">
      <span class="badge">why MAC?</span>
      <h3>The atomic op of AI</h3>
      <p>Look inside any matrix multiply and you find a triple <code class="inline">for</code> loop:</p>
      <div class="formula">
<span class="c">// matrix multiply, three nested loops</span>
<span class="o">for</span> i: <span class="o">for</span> j: <span class="o">for</span> k:
  out[i,k] <span class="o">+=</span> A[i,j] <span class="o">*</span> B[j,k]
      </div>
      <p>Every step is one <strong class="hi">multiply-accumulate</strong> — a multiply, an add into a running sum. So the whole chip can be optimized around that one operation.</p>
    </div>

    <div class="card accent-phos">
      <span class="badge">precision asymmetry</span>
      <h3>Multiply small, accumulate big</h3>
      <p>Reiner's example: a <span class="hi">4-bit × 4-bit</span> multiply, accumulating into an <span class="hi">8-bit</span> running sum.</p>
      <p>Why the asymmetry? Two reasons:</p>
      <ul>
        <li>The product of two N-bit numbers needs 2N bits to hold without loss.</li>
        <li>You sum many of these — rounding errors pile up in the <strong>accumulator</strong>, not the multiplier.</li>
      </ul>
      <p>So <span class="hi-m">low-precision multiply + higher-precision add</span> is a free lunch in error.</p>
    </div>
  </div>

  <!-- DIAGRAM: long multiplication -->
  <div class="diagram">
    <div class="label">// FIG-01 — 4-bit × 4-bit long multiplication, accumulator on top</div>
    <svg viewBox="0 0 720 360" xmlns="http://www.w3.org/2000/svg">
      <defs>
        <pattern id="grid1" width="20" height="20" patternUnits="userSpaceOnUse">
          <path d="M 20 0 L 0 0 0 20" fill="none" stroke="#1f2332" stroke-width="0.5"/>
        </pattern>
      </defs>
      <rect width="720" height="360" fill="url(#grid1)"/>

      <!-- multiplier example -->
      <g font-family="JetBrains Mono" font-size="16" fill="#cbd1de">
        <!-- top number A = 1101 -->
        <text x="240" y="50" fill="#7a8194" font-size="11">A (4-bit)</text>
        <text x="320" y="50" font-size="20" fill="#5cf2a4">1 0 0 1</text>
        <!-- B = 1010 -->
        <text x="240" y="80" fill="#7a8194" font-size="11">B (4-bit)</text>
        <text x="320" y="80" font-size="20" fill="#38d9ff">× 1 0 1 0</text>

        <line x1="240" y1="92" x2="450" y2="92" stroke="#2b3145"/>

        <!-- partial products (16 ANDs) -->
        <text x="40" y="120" fill="#7a8194" font-size="11">16 AND gates → partial products</text>
        <text x="320" y="120" fill="#ffb547">0 0 0 0</text>
        <text x="305" y="142" fill="#ffb547">1 0 0 1 ·</text>
        <text x="290" y="164" fill="#ffb547">0 0 0 0 · ·</text>
        <text x="275" y="186" fill="#ffb547">1 0 0 1 · · ·</text>

        <!-- 8-bit accumulator -->
        <text x="40" y="218" fill="#7a8194" font-size="11">+ accumulator (8-bit)</text>
        <text x="245" y="218" fill="#ff3d8a">0 1 1 0 1 0 1 1</text>

        <line x1="240" y1="232" x2="470" y2="232" stroke="#2b3145"/>

        <!-- 5-way sum -->
        <text x="40" y="262" fill="#7a8194" font-size="11">5-way column sum → 16 full adders</text>

        <text x="245" y="266" fill="#5cf2a4" font-size="20">1 0 1 0 0 1 0 1</text>
      </g>

      <!-- right column callouts -->
      <g font-family="JetBrains Mono" font-size="11" fill="#7a8194">
        <text x="510" y="55">p × q ANDs</text>
        <text x="510" y="70" fill="#5cf2a4">= 16</text>

        <text x="510" y="160">+ q accumulator bits</text>
        <text x="510" y="175">= 24 input bits</text>

        <text x="510" y="260">p × q full adders</text>
        <text x="510" y="275" fill="#ff3d8a">= 16</text>
      </g>

      <!-- formula box -->
      <g transform="translate(40, 295)">
        <rect width="640" height="50" fill="#0c0e15" stroke="#5cf2a4" stroke-width="1"/>
        <text x="20" y="22" fill="#7a8194" font-size="11">SCALING LAW</text>
        <text x="20" y="40" fill="#5cf2a4" font-size="13">
          p-bit × q-bit MAC  →  p×q AND gates  +  p×q full adders  → area ≈ O(p·q)
        </text>
      </g>
    </svg>
  </div>

  <!-- FULL ADDER -->
  <div class="grid cols-2">
    <div class="card accent-cyan">
      <span class="badge">building block</span>
      <h3>The full adder is a <span class="hi-c">3→2 compressor</span></h3>
      <p>Coming from software you'd assume a "full adder" adds two 32-bit numbers. <strong>It doesn't.</strong></p>
      <p>It takes <strong>three single bits</strong>, counts them, and writes the count in binary as <strong>two bits</strong>.</p>
      <div class="formula">
<span class="c">// truth table sample</span>
in = 1 1 1  →  out = 1 1   <span class="c">(count=3)</span>
in = 1 0 1  →  out = 1 0   <span class="c">(count=2)</span>
in = 0 1 0  →  out = 0 1   <span class="c">(count=1)</span>
in = 0 0 0  →  out = 0 0   <span class="c">(count=0)</span>
      </div>
      <p>The right output bit is the column sum. The left bit is the <strong>carry</strong> into the next column.</p>
    </div>

    <div class="card accent-amb">
      <span class="badge">algorithm</span>
      <h3>Dadda multiplier</h3>
      <p>To do the big column sum from the multiplication above, you tile <strong>full adders</strong> across the partial-product grid.</p>
      <ul>
        <li>Each adder eats <span class="hi-a">3 bits</span>, emits <span class="hi-a">2 bits</span> — net <strong>−1 bit</strong>.</li>
        <li>Started with 24 input bits, ended with 8 output bits.</li>
        <li>So you needed exactly <span class="hi">24 − 8 = 16</span> full adders.</li>
        <li>Generalizes to <span class="hi">p × q</span> full adders for a p-bit × q-bit MAC.</li>
      </ul>
      <p>It's the standard area-efficient multiplier construction.</p>
    </div>
  </div>

  <!-- Quadratic scaling insight -->
  <div class="insight">
    <p>
      <strong>The quadratic scaling insight.</strong> Halving precision doesn't just double throughput
      — it more than doubles it, because area scales as <code class="inline">p × q</code>. This is
      the single biggest reason low-precision arithmetic has worked so well for neural nets.
      Nvidia even acknowledged this on B300 by quoting <span class="hi">FP4 ≈ 3× FP8</span> instead
      of the historical 2× ratio. Technically it should be 4×.
    </p>
  </div>
</section>

<!-- ============================ S2 — MUX / DATA MOVEMENT ============================ -->
<section class="chunk" id="s2">
  <div class="chunk-head">
    <div class="num">02</div>
    <h2>The hidden tax: muxes and data movement</h2>
    <div class="pill">communication</div>
  </div>

  <div class="grid cols-2">
    <div class="card">
      <span class="badge">old-school CUDA core / CPU</span>
      <h3>Where does the MAC live?</h3>
      <p>You drop the multiply-accumulate unit next to a <strong>register file</strong>. The MAC reads three registers — two operands and the accumulator — does its thing, writes back.</p>
      <p>But which registers? The MAC doesn't always read the same three slots. So you need a <strong class="hi-m">mux</strong> in front of each input to <em>select</em>.</p>
    </div>

    <div class="card accent-mag">
      <span class="badge">what is a mux</span>
      <h3>A mux is a software switch</h3>
      <p>To pick "register #3" out of 8, hardware does the dumb thing: <strong>AND every entry with a one-hot mask, then OR everything together</strong>.</p>
      <div class="formula">
<span class="c">// n-input, p-bit mux</span>
ANDs = <span class="v">n × p</span>
ORs  = <span class="v">(n − 1) × p</span>
      </div>
      <p>Selecting a register is not free. It looks like nothing in software but it's a real chunk of silicon.</p>
    </div>
  </div>

  <!-- DIAGRAM: mux cost vs MAC -->
  <div class="diagram">
    <div class="label">// FIG-02 — Where the gates actually go in a CUDA-style core</div>
    <svg viewBox="0 0 720 320" xmlns="http://www.w3.org/2000/svg">
      <!-- register file -->
      <g>
        <rect x="40" y="40" width="120" height="220" fill="#0c0e15" stroke="#2b3145"/>
        <text x="100" y="32" text-anchor="middle" fill="#7a8194" font-size="11">REGISTER FILE</text>
        <text x="100" y="252" text-anchor="middle" fill="#7a8194" font-size="11">8 entries × p bits</text>
        <g font-family="JetBrains Mono" font-size="11" fill="#cbd1de">
          <text x="55" y="60">R0  0110</text>
          <text x="55" y="80">R1  1010</text>
          <text x="55" y="100">R2  1101</text>
          <text x="55" y="120">R3  0001</text>
          <text x="55" y="140">R4  1011</text>
          <text x="55" y="160">R5  0100</text>
          <text x="55" y="180">R6  1111</text>
          <text x="55" y="200">R7  0010</text>
        </g>
      </g>

      <!-- 3 muxes -->
      <g>
        <rect x="220" y="60" width="80" height="50" fill="#0c0e15" stroke="#ff3d8a" stroke-width="1.5"/>
        <text x="260" y="92" text-anchor="middle" fill="#ff3d8a" font-size="13">MUX A</text>
        <rect x="220" y="130" width="80" height="50" fill="#0c0e15" stroke="#ff3d8a" stroke-width="1.5"/>
        <text x="260" y="162" text-anchor="middle" fill="#ff3d8a" font-size="13">MUX B</text>
        <rect x="220" y="200" width="80" height="50" fill="#0c0e15" stroke="#ff3d8a" stroke-width="1.5"/>
        <text x="260" y="232" text-anchor="middle" fill="#ff3d8a" font-size="13">MUX C</text>
      </g>

      <!-- wires from regfile to muxes (a bunch of them, indicating bandwidth) -->
      <g stroke="#2b3145" stroke-width="0.5" fill="none">
        <path d="M160 65 L 220 85"/><path d="M160 85 L 220 85"/>
        <path d="M160 105 L 220 85"/><path d="M160 125 L 220 85"/>
        <path d="M160 145 L 220 85"/><path d="M160 165 L 220 85"/>
        <path d="M160 185 L 220 85"/><path d="M160 205 L 220 85"/>

        <path d="M160 65 L 220 155"/><path d="M160 85 L 220 155"/>
        <path d="M160 105 L 220 155"/><path d="M160 125 L 220 155"/>
        <path d="M160 145 L 220 155"/><path d="M160 165 L 220 155"/>
        <path d="M160 185 L 220 155"/><path d="M160 205 L 220 155"/>

        <path d="M160 65 L 220 225"/><path d="M160 85 L 220 225"/>
        <path d="M160 105 L 220 225"/><path d="M160 125 L 220 225"/>
        <path d="M160 145 L 220 225"/><path d="M160 165 L 220 225"/>
        <path d="M160 185 L 220 225"/><path d="M160 205 L 220 225"/>
      </g>

      <!-- MAC unit -->
      <g>
        <rect x="360" y="120" width="120" height="80" fill="#0c0e15" stroke="#5cf2a4" stroke-width="2"/>
        <text x="420" y="148" text-anchor="middle" fill="#5cf2a4" font-size="14" font-weight="bold">MAC</text>
        <text x="420" y="168" text-anchor="middle" fill="#5cf2a4" font-size="11">multiply-add</text>
        <text x="420" y="188" text-anchor="middle" fill="#7a8194" font-size="10">p × q gates</text>
      </g>

      <!-- wires mux→MAC -->
      <g stroke="#5cf2a4" stroke-width="1" fill="none">
        <path d="M300 85 L 360 140"/>
        <path d="M300 155 L 360 160"/>
        <path d="M300 225 L 360 180"/>
      </g>

      <!-- writeback -->
      <path d="M480 160 L 540 160 L 540 60 L 160 60" stroke="#38d9ff" stroke-width="1" fill="none" stroke-dasharray="3,2"/>
      <text x="550" y="100" fill="#38d9ff" font-size="10">writeback</text>

      <!-- COST BANNER -->
      <g transform="translate(40, 280)">
        <rect width="640" height="34" fill="#0c0e15" stroke="#ff3d8a"/>
        <text x="18" y="14" fill="#ff3d8a" font-size="10">⚠ AREA BUDGET</text>
        <text x="18" y="28" fill="#cbd1de" font-size="12">3 muxes × 8 inputs × p bits = 24p ANDs  vs.  MAC = ~4p gates  →  7/8 of cost is just MOVING DATA</text>
      </g>
    </svg>
  </div>

  <div class="insight">
    <p>
      <strong>Almost all the area in a classic CUDA core is just moving bytes</strong> — not doing
      arithmetic. ~7/8 of the gates feed the muxes that read and write the register file. This is
      the problem statement that motivated <span class="hi">Tensor Cores</span> and, before them,
      <span class="hi">systolic arrays</span>.
    </p>
  </div>
</section>

<!-- ============================ S3 — SYSTOLIC ARRAYS ============================ -->
<section class="chunk" id="s3">
  <div class="chunk-head">
    <div class="num">03</div>
    <h2>Systolic arrays: tilting the ratio</h2>
    <div class="pill">tensor cores</div>
  </div>

  <div class="grid cols-2">
    <div class="card">
      <span class="badge">the move</span>
      <h3>Bake two loops into hardware</h3>
      <p>A single MAC bakes <strong>one</strong> level of the triple loop into silicon. A systolic array bakes <strong>two</strong>: an entire matrix-vector multiply becomes one fixed-function block.</p>
      <p>The unit goes from <em>scalar op</em> to <em>tile of ops</em>. Larger granularity means the same <strong>register-file tax</strong> is amortized over way more arithmetic.</p>
    </div>

    <div class="card accent-phos">
      <span class="badge">scaling property</span>
      <h3>Quadratic compute, linear comm</h3>
      <p>An <span class="hi">x × y</span> systolic array does <strong>x × y</strong> multiply-accumulates per cycle. But the data flowing in and out only scales as <strong>x</strong> (or x + y).</p>
      <div class="formula">
compute   ∝  <span class="v">x · y</span>     <span class="c">(quadratic)</span>
i/o wires ∝  <span class="v">x</span>         <span class="c">(linear)</span>
ratio     →  <span class="v">y</span>x better as it grows
      </div>
      <p>The bigger the array, the better the ratio. Older TPUs ran <strong>128 × 128</strong>.</p>
    </div>
  </div>

  <!-- DIAGRAM: systolic array -->
  <div class="diagram">
    <div class="label">// FIG-03 — 2×2 systolic array: weights stay, activations flow</div>
    <svg viewBox="0 0 720 380" xmlns="http://www.w3.org/2000/svg">

      <!-- input vector top -->
      <g>
        <text x="200" y="30" fill="#7a8194" font-size="11">activations stream in →</text>
        <rect x="200" y="40" width="60" height="40" fill="#0c0e15" stroke="#38d9ff" stroke-width="1.5"/>
        <text x="230" y="65" text-anchor="middle" fill="#38d9ff" font-size="16">7</text>
        <rect x="280" y="40" width="60" height="40" fill="#0c0e15" stroke="#38d9ff" stroke-width="1.5"/>
        <text x="310" y="65" text-anchor="middle" fill="#38d9ff" font-size="16">3</text>
      </g>

      <!-- 2x2 MAC grid with stored weights -->
      <g>
        <!-- top-left -->
        <rect x="200" y="110" width="60" height="60" fill="#0c0e15" stroke="#5cf2a4" stroke-width="2"/>
        <text x="230" y="135" text-anchor="middle" fill="#5cf2a4" font-size="11">w=0</text>
        <text x="230" y="155" text-anchor="middle" fill="#cbd1de" font-size="10">MAC</text>

        <!-- top-right -->
        <rect x="280" y="110" width="60" height="60" fill="#0c0e15" stroke="#5cf2a4" stroke-width="2"/>
        <text x="310" y="135" text-anchor="middle" fill="#5cf2a4" font-size="11">w=1</text>
        <text x="310" y="155" text-anchor="middle" fill="#cbd1de" font-size="10">MAC</text>

        <!-- bottom-left -->
        <rect x="200" y="180" width="60" height="60" fill="#0c0e15" stroke="#5cf2a4" stroke-width="2"/>
        <text x="230" y="205" text-anchor="middle" fill="#5cf2a4" font-size="11">w=3</text>
        <text x="230" y="225" text-anchor="middle" fill="#cbd1de" font-size="10">MAC</text>

        <!-- bottom-right -->
        <rect x="280" y="180" width="60" height="60" fill="#0c0e15" stroke="#5cf2a4" stroke-width="2"/>
        <text x="310" y="205" text-anchor="middle" fill="#5cf2a4" font-size="11">w=2</text>
        <text x="310" y="225" text-anchor="middle" fill="#cbd1de" font-size="10">MAC</text>
      </g>

      <!-- flow arrows: activations top -> down -->
      <g stroke="#38d9ff" stroke-width="1.5" fill="none" marker-end="url(#arrCyan)">
        <path d="M230 80 L 230 110"/>
        <path d="M310 80 L 310 110"/>
        <path d="M230 170 L 230 180"/>
        <path d="M310 170 L 310 180"/>
      </g>

      <!-- partial sums flow down -->
      <g stroke="#ff3d8a" stroke-width="1.5" fill="none" marker-end="url(#arrMag)">
        <path d="M230 240 L 230 270"/>
        <path d="M310 240 L 310 270"/>
      </g>

      <!-- output -->
      <g>
        <rect x="200" y="270" width="60" height="40" fill="#0c0e15" stroke="#ff3d8a" stroke-width="1.5"/>
        <text x="230" y="295" text-anchor="middle" fill="#ff3d8a" font-size="16">21</text>
        <rect x="280" y="270" width="60" height="40" fill="#0c0e15" stroke="#ff3d8a" stroke-width="1.5"/>
        <text x="310" y="295" text-anchor="middle" fill="#ff3d8a" font-size="16">13</text>
        <text x="200" y="335" fill="#7a8194" font-size="11">↓ output vector (column dot-products)</text>
      </g>

      <!-- right side commentary -->
      <g font-family="JetBrains Mono" font-size="11">
        <text x="400" y="50" fill="#7a8194">// WEIGHTS</text>
        <text x="400" y="68" fill="#cbd1de">stay put. loaded once,</text>
        <text x="400" y="84" fill="#cbd1de">reused thousands of times.</text>
        <text x="400" y="100" fill="#5cf2a4">→ huge compute reuse</text>

        <text x="400" y="135" fill="#7a8194">// ACTIVATIONS</text>
        <text x="400" y="153" fill="#cbd1de">flow top → bottom. only</text>
        <text x="400" y="169" fill="#cbd1de">x wires of input bandwidth.</text>
        <text x="400" y="185" fill="#38d9ff">→ linear i/o cost</text>

        <text x="400" y="220" fill="#7a8194">// PARTIAL SUMS</text>
        <text x="400" y="238" fill="#cbd1de">accumulate down columns →</text>
        <text x="400" y="254" fill="#cbd1de">column dot-products fall out</text>
        <text x="400" y="270" fill="#ff3d8a">at the bottom edge.</text>

        <text x="400" y="305" fill="#7a8194">// LOADING WEIGHTS</text>
        <text x="400" y="323" fill="#cbd1de">trickled in row by row as a</text>
        <text x="400" y="339" fill="#cbd1de">daisy chain — slow but cheap,</text>
        <text x="400" y="355" fill="#ffb547">since it happens rarely.</text>
      </g>

      <!-- defs for arrowheads -->
      <defs>
        <marker id="arrCyan" markerWidth="6" markerHeight="6" refX="3" refY="3" orient="auto">
          <path d="M0,0 L6,3 L0,6 z" fill="#38d9ff"/>
        </marker>
        <marker id="arrMag" markerWidth="6" markerHeight="6" refX="3" refY="3" orient="auto">
          <path d="M0,0 L6,3 L0,6 z" fill="#ff3d8a"/>
        </marker>
      </defs>
    </svg>
  </div>

  <div class="grid cols-2">
    <div class="card accent-amb">
      <span class="badge">sizing decision</span>
      <h3>How big should the array be?</h3>
      <p>A huge systolic array means more amortization. But it also means <strong>less flexibility</strong> for the register file and other ops.</p>
      <p>Reiner's framing: set a budget — e.g. <span class="hi-a">10% of die area on data movement, 90% on the array</span> — and size everything from there.</p>
      <p>Bigger register files = more application performance but less array.</p>
    </div>
    <div class="card accent-mag">
      <span class="badge">MatX hint</span>
      <h3>Splittable systolic arrays</h3>
      <p>Reiner mentions MatX has a "<strong class="hi-m">splittable systolic array</strong>" — big arrays that can also operate as several small ones.</p>
      <p>It's the obvious compromise between TPU's coarse granularity and GPU's many-small-cores layout. We'll come back to this in §09.</p>
    </div>
  </div>
</section>

<!-- ============================ S4 — CLOCK CYCLES ============================ -->
<section class="chunk" id="s4">
  <div class="chunk-head">
    <div class="num">04</div>
    <h2>Clock cycles & pipeline registers</h2>
    <div class="pill">timing</div>
  </div>

  <div class="grid cols-2">
    <div class="card">
      <span class="badge">why a clock?</span>
      <h3>100 billion transistors, in lockstep</h3>
      <p>Chips are <strong>massively</strong> parallel. To avoid software-style synchronization (mutexes, locks — way too slow), every nanosecond <strong>everything pauses simultaneously</strong>.</p>
      <p>That moment is the <span class="hi">clock cycle</span>. Mediated by registers — tiny storage devices that latch whatever value is on their input wire at the tick.</p>
    </div>

    <div class="card accent-cyan">
      <span class="badge">the constraint</span>
      <h3>Logic must finish before the tick</h3>
      <p>If your "cloud of logic" between two registers takes longer than the clock period, you lose. The signal hasn't settled.</p>
      <p>So a major job in chip design is making the <strong>longest path</strong> through any cloud of logic as short as possible.</p>
      <p>Designers margin for ~25% slack so the chip basically never misses.</p>
    </div>
  </div>

  <!-- DIAGRAM: pipeline register insertion -->
  <div class="diagram">
    <div class="label">// FIG-04 — Pipeline register insertion: trade area for clock speed</div>
    <svg viewBox="0 0 720 280" xmlns="http://www.w3.org/2000/svg">
      <!-- BEFORE -->
      <g>
        <text x="40" y="30" fill="#7a8194" font-size="11">BEFORE — long logic, 1 GHz max</text>
        <rect x="40" y="50" width="20" height="40" fill="#0c0e15" stroke="#38d9ff" stroke-width="2"/>
        <text x="50" y="76" text-anchor="middle" fill="#38d9ff" font-size="13">R</text>

        <path d="M60 70 L 80 70 L 80 50 L 220 50 L 240 70 L 220 90 L 80 90 L 80 70 Z"
              fill="#0c0e15" stroke="#5cf2a4" stroke-width="1.5"/>
        <text x="155" y="74" text-anchor="middle" fill="#5cf2a4" font-size="12">logic cloud (delay = T)</text>

        <rect x="240" y="50" width="20" height="40" fill="#0c0e15" stroke="#38d9ff" stroke-width="2"/>
        <text x="250" y="76" text-anchor="middle" fill="#38d9ff" font-size="13">R</text>

        <text x="40" y="115" fill="#7a8194" font-size="10">f_max ≈ 1/T</text>
      </g>

      <!-- AFTER: split with pipeline reg -->
      <g transform="translate(0, 130)">
        <text x="40" y="30" fill="#7a8194" font-size="11">AFTER — split with pipeline register, 2 GHz max</text>
        <rect x="40" y="50" width="20" height="40" fill="#0c0e15" stroke="#38d9ff" stroke-width="2"/>
        <text x="50" y="76" text-anchor="middle" fill="#38d9ff" font-size="13">R</text>

        <path d="M60 70 L 80 70 L 80 50 L 130 50 L 150 70 L 130 90 L 80 90 L 80 70 Z"
              fill="#0c0e15" stroke="#5cf2a4" stroke-width="1.5"/>
        <text x="105" y="74" text-anchor="middle" fill="#5cf2a4" font-size="11">half-logic</text>

        <rect x="150" y="50" width="20" height="40" fill="#0c0e15" stroke="#ffb547" stroke-width="2"/>
        <text x="160" y="76" text-anchor="middle" fill="#ffb547" font-size="13">R</text>

        <path d="M170 70 L 190 70 L 190 50 L 240 50 L 260 70 L 240 90 L 190 90 L 190 70 Z"
              fill="#0c0e15" stroke="#5cf2a4" stroke-width="1.5"/>
        <text x="215" y="74" text-anchor="middle" fill="#5cf2a4" font-size="11">half-logic</text>

        <rect x="260" y="50" width="20" height="40" fill="#0c0e15" stroke="#38d9ff" stroke-width="2"/>
        <text x="270" y="76" text-anchor="middle" fill="#38d9ff" font-size="13">R</text>

        <text x="160" y="40" text-anchor="middle" fill="#ffb547" font-size="10">↑ inserted register</text>
        <text x="40" y="115" fill="#7a8194" font-size="10">f_max ≈ 2/T  (twice the speed, +1 register area)</text>
      </g>

      <!-- right side: feedback loop case -->
      <g transform="translate(340, 0)">
        <text x="0" y="30" fill="#7a8194" font-size="11">THE HARD CASE — feedback loop</text>
        <text x="0" y="50" fill="#cbd1de" font-size="11">A running sum: reads its own value and adds.</text>

        <g transform="translate(0, 70)">
          <rect x="60" y="20" width="20" height="40" fill="#0c0e15" stroke="#38d9ff" stroke-width="2"/>
          <text x="70" y="46" text-anchor="middle" fill="#38d9ff" font-size="13">R</text>

          <path d="M80 40 L 110 40 L 110 20 L 160 20 L 180 40 L 160 60 L 110 60 L 110 40 Z"
                fill="#0c0e15" stroke="#5cf2a4" stroke-width="1.5"/>
          <text x="135" y="44" text-anchor="middle" fill="#5cf2a4" font-size="12">+</text>

          <path d="M180 40 L 210 40 L 210 100 L 60 100 L 60 60" fill="none" stroke="#ff3d8a" stroke-width="1.5" stroke-dasharray="3,2" marker-end="url(#arrMag2)"/>
          <text x="210" y="116" text-anchor="end" fill="#ff3d8a" font-size="10">feedback</text>
        </g>

        <text x="0" y="200" fill="#cbd1de" font-size="11" font-style="italic">You can't just insert a pipeline reg —</text>
        <text x="0" y="216" fill="#cbd1de" font-size="11" font-style="italic">it would split the sum into "evens" and "odds".</text>
        <text x="0" y="240" fill="#ff3d8a" font-size="11">→ feedback loops set the chip's max clock.</text>
      </g>

      <defs>
        <marker id="arrMag2" markerWidth="6" markerHeight="6" refX="3" refY="3" orient="auto">
          <path d="M0,0 L6,3 L0,6 z" fill="#ff3d8a"/>
        </marker>
      </defs>
    </svg>
  </div>

  <div class="insight">
    <p>
      <strong>Latency vs throughput is a real knob.</strong> You can push clock speed arbitrarily
      high by stuffing pipeline registers everywhere — but past a point, almost all your area is
      registers, not logic. <em>Same energy lesson as last episode's batch-size talk: high clock
      / low batch favors latency. Lower clock / wider arrays favor throughput.</em>
    </p>
  </div>
</section>

<!-- ============================ S5 — FPGA vs ASIC ============================ -->
<section class="chunk" id="s5">
  <div class="chunk-head">
    <div class="num">05</div>
    <h2>FPGA vs ASIC</h2>
    <div class="pill">reconfigurability</div>
  </div>

  <div class="vs">
    <div class="side left">
      <h4>FPGA</h4>
      <ul>
        <li>First unit: <strong>~$10K</strong>.</li>
        <li>Reconfigurable in the field — change the design any time.</li>
        <li>Built from <strong>LUTs + registers + a giant mesh of muxes</strong>.</li>
        <li>~10× more expensive in area and energy than ASIC.</li>
        <li>Great when you change the workload often (e.g. HFT, prototyping).</li>
      </ul>
    </div>
    <div class="divider">vs</div>
    <div class="side right">
      <h4>ASIC</h4>
      <ul>
        <li>First unit: <strong>~$30M</strong> (a full tape-out).</li>
        <li>Frozen at fabrication. No changing the logic.</li>
        <li>Custom polysilicon and wires — minimum gates for the job.</li>
        <li>~10× cheaper & more efficient than the equivalent FPGA.</li>
        <li>Worth it once volume + stability justify the NRE cost.</li>
      </ul>
    </div>
  </div>

  <div class="grid cols-2">
    <div class="card accent-cyan">
      <span class="badge">primitive</span>
      <h3>The LUT: a 4→1 truth table in silicon</h3>
      <p>A typical FPGA "lookup table" has <strong>4 input bits, 1 output bit</strong>. Inside it is a 16-entry table stored in configuration memory.</p>
      <p>By writing different 16-bit patterns into that memory, the LUT becomes AND, OR, XOR, NAND, a 3-way majority, a 4-way parity — anything.</p>
      <p>That's where the "field-programmable" comes from: muxes route signals between LUTs, LUTs configure into any gate. <strong class="hi-c">It's muxes all the way down.</strong></p>
    </div>

    <div class="card accent-mag">
      <span class="badge">why 10× slower</span>
      <h3>Programmability has a cost</h3>
      <p>An ASIC implements a 4-way AND with literally <strong>3 AND gates</strong>.</p>
      <p>An FPGA implements the same thing with one LUT — which internally is ~<strong>32 gates</strong> of muxes selecting from a 16-entry table.</p>
      <p>That's the ~10× tax. Plus the routing muxes between LUTs cost area and add wire delay.</p>
    </div>
  </div>

  <!-- DIAGRAM: LUT -->
  <div class="diagram">
    <div class="label">// FIG-05 — 4-input LUT: a programmable truth table</div>
    <svg viewBox="0 0 720 240" xmlns="http://www.w3.org/2000/svg">
      <!-- inputs -->
      <g font-family="JetBrains Mono" font-size="11" fill="#cbd1de">
        <text x="30" y="60">a →</text>
        <text x="30" y="100">b →</text>
        <text x="30" y="140">c →</text>
        <text x="30" y="180">d →</text>
      </g>

      <!-- 4 input muxes from "soup" -->
      <g>
        <rect x="80" y="45" width="50" height="30" fill="#0c0e15" stroke="#38d9ff"/>
        <text x="105" y="64" text-anchor="middle" fill="#38d9ff" font-size="10">mux8→1</text>
        <rect x="80" y="85" width="50" height="30" fill="#0c0e15" stroke="#38d9ff"/>
        <text x="105" y="104" text-anchor="middle" fill="#38d9ff" font-size="10">mux8→1</text>
        <rect x="80" y="125" width="50" height="30" fill="#0c0e15" stroke="#38d9ff"/>
        <text x="105" y="144" text-anchor="middle" fill="#38d9ff" font-size="10">mux8→1</text>
        <rect x="80" y="165" width="50" height="30" fill="#0c0e15" stroke="#38d9ff"/>
        <text x="105" y="184" text-anchor="middle" fill="#38d9ff" font-size="10">mux8→1</text>
      </g>

      <text x="80" y="220" fill="#7a8194" font-size="10">↑ select from nearby LUTs / registers</text>

      <!-- LUT body -->
      <g>
        <rect x="200" y="50" width="240" height="160" fill="#0c0e15" stroke="#5cf2a4" stroke-width="2"/>
        <text x="320" y="42" text-anchor="middle" fill="#5cf2a4" font-size="11">16-ENTRY TRUTH TABLE</text>
        <!-- truth table -->
        <g font-family="JetBrains Mono" font-size="10" fill="#cbd1de">
          <text x="218" y="72">0000→0</text><text x="288" y="72">0100→1</text><text x="358" y="72">1000→0</text>
          <text x="218" y="92">0001→1</text><text x="288" y="92">0101→0</text><text x="358" y="92">1001→1</text>
          <text x="218" y="112">0010→1</text><text x="288" y="112">0110→1</text><text x="358" y="112">1010→1</text>
          <text x="218" y="132">0011→0</text><text x="288" y="132">0111→0</text><text x="358" y="132">1011→0</text>
          <text x="218" y="152">— program-</text><text x="288" y="152">able 16-</text><text x="358" y="152">bit memory</text>
          <text x="218" y="172">defines</text><text x="288" y="172">which gate</text><text x="358" y="172">this LUT is</text>
        </g>
        <text x="320" y="200" text-anchor="middle" fill="#ffb547" font-size="10">cost ≈ 32 gates per LUT (vs 1 gate in ASIC)</text>
      </g>

      <!-- output -->
      <g>
        <rect x="490" y="115" width="40" height="30" fill="#0c0e15" stroke="#ff3d8a"/>
        <text x="510" y="134" text-anchor="middle" fill="#ff3d8a" font-size="11">OUT</text>
      </g>

      <!-- wires -->
      <g stroke="#2b3145" stroke-width="0.8" fill="none">
        <path d="M130 60 L 200 70"/>
        <path d="M130 100 L 200 110"/>
        <path d="M130 140 L 200 150"/>
        <path d="M130 180 L 200 190"/>
        <path d="M440 130 L 490 130"/>
      </g>

      <text x="550" y="80" fill="#7a8194" font-size="10">FIELD-PROGRAMMABLE</text>
      <text x="550" y="98" fill="#cbd1de" font-size="10">"field" = deployed in</text>
      <text x="550" y="114" fill="#cbd1de" font-size="10">the wild, not at fab time</text>
      <text x="550" y="148" fill="#7a8194" font-size="10">CONFIG</text>
      <text x="550" y="166" fill="#cbd1de" font-size="10">16-bit per LUT +</text>
      <text x="550" y="182" fill="#cbd1de" font-size="10">mux selector bits</text>
    </svg>
  </div>
</section>

<!-- ============================ S6 — Cache vs Scratchpad ============================ -->
<section class="chunk" id="s6">
  <div class="chunk-head">
    <div class="num">06</div>
    <h2>Cache vs scratchpad: who decides what's hot?</h2>
    <div class="pill">memory model</div>
  </div>

  <div class="vs">
    <div class="side left">
      <h4>CPU / Cache</h4>
      <ul>
        <li>One "read memory" instruction. <strong>Hardware decides</strong> if data is in cache.</li>
        <li>Cache is ~100× faster than DDR — programs need it to run at reasonable speed.</li>
        <li>But hit/miss depends on ambient environment: other programs, recent accesses, replacement policy.</li>
        <li><strong class="hi-c">Non-deterministic latency.</strong></li>
      </ul>
    </div>
    <div class="divider">vs</div>
    <div class="side right">
      <h4>TPU / Scratchpad</h4>
      <ul>
        <li>Two distinct instructions: <strong>"read scratchpad"</strong> and <strong>"read HBM"</strong>.</li>
        <li>Software is responsible for placing data in the right tier.</li>
        <li>Same idea, totally different control surface.</li>
        <li><strong class="hi-m">Deterministic latency — by construction.</strong></li>
      </ul>
    </div>
  </div>

  <div class="card">
    <span class="badge">key insight</span>
    <h3>FPGAs win on latency because cache is gone</h3>
    <p>Reiner notes you <em>could</em> build a CPU with deterministic latency — and some chips do (Groq, TPU compute cores). It's actually a simpler starting point. CPUs are non-deterministic because someone added caches and branch prediction. You can take them out — you just give up performance per cycle to do it.</p>
    <p>This is why HFT shops like Jane Street reach for FPGAs: predictable per-packet latency matters more than peak throughput.</p>
  </div>
</section>

<!-- ============================ S7 — CPU vs GPU CORES ============================ -->
<section class="chunk" id="s7">
  <div class="chunk-head">
    <div class="num">07</div>
    <h2>Why CPU cores are bigger than GPU cores</h2>
    <div class="pill">architecture</div>
  </div>

  <div class="grid cols-2">
    <div class="card accent-cyan">
      <span class="badge">CPU</span>
      <h3>~100 cores × big & complicated</h3>
      <p>A modern CPU has ~100 cores doing ~16-way SIMD = ~1,000-way parallelism. But each core is <strong>huge</strong>.</p>
      <p>Where does the die go? Mostly:</p>
      <ul>
        <li><strong>Cache hierarchy</strong></li>
        <li><strong>Register files</strong></li>
        <li><span class="hi-c">Branch predictor</span> (the GPU-killer)</li>
        <li>ALUs (small fraction)</li>
      </ul>
    </div>
    <div class="card accent-mag">
      <span class="badge">GPU</span>
      <h3>Many small SMs, no branch predictor</h3>
      <p>A GPU rips out a lot of CPU baggage — most importantly the branch predictor — and shrinks the register files. Result: way more cores, more area per ALU.</p>
      <p>The trade is that GPUs are bad at branchy serial code. They're great at SIMD throughput where you don't need to guess where the next instruction lives.</p>
    </div>
  </div>

  <div class="card">
    <span class="badge">what does a branch predictor do?</span>
    <h3>It predicts the future ~5 cycles ahead</h3>
    <p>A single instruction takes ~5 ns to process: read, decode, evaluate, write back. To run at 1–2 GHz, the CPU must <strong>keep pipelining new instructions</strong> while old ones finish. But if an old instruction is a branch (an <code class="inline">if</code>), the CPU doesn't yet know which way to go.</p>
    <p>The branch predictor guesses — based on history, target tables, and pattern detection — and the pipeline runs ahead speculatively. On a misprediction, the speculative work is thrown away. That's why a tight branchy loop on a CPU benefits enormously from a sophisticated predictor.</p>
    <p>GPUs don't have one because they don't need one. They rely on having <strong>so many threads in flight</strong> that they can just switch to ready work while a branch resolves.</p>
  </div>
</section>

<!-- ============================ S8 — BRAINS VS CHIPS ============================ -->
<section class="chunk" id="s8">
  <div class="chunk-head">
    <div class="num">08</div>
    <h2>Brains vs chips</h2>
    <div class="pill">analogies & limits</div>
  </div>

  <div class="grid cols-2">
    <div class="card accent-amb">
      <span class="badge">structural differences</span>
      <h3>Where the analogy holds</h3>
      <ul>
        <li><strong>Memory ↔ compute co-location:</strong> brains do it natively, but systolic arrays do too — the weight sits where the math happens.</li>
        <li><strong>Sparsity:</strong> brains are <em>unstructured</em> sparse. Chips can do structured sparsity but pay a tax for unstructured.</li>
        <li><strong>Clock speed:</strong> brain runs at maybe kilohertz. Chips run at gigahertz.</li>
      </ul>
    </div>

    <div class="card accent-phos">
      <span class="badge">energy</span>
      <h3>Why slow doesn't mean efficient</h3>
      <p>Most of a chip's energy is in <span class="hi">switching bits 0↔1</span> — charging and discharging tiny capacitors. Static idle power is much smaller.</p>
      <p>So if you ran a GPU at 1 MHz instead of 1 GHz, you'd use ~1,000× less energy. But you'd also do ~1,000× less work per second. <strong>Per operation, you don't save much.</strong></p>
      <p>The brain isn't more efficient simply by being slow. Something else is going on — likely a combination of co-location, sparsity, and analog computation.</p>
    </div>
  </div>
</section>

<!-- ============================ S9 — GPU = tiny TPUs ============================ -->
<section class="chunk" id="s9">
  <div class="chunk-head">
    <div class="num">09</div>
    <h2>A GPU is just a bunch of tiny TPUs</h2>
    <div class="pill">tile sizing</div>
  </div>

  <!-- DIAGRAM: gpu vs tpu floorplan -->
  <div class="diagram">
    <div class="label">// FIG-06 — Floor plan comparison: GPU's many small tiles vs TPU's few big tiles</div>
    <svg viewBox="0 0 720 320" xmlns="http://www.w3.org/2000/svg">
      <!-- GPU side -->
      <g>
        <text x="40" y="30" fill="#7a8194" font-size="11">GPU — many small SMs around an L2</text>
        <rect x="40" y="50" width="300" height="240" fill="#0c0e15" stroke="#38d9ff" stroke-width="2"/>

        <!-- top row of SMs -->
        <g fill="#0c0e15" stroke="#38d9ff" stroke-width="0.8">
          <rect x="55" y="65" width="40" height="40"/>
          <rect x="100" y="65" width="40" height="40"/>
          <rect x="145" y="65" width="40" height="40"/>
          <rect x="190" y="65" width="40" height="40"/>
          <rect x="235" y="65" width="40" height="40"/>
          <rect x="280" y="65" width="40" height="40"/>
        </g>
        <!-- L2 -->
        <rect x="55" y="115" width="265" height="90" fill="#11141d" stroke="#7a8194"/>
        <text x="187" y="165" text-anchor="middle" fill="#7a8194" font-size="13">L2 cache</text>
        <!-- bottom row of SMs -->
        <g fill="#0c0e15" stroke="#38d9ff" stroke-width="0.8">
          <rect x="55" y="215" width="40" height="40"/>
          <rect x="100" y="215" width="40" height="40"/>
          <rect x="145" y="215" width="40" height="40"/>
          <rect x="190" y="215" width="40" height="40"/>
          <rect x="235" y="215" width="40" height="40"/>
          <rect x="280" y="215" width="40" height="40"/>
        </g>

        <text x="75" y="92" fill="#38d9ff" font-size="9">SM</text>
        <text x="120" y="92" fill="#38d9ff" font-size="9">SM</text>
        <text x="165" y="92" fill="#38d9ff" font-size="9">SM</text>
        <text x="210" y="92" fill="#38d9ff" font-size="9">SM</text>
        <text x="255" y="92" fill="#38d9ff" font-size="9">SM</text>
        <text x="300" y="92" fill="#38d9ff" font-size="9">SM</text>

        <text x="40" y="305" fill="#cbd1de" font-size="10">each SM ≈ small TPU: tensor core + vector unit</text>
      </g>

      <!-- TPU side -->
      <g>
        <text x="380" y="30" fill="#7a8194" font-size="11">TPU — few big matrix units (MXUs) with one vector unit</text>
        <rect x="380" y="50" width="300" height="240" fill="#0c0e15" stroke="#ff3d8a" stroke-width="2"/>

        <!-- top MXU -->
        <rect x="395" y="65" width="270" height="80" fill="#0c0e15" stroke="#ff3d8a" stroke-width="1.2"/>
        <text x="530" y="110" text-anchor="middle" fill="#ff3d8a" font-size="13">MXU (big systolic array)</text>

        <!-- vector unit -->
        <rect x="395" y="155" width="270" height="40" fill="#11141d" stroke="#ffb547"/>
        <text x="530" y="180" text-anchor="middle" fill="#ffb547" font-size="12">vector unit</text>

        <!-- bottom MXU -->
        <rect x="395" y="205" width="270" height="70" fill="#0c0e15" stroke="#ff3d8a" stroke-width="1.2"/>
        <text x="530" y="245" text-anchor="middle" fill="#ff3d8a" font-size="13">MXU (big systolic array)</text>

        <text x="380" y="305" fill="#cbd1de" font-size="10">amortizes the register-file tax across a much larger tile</text>
      </g>
    </svg>
  </div>

  <div class="grid cols-2">
    <div class="card accent-cyan">
      <span class="badge">GPU</span>
      <h3>Lots of small tiles → flexibility</h3>
      <ul>
        <li>Many SMs, each with their own tensor core, vector ALUs, register file.</li>
        <li>Lots of <strong>perimeter</strong> between matrix and vector units → tons of cross-bandwidth.</li>
        <li>Great when there isn't one giant matmul — when work is uneven or branchy.</li>
        <li>Pays the register-file tax over many small tiles.</li>
      </ul>
    </div>

    <div class="card accent-mag">
      <span class="badge">TPU</span>
      <h3>Few big tiles → amortization</h3>
      <ul>
        <li>One or two huge MXUs + one vector unit.</li>
        <li>All data movement between matrix and vector squeezes through narrow perimeter — <strong>lower bandwidth</strong> there.</li>
        <li>But each register-file dollar is spread over a much bigger tile.</li>
        <li>Wins when the workload is one big matmul. Suffers when work doesn't fit the shape.</li>
      </ul>
    </div>
  </div>

  <div class="insight">
    <p>
      <strong>MatX's pitch (hinted at):</strong> a <span class="hi-m">splittable systolic array</span> that
      behaves like a big TPU tile when the matmul is big, and like a stack of small GPU-style tiles when
      it isn't. Best of both granularities, ideally without the worst of either.
    </p>
  </div>
</section>

<!-- ============================ S10 — TAKEAWAYS ============================ -->
<section class="chunk" id="s10">
  <div class="chunk-head">
    <div class="num">10</div>
    <h2>The mental model in 7 lines</h2>
    <div class="pill">summary</div>
  </div>

  <div class="stats">
    <div class="stat"><div class="big">×4</div><div class="lbl">FP4 vs FP8 (theoretical)</div></div>
    <div class="stat mag"><div class="big">7/8</div><div class="lbl">cost in data movement</div></div>
    <div class="stat amb"><div class="big">128²</div><div class="lbl">classic TPU MXU size</div></div>
    <div class="stat cyan"><div class="big">10×</div><div class="lbl">FPGA tax vs ASIC</div></div>
  </div>

  <div class="grid cols-2">
    <div class="card accent-phos">
      <span class="badge">remember this</span>
      <h3>The compute / comm ratio is everything</h3>
      <p>At every level of the stack — from the bit-width of a multiplier to the floor plan of a datacenter — the optimization is <strong>more arithmetic per byte moved</strong>. That's what motivates low precision, systolic arrays, scratchpads, and the GPU-vs-TPU layout debate.</p>
    </div>

    <div class="card accent-amb">
      <span class="badge">7 lines</span>
      <h3>The whole conversation, compressed</h3>
      <ul>
        <li>The atomic op of AI hardware is the <strong class="hi-a">multiply-accumulate</strong>.</li>
        <li>Multiplier area scales as <strong class="hi-a">p × q</strong>, so low precision wins quadratically.</li>
        <li>In a vanilla core, <strong class="hi-a">most of the area moves data</strong>, not arithmetic.</li>
        <li>Systolic arrays fix this by baking a 2D loop into hardware and parking weights in place.</li>
        <li>Clock speed is set by the longest path — pipeline registers buy speed at an area cost.</li>
        <li>FPGAs trade ~10× efficiency for in-field reconfigurability via LUTs and routing muxes.</li>
        <li>GPUs are many tiny TPUs; TPUs are few big GPU-tiles. MatX wants <strong class="hi-a">splittable</strong>.</li>
      </ul>
    </div>
  </div>

  <div class="card" style="margin-top: 22px;">
    <span class="badge">further</span>
    <h3>Open questions the transcript points at</h3>
    <ul>
      <li>How much of FP4 vs FP8 should a chip dedicate? Equal die area? Equal power budget? Customer-driven?</li>
      <li>How big should one systolic array be before perimeter bandwidth kills you?</li>
      <li>Can splittable systolic arrays really get TPU's amortization <strong>and</strong> GPU's intra-chip bandwidth?</li>
      <li>What's the analog-computation / co-location story that lets brains run at kilohertz and still beat silicon on perception?</li>
    </ul>
  </div>
</section>

<footer>
  <div class="ascii">  ╭──────────────────────────────────────────────╮
  │   END OF TRANSMISSION                        │
  │   built from notes // Dwarkesh × Reiner Pope │
  ╰──────────────────────────────────────────────╯</div>
  <div>// CHIP-NOTES v1.0 — single-file html, no deps, vibes intact</div>
</footer>

</main>

<script>
  // smooth-scroll for TOC
  document.querySelectorAll('.toc a').forEach(a => {
    a.addEventListener('click', e => {
      e.preventDefault();
      const t = document.querySelector(a.getAttribute('href'));
      if (t) t.scrollIntoView({ behavior: 'smooth', block: 'start' });
    });
  });

  // highlight current section in TOC
  const sections = document.querySelectorAll('section.chunk');
  const tocLinks = document.querySelectorAll('.toc a');
  const observer = new IntersectionObserver((entries) => {
    entries.forEach(entry => {
      if (entry.isIntersecting) {
        tocLinks.forEach(l => l.style.color = '');
        const active = document.querySelector(`.toc a[href="#${entry.target.id}"]`);
        if (active) {
          active.style.color = 'var(--phos)';
          active.style.borderLeftColor = 'var(--phos)';
        }
      }
    });
  }, { threshold: 0.2, rootMargin: '-30% 0px -50% 0px' });
  sections.forEach(s => observer.observe(s));
</script>

</body>
</html>