Inside the Compiler

🎯 What is a Compiler?

A compiler is a translator that converts human-readable code into machine instructions that your computer's hardware can execute directly.

Your Code (Python, C++, Java) → COMPILER → Machine Instructions (1s and 0s)

Why Do We Need Compilers?

High-level languages let programmers express complex ideas quickly and clearly. But computers only understand machine code - sequences of numbers representing basic operations. Machine code is nearly impossible for humans to read or write directly, so compilers bridge this gap.

🏗️ The Three Stages

Every compiler follows a similar structure, broken into three main stages:

Frontend

Syntax checking, tokenization, parsing

Middle

Analysis & optimization

Backend

Code generation

Click any stage to learn more

💡 Key Insight

Think of a compiler like a translator converting a novel from English to Japanese:

Frontend = Understanding the English (grammar, vocabulary, sentence structure)
Middle = Deciding the best way to express each idea in Japanese
Backend = Actually writing out the Japanese text

🔄 Complete Compilation Pipeline

Follow your code's journey from source to executable:

Source Code
int x = 5 + 3;

→

Lexer
Tokenize

→

Parser
Build AST

↓

Optimizer
Improve Code

↓

Code Generator
Create Instructions

→

Executable
Machine Code

📊 Data Structures Along the Way

🏷️ Tokens

Smallest meaningful units of code

Created by the lexer. Examples: int, x, =, 5, +, 3, ;

Each token has a type (keyword, identifier, operator, etc.) and a value.

🌳 Abstract Syntax Tree

Hierarchical program structure

A tree showing how code elements relate. For x = 5 + 3:

Assignment (=) ├── Variable: x └── Add (+) ├── Number: 5 └── Number: 3

📋 Symbol Table

Registry of all identifiers

Tracks every variable, function, and class name along with their types, scopes, and memory locations.

Allows the compiler to know "what is x?" at any point in the code.

🔢 Intermediate Representation

Platform-independent code form

A simplified version of your code that's easier to optimize, but not yet specific to any CPU architecture.

Like a universal language between your code and machine code.

📥 Compiler Frontend

The frontend's job is to understand your code and check it for errors.

🔤 Lexical Analyzer (Lexer)

The lexer reads your raw source code character by character and groups them into tokens.

What are Tokens?

Tokens are the smallest meaningful units in a programming language - like words in a sentence.

Example: How the lexer sees int count = 42;

KEYWORD: int IDENTIFIER: count OPERATOR: = NUMBER: 42 SEMICOLON: ;

Token Types

Keywords - Reserved words like if, while, return
Identifiers - Names you create: variables, functions, classes
Literals - Values like numbers (42) or strings ("hello")
Operators - Symbols like +, -, =, ==
Punctuation - Structural symbols like ;, {, }

📐 Parser

The parser takes tokens and checks if they follow the grammar rules of the language.

Grammar Rules Examples

Assignment Rule: IDENTIFIER = EXPRESSION ; If Statement: if ( CONDITION ) { STATEMENTS } Function Call: IDENTIFIER ( ARGUMENTS )

What the Parser Catches

Missing semicolons: int x = 5 ← where's the ;?
Unbalanced brackets: if (x > 5 { } ← missing )
Invalid assignments: 5 = x; ← can't assign to a number!

The Output: Abstract Syntax Tree (AST)

If the code is valid, the parser builds a tree structure that represents the program's logic.

AST for: result = (a + b) * 2;

Assignment
├── result
└── Multiply (*)
    ├── Add (+)
    │   ├── a
    │   └── b
    └── 2

📋 Symbol Table

While parsing, the compiler builds a symbol table - a database of all identifiers in your program.

For this code:

int globalCount = 0;

void calculateSum(int a, int b) {
    int result = a + b;
    globalCount++;
}

Name	Type	Scope	Details
globalCount	int	global	initialized to 0
calculateSum	function	global	returns void, 2 params
a	int	calculateSum	parameter
b	int	calculateSum	parameter
result	int	calculateSum	local variable

Why Symbol Tables Matter

Type checking - Is x + "hello" valid?
Scope resolution - Which x are we referring to?
Memory allocation - Where should each variable live?

⚡ Middle Stage (Optimizer)

The middle stage makes your code faster and smaller without changing what it does.

This stage is optional but almost always used in production compilers.

🎯 Optimization Goals

🏃 Faster Execution

Reduce CPU cycles needed

Rearrange operations to take advantage of CPU features like pipelining, caching, and parallel execution units.

📦 Smaller Code

Reduce executable size

Remove redundant code, combine similar operations, and eliminate unused functions to shrink the final binary.

💾 Less Memory

Reduce RAM usage

Reuse memory locations, eliminate unnecessary temporary variables, and optimize data layout.

🔧 Common Optimizations

Dead Code Elimination

Remove code that never runs or variables that are never used.

Before

int x = 5;
int y = 10;  // never used!
int z = x * 2;
return z;

After

int x = 5;
int z = x * 2;
return z;

Constant Folding

Calculate constant expressions at compile time instead of runtime.

Before

int seconds = 60 * 60 * 24;
int area = 3.14159 * 10 * 10;

After

int seconds = 86400;
int area = 314.159;

Loop Optimization

Move calculations outside loops when possible.

Before

for (i = 0; i < n; i++) {
    x = y * z;  // same every iteration!
    arr[i] = x + i;
}

After

x = y * z;  // moved outside
for (i = 0; i < n; i++) {
    arr[i] = x + i;
}

Strength Reduction

Replace expensive operations with cheaper equivalents.

Before

x = y * 2;
z = w / 4;
p = q * 8;

After (using bit shifts)

x = y << 1;   // shift left = *2
z = w >> 2;   // shift right = /4
p = q << 3;   // shift left 3 = *8

📤 Compiler Backend

The backend transforms the optimized intermediate code into actual machine instructions for a specific CPU.

🎯 Backend Responsibilities

📍 Memory Layout

Organize data in RAM

Decide where each variable lives - in registers (fast), on the stack (function-local), or in the heap (dynamic).

Memory Layout: ┌─────────────┐ High Address │ Stack │ ← Local variables, function calls ├─────────────┤ │ ↓ │ │ ↑ │ ├─────────────┤ │ Heap │ ← Dynamic allocations (malloc/new) ├─────────────┤ │ Data │ ← Global/static variables ├─────────────┤ │ Code │ ← Your program instructions └─────────────┘ Low Address

⚙️ Instruction Selection

Choose CPU operations

Match each operation in your code to specific machine instructions available on the target CPU (x86, ARM, etc.).

Different CPUs have different instruction sets!

📝 Register Allocation

Assign CPU registers

Registers are the fastest storage on a CPU, but there are only a few (8-32 typically). The compiler decides which variables get registers vs. memory.

🔧 Code Generation Example

See how high-level code becomes machine instructions:

C Code

int sum = a + b;

x86 Assembly (simplified)

MOV  EAX, [a]     ; Load value of 'a' into register EAX
ADD  EAX, [b]     ; Add value of 'b' to EAX
MOV  [sum], EAX   ; Store result in memory location 'sum'

Instruction Templates

The backend uses templates for common patterns:

IF Statement Template: CMP condition ; Compare JZ else_label ; Jump if zero (false) [if-body code] JMP end_label else_label: [else-body code] end_label: WHILE Loop Template: loop_start: CMP condition JZ loop_end ; Exit if false [loop-body code] JMP loop_start ; Repeat loop_end:

🛠️ Additional Compiler Tools

Compilers work with other tools to create final executables.

🔗 Linkers

Linkers combine separately compiled pieces into a single executable.

What Gets Linked?

Your compiled code - The .o or .obj files from your source
Standard libraries - printf, malloc, math functions
Third-party libraries - Any external code you use

When you write printf("Hello"), you don't include the actual printf code:

Your Code: Standard Library: ┌──────────────┐ ┌──────────────┐ │ main.o │ │ libc.a │ │ │ │ │ │ call printf ─┼───────────┼→ printf code │ │ │ LINKER │ │ └──────────────┘ fills └──────────────┘ this gap! ↓ Final Executable: ┌──────────────┐ │ main code │ │ printf code │ ← copied or referenced └──────────────┘

Static vs Dynamic Linking

Static Linking

Library code is copied INTO your executable.

✅ Self-contained, ❌ Larger file

Dynamic Linking

Library code is loaded at runtime (.dll, .so).

✅ Smaller file, ❌ Needs library installed

📝 Assemblers

Assemblers translate assembly language into machine code.

Assembly vs Machine Code

Assembly (Human Readable): Machine Code (CPU Reads): MOV EAX, 5 → B8 05 00 00 00 ADD EAX, 3 → 83 C0 03

What Assemblers Do

Translate mnemonics - MOV, ADD, JMP → numeric opcodes
Resolve labels - Convert names like loop_start to addresses
Manage symbol table - Track function/variable locations
Handle directives - .data, .text, memory alignment

💡 Assembly is much simpler to translate than high-level languages because there's almost a 1:1 mapping between assembly instructions and machine code.

🎮 Interactive Lexer Demo

Enter some code and watch it get tokenized!

int count = 42;
if (count > 10) {
    return count * 2;
}

🎯 What is a Compiler?

Why Do We Need Compilers?

🏗️ The Three Stages

Frontend

Middle

Backend

💡 Key Insight

🔄 Complete Compilation Pipeline

📊 Data Structures Along the Way

🏷️ Tokens

🌳 Abstract Syntax Tree

📋 Symbol Table

🔢 Intermediate Representation

📥 Compiler Frontend

🔤 Lexical Analyzer (Lexer)

What are Tokens?

Token Types

📐 Parser

Grammar Rules Examples

What the Parser Catches

The Output: Abstract Syntax Tree (AST)

📋 Symbol Table

Why Symbol Tables Matter

⚡ Middle Stage (Optimizer)

🎯 Optimization Goals

🏃 Faster Execution

📦 Smaller Code

💾 Less Memory

🔧 Common Optimizations

Dead Code Elimination

Before

After

Constant Folding

Before

After

Loop Optimization

Before

After

Strength Reduction

Before

After (using bit shifts)

📤 Compiler Backend

🎯 Backend Responsibilities

📍 Memory Layout

⚙️ Instruction Selection

📝 Register Allocation

🔧 Code Generation Example

C Code

x86 Assembly (simplified)

Instruction Templates

🛠️ Additional Compiler Tools

🔗 Linkers

What Gets Linked?

Static vs Dynamic Linking

Static Linking

Dynamic Linking

📝 Assemblers

Assembly vs Machine Code

What Assemblers Do

🎮 Interactive Lexer Demo

Tokens

Symbol Table

🌳 AST Builder

Abstract Syntax Tree

📝 Test Your Knowledge

Q1: What does a lexer produce?

Q2: What does the parser check for?

Q3: What does a symbol table track?

Q4: Which compiler stage is optional but improves performance?

Q5: What does "constant folding" optimization do?

Q6: What does a linker do?

Q7: Why is assembly-to-machine-code translation simpler than compilation?

Q8: What is the output of the compiler frontend?