⚙️ Inside the Compiler

CSE 230 - Section 2.1 Interactive Study Guide

🎯 What is a Compiler?

A compiler is a translator that converts human-readable code into machine instructions that your computer's hardware can execute directly.

Your Code (Python, C++, Java) → COMPILER → Machine Instructions (1s and 0s)

Why Do We Need Compilers?

High-level languages let programmers express complex ideas quickly and clearly. But computers only understand machine code - sequences of numbers representing basic operations. Machine code is nearly impossible for humans to read or write directly, so compilers bridge this gap.

🏗️ The Three Stages

Every compiler follows a similar structure, broken into three main stages:

Frontend

Syntax checking, tokenization, parsing

Middle

Analysis & optimization

Backend

Code generation

Click any stage to learn more

💡 Key Insight

Think of a compiler like a translator converting a novel from English to Japanese:

  • Frontend = Understanding the English (grammar, vocabulary, sentence structure)
  • Middle = Deciding the best way to express each idea in Japanese
  • Backend = Actually writing out the Japanese text

🔄 Complete Compilation Pipeline

Follow your code's journey from source to executable:

Source Code
int x = 5 + 3;
Lexer
Tokenize
Parser
Build AST
Optimizer
Improve Code
Code Generator
Create Instructions
Executable
Machine Code

📊 Data Structures Along the Way

🏷️ Tokens

Smallest meaningful units of code

Created by the lexer. Examples: int, x, =, 5, +, 3, ;

Each token has a type (keyword, identifier, operator, etc.) and a value.

🌳 Abstract Syntax Tree

Hierarchical program structure

A tree showing how code elements relate. For x = 5 + 3:

Assignment (=) ├── Variable: x └── Add (+) ├── Number: 5 └── Number: 3

📋 Symbol Table

Registry of all identifiers

Tracks every variable, function, and class name along with their types, scopes, and memory locations.

Allows the compiler to know "what is x?" at any point in the code.

🔢 Intermediate Representation

Platform-independent code form

A simplified version of your code that's easier to optimize, but not yet specific to any CPU architecture.

Like a universal language between your code and machine code.

📥 Compiler Frontend

The frontend's job is to understand your code and check it for errors.

🔤 Lexical Analyzer (Lexer)

The lexer reads your raw source code character by character and groups them into tokens.

What are Tokens?

Tokens are the smallest meaningful units in a programming language - like words in a sentence.

Example: How the lexer sees int count = 42;

KEYWORD: int IDENTIFIER: count OPERATOR: = NUMBER: 42 SEMICOLON: ;

Token Types

  • Keywords - Reserved words like if, while, return
  • Identifiers - Names you create: variables, functions, classes
  • Literals - Values like numbers (42) or strings ("hello")
  • Operators - Symbols like +, -, =, ==
  • Punctuation - Structural symbols like ;, {, }

📐 Parser

The parser takes tokens and checks if they follow the grammar rules of the language.

Grammar Rules Examples

Assignment Rule: IDENTIFIER = EXPRESSION ; If Statement: if ( CONDITION ) { STATEMENTS } Function Call: IDENTIFIER ( ARGUMENTS )

What the Parser Catches

  • Missing semicolons: int x = 5 ← where's the ;?
  • Unbalanced brackets: if (x > 5 { } ← missing )
  • Invalid assignments: 5 = x; ← can't assign to a number!

The Output: Abstract Syntax Tree (AST)

If the code is valid, the parser builds a tree structure that represents the program's logic.

AST for: result = (a + b) * 2;

Assignment
├── result
└── Multiply (*)
    ├── Add (+)
    │   ├── a
    │   └── b
    └── 2

📋 Symbol Table

While parsing, the compiler builds a symbol table - a database of all identifiers in your program.

For this code:

int globalCount = 0;

void calculateSum(int a, int b) {
    int result = a + b;
    globalCount++;
}
Name Type Scope Details
globalCount int global initialized to 0
calculateSum function global returns void, 2 params
a int calculateSum parameter
b int calculateSum parameter
result int calculateSum local variable

Why Symbol Tables Matter

  • Type checking - Is x + "hello" valid?
  • Scope resolution - Which x are we referring to?
  • Memory allocation - Where should each variable live?

⚡ Middle Stage (Optimizer)

The middle stage makes your code faster and smaller without changing what it does.

This stage is optional but almost always used in production compilers.

🎯 Optimization Goals

🏃 Faster Execution

Reduce CPU cycles needed

Rearrange operations to take advantage of CPU features like pipelining, caching, and parallel execution units.

📦 Smaller Code

Reduce executable size

Remove redundant code, combine similar operations, and eliminate unused functions to shrink the final binary.

💾 Less Memory

Reduce RAM usage

Reuse memory locations, eliminate unnecessary temporary variables, and optimize data layout.

🔧 Common Optimizations

Dead Code Elimination

Remove code that never runs or variables that are never used.

Before
int x = 5;
int y = 10;  // never used!
int z = x * 2;
return z;
After
int x = 5;
int z = x * 2;
return z;

Constant Folding

Calculate constant expressions at compile time instead of runtime.

Before
int seconds = 60 * 60 * 24;
int area = 3.14159 * 10 * 10;
After
int seconds = 86400;
int area = 314.159;

Loop Optimization

Move calculations outside loops when possible.

Before
for (i = 0; i < n; i++) {
    x = y * z;  // same every iteration!
    arr[i] = x + i;
}
After
x = y * z;  // moved outside
for (i = 0; i < n; i++) {
    arr[i] = x + i;
}

Strength Reduction

Replace expensive operations with cheaper equivalents.

Before
x = y * 2;
z = w / 4;
p = q * 8;
After (using bit shifts)
x = y << 1;   // shift left = *2
z = w >> 2;   // shift right = /4
p = q << 3;   // shift left 3 = *8

📤 Compiler Backend

The backend transforms the optimized intermediate code into actual machine instructions for a specific CPU.

🎯 Backend Responsibilities

📍 Memory Layout

Organize data in RAM

Decide where each variable lives - in registers (fast), on the stack (function-local), or in the heap (dynamic).

Memory Layout: ┌─────────────┐ High Address │ Stack │ ← Local variables, function calls ├─────────────┤ │ ↓ │ │ ↑ │ ├─────────────┤ │ Heap │ ← Dynamic allocations (malloc/new) ├─────────────┤ │ Data │ ← Global/static variables ├─────────────┤ │ Code │ ← Your program instructions └─────────────┘ Low Address

⚙️ Instruction Selection

Choose CPU operations

Match each operation in your code to specific machine instructions available on the target CPU (x86, ARM, etc.).

Different CPUs have different instruction sets!

📝 Register Allocation

Assign CPU registers

Registers are the fastest storage on a CPU, but there are only a few (8-32 typically). The compiler decides which variables get registers vs. memory.

🔧 Code Generation Example

See how high-level code becomes machine instructions:

C Code
int sum = a + b;
x86 Assembly (simplified)
MOV  EAX, [a]     ; Load value of 'a' into register EAX
ADD  EAX, [b]     ; Add value of 'b' to EAX
MOV  [sum], EAX   ; Store result in memory location 'sum'

Instruction Templates

The backend uses templates for common patterns:

IF Statement Template: CMP condition ; Compare JZ else_label ; Jump if zero (false) [if-body code] JMP end_label else_label: [else-body code] end_label: WHILE Loop Template: loop_start: CMP condition JZ loop_end ; Exit if false [loop-body code] JMP loop_start ; Repeat loop_end:

🛠️ Additional Compiler Tools

Compilers work with other tools to create final executables.

🔗 Linkers

Linkers combine separately compiled pieces into a single executable.

What Gets Linked?

  • Your compiled code - The .o or .obj files from your source
  • Standard libraries - printf, malloc, math functions
  • Third-party libraries - Any external code you use

When you write printf("Hello"), you don't include the actual printf code:

Your Code: Standard Library: ┌──────────────┐ ┌──────────────┐ │ main.o │ │ libc.a │ │ │ │ │ │ call printf ─┼───────────┼→ printf code │ │ │ LINKER │ │ └──────────────┘ fills └──────────────┘ this gap! ↓ Final Executable: ┌──────────────┐ │ main code │ │ printf code │ ← copied or referenced └──────────────┘

Static vs Dynamic Linking

Static Linking

Library code is copied INTO your executable.

✅ Self-contained, ❌ Larger file

Dynamic Linking

Library code is loaded at runtime (.dll, .so).

✅ Smaller file, ❌ Needs library installed

📝 Assemblers

Assemblers translate assembly language into machine code.

Assembly vs Machine Code

Assembly (Human Readable): Machine Code (CPU Reads): MOV EAX, 5 → B8 05 00 00 00 ADD EAX, 3 → 83 C0 03

What Assemblers Do

  • Translate mnemonics - MOV, ADD, JMP → numeric opcodes
  • Resolve labels - Convert names like loop_start to addresses
  • Manage symbol table - Track function/variable locations
  • Handle directives - .data, .text, memory alignment

💡 Assembly is much simpler to translate than high-level languages because there's almost a 1:1 mapping between assembly instructions and machine code.

🎮 Interactive Lexer Demo

Enter some code and watch it get tokenized!

Tokens
Symbol Table

🌳 AST Builder

Enter a simple expression to see its syntax tree:

Abstract Syntax Tree

                    

📝 Test Your Knowledge

See how well you understand compiler internals!

Q1: What does a lexer produce?

Q2: What does the parser check for?

Q3: What does a symbol table track?

Q4: Which compiler stage is optional but improves performance?

Q5: What does "constant folding" optimization do?

Q6: What does a linker do?

Q7: Why is assembly-to-machine-code translation simpler than compilation?

Q8: What is the output of the compiler frontend?