How Programs Are Executed by a Computer System
Basic Components of a CPU Design
The Central Processing Unit (CPU) is the brain of the computer. It consists of several fundamental components that work together to execute instructions:
Control Unit (CU)
The "manager" of the CPU. It fetches instructions from memory, decodes them to understand what operation to perform, and coordinates all other components to execute the instruction. Think of it as the conductor of an orchestra.
Arithmetic Logic Unit (ALU)
The "calculator" of the CPU. Performs all arithmetic operations (add, subtract, multiply, divide) and logical operations (AND, OR, NOT, XOR, comparisons). Every computation flows through here.
Registers
Ultra-fast, small storage locations inside the CPU. Used to hold data currently being processed, addresses, and status information. Much faster than RAM because they're on the CPU chip itself.
Program Counter (PC)
A special register that holds the memory address of the next instruction to execute. After each instruction fetch, it typically increments to point to the next sequential instruction (unless a jump/branch occurs).
┌─────────────────────────────────────────────────────────────┐
│ CPU │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Control Unit │ │
│ │ ┌─────────┐ ┌─────────────┐ ┌──────────────┐ │ │
│ │ │ PC │ │ Instruction │ │ Control │ │ │
│ │ │ Counter │──│ Register │──│ Signals │────┼───┼──▶
│ │ └─────────┘ └─────────────┘ └──────────────┘ │ │
│ └──────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────────┼────────────────────────────┐ │
│ │ Register File │ │ │
│ │ ┌────┐ ┌────┐ ┌────┐ │ ┌────┐ ┌────┐ ┌────┐ │ │
│ │ │ R0 │ │ R1 │ │ R2 │ │ │ R3 │ │... │ │ Rn │ │ │
│ │ └────┘ └────┘ └────┘ │ └────┘ └────┘ └────┘ │ │
│ └─────────────────────────┼────────────────────────────┘ │
│ │ │
│ ┌─────────────────────────▼────────────────────────────┐ │
│ │ ALU │ │
│ │ ADD SUB MUL DIV AND OR XOR NOT │ │
│ └──────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
◀────────┴────────▶
Memory Bus
The Program Compilation Process
When you write code in a high-level language like C or Python, it can't run directly on hardware. The compilation process transforms human-readable code into machine instructions the CPU understands.
Source Code (.c)
│
▼
┌─────────────────┐
│ PREPROCESSOR │ → Handles #include, #define, macros
└────────┬────────┘ Produces expanded source code
│
▼
┌─────────────────┐
│ COMPILER │ → Converts C to Assembly language
└────────┬────────┘ Syntax checking, optimization
│
▼
┌─────────────────┐
│ ASSEMBLER │ → Converts Assembly to Object code
└────────┬────────┘ Machine instructions (binary)
│
▼
┌─────────────────┐
│ LINKER │ → Combines object files + libraries
└────────┬────────┘ Resolves external references
│
▼
Executable File
Stage Details
- Preprocessor: Text substitution phase. Expands macros, includes header files, processes conditional compilation directives. Output is still C code.
- Compiler: The heavy lifting. Performs lexical analysis (tokenization), parsing (syntax tree), semantic analysis (type checking), optimization, and code generation. Output is assembly language.
- Assembler: Translates assembly mnemonics into binary machine code. Creates object files containing machine instructions plus symbol tables for linking.
- Linker: Combines multiple object files and libraries. Resolves symbol references (like function calls to library functions). Produces final executable with proper memory addresses.
The Instruction Cycle (Fetch-Decode-Execute)
Every instruction the CPU executes follows the same fundamental cycle. This continuous loop is the heartbeat of computation.
┌─────────────────────────────────────────────────────┐
│ │
▼ │
┌─────────┐ ┌──────────┐ ┌───────────┐ │
│ FETCH │ ──▶ │ DECODE │ ──▶ │ EXECUTE │ ─────────┘
└─────────┘ └──────────┘ └───────────┘
│ │ │
│ │ │
Read instruction Determine Perform the
from memory at what operation operation:
address in PC, and operands, ALU computation,
increment PC setup control memory access,
signals or branch
1. Fetch: The CPU reads the instruction from memory at the address stored in the Program Counter. The instruction is loaded into the Instruction Register. The PC is incremented to point to the next instruction.
2. Decode: The Control Unit examines the instruction to determine: What operation? Which registers? What addressing mode? It generates control signals to coordinate the execution.
3. Execute: The actual work happens. This might involve the ALU performing arithmetic, data being read/written to memory, or the PC being modified for branches/jumps.
Some architectures add additional stages:
- Memory Access: For load/store instructions, access RAM to read or write data
- Write Back: Store the result of the operation back into a register
Hardware Factors Impacting Performance
Measuring and Comparing Computer Performance
Performance can be measured in several ways, each telling a different part of the story:
Total time to complete a task. Lower is better. This is what end users care about most.
Tasks completed per unit time. Higher is better. Important for servers handling many requests.
Comparing two systems:
Key Metrics
- MIPS (Million Instructions Per Second): Raw instruction throughput. Can be misleading because different instructions do different amounts of work.
- FLOPS (Floating Point Operations Per Second): Better for scientific computing comparisons. Measures actual computational work.
- Benchmarks (SPEC, Geekbench): Standardized programs that measure real-world performance across various workloads.
Propagation Delay
Propagation delay is the time it takes for a signal to travel through a logic gate or circuit. It's the fundamental speed limit of digital circuits.
Propagation delay is affected by:
- Gate complexity: More transistors = more delay
- Capacitive load: More outputs to drive = slower switching
- Wire length: Signals take time to travel, even at near light speed
- Temperature: Higher temps generally increase delay
- Voltage: Lower voltage = slower transitions
The critical path is the longest path through combinational logic between registers. It determines the minimum clock period and thus the maximum clock frequency. All design optimization focuses on reducing critical path delay.
Clock Speed and Performance
Clock speed (frequency) determines how many cycles per second the CPU runs. Higher clock = more operations per second, but there are tradeoffs.
Where CPI = Cycles Per Instruction (average)
Why Not Just Crank Up Clock Speed?
- Power consumption: Power scales roughly with frequency cubed (P ∝ f³). Double the clock, roughly 8× the heat.
- Heat dissipation: Can't cool it fast enough. This is why we hit the "frequency wall" around 4-5 GHz.
- Signal integrity: At high frequencies, wires start acting like antennas and transmission lines.
- Timing margins: Less room for error in meeting setup/hold times.
Computer Architecture's Influence on Performance
Architecture decisions fundamentally impact what's possible:
| Architectural Feature | Performance Impact |
|---|---|
| Word Size (32-bit vs 64-bit) | More data processed per instruction, larger address space |
| Register Count | More registers = fewer memory accesses = faster |
| Cache Size & Levels | Larger/more cache = fewer slow main memory accesses |
| Pipeline Depth | More stages = higher clock possible, but hazard penalties |
| Superscalar Width | Execute multiple instructions per cycle |
| Out-of-Order Execution | Better utilization by reordering independent instructions |
| Branch Prediction | Reduce pipeline stalls from branches |
Architecture vs. Instruction Set Relationship
The Instruction Set Architecture (ISA) is the interface between software and hardware — the "contract" that defines what instructions exist, their formats, registers, and addressing modes.
CISC (Complex Instruction Set)
x86
- Many complex instructions
- Variable instruction lengths
- Instructions can access memory directly
- Fewer instructions needed per program
- Complex decode logic
RISC (Reduced Instruction Set)
ARM MIPS RISC-V
- Simple, uniform instructions
- Fixed instruction length
- Load/Store architecture
- More instructions per program
- Simpler decode, easier pipelining
Modern "CISC" processors (like x86) actually translate complex instructions into RISC-like micro-operations internally. The ISA is preserved for compatibility, but execution is RISC-style for performance.
Predicting Assembly Program Execution Time
To calculate execution time, you need to account for instruction mix, cycles per instruction type, and clock speed.
Example Calculation
Given a program with:
- 100 ALU instructions (1 cycle each)
- 30 load instructions (3 cycles each — memory access)
- 20 store instructions (3 cycles each)
- 10 branch instructions (2 cycles each)
- Clock frequency: 2 GHz
Actual execution time is affected by cache hits/misses (huge impact!), branch prediction accuracy, pipeline hazards, memory bandwidth, and out-of-order execution effects. These make precise prediction difficult.
Data Representation, Instruction Sets, and Addressing Modes
Binary and Hexadecimal Number Systems
Binary (Base 2)
Computers use binary because transistors have two states: on (1) and off (0). Each digit is a "bit".
Position: 7 6 5 4 3 2 1 0
Value: 128 64 32 16 8 4 2 1
2⁷ 2⁶ 2⁵ 2⁴ 2³ 2² 2¹ 2⁰
Example: 10110101₂ = 128 + 32 + 16 + 4 + 1 = 181₁₀
Hexadecimal (Base 16)
Hex is a compact way to represent binary. Each hex digit represents exactly 4 bits (a "nibble").
| Decimal | Binary | Hex | Decimal | Binary | Hex |
|---|---|---|---|---|---|
| 0 | 0000 | 0 | 8 | 1000 | 8 |
| 1 | 0001 | 1 | 9 | 1001 | 9 |
| 2 | 0010 | 2 | 10 | 1010 | A |
| 3 | 0011 | 3 | 11 | 1011 | B |
| 4 | 0100 | 4 | 12 | 1100 | C |
| 5 | 0101 | 5 | 13 | 1101 | D |
| 6 | 0110 | 6 | 14 | 1110 | E |
| 7 | 0111 | 7 | 15 | 1111 | F |
Binary Addition and Subtraction
Binary Addition
Works just like decimal addition, but carry occurs at 2 instead of 10.
Rules: Example: 0101 + 0011 = 1000 (5 + 3 = 8)
0 + 0 = 0
0 + 1 = 1 0 1 0 1 (5)
1 + 0 = 1 + 0 0 1 1 (3)
1 + 1 = 10 ---------
(0 carry 1) ¹ ¹
1 0 0 0 (8)
Binary Subtraction (Using Two's Complement)
Instead of subtracting, we add the negative. To negate in two's complement: invert all bits, then add 1.
To compute 5 - 3 using 4 bits:
Step 1: Convert 3 to -3 (two's complement)
3 = 0011
Invert: 1100
Add 1: 1101 = -3
Step 2: Add 5 + (-3)
0101 (5)
+ 1101 (-3)
------
¹ 0010 (2) ← discard carry, answer is 2 ✓
Decimal to Binary Conversion
Method: Repeated Division by 2
Divide by 2, record remainder. Repeat until quotient is 0. Read remainders bottom-to-top.
Convert 181 to binary: 181 ÷ 2 = 90 remainder 1 ↑ 90 ÷ 2 = 45 remainder 0 │ 45 ÷ 2 = 22 remainder 1 │ Read upward 22 ÷ 2 = 11 remainder 0 │ 11 ÷ 2 = 5 remainder 1 │ 5 ÷ 2 = 2 remainder 1 │ 2 ÷ 2 = 1 remainder 0 │ 1 ÷ 2 = 0 remainder 1 │ Result: 181₁₀ = 10110101₂
Binary to Decimal
Multiply each bit by its place value and sum:
Decimal to Hexadecimal Conversion
Method 1: Repeated Division by 16
Convert 450 to hex: 450 ÷ 16 = 28 remainder 2 ↑ 28 ÷ 16 = 1 remainder 12 (C) │ Read upward 1 ÷ 16 = 0 remainder 1 │ Result: 450₁₀ = 0x1C2
Method 2: Convert to Binary First, Then Group by 4
Signed Binary Representation (Two's Complement)
Two's complement is the standard for representing signed integers. The MSB (leftmost bit) indicates sign: 0 = positive, 1 = negative.
| 4-bit Binary | Unsigned Value | Signed Value (Two's Complement) |
|---|---|---|
| 0000 | 0 | 0 |
| 0001 | 1 | +1 |
| 0111 | 7 | +7 (max positive) |
| 1000 | 8 | -8 (min negative) |
| 1111 | 15 | -1 |
| 1110 | 14 | -2 |
Range for n-bit Two's Complement
For 8 bits: -128 to +127
For 32 bits: -2,147,483,648 to +2,147,483,647
Converting Negative Numbers
To represent -5 in 8-bit two's complement: 1. Start with +5: 00000101 2. Invert all bits: 11111010 3. Add 1: 11111011 ← This is -5 To verify: add 5 + (-5) should equal 0 00000101 + 11111011 ---------- 100000000 ← Discard 9th bit carry = 00000000 ✓
Big-Endian vs Little-Endian
Endianness describes byte ordering in multi-byte data. Consider the 32-bit value 0x12345678:
Address: 0x100 0x101 0x102 0x103
Big-Endian: 12 34 56 78
↑ Most significant byte first
(Network byte order, used by some RISC)
Little-Endian: 78 56 34 12
↑ Least significant byte first
(x86, ARM default)
Like reading left-to-right. MSB at lowest address. Used in: network protocols, some RISC processors, Java bytecode.
LSB at lowest address. Used in: x86, x64, ARM (usually), most modern PCs. Easier for hardware to extend values.
Overflow and Underflow
Overflow
Occurs when the result of an operation is too large to represent in the available bits.
8-bit signed addition overflow example: 01111111 (+127, max positive) + 00000001 (+1) ---------- 10000000 (interpreted as -128, not +128!) Overflow detection (signed): If two positive numbers produce negative result → overflow If two negative numbers produce positive result → overflow
Underflow
Occurs when the result is too negative (below minimum representable value).
8-bit signed subtraction underflow: 10000000 (-128, min negative) - 00000001 (-1) ---------- 01111111 (+127, wrapped around!)
Overflow/underflow can cause security vulnerabilities (integer overflow attacks), incorrect calculations, and program crashes. Always validate ranges and use appropriate data types!
Assembly to Machine Language Translation
Assembly mnemonics are human-readable representations of machine code. Each instruction has a binary encoding defined by the ISA.
MIPS R-Type Instruction Format Example
Instruction: ADD $t0, $s1, $s2 (add registers s1 and s2, store in t0)
R-Type Format (32 bits total):
┌────────┬───────┬───────┬───────┬───────┬────────┐
│ opcode │ rs │ rt │ rd │ shamt │ funct │
│ 6 bits │5 bits │5 bits │5 bits │5 bits │ 6 bits │
└────────┴───────┴───────┴───────┴───────┴────────┘
For ADD $t0, $s1, $s2:
opcode = 000000 (R-type)
rs = 10001 ($s1 = register 17)
rt = 10010 ($s2 = register 18)
rd = 01000 ($t0 = register 8)
shamt = 00000 (not a shift)
funct = 100000 (ADD function)
Machine code: 000000 10001 10010 01000 00000 100000
= 0x02324020
I-Type and J-Type Formats
I-Type (Immediate): LW, SW, BEQ, ADDI, etc. ┌────────┬───────┬───────┬──────────────────┐ │ opcode │ rs │ rt │ immediate │ │ 6 bits │5 bits │5 bits │ 16 bits │ └────────┴───────┴───────┴──────────────────┘ J-Type (Jump): J, JAL ┌────────┬─────────────────────────────────┐ │ opcode │ address │ │ 6 bits │ 26 bits │ └────────┴─────────────────────────────────┘
Assembly Language Programming
Mathematical Expressions in Assembly
High-level math expressions must be broken down into individual operations using registers.
Example: Compute result = (a + b) * (c - d)
# Assume: a in $s0, b in $s1, c in $s2, d in $s3
# Result will be in $s4
add $t0, $s0, $s1 # $t0 = a + b
sub $t1, $s2, $s3 # $t1 = c - d
mul $s4, $t0, $t1 # $s4 = (a+b) * (c-d)
Example: Array Element Access (array[i] = array[i] + 5)
# Assume: base address of array in $s0, i in $s1
sll $t0, $s1, 2 # $t0 = i * 4 (word offset)
add $t0, $t0, $s0 # $t0 = address of array[i]
lw $t1, 0($t0) # $t1 = array[i]
addi $t1, $t1, 5 # $t1 = array[i] + 5
sw $t1, 0($t0) # array[i] = $t1
Decision Structures (If-Else)
Simple If Statement
// C code:
if (a == b) {
c = d + e;
}
# a=$s0, b=$s1, c=$s2, d=$s3, e=$s4
bne $s0, $s1, skip # if a != b, skip the body
add $s2, $s3, $s4 # c = d + e
skip:
# continue...
If-Else Statement
// C code:
if (a < b) {
c = 1;
} else {
c = 0;
}
# a=$s0, b=$s1, c=$s2
slt $t0, $s0, $s1 # $t0 = 1 if a < b, else 0
beq $t0, $zero, else # if $t0==0 (a >= b), go to else
addi $s2, $zero, 1 # c = 1
j done # skip else block
else:
add $s2, $zero, $zero # c = 0
done:
Loops
While Loop
// C code: sum numbers from 1 to n
int sum = 0;
int i = 1;
while (i <= n) {
sum = sum + i;
i = i + 1;
}
# n=$s0, sum=$s1, i=$s2
add $s1, $zero, $zero # sum = 0
addi $s2, $zero, 1 # i = 1
loop:
slt $t0, $s0, $s2 # $t0 = 1 if n < i
bne $t0, $zero, done # if n < i, exit loop
add $s1, $s1, $s2 # sum = sum + i
addi $s2, $s2, 1 # i = i + 1
j loop # repeat
done:
For Loop (Array Sum)
# Sum array elements: for(i=0; i
# arr base=$s0, n=$s1, sum in $s2
add $s2, $zero, $zero # sum = 0
add $t0, $zero, $zero # i = 0
add $t1, $s0, $zero # $t1 = current address
loop:
slt $t2, $t0, $s1 # $t2 = 1 if i < n
beq $t2, $zero, done # exit if i >= n
lw $t3, 0($t1) # $t3 = arr[i]
add $s2, $s2, $t3 # sum += arr[i]
addi $t1, $t1, 4 # next address
addi $t0, $t0, 1 # i++
j loop
done:
Procedures (Functions)
Procedures in assembly require careful management of the stack, return addresses, and registers.
MIPS Calling Convention
$a0-$a3: Arguments to procedure$v0-$v1: Return values$ra: Return address (set byjal)$sp: Stack pointer$s0-$s7: Saved registers (callee must preserve)$t0-$t9: Temporary registers (caller-saved)
Simple Leaf Procedure (No Nested Calls)
// C code:
int square(int x) {
return x * x;
}
square:
mul $v0, $a0, $a0 # $v0 = x * x
jr $ra # return
# Calling the function:
addi $a0, $zero, 5 # arg = 5
jal square # call square(5)
# result now in $v0
Non-Leaf Procedure (Makes Nested Calls)
# int factorial(int n) { if(n<=1) return 1; return n * factorial(n-1); }
factorial:
addi $sp, $sp, -8 # allocate stack space
sw $ra, 4($sp) # save return address
sw $a0, 0($sp) # save argument n
slti $t0, $a0, 2 # $t0 = 1 if n < 2
beq $t0, $zero, recurse
addi $v0, $zero, 1 # base case: return 1
addi $sp, $sp, 8 # restore stack
jr $ra
recurse:
addi $a0, $a0, -1 # n - 1
jal factorial # recursive call
lw $a0, 0($sp) # restore n
lw $ra, 4($sp) # restore return address
addi $sp, $sp, 8 # deallocate stack
mul $v0, $a0, $v0 # return n * factorial(n-1)
jr $ra
High Memory ┌─────────────────────┐ │ Previous frames │ ├─────────────────────┤ │ Saved $ra │ ← $sp + 4 (after allocation) ├─────────────────────┤ │ Saved $a0 (n) │ ← $sp (after allocation) ├─────────────────────┤ │ (next frame...) │ └─────────────────────┘ Low Memory
Hardware Performance Techniques
Increasing Clock Rate
- More cycles per second = more instructions executed
- Direct performance improvement for single-threaded code
- Simple conceptually - just make everything faster
- Power consumption increases dramatically (P ∝ f³)
- Heat generation becomes unmanageable
- Requires faster transistors and better cooling
- Hit practical limits around 4-5 GHz ("frequency wall")
- Timing margins shrink, reliability concerns
Decreasing Silicon Feature Size
Feature size refers to the smallest elements that can be fabricated on a chip (measured in nanometers). Smaller = more transistors, lower power, higher speeds.
- More transistors per chip (Moore's Law)
- Lower capacitance = faster switching
- Lower voltage operation possible
- Reduced power consumption per transistor
- Smaller die size = lower cost per chip
- Leakage current increases (quantum tunneling)
- Manufacturing complexity and cost skyrocket
- Heat density increases (same power, smaller area)
- Reliability issues (electromigration, variability)
- Approaching physical limits (atomic scale)
Pipelined Datapath
Pipelining divides instruction execution into stages, allowing multiple instructions to be "in flight" simultaneously. Like an assembly line.
Time → 1 2 3 4 5 6 7 8
┌────┬────┬────┬────┬────┬────┬────┬────┐
Instr1│ IF │ ID │ EX │ MEM│ WB │ │ │ │
├────┼────┼────┼────┼────┼────┼────┼────┤
Instr2│ │ IF │ ID │ EX │ MEM│ WB │ │ │
├────┼────┼────┼────┼────┼────┼────┼────┤
Instr3│ │ │ IF │ ID │ EX │ MEM│ WB │ │
├────┼────┼────┼────┼────┼────┼────┼────┤
Instr4│ │ │ │ IF │ ID │ EX │ MEM│ WB │
└────┴────┴────┴────┴────┴────┴────┴────┘
IF = Instruction Fetch MEM = Memory Access
ID = Instruction Decode WB = Write Back
EX = Execute
- Higher throughput (1 instruction completing per cycle at steady state)
- Better hardware utilization
- Enables higher clock frequencies (shorter critical path per stage)
- Pipeline hazards (data, control, structural)
- Increased complexity
- Latency for single instruction unchanged
- Branch mispredictions cause pipeline flushes
- Power overhead from pipeline registers
Branch Prediction
Branch prediction guesses the outcome of conditional branches before they're resolved, keeping the pipeline full.
Static Prediction Methods
- Always Predict Not Taken: Simple, but wrong ~50% for general code
- Always Predict Taken: Better for loops (backward branches often taken)
- Backward Taken, Forward Not Taken (BTFNT): Exploits loop behavior
Dynamic Prediction Methods
- 1-Bit Predictor: Remember last outcome. Problem: double misprediction at loop boundaries.
- 2-Bit Saturating Counter: Needs two wrong predictions to change. Much better for loops.
- Correlating Predictors: Use history of recent branches to predict (captures patterns).
- Tournament Predictors: Multiple predictors compete; choose best per-branch.
Taken
┌───────────────────────────────┐
│ │
▼ Taken │
┌─────────┐ ─────────▶ ┌─────────┐
│ Strongly│ │ Weakly │
│ Taken │ ◀───────── │ Taken │
└─────────┘ Not Taken └─────────┘
│ ▲
│ │
│ Not Taken Taken │
▼ │
┌─────────┐ ─────────▶ ┌─────────┐
│ Strongly│ │ Weakly │
│Not Taken│ ◀───────── │Not Taken│
└─────────┘ Not Taken └─────────┘
▲ │
│ │
└───────────────────────────────┘
Not Taken
Comparing Branch Prediction Performance
| Predictor Type | Accuracy | Hardware Cost | Use Case |
|---|---|---|---|
| Static (BTFNT) | ~65% | None | Simple embedded |
| 1-Bit Local | ~80% | Low | Basic processors |
| 2-Bit Local | ~85% | Low | Most common |
| Correlating/Global | ~92% | Medium | Desktop CPUs |
| Tournament/Hybrid | ~95%+ | High | High-performance |
| Neural (TAGE-SC-L) | ~97%+ | Very High | Modern CPUs |
Datapath Implementation
Single-Cycle Datapath
In a single-cycle design, every instruction completes in exactly one clock cycle. The clock period must be long enough for the slowest instruction.
┌─────┐ ┌─────────┐
PC ────▶│ I │─── Instr ──────▶│ Control │──── Control Signals
│ Mem │ └─────────┘
└─────┘ │
│ ▼
│ ┌──────────────────────────────────────┐
│ │ Register File │
│ │ ┌────┐ ┌────┐ │
│ │ │Read│───▶ Data1 ───▶┐ │Write│ │
│ │ │Reg1│ │ │Data │ │
│ │ └────┘ │ └────┘ │
│ │ ┌────┐ │ ▲ │
│ │ │Read│───▶ Data2 ──┐ │ │ │
│ │ │Reg2│ │ │ │ │
│ │ └────┘ │ │ │ │
│ └─────────────────────┼─┼─────┼───────┘
│ │ │ │
│ ▼ ▼ │
│ ┌─────┐ │
│ │ ALU │───┼───────────▶ Result
│ └─────┘ │
│ │ │
│ ▼ │
│ ┌───────┐ │
│ │ Data │──┘
│ │Memory │
│ └───────┘
│
▼
┌─────────┐
│ PC + 4 │───▶ Next PC
└─────────┘
Clock period = time for slowest instruction (usually load: fetch + decode + ALU + memory + write-back). Faster instructions (like ADD) must wait the same time, wasting cycles.
Pipelined Datapath
Pipelining breaks execution into stages with pipeline registers between them. Each stage works on a different instruction simultaneously.
┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
│ IF │──▶│ ID │──▶│ EX │──▶│ MEM │──▶│ WB │
│ │ │ │ │ │ │ │ │ │
│ Fetch │ │ Decode │ │Execute │ │ Memory │ │ Write │
│ Instr │ │ & Read │ │ ALU │ │ Access │ │ Back │
│ │ │ Regs │ │ │ │ │ │ │
└────────┘ └────────┘ └────────┘ └────────┘ └────────┘
│ │ │ │
▼ ▼ ▼ ▼
IF/ID ID/EX EX/MEM MEM/WB
Register Register Register Register
Pipeline Hazards
- Structural Hazards: Hardware resource conflict (e.g., single memory for instruction and data). Solution: separate I-cache and D-cache.
- Data Hazards: Instruction needs data not yet available. Solutions: forwarding/bypassing, stalling.
- Control Hazards: Branch/jump changes flow. Solutions: branch prediction, delayed branches.
Comparison
| Aspect | Single-Cycle | Pipelined |
|---|---|---|
| Clock Period | Long (slowest instruction) | Short (slowest stage) |
| Throughput | 1 instr per (long) cycle | ~1 instr per (short) cycle |
| Latency | 1 cycle | N cycles (N stages) |
| Complexity | Simple | Complex (hazard handling) |
| Hardware | Less | More (pipeline registers) |
Memory Hierarchy
Types of Memory
┌───────┐
│ CPU │
│ Regs │ ← Fastest, smallest, most expensive
└───┬───┘ ~1 cycle access
│
┌────┴────┐
│ L1 │ ← ~4 cycles, 32-64 KB
│ Cache │
└────┬────┘
│
┌─────┴─────┐
│ L2 │ ← ~10-20 cycles, 256 KB - 1 MB
│ Cache │
└─────┬─────┘
│
┌──────┴──────┐
│ L3 │ ← ~40-75 cycles, 4-64 MB
│ Cache │
└──────┬──────┘
│
┌─────────┴─────────┐
│ Main Memory │ ← ~100-300 cycles, 8-128 GB
│ (RAM) │
└─────────┬─────────┘
│
┌────────────┴────────────┐
│ Secondary Storage │ ← millions of cycles
│ (SSD / HDD) │ TB capacity
└─────────────────────────┘
| Memory Type | Technology | Speed | Volatility | Use |
|---|---|---|---|---|
| Registers | Flip-flops | ~0.25 ns | Volatile | Current operands |
| SRAM (Cache) | 6 transistors/bit | ~1-10 ns | Volatile | L1/L2/L3 cache |
| DRAM (RAM) | 1 transistor + capacitor | ~50-100 ns | Volatile | Main memory |
| Flash/SSD | Floating-gate transistors | ~25-100 μs | Non-volatile | Storage |
| HDD | Magnetic platters | ~5-10 ms | Non-volatile | Bulk storage |
How Different Memory Types Are Used
- Registers: Hold data actively being computed. Compiler allocates variables here when possible.
- L1 Cache: Split into I-cache (instructions) and D-cache (data). Holds most recently accessed data. Checked first on every memory access.
- L2/L3 Cache: Larger, slower backup. L3 often shared among cores. Catches misses from L1.
- Main Memory (RAM): Holds active programs and data. OS manages allocation. Accessed on cache miss.
- Virtual Memory: Uses disk as "backup" for RAM. Pages swapped in/out as needed.
- Storage: Persistent file storage. Programs and data loaded from here into RAM.
Temporal Locality: Recently accessed data likely to be accessed again soon.
Spatial Locality: Data near recently accessed data likely to be accessed soon.
Caches exploit both by keeping recent data and fetching in blocks (cache lines).
Calculating Cache Miss Delays
Cache performance significantly impacts overall system performance.
Example Calculation
Given:
- L1 cache hit time: 1 cycle
- L1 miss rate: 5%
- L2 cache hit time: 10 cycles
- L2 miss rate (of L1 misses): 20%
- Main memory access time: 200 cycles
For a Program
Given a program with 1000 memory references:
- 950 L1 hits (95%): 950 × 1 = 950 cycles
- 40 L2 hits (50 L1 misses × 80%): 40 × 10 = 400 cycles
- 10 Main memory accesses: 10 × 200 = 2000 cycles
Multiprocessor Architectures
Why Multiprocessors?
The transition to multiprocessor (multi-core) architectures happened because single-processor performance improvements hit fundamental limits:
The Power Wall
Power consumption and heat generation grew faster than performance. Increasing clock speed beyond ~4 GHz became impractical due to cooling limitations. Power ≈ Capacitance × Voltage² × Frequency.
The Memory Wall
Processor speed improved much faster than memory speed. CPUs spend increasing time waiting for data. Even with caches, memory latency limits single-thread performance.
The ILP Wall
Instruction-Level Parallelism has diminishing returns. Finding independent instructions to execute simultaneously becomes harder. Out-of-order execution complexity grows exponentially.
Instead of making one core faster, use multiple cores. Trade single-thread performance for multi-thread throughput. Let software exploit parallelism through threads/processes.
Types of Parallelism
Thread-Level Parallelism (TLP)
Multiple threads execute simultaneously on different cores. Requires parallel software design. Good for server workloads, scientific computing.
Data-Level Parallelism (DLP)
Same operation on multiple data elements (SIMD). GPUs excel at this. Good for graphics, machine learning, signal processing.
Multiprocessor Types
| Type | Description | Example |
|---|---|---|
| SMP (Symmetric) | All processors equal, shared memory | Typical desktop/laptop CPUs |
| NUMA | Non-uniform memory access times | Multi-socket servers |
| Heterogeneous | Different processor types (CPU + GPU) | Modern SoCs, game consoles |
| Cluster | Networked independent computers | Data centers, HPC |
Challenges
- Cache Coherence: Keeping caches consistent when multiple cores access shared data
- Synchronization: Coordinating access to shared resources (locks, barriers)
- Amdahl's Law: Speedup limited by sequential portion of code. If 10% is sequential, max speedup is 10× regardless of core count.
- Programming Difficulty: Parallel programming is harder than sequential. Race conditions, deadlocks, load balancing.
Where P = parallel fraction, N = number of processors