01

How Programs Execute on a Computer

Basic Components of CPU Design

┌─────────────────────────────────────────────────────────────────┐ │ CPU │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ Control │ │ ALU │ │ Registers │ │ │ │ Unit │◄──►│ (Arithmetic │◄──►│ (Fast │ │ │ │ (Fetches & │ │ Logic Unit) │ │ Storage) │ │ │ │ Decodes) │ │ │ │ │ │ │ └──────┬───────┘ └──────────────┘ └──────────────┘ │ │ │ ▲ ▲ │ │ ▼ │ │ │ │ ┌──────────────┐ │ │ │ │ │Program Counter│ └───────────────────┘ │ │ │ (PC) │ │ │ │ └──────────────┘ ▼ │ │ ┌──────────────┐ │ │ │ Data Bus │ │ └───────────────────────────┴──────────────┴───────────────────────┘ │ ▼ ┌──────────────┐ │ Memory │ └──────────────┘
  • Control Unit (CU) — Directs operations, fetches instructions, decodes opcodes, generates control signals
  • Arithmetic Logic Unit (ALU) — Performs arithmetic (+, -, ×, ÷) and logical (AND, OR, XOR, NOT) operations
  • Registers — Ultra-fast storage inside CPU (general purpose, special purpose like PC, SP, IR)
  • Program Counter (PC) — Holds address of next instruction to execute
  • Instruction Register (IR) — Holds currently executing instruction
  • Memory Address Register (MAR) — Holds address for memory access
  • Memory Data Register (MDR) — Holds data being transferred to/from memory
  • Stack Pointer (SP) — Points to top of the stack in memory

Program Compilation Process

┌─────────────────┐ │ Source Code │ (.c, .cpp, .java) │ (High-Level) │ └────────┬────────┘ │ ▼ [Preprocessor] ┌─────────────────┐ │ Expanded Code │ (includes, macros resolved) └────────┬────────┘ │ ▼ [Compiler] ┌─────────────────┐ │ Assembly Code │ (.s, .asm) └────────┬────────┘ │ ▼ [Assembler] ┌─────────────────┐ │ Object Code │ (.o, .obj) - machine code + symbols └────────┬────────┘ │ ▼ [Linker] ┌─────────────────┐ │ Executable │ (.exe, .out) - complete binary └────────┬────────┘ │ ▼ [Loader] ┌─────────────────┐ │ Process in │ Running in RAM │ Memory │ └─────────────────┘
The compiler translates high-level to assembly. The assembler converts assembly to machine code. The linker combines object files and resolves external references.

Stages of the Instruction Cycle (Fetch-Decode-Execute)

FETCH
Get instruction from memory at PC address
DECODE
Interpret opcode & identify operands
EXECUTE
Perform the operation (ALU, branch, etc.)
MEMORY
Access memory if needed (load/store)
WRITEBACK
Write result to register

Detailed Breakdown

Stage Actions Hardware Used
FETCH MAR ← PC; IR ← Memory[MAR]; PC ← PC + 4 PC, MAR, MDR, IR, Memory
DECODE Extract opcode, read source registers, sign-extend immediate Control Unit, Register File
EXECUTE ALU computes result or calculates branch target address ALU, Branch Unit
MEMORY Load: read from memory; Store: write to memory Data Memory, Cache
WRITEBACK Write ALU result or loaded data to destination register Register File
02

Hardware Factors & Performance

Measuring & Comparing Performance

CPU Performance Equation
CPU Time = Instructions × CPI × Clock Period
CPU Time = (Instruction Count × CPI) / Clock Rate
  • Instruction Count — Total instructions executed (depends on ISA, compiler)
  • CPI (Cycles Per Instruction) — Average clock cycles per instruction
  • Clock Rate (Hz) — Cycles per second (e.g., 3 GHz = 3×10⁹ cycles/sec)
  • Clock Period — Time for one cycle = 1/Clock Rate
MIPS (Million Instructions Per Second)
MIPS = Instruction Count / (Execution Time × 10⁶)
MIPS = Clock Rate / (CPI × 10⁶)
⚠ MIPS Pitfalls
MIPS can be misleading — different ISAs have different instruction complexities. A CISC machine might do more work per instruction than a RISC machine.

Propagation Delay

Definition: Time for a signal to travel through a circuit component from input to stable output.

Input ──┐ ┌── Output │ Component │ └──────────────┘ │←── t_pd ────→│ t_pd = propagation delay
  • Gate Delay — Time through a single logic gate (typically 1-10 ns)
  • Wire Delay — Signal travel time through interconnects
  • Critical Path — Longest delay path determines max clock speed
Maximum Clock Frequency
f_max = 1 / t_critical_path

Clock Period ≥ t_propagation + t_setup
Reducing propagation delay allows higher clock frequencies but may require more power or better silicon technology.

Influence of Clock Speed on Performance

Clock speed (frequency) determines how many cycles execute per second.

✓ Higher Clock Speed Benefits
  • More operations per second
  • Faster instruction throughput
  • Better single-thread performance
✗ Higher Clock Speed Costs
  • Increased power consumption (P ∝ f)
  • More heat generation
  • Higher voltage requirements
  • Diminishing returns
Power-Frequency Relationship
P_dynamic = α × C × V² × f

Where: α = activity factor, C = capacitance,
V = voltage, f = frequency

Computer Architecture & Performance

Architecture Types

Aspect RISC CISC
Instructions Simple, fixed-length Complex, variable-length
CPI ~1 (ideally) Variable (1-20+)
Memory Access Load/Store only Memory operands allowed
Registers Many (32+) Fewer
Pipelining Easy to pipeline Harder to pipeline
Examples ARM, MIPS, RISC-V x86, x86-64

Relationship Between Architecture & Instruction Set

Instruction Set Architecture (ISA) is the interface between hardware and software — the "contract" specifying:

  • Available instructions and their encoding
  • Registers and their purposes
  • Addressing modes
  • Data types and sizes
  • Memory model

Microarchitecture is how the ISA is implemented in hardware:

  • Pipeline depth and organization
  • Cache sizes and hierarchy
  • Branch prediction strategy
  • Out-of-order execution
  • Superscalar width
Same ISA (e.g., x86-64) can have different microarchitectures (Intel Skylake vs AMD Zen) with vastly different performance characteristics.

Predicting Execution Time

Given CPU architecture details, predict execution time by considering:

Datapath Design
  • Single-cycle: CPI = 1, but slow clock
  • Multi-cycle: Variable CPI, faster clock
  • Pipelined: CPI ≈ 1, fast clock, hazards add stalls
Branch Prediction Impact
  • Misprediction penalty = pipeline depth
  • 90% accuracy with 10-stage pipeline:
  • Extra CPI = 0.10 × 10 = 1.0 cycles/branch
Effective CPI Calculation
CPI_effective = CPI_base + (branch_freq × mispredict_rate × penalty)
+ (mem_access_freq × miss_rate × miss_penalty)

Example: CPI = 1.0 + (0.2 × 0.1 × 10) + (0.3 × 0.05 × 100)
= 1.0 + 0.2 + 1.5 = 2.7 cycles/instruction
03

Data Representation & Addressing Modes

Binary & Hexadecimal Number Systems

Decimal Binary Hex Decimal Binary Hex
000000810008
100011910019
200102101010A
300113111011B
401004121100C
501015131101D
601106141110E
701117151111F
Each hex digit = exactly 4 binary bits. This is why hex is used — compact representation of binary.
Powers of 2 (Memorize These!)
2⁰=1 2¹=2 2²=4 2³=8 2⁴=16 2⁵=32
2⁶=64 2⁷=128 2⁸=256 2⁹=512 2¹⁰=1024
2¹⁶=65,536 2³²=4,294,967,296

Binary Addition & Subtraction

Binary Addition Rules
0 + 0 = 0 0 + 1 = 1 1 + 0 = 1 1 + 1 = 0 (carry 1) 1 + 1 + 1 = 1 (carry 1)
Example: 1011 + 0110
1 1 1 ← carries 1 0 1 1 + 0 1 1 0 --------- 1 0 0 0 1 = 17₁₀ (11 + 6 = 17 ✓)
Binary Subtraction (Using 2's Complement)

A - B = A + (-B) = A + (2's complement of B)

Example: 1010 - 0011 (10 - 3 = 7)
2's complement of 0011: Invert: 1100 Add 1: 1101 Now add: 1010 + 1101 1 1 1 0 1 0 + 1 1 0 1 --------- 1 0 1 1 1 (ignore overflow bit) = 0111 = 7₁₀ ✓

Decimal ↔ Binary Conversion

Decimal to Binary (Divide by 2)
Convert 45₁₀ to binary:
45 ÷ 2 = 22 remainder 1 ↑ 22 ÷ 2 = 11 remainder 0 │ 11 ÷ 2 = 5 remainder 1 │ Read 5 ÷ 2 = 2 remainder 1 │ bottom 2 ÷ 2 = 1 remainder 0 │ to top 1 ÷ 2 = 0 remainder 1 │ Answer: 101101₂
Binary to Decimal (Positional Weights)
Convert 101101₂ to decimal:
Position: 5 4 3 2 1 0 Binary: 1 0 1 1 0 1 Weight: 32 16 8 4 2 1 = 1×32 + 0×16 + 1×8 + 1×4 + 0×2 + 1×1 = 32 + 8 + 4 + 1 = 45₁₀ ✓

Decimal ↔ Hexadecimal Conversion

Decimal to Hex (Divide by 16)
Convert 255₁₀ to hex:
255 ÷ 16 = 15 remainder 15 (F) ↑ 15 ÷ 16 = 0 remainder 15 (F) │ Answer: FF₁₆ (or 0xFF)
Hex to Decimal
Convert 2A3₁₆ to decimal:
2A3₁₆ = 2×16² + A×16¹ + 3×16⁰ = 2×256 + 10×16 + 3×1 = 512 + 160 + 3 = 675₁₀
Binary ↔ Hex (Group by 4)
Binary to Hex: 11010110₂
1101 0110 D 6 → D6₁₆

Signed Binary Representation

Two's Complement (Standard Method)

For n-bit number, range is: -2^(n-1) to 2^(n-1)-1

8-bit range: -128 to +127

Converting -45 to 8-bit two's complement:
Step 1: +45 in binary = 00101101 Step 2: Invert all bits = 11010010 Step 3: Add 1 = 11010011 -45₁₀ = 11010011₂
1
1
0
1
0
0
1
1

MSB (sign bit): 1 = negative, 0 = positive

Two's Complement Values (8-bit)
Binary Unsigned Signed
0000000000
000000011+1
01111111127+127
10000000128-128
10000001129-127
11111111255-1
Reading negative: 11010011₂
MSB is 1 → negative number Invert: 00101100 Add 1: 00101101 = 45 Answer: -45

Big-Endian vs Little-Endian

How multi-byte values are stored in memory.

Example: Store 0x12345678 at address 0x100

Big-Endian (MSB First) Address: 0x100 0x101 0x102 0x103 Data: 12 34 56 78 [MSB] [LSB] Little-Endian (LSB First) Address: 0x100 0x101 0x102 0x103 Data: 78 56 34 12 [LSB] [MSB]
Type Description Used By
Big-Endian Most significant byte at lowest address Network protocols, MIPS, SPARC
Little-Endian Least significant byte at lowest address x86, x86-64, ARM (configurable)

Data Overflow & Underflow

⚠ Overflow
Occurs when result is too large for the number of bits.
Unsigned Overflow
8-bit max = 255 (11111111) 255 + 1 = 256... but 256 = 100000000 (9 bits!) Result wraps to 0 (overflow)
Signed Overflow Detection
  • Positive + Positive = Negative → Overflow!
  • Negative + Negative = Positive → Overflow!
  • Positive + Negative → Cannot overflow
Example: 8-bit signed, 100 + 50 01100100 (100) + 00110010 (50) = 10010110 = -106 in signed! (overflow)
Underflow is the opposite — result too small (negative overflow). Same detection rules apply.

Assembly Language to Machine Language Translation

The assembler converts human-readable assembly into binary machine code using the ISA specification.

MIPS Instruction Formats (32-bit)
R-Type (Register): ┌────────┬───────┬───────┬───────┬───────┬────────┐ │ opcode │ rs │ rt │ rd │ shamt │ funct │ │ 6 bits │ 5 bits│ 5 bits│ 5 bits│ 5 bits│ 6 bits │ └────────┴───────┴───────┴───────┴───────┴────────┘ I-Type (Immediate): ┌────────┬───────┬───────┬─────────────────────────┐ │ opcode │ rs │ rt │ immediate │ │ 6 bits │ 5 bits│ 5 bits│ 16 bits │ └────────┴───────┴───────┴─────────────────────────┘ J-Type (Jump): ┌────────┬────────────────────────────────────────┐ │ opcode │ address │ │ 6 bits │ 26 bits │ └────────┴────────────────────────────────────────┘
Translation Example
# Assembly instruction: add $t0, $s1, $s2 # Register numbers: # $t0 = $8, $s1 = $17, $s2 = $18 # R-Type encoding: # opcode=0, rs=$s1=17, rt=$s2=18, rd=$t0=8, shamt=0, funct=32 # Binary: # 000000 10001 10010 01000 00000 100000 # Hex: 0x02324020
Field Value Binary
opcode0000000
rs ($s1)1710001
rt ($s2)1810010
rd ($t0)801000
shamt000000
funct (add)32100000

Addressing Modes

Mode Description Example Effective Address
Immediate Operand is part of instruction addi $t0, $t1, 5 N/A (value = 5)
Register Operand is in a register add $t0, $t1, $t2 N/A (value in $t2)
Base/Displacement Base register + offset lw $t0, 8($s0) EA = $s0 + 8
PC-Relative PC + offset (for branches) beq $t0, $t1, label EA = PC + 4 + offset×4
Pseudo-Direct Upper PC bits + address field j target EA = PC[31:28] | addr | 00
04

Assembly Language Programming

MIPS Register Convention

Name Number Usage Preserved?
$zero$0Constant 0N/A
$at$1Assembler temporaryNo
$v0-$v1$2-$3Return valuesNo
$a0-$a3$4-$7ArgumentsNo
$t0-$t7$8-$15TemporariesNo
$s0-$s7$16-$23Saved registersYes
$t8-$t9$24-$25More temporariesNo
$gp$28Global pointerYes
$sp$29Stack pointerYes
$fp$30Frame pointerYes
$ra$31Return addressYes

Essential MIPS Instructions

Arithmetic
add $d, $s, $t # $d = $s + $t addi $t, $s, imm # $t = $s + imm sub $d, $s, $t # $d = $s - $t mul $d, $s, $t # $d = $s × $t (low 32 bits) div $s, $t # lo = $s / $t, hi = $s % $t mflo $d # $d = lo (quotient) mfhi $d # $d = hi (remainder)
Logical
and $d, $s, $t # $d = $s AND $t andi $t, $s, imm # $t = $s AND imm or $d, $s, $t # $d = $s OR $t ori $t, $s, imm # $t = $s OR imm xor $d, $s, $t # $d = $s XOR $t nor $d, $s, $t # $d = NOT($s OR $t) sll $d, $t, shamt # $d = $t << shamt srl $d, $t, shamt # $d = $t >> shamt (logical) sra $d, $t, shamt # $d = $t >> shamt (arithmetic)

Memory Access Instructions

# Load instructions lw $t, offset($s) # Load word: $t = Mem[$s + offset] lh $t, offset($s) # Load halfword (sign-extended) lhu $t, offset($s) # Load halfword (zero-extended) lb $t, offset($s) # Load byte (sign-extended) lbu $t, offset($s) # Load byte (zero-extended) # Store instructions sw $t, offset($s) # Store word: Mem[$s + offset] = $t sh $t, offset($s) # Store halfword sb $t, offset($s) # Store byte # Load address (pseudo-instruction) la $t, label # $t = address of label
MIPS uses byte addressing. Words are 4 bytes, so word addresses must be divisible by 4 (aligned).

Mathematical Expressions in Assembly

Example: f = (g + h) - (i + j)

Assume: g→$s1, h→$s2, i→$s3, j→$s4, f→$s0

# f = (g + h) - (i + j) add $t0, $s1, $s2 # $t0 = g + h add $t1, $s3, $s4 # $t1 = i + j sub $s0, $t0, $t1 # f = $t0 - $t1
Example: f = 2*g + h/4
# f = 2*g + h/4 sll $t0, $s1, 1 # $t0 = g * 2 (shift left) sra $t1, $s2, 2 # $t1 = h / 4 (shift right) add $s0, $t0, $t1 # f = $t0 + $t1
Multiply/divide by powers of 2 using shifts: sll by n = multiply by 2^n, srl/sra by n = divide by 2^n

Decisions (Conditional Branching)

# Branch instructions beq $s, $t, label # if ($s == $t) goto label bne $s, $t, label # if ($s != $t) goto label # Set on less than slt $d, $s, $t # $d = ($s < $t) ? 1 : 0 slti $t, $s, imm # $t = ($s < imm) ? 1 : 0 sltu $d, $s, $t # unsigned comparison # Unconditional jump j label # goto label
If-Else Example

C Code:

if (i == j) f = g + h; else f = g - h;

MIPS Assembly:

bne $s3, $s4, Else add $s0, $s1, $s2 j Exit Else: sub $s0, $s1, $s2 Exit:

Loops in Assembly

While Loop

C Code:

while (save[i] == k) i += 1;

MIPS: (i→$s3, k→$s5, save base→$s6)

Loop: sll $t1, $s3, 2 # $t1 = i * 4 add $t1, $t1, $s6 # $t1 = &save[i] lw $t0, 0($t1) # $t0 = save[i] bne $t0, $s5, Exit # if != k, exit addi $s3, $s3, 1 # i++ j Loop Exit:
For Loop

C Code:

// Sum array elements int sum = 0; for (int i = 0; i < n; i++) sum += arr[i];

MIPS:

li $t0, 0 # sum = 0 li $t1, 0 # i = 0 Loop: bge $t1, $s1, Exit # if i >= n, exit sll $t2, $t1, 2 # $t2 = i * 4 add $t2, $t2, $s0 # $t2 = &arr[i] lw $t3, 0($t2) # $t3 = arr[i] add $t0, $t0, $t3 # sum += arr[i] addi $t1, $t1, 1 # i++ j Loop Exit:

Procedures (Functions)

Procedure Call Steps
  1. Place arguments in $a0-$a3
  2. Transfer control: jal ProcedureLabel
  3. Acquire storage (if needed)
  4. Perform procedure's task
  5. Place result in $v0-$v1
  6. Return: jr $ra
# Caller code li $a0, 5 # arg1 = 5 li $a1, 3 # arg2 = 3 jal multiply # call function move $s0, $v0 # save result
Procedure Implementation
# int multiply(int a, int b) multiply: # Leaf procedure - no stack needed mul $v0, $a0, $a1 # $v0 = a * b jr $ra # return
Stack Management for Non-Leaf Procedures
# int factorial(int n) factorial: # Prologue - save registers addi $sp, $sp, -8 # allocate stack space sw $ra, 4($sp) # save return address sw $a0, 0($sp) # save n # Base case: if (n <= 1) return 1 slti $t0, $a0, 2 # $t0 = (n < 2) beq $t0, $zero, recurse li $v0, 1 # return 1 j fact_exit recurse: # Recursive case: return n * factorial(n-1) addi $a0, $a0, -1 # n - 1 jal factorial # factorial(n-1) lw $a0, 0($sp) # restore n mul $v0, $a0, $v0 # n * factorial(n-1) fact_exit: # Epilogue - restore registers lw $ra, 4($sp) # restore return address addi $sp, $sp, 8 # deallocate stack jr $ra # return
Stack Layout During Recursive Call: High Address ┌──────────────┐ │ ... │ ├──────────────┤ │ saved $ra │ ← $sp + 4 ├──────────────┤ │ saved $a0 │ ← $sp ├──────────────┤ │ ... │ └──────────────┘ Low Address Stack grows downward (toward lower addresses)

System Calls (SPIM/MARS)

Service $v0 Arguments Result
Print int1$a0 = integer-
Print string4$a0 = address-
Read int5-$v0 = integer
Read string8$a0=buffer, $a1=len-
Exit10--
Print char11$a0 = char-
Read char12-$v0 = char
# Print "Hello" and read an integer .data msg: .asciiz "Enter a number: " .text li $v0, 4 # print_string la $a0, msg syscall li $v0, 5 # read_int syscall move $s0, $v0 # save input in $s0
05

Hardware Performance Techniques

Increasing Clock Rate

✓ Advantages
  • More cycles per second
  • Faster single-thread execution
  • Direct performance boost
  • No software changes needed
✗ Disadvantages
  • Power increases linearly with freq
  • Heat dissipation problems
  • Signal integrity issues
  • Diminishing returns (power wall)

Decreasing Silicon Feature Size

✓ Advantages
  • More transistors per chip
  • Lower capacitance → less power
  • Faster switching speed
  • Lower voltage operation
✗ Disadvantages
  • Increased leakage current
  • Manufacturing complexity
  • Quantum effects at small scales
  • Higher defect rates
  • Extreme cost of fab facilities
Moore's Law: Transistor count doubles ~every 2 years. Currently at 3-5nm process nodes.

Pipelined Datapath

✓ Advantages
  • Higher throughput (instructions/second)
  • Better hardware utilization
  • Allows higher clock frequency
  • Ideally CPI approaches 1
✗ Disadvantages
  • Pipeline hazards cause stalls
  • Increased latency per instruction
  • More complex control logic
  • Branch misprediction penalty
Pipeline Hazards
Hazard Type Cause Solution
Structural Hardware resource conflict Duplicate resources, stall
Data Instruction depends on prior result Forwarding, stall, reorder
Control Branch outcome unknown Branch prediction, delay slot
Pipeline Execution (5-stage): Cycle: 1 2 3 4 5 6 7 8 9 Instr 1: [IF][ID][EX][MEM][WB] Instr 2: [IF][ID][EX][MEM][WB] Instr 3: [IF][ID][EX][MEM][WB] Instr 4: [IF][ID][EX][MEM][WB] Instr 5: [IF][ID][EX][MEM][WB] Without pipeline: 5 instructions × 5 cycles = 25 cycles With pipeline: 5 instructions in 9 cycles (after filling)

Branch Prediction Methods

Method Description Typical Accuracy Cost
Static: Always Not Taken Predict branch not taken ~30-40% None
Static: Always Taken Predict branch taken ~60-70% None
Static: Backward Taken Backward=taken (loops), Forward=not taken ~65-75% Low
1-bit Dynamic Remember last outcome ~80-85% Medium
2-bit Saturating Counter 4 states: strongly/weakly taken/not ~85-92% Medium
Correlating/Two-Level Uses global/local branch history ~93-97% High
Tournament Multiple predictors, choose best ~95-98% Very High
2-bit Saturating Counter State Machine
Taken Taken ┌────────────┐ ┌────────────┐ │ ▼ │ ▼ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ Strongly│ │ Weakly │ │ Weakly │ │ Strongly│ │ Not │◄─────│ Not │◄─────│ Taken │◄─────│ Taken │ │ Taken │ Not │ Taken │ Not │ │ Not │ │ │ 00 │Taken │ 01 │Taken │ 10 │Taken │ 11 │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │ │ │ Predict: NOT TAKEN Predict: TAKEN │ └───────────────────────────────────────────────────┘
Misprediction Penalty
Extra cycles = misprediction_rate × pipeline_stages_to_flush

Example: 5% mispredict, 10-stage pipeline
Penalty = 0.05 × 10 = 0.5 cycles per branch
06

Datapath Design

Single-Cycle vs Pipelined Design

Aspect Single-Cycle Multi-Cycle Pipelined
CPI 1 Variable (3-5) ~1 (ideal)
Clock Period Longest instruction Longest stage Longest stage
Throughput Low Medium High
Hardware Duplicated (ALU, memory) Shared Pipeline registers
Control Simple FSM-based Complex (hazards)
Single-Cycle Datapath
┌─────────┐ PC ──────►│Instruction│──────────────────────────────────────┐ │ Memory │ │ └─────────┬─┘ │ │ │ ▼ │ ┌─────────────────┐ │ │ Control Unit │────────────────────┐ │ └─────────────────┘ │ │ │ │ │ ┌──────────────┼──────────────┐ │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │ │ Register│ │ Sign │ │ Register│ │ │ │ Read │ │ Extend │ │ Write │ │ │ └────┬────┘ └────┬────┘ └────▲────┘ │ │ │ │ │ │ │ └──────┬──────┘ │ │ │ │ │ │ │ ▼ │ │ │ ┌─────────┐ │ │ │ │ ALU │──────────────┤ │ │ └────┬────┘ │ │ │ │ │ │ │ ▼ │ │ │ ┌─────────┐ │ │ │ │ Data │─────────────┘ │ │ │ Memory │ │ │ └─────────┘ │ │ │ │ Clock: All operations complete in ONE cycle │
Pipelined Datapath
IF/ID ID/EX EX/MEM MEM/WB │ │ │ │ ▼ ▼ ▼ ▼ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ Instruction│ │ Decode │ │ Execute │ │ Memory │ │ WriteBack│ │ Fetch │──│ + Reg │──│ (ALU) │──│ Access │──│ (to Reg) │ │ │ │ Read │ │ │ │ │ │ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ │ │ │ │ └─────────────┴─────────────┴─────────────┴─────────────┘ Pipeline Registers (store intermediate results)
Performance Comparison
Single-cycle time: T_single = longest_instruction_time
Pipelined time: T_pipelined = longest_stage_time + register_overhead

Speedup ≈ number_of_stages (ideal, no hazards)
Actual speedup = stages / (1 + stall_cycles_per_instruction)
07

Memory Hierarchy

Types of Memory

Registers (~1ns)
L1 Cache (~2-4ns)
L2 Cache (~10ns)
L3 Cache (~40ns)
Main Memory DRAM (~100ns)
Level Size Speed Cost/GB
Registers~1 KB< 1 ns$$$$$
L1 Cache32-64 KB~2 ns$$$$
L2 Cache256 KB-1 MB~10 ns$$$
L3 Cache8-64 MB~40 ns$$
DRAM8-128 GB~100 ns$
SSD256 GB-8 TB~100 μs¢
HDD1-20 TB~10 ms¢¢

How Memory Types Are Used

  • Registers: Currently executing instruction's operands
  • L1 Cache: Split I-cache and D-cache, most recently used code/data
  • L2 Cache: Unified, backs up L1 misses
  • L3 Cache: Shared among cores, last level before DRAM
  • DRAM: Running programs, OS, active data
  • Disk/SSD: Virtual memory, persistent storage
Principle of Locality: Temporal (recently used → likely reused) and Spatial (nearby addresses → likely accessed soon)

Calculating Cache Miss Delays

Average Memory Access Time (AMAT)
AMAT = Hit_time + (Miss_rate × Miss_penalty)

For multi-level cache:
AMAT = L1_hit_time + L1_miss_rate × (L2_hit_time + L2_miss_rate × L2_miss_penalty)
Example Calculation
Given:
  • L1 hit time = 1 cycle
  • L1 miss rate = 5%
  • L2 hit time = 10 cycles
  • L2 miss rate = 20% (of L1 misses)
  • Memory access = 100 cycles

Calculate AMAT: AMAT = 1 + 0.05 × (10 + 0.20 × 100) = 1 + 0.05 × (10 + 20) = 1 + 0.05 × 30 = 1 + 1.5 = 2.5 cycles
CPI with Memory Stalls
CPI_total = CPI_base + Memory_stall_cycles

Memory_stall_cycles = (Reads/Instruction × Read_miss_rate × Miss_penalty)
+ (Writes/Instruction × Write_miss_rate × Miss_penalty)

Cache Organization

Type Description Pros Cons
Direct Mapped Each block maps to exactly one cache line Simple, fast lookup Conflict misses
Fully Associative Block can go anywhere in cache No conflict misses Expensive lookup
Set Associative Block maps to a set, can go in any line within set Balance of both Moderate complexity
Address breakdown for cache access: │◄────────────── Address ──────────────►│ ┌────────────────┬───────────────┬──────┐ │ Tag │ Index │Offset│ │ │ │ │ └────────────────┴───────────────┴──────┘ │ │ │ │ │ └─► Byte within block │ └─────────────► Which set/line └─────────────────────────────► Verify correct block For direct-mapped cache with 2^n blocks, block size 2^m bytes: - Offset bits = m (selects byte in block) - Index bits = n (selects cache line) - Tag bits = address_bits - n - m
08

Transition to Multiprocessor Architectures

Why Multiprocessors?

The Power Wall

Around 2004, single-core frequency scaling hit physical limits:

  • Power consumption grew faster than performance gains
  • Heat dissipation became unmanageable
  • Voltage couldn't decrease further without reliability issues
  • Dennard scaling ended
The ILP Wall

Diminishing returns from single-thread optimizations:

  • Branch prediction accuracy plateaued
  • Memory latency gap widened
  • Out-of-order windows limited by power
  • Instruction-level parallelism exhausted
Solution: Instead of making one core faster, use multiple cores working in parallel. Performance scales with parallelizable workload.
Multiprocessor Types
Type Description Examples
SMP (Symmetric) Shared memory, equal access time Multi-core desktops
NUMA Shared memory, non-uniform access time Multi-socket servers
Cluster Distributed memory, message passing Supercomputers, cloud
Amdahl's Law
Speedup = 1 / ((1 - P) + P/N)

Where: P = parallelizable fraction, N = number of processors

Example: 90% parallelizable, 4 cores
Speedup = 1 / (0.1 + 0.9/4) = 1 / 0.325 = 3.08×
⚠ Challenges of Multiprocessing
  • Cache Coherence: Multiple caches must stay synchronized
  • Synchronization: Locks, barriers, atomic operations needed
  • Load Balancing: Work must be evenly distributed
  • Communication Overhead: Data sharing has latency cost
  • Software Complexity: Parallel programming is hard!