How Programs Execute on a Computer
Basic Components of CPU Design
- Control Unit (CU) — Directs operations, fetches instructions, decodes opcodes, generates control signals
- Arithmetic Logic Unit (ALU) — Performs arithmetic (+, -, ×, ÷) and logical (AND, OR, XOR, NOT) operations
- Registers — Ultra-fast storage inside CPU (general purpose, special purpose like PC, SP, IR)
- Program Counter (PC) — Holds address of next instruction to execute
- Instruction Register (IR) — Holds currently executing instruction
- Memory Address Register (MAR) — Holds address for memory access
- Memory Data Register (MDR) — Holds data being transferred to/from memory
- Stack Pointer (SP) — Points to top of the stack in memory
Program Compilation Process
Stages of the Instruction Cycle (Fetch-Decode-Execute)
Detailed Breakdown
| Stage | Actions | Hardware Used |
|---|---|---|
FETCH |
MAR ← PC; IR ← Memory[MAR]; PC ← PC + 4 | PC, MAR, MDR, IR, Memory |
DECODE |
Extract opcode, read source registers, sign-extend immediate | Control Unit, Register File |
EXECUTE |
ALU computes result or calculates branch target address | ALU, Branch Unit |
MEMORY |
Load: read from memory; Store: write to memory | Data Memory, Cache |
WRITEBACK |
Write ALU result or loaded data to destination register | Register File |
Hardware Factors & Performance
Measuring & Comparing Performance
CPU Time = (Instruction Count × CPI) / Clock Rate
- Instruction Count — Total instructions executed (depends on ISA, compiler)
- CPI (Cycles Per Instruction) — Average clock cycles per instruction
- Clock Rate (Hz) — Cycles per second (e.g., 3 GHz = 3×10⁹ cycles/sec)
- Clock Period — Time for one cycle = 1/Clock Rate
MIPS = Clock Rate / (CPI × 10⁶)
Propagation Delay
Definition: Time for a signal to travel through a circuit component from input to stable output.
- Gate Delay — Time through a single logic gate (typically 1-10 ns)
- Wire Delay — Signal travel time through interconnects
- Critical Path — Longest delay path determines max clock speed
Clock Period ≥ t_propagation + t_setup
Influence of Clock Speed on Performance
Clock speed (frequency) determines how many cycles execute per second.
- More operations per second
- Faster instruction throughput
- Better single-thread performance
- Increased power consumption (P ∝ f)
- More heat generation
- Higher voltage requirements
- Diminishing returns
Where: α = activity factor, C = capacitance,
V = voltage, f = frequency
Computer Architecture & Performance
Architecture Types
| Aspect | RISC | CISC |
|---|---|---|
| Instructions | Simple, fixed-length | Complex, variable-length |
| CPI | ~1 (ideally) | Variable (1-20+) |
| Memory Access | Load/Store only | Memory operands allowed |
| Registers | Many (32+) | Fewer |
| Pipelining | Easy to pipeline | Harder to pipeline |
| Examples | ARM, MIPS, RISC-V | x86, x86-64 |
Relationship Between Architecture & Instruction Set
Instruction Set Architecture (ISA) is the interface between hardware and software — the "contract" specifying:
- Available instructions and their encoding
- Registers and their purposes
- Addressing modes
- Data types and sizes
- Memory model
Microarchitecture is how the ISA is implemented in hardware:
- Pipeline depth and organization
- Cache sizes and hierarchy
- Branch prediction strategy
- Out-of-order execution
- Superscalar width
Predicting Execution Time
Given CPU architecture details, predict execution time by considering:
- Single-cycle: CPI = 1, but slow clock
- Multi-cycle: Variable CPI, faster clock
- Pipelined: CPI ≈ 1, fast clock, hazards add stalls
- Misprediction penalty = pipeline depth
- 90% accuracy with 10-stage pipeline:
- Extra CPI = 0.10 × 10 = 1.0 cycles/branch
+ (mem_access_freq × miss_rate × miss_penalty)
Example: CPI = 1.0 + (0.2 × 0.1 × 10) + (0.3 × 0.05 × 100)
= 1.0 + 0.2 + 1.5 = 2.7 cycles/instruction
Data Representation & Addressing Modes
Binary & Hexadecimal Number Systems
| Decimal | Binary | Hex | Decimal | Binary | Hex |
|---|---|---|---|---|---|
| 0 | 0000 | 0 | 8 | 1000 | 8 |
| 1 | 0001 | 1 | 9 | 1001 | 9 |
| 2 | 0010 | 2 | 10 | 1010 | A |
| 3 | 0011 | 3 | 11 | 1011 | B |
| 4 | 0100 | 4 | 12 | 1100 | C |
| 5 | 0101 | 5 | 13 | 1101 | D |
| 6 | 0110 | 6 | 14 | 1110 | E |
| 7 | 0111 | 7 | 15 | 1111 | F |
2⁶=64 2⁷=128 2⁸=256 2⁹=512 2¹⁰=1024
2¹⁶=65,536 2³²=4,294,967,296
Binary Addition & Subtraction
0 + 0 = 0
0 + 1 = 1
1 + 0 = 1
1 + 1 = 0 (carry 1)
1 + 1 + 1 = 1 (carry 1)
1 1 1 ← carries
1 0 1 1
+ 0 1 1 0
---------
1 0 0 0 1 = 17₁₀ (11 + 6 = 17 ✓)
A - B = A + (-B) = A + (2's complement of B)
2's complement of 0011:
Invert: 1100
Add 1: 1101
Now add: 1010 + 1101
1 1
1 0 1 0
+ 1 1 0 1
---------
1 0 1 1 1 (ignore overflow bit)
= 0111 = 7₁₀ ✓
Decimal ↔ Binary Conversion
45 ÷ 2 = 22 remainder 1 ↑
22 ÷ 2 = 11 remainder 0 │
11 ÷ 2 = 5 remainder 1 │ Read
5 ÷ 2 = 2 remainder 1 │ bottom
2 ÷ 2 = 1 remainder 0 │ to top
1 ÷ 2 = 0 remainder 1 │
Answer: 101101₂
Position: 5 4 3 2 1 0
Binary: 1 0 1 1 0 1
Weight: 32 16 8 4 2 1
= 1×32 + 0×16 + 1×8 + 1×4 + 0×2 + 1×1
= 32 + 8 + 4 + 1 = 45₁₀ ✓
Decimal ↔ Hexadecimal Conversion
255 ÷ 16 = 15 remainder 15 (F) ↑
15 ÷ 16 = 0 remainder 15 (F) │
Answer: FF₁₆ (or 0xFF)
2A3₁₆ = 2×16² + A×16¹ + 3×16⁰
= 2×256 + 10×16 + 3×1
= 512 + 160 + 3
= 675₁₀
1101 0110
D 6 → D6₁₆
Signed Binary Representation
For n-bit number, range is: -2^(n-1) to 2^(n-1)-1
8-bit range: -128 to +127
Step 1: +45 in binary = 00101101
Step 2: Invert all bits = 11010010
Step 3: Add 1 = 11010011
-45₁₀ = 11010011₂
MSB (sign bit): 1 = negative, 0 = positive
| Binary | Unsigned | Signed |
|---|---|---|
00000000 | 0 | 0 |
00000001 | 1 | +1 |
01111111 | 127 | +127 |
10000000 | 128 | -128 |
10000001 | 129 | -127 |
11111111 | 255 | -1 |
MSB is 1 → negative number
Invert: 00101100
Add 1: 00101101 = 45
Answer: -45
Big-Endian vs Little-Endian
How multi-byte values are stored in memory.
Example: Store 0x12345678 at address 0x100
| Type | Description | Used By |
|---|---|---|
| Big-Endian | Most significant byte at lowest address | Network protocols, MIPS, SPARC |
| Little-Endian | Least significant byte at lowest address | x86, x86-64, ARM (configurable) |
Data Overflow & Underflow
8-bit max = 255 (11111111)
255 + 1 = 256... but 256 = 100000000 (9 bits!)
Result wraps to 0 (overflow)
- Positive + Positive = Negative → Overflow!
- Negative + Negative = Positive → Overflow!
- Positive + Negative → Cannot overflow
01100100 (100)
+ 00110010 (50)
= 10010110 = -106 in signed! (overflow)
Assembly Language to Machine Language Translation
The assembler converts human-readable assembly into binary machine code using the ISA specification.
# Assembly instruction:
add $t0, $s1, $s2
# Register numbers:
# $t0 = $8, $s1 = $17, $s2 = $18
# R-Type encoding:
# opcode=0, rs=$s1=17, rt=$s2=18, rd=$t0=8, shamt=0, funct=32
# Binary:
# 000000 10001 10010 01000 00000 100000
# Hex: 0x02324020| Field | Value | Binary |
|---|---|---|
| opcode | 0 | 000000 |
| rs ($s1) | 17 | 10001 |
| rt ($s2) | 18 | 10010 |
| rd ($t0) | 8 | 01000 |
| shamt | 0 | 00000 |
| funct (add) | 32 | 100000 |
Addressing Modes
| Mode | Description | Example | Effective Address |
|---|---|---|---|
| Immediate | Operand is part of instruction | addi $t0, $t1, 5 |
N/A (value = 5) |
| Register | Operand is in a register | add $t0, $t1, $t2 |
N/A (value in $t2) |
| Base/Displacement | Base register + offset | lw $t0, 8($s0) |
EA = $s0 + 8 |
| PC-Relative | PC + offset (for branches) | beq $t0, $t1, label |
EA = PC + 4 + offset×4 |
| Pseudo-Direct | Upper PC bits + address field | j target |
EA = PC[31:28] | addr | 00 |
Assembly Language Programming
MIPS Register Convention
| Name | Number | Usage | Preserved? |
|---|---|---|---|
$zero | $0 | Constant 0 | N/A |
$at | $1 | Assembler temporary | No |
$v0-$v1 | $2-$3 | Return values | No |
$a0-$a3 | $4-$7 | Arguments | No |
$t0-$t7 | $8-$15 | Temporaries | No |
$s0-$s7 | $16-$23 | Saved registers | Yes |
$t8-$t9 | $24-$25 | More temporaries | No |
$gp | $28 | Global pointer | Yes |
$sp | $29 | Stack pointer | Yes |
$fp | $30 | Frame pointer | Yes |
$ra | $31 | Return address | Yes |
Essential MIPS Instructions
add $d, $s, $t # $d = $s + $t
addi $t, $s, imm # $t = $s + imm
sub $d, $s, $t # $d = $s - $t
mul $d, $s, $t # $d = $s × $t (low 32 bits)
div $s, $t # lo = $s / $t, hi = $s % $t
mflo $d # $d = lo (quotient)
mfhi $d # $d = hi (remainder)and $d, $s, $t # $d = $s AND $t
andi $t, $s, imm # $t = $s AND imm
or $d, $s, $t # $d = $s OR $t
ori $t, $s, imm # $t = $s OR imm
xor $d, $s, $t # $d = $s XOR $t
nor $d, $s, $t # $d = NOT($s OR $t)
sll $d, $t, shamt # $d = $t << shamt
srl $d, $t, shamt # $d = $t >> shamt (logical)
sra $d, $t, shamt # $d = $t >> shamt (arithmetic)Memory Access Instructions
# Load instructions
lw $t, offset($s) # Load word: $t = Mem[$s + offset]
lh $t, offset($s) # Load halfword (sign-extended)
lhu $t, offset($s) # Load halfword (zero-extended)
lb $t, offset($s) # Load byte (sign-extended)
lbu $t, offset($s) # Load byte (zero-extended)
# Store instructions
sw $t, offset($s) # Store word: Mem[$s + offset] = $t
sh $t, offset($s) # Store halfword
sb $t, offset($s) # Store byte
# Load address (pseudo-instruction)
la $t, label # $t = address of labelMathematical Expressions in Assembly
Assume: g→$s1, h→$s2, i→$s3, j→$s4, f→$s0
# f = (g + h) - (i + j)
add $t0, $s1, $s2 # $t0 = g + h
add $t1, $s3, $s4 # $t1 = i + j
sub $s0, $t0, $t1 # f = $t0 - $t1# f = 2*g + h/4
sll $t0, $s1, 1 # $t0 = g * 2 (shift left)
sra $t1, $s2, 2 # $t1 = h / 4 (shift right)
add $s0, $t0, $t1 # f = $t0 + $t1sll by n = multiply by 2^n, srl/sra by n = divide by 2^n
Decisions (Conditional Branching)
# Branch instructions
beq $s, $t, label # if ($s == $t) goto label
bne $s, $t, label # if ($s != $t) goto label
# Set on less than
slt $d, $s, $t # $d = ($s < $t) ? 1 : 0
slti $t, $s, imm # $t = ($s < imm) ? 1 : 0
sltu $d, $s, $t # unsigned comparison
# Unconditional jump
j label # goto labelC Code:
if (i == j)
f = g + h;
else
f = g - h;MIPS Assembly:
bne $s3, $s4, Else
add $s0, $s1, $s2
j Exit
Else: sub $s0, $s1, $s2
Exit:Loops in Assembly
C Code:
while (save[i] == k)
i += 1;MIPS: (i→$s3, k→$s5, save base→$s6)
Loop: sll $t1, $s3, 2 # $t1 = i * 4
add $t1, $t1, $s6 # $t1 = &save[i]
lw $t0, 0($t1) # $t0 = save[i]
bne $t0, $s5, Exit # if != k, exit
addi $s3, $s3, 1 # i++
j Loop
Exit:C Code:
// Sum array elements
int sum = 0;
for (int i = 0; i < n; i++)
sum += arr[i];MIPS:
li $t0, 0 # sum = 0
li $t1, 0 # i = 0
Loop: bge $t1, $s1, Exit # if i >= n, exit
sll $t2, $t1, 2 # $t2 = i * 4
add $t2, $t2, $s0 # $t2 = &arr[i]
lw $t3, 0($t2) # $t3 = arr[i]
add $t0, $t0, $t3 # sum += arr[i]
addi $t1, $t1, 1 # i++
j Loop
Exit:Procedures (Functions)
- Place arguments in $a0-$a3
- Transfer control:
jal ProcedureLabel - Acquire storage (if needed)
- Perform procedure's task
- Place result in $v0-$v1
- Return:
jr $ra
# Caller code
li $a0, 5 # arg1 = 5
li $a1, 3 # arg2 = 3
jal multiply # call function
move $s0, $v0 # save result# int multiply(int a, int b)
multiply:
# Leaf procedure - no stack needed
mul $v0, $a0, $a1 # $v0 = a * b
jr $ra # return# int factorial(int n)
factorial:
# Prologue - save registers
addi $sp, $sp, -8 # allocate stack space
sw $ra, 4($sp) # save return address
sw $a0, 0($sp) # save n
# Base case: if (n <= 1) return 1
slti $t0, $a0, 2 # $t0 = (n < 2)
beq $t0, $zero, recurse
li $v0, 1 # return 1
j fact_exit
recurse:
# Recursive case: return n * factorial(n-1)
addi $a0, $a0, -1 # n - 1
jal factorial # factorial(n-1)
lw $a0, 0($sp) # restore n
mul $v0, $a0, $v0 # n * factorial(n-1)
fact_exit:
# Epilogue - restore registers
lw $ra, 4($sp) # restore return address
addi $sp, $sp, 8 # deallocate stack
jr $ra # returnSystem Calls (SPIM/MARS)
| Service | $v0 | Arguments | Result |
|---|---|---|---|
| Print int | 1 | $a0 = integer | - |
| Print string | 4 | $a0 = address | - |
| Read int | 5 | - | $v0 = integer |
| Read string | 8 | $a0=buffer, $a1=len | - |
| Exit | 10 | - | - |
| Print char | 11 | $a0 = char | - |
| Read char | 12 | - | $v0 = char |
# Print "Hello" and read an integer
.data
msg: .asciiz "Enter a number: "
.text
li $v0, 4 # print_string
la $a0, msg
syscall
li $v0, 5 # read_int
syscall
move $s0, $v0 # save input in $s0Hardware Performance Techniques
Increasing Clock Rate
- More cycles per second
- Faster single-thread execution
- Direct performance boost
- No software changes needed
- Power increases linearly with freq
- Heat dissipation problems
- Signal integrity issues
- Diminishing returns (power wall)
Decreasing Silicon Feature Size
- More transistors per chip
- Lower capacitance → less power
- Faster switching speed
- Lower voltage operation
- Increased leakage current
- Manufacturing complexity
- Quantum effects at small scales
- Higher defect rates
- Extreme cost of fab facilities
Pipelined Datapath
- Higher throughput (instructions/second)
- Better hardware utilization
- Allows higher clock frequency
- Ideally CPI approaches 1
- Pipeline hazards cause stalls
- Increased latency per instruction
- More complex control logic
- Branch misprediction penalty
| Hazard Type | Cause | Solution |
|---|---|---|
| Structural | Hardware resource conflict | Duplicate resources, stall |
| Data | Instruction depends on prior result | Forwarding, stall, reorder |
| Control | Branch outcome unknown | Branch prediction, delay slot |
Branch Prediction Methods
| Method | Description | Typical Accuracy | Cost |
|---|---|---|---|
| Static: Always Not Taken | Predict branch not taken | ~30-40% | None |
| Static: Always Taken | Predict branch taken | ~60-70% | None |
| Static: Backward Taken | Backward=taken (loops), Forward=not taken | ~65-75% | Low |
| 1-bit Dynamic | Remember last outcome | ~80-85% | Medium |
| 2-bit Saturating Counter | 4 states: strongly/weakly taken/not | ~85-92% | Medium |
| Correlating/Two-Level | Uses global/local branch history | ~93-97% | High |
| Tournament | Multiple predictors, choose best | ~95-98% | Very High |
Example: 5% mispredict, 10-stage pipeline
Penalty = 0.05 × 10 = 0.5 cycles per branch
Datapath Design
Single-Cycle vs Pipelined Design
| Aspect | Single-Cycle | Multi-Cycle | Pipelined |
|---|---|---|---|
| CPI | 1 | Variable (3-5) | ~1 (ideal) |
| Clock Period | Longest instruction | Longest stage | Longest stage |
| Throughput | Low | Medium | High |
| Hardware | Duplicated (ALU, memory) | Shared | Pipeline registers |
| Control | Simple | FSM-based | Complex (hazards) |
Pipelined time: T_pipelined = longest_stage_time + register_overhead
Speedup ≈ number_of_stages (ideal, no hazards)
Actual speedup = stages / (1 + stall_cycles_per_instruction)
Memory Hierarchy
Types of Memory
| Level | Size | Speed | Cost/GB |
|---|---|---|---|
| Registers | ~1 KB | < 1 ns | $$$$$ |
| L1 Cache | 32-64 KB | ~2 ns | $$$$ |
| L2 Cache | 256 KB-1 MB | ~10 ns | $$$ |
| L3 Cache | 8-64 MB | ~40 ns | $$ |
| DRAM | 8-128 GB | ~100 ns | $ |
| SSD | 256 GB-8 TB | ~100 μs | ¢ |
| HDD | 1-20 TB | ~10 ms | ¢¢ |
How Memory Types Are Used
- Registers: Currently executing instruction's operands
- L1 Cache: Split I-cache and D-cache, most recently used code/data
- L2 Cache: Unified, backs up L1 misses
- L3 Cache: Shared among cores, last level before DRAM
- DRAM: Running programs, OS, active data
- Disk/SSD: Virtual memory, persistent storage
Calculating Cache Miss Delays
For multi-level cache:
AMAT = L1_hit_time + L1_miss_rate × (L2_hit_time + L2_miss_rate × L2_miss_penalty)
- L1 hit time = 1 cycle
- L1 miss rate = 5%
- L2 hit time = 10 cycles
- L2 miss rate = 20% (of L1 misses)
- Memory access = 100 cycles
Calculate AMAT:
AMAT = 1 + 0.05 × (10 + 0.20 × 100)
= 1 + 0.05 × (10 + 20)
= 1 + 0.05 × 30
= 1 + 1.5
= 2.5 cycles
Memory_stall_cycles = (Reads/Instruction × Read_miss_rate × Miss_penalty)
+ (Writes/Instruction × Write_miss_rate × Miss_penalty)
Cache Organization
| Type | Description | Pros | Cons |
|---|---|---|---|
| Direct Mapped | Each block maps to exactly one cache line | Simple, fast lookup | Conflict misses |
| Fully Associative | Block can go anywhere in cache | No conflict misses | Expensive lookup |
| Set Associative | Block maps to a set, can go in any line within set | Balance of both | Moderate complexity |
Transition to Multiprocessor Architectures
Why Multiprocessors?
Around 2004, single-core frequency scaling hit physical limits:
- Power consumption grew faster than performance gains
- Heat dissipation became unmanageable
- Voltage couldn't decrease further without reliability issues
- Dennard scaling ended
Diminishing returns from single-thread optimizations:
- Branch prediction accuracy plateaued
- Memory latency gap widened
- Out-of-order windows limited by power
- Instruction-level parallelism exhausted
| Type | Description | Examples |
|---|---|---|
| SMP (Symmetric) | Shared memory, equal access time | Multi-core desktops |
| NUMA | Shared memory, non-uniform access time | Multi-socket servers |
| Cluster | Distributed memory, message passing | Supercomputers, cloud |
Where: P = parallelizable fraction, N = number of processors
Example: 90% parallelizable, 4 cores
Speedup = 1 / (0.1 + 0.9/4) = 1 / 0.325 = 3.08×
- Cache Coherence: Multiple caches must stay synchronized
- Synchronization: Locks, barriers, atomic operations needed
- Load Balancing: Work must be evenly distributed
- Communication Overhead: Data sharing has latency cost
- Software Complexity: Parallel programming is hard!