CSE 230 | Computer Organization & Assembly Language

How Programs Execute on a Computer

Basic Components of CPU Design

┌─────────────────────────────────────────────────────────────────┐ │ CPU │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ Control │ │ ALU │ │ Registers │ │ │ │ Unit │◄──►│ (Arithmetic │◄──►│ (Fast │ │ │ │ (Fetches & │ │ Logic Unit) │ │ Storage) │ │ │ │ Decodes) │ │ │ │ │ │ │ └──────┬───────┘ └──────────────┘ └──────────────┘ │ │ │ ▲ ▲ │ │ ▼ │ │ │ │ ┌──────────────┐ │ │ │ │ │Program Counter│ └───────────────────┘ │ │ │ (PC) │ │ │ │ └──────────────┘ ▼ │ │ ┌──────────────┐ │ │ │ Data Bus │ │ └───────────────────────────┴──────────────┴───────────────────────┘ │ ▼ ┌──────────────┐ │ Memory │ └──────────────┘

Control Unit (CU) — Directs operations, fetches instructions, decodes opcodes, generates control signals
Arithmetic Logic Unit (ALU) — Performs arithmetic (+, -, ×, ÷) and logical (AND, OR, XOR, NOT) operations
Registers — Ultra-fast storage inside CPU (general purpose, special purpose like PC, SP, IR)
Program Counter (PC) — Holds address of next instruction to execute
Instruction Register (IR) — Holds currently executing instruction
Memory Address Register (MAR) — Holds address for memory access
Memory Data Register (MDR) — Holds data being transferred to/from memory
Stack Pointer (SP) — Points to top of the stack in memory

Program Compilation Process

┌─────────────────┐ │ Source Code │ (.c, .cpp, .java) │ (High-Level) │ └────────┬────────┘ │ ▼ [Preprocessor] ┌─────────────────┐ │ Expanded Code │ (includes, macros resolved) └────────┬────────┘ │ ▼ [Compiler] ┌─────────────────┐ │ Assembly Code │ (.s, .asm) └────────┬────────┘ │ ▼ [Assembler] ┌─────────────────┐ │ Object Code │ (.o, .obj) - machine code + symbols └────────┬────────┘ │ ▼ [Linker] ┌─────────────────┐ │ Executable │ (.exe, .out) - complete binary └────────┬────────┘ │ ▼ [Loader] ┌─────────────────┐ │ Process in │ Running in RAM │ Memory │ └─────────────────┘

★ The compiler translates high-level to assembly. The assembler converts assembly to machine code. The linker combines object files and resolves external references.

Stages of the Instruction Cycle (Fetch-Decode-Execute)

FETCH

Get instruction from memory at PC address

DECODE

Interpret opcode & identify operands

EXECUTE

Perform the operation (ALU, branch, etc.)

MEMORY

Access memory if needed (load/store)

WRITEBACK

Write result to register

Detailed Breakdown

Stage	Actions	Hardware Used
`FETCH`	MAR ← PC; IR ← Memory[MAR]; PC ← PC + 4	PC, MAR, MDR, IR, Memory
`DECODE`	Extract opcode, read source registers, sign-extend immediate	Control Unit, Register File
`EXECUTE`	ALU computes result or calculates branch target address	ALU, Branch Unit
`MEMORY`	Load: read from memory; Store: write to memory	Data Memory, Cache
`WRITEBACK`	Write ALU result or loaded data to destination register	Register File

Hardware Factors & Performance

Measuring & Comparing Performance

CPU Performance Equation

CPU Time = Instructions × CPI × Clock Period
CPU Time = (Instruction Count × CPI) / Clock Rate

Instruction Count — Total instructions executed (depends on ISA, compiler)
CPI (Cycles Per Instruction) — Average clock cycles per instruction
Clock Rate (Hz) — Cycles per second (e.g., 3 GHz = 3×10⁹ cycles/sec)
Clock Period — Time for one cycle = 1/Clock Rate

MIPS (Million Instructions Per Second)

MIPS = Instruction Count / (Execution Time × 10⁶)
MIPS = Clock Rate / (CPI × 10⁶)

⚠ MIPS Pitfalls

MIPS can be misleading — different ISAs have different instruction complexities. A CISC machine might do more work per instruction than a RISC machine.

Propagation Delay

Definition: Time for a signal to travel through a circuit component from input to stable output.

Input ──┐ ┌── Output │ Component │ └──────────────┘ │←── t_pd ────→│ t_pd = propagation delay

Gate Delay — Time through a single logic gate (typically 1-10 ns)
Wire Delay — Signal travel time through interconnects
Critical Path — Longest delay path determines max clock speed

Maximum Clock Frequency

f_max = 1 / t_critical_path

Clock Period ≥ t_propagation + t_setup

★ Reducing propagation delay allows higher clock frequencies but may require more power or better silicon technology.

Influence of Clock Speed on Performance

Clock speed (frequency) determines how many cycles execute per second.

✓ Higher Clock Speed Benefits

More operations per second
Faster instruction throughput
Better single-thread performance

✗ Higher Clock Speed Costs

Increased power consumption (P ∝ f)
More heat generation
Higher voltage requirements
Diminishing returns

Power-Frequency Relationship

P_dynamic = α × C × V² × f

Where: α = activity factor, C = capacitance,
V = voltage, f = frequency

Computer Architecture & Performance

Architecture Types

Aspect	RISC	CISC
Instructions	Simple, fixed-length	Complex, variable-length
CPI	~1 (ideally)	Variable (1-20+)
Memory Access	Load/Store only	Memory operands allowed
Registers	Many (32+)	Fewer
Pipelining	Easy to pipeline	Harder to pipeline
Examples	ARM, MIPS, RISC-V	x86, x86-64

Relationship Between Architecture & Instruction Set

Instruction Set Architecture (ISA) is the interface between hardware and software — the "contract" specifying:

Available instructions and their encoding
Registers and their purposes
Addressing modes
Data types and sizes
Memory model

Microarchitecture is how the ISA is implemented in hardware:

Pipeline depth and organization
Cache sizes and hierarchy
Branch prediction strategy
Out-of-order execution
Superscalar width

★ Same ISA (e.g., x86-64) can have different microarchitectures (Intel Skylake vs AMD Zen) with vastly different performance characteristics.

Predicting Execution Time

Given CPU architecture details, predict execution time by considering:

Datapath Design

Single-cycle: CPI = 1, but slow clock
Multi-cycle: Variable CPI, faster clock
Pipelined: CPI ≈ 1, fast clock, hazards add stalls

Branch Prediction Impact

Misprediction penalty = pipeline depth
90% accuracy with 10-stage pipeline:
Extra CPI = 0.10 × 10 = 1.0 cycles/branch

Effective CPI Calculation

CPI_effective = CPI_base + (branch_freq × mispredict_rate × penalty)
+ (mem_access_freq × miss_rate × miss_penalty)

Example: CPI = 1.0 + (0.2 × 0.1 × 10) + (0.3 × 0.05 × 100)
= 1.0 + 0.2 + 1.5 = 2.7 cycles/instruction

Data Representation & Addressing Modes

Binary & Hexadecimal Number Systems

Decimal	Binary	Hex	Decimal	Binary	Hex
0	0000	0	8	1000	8
1	0001	1	9	1001	9
2	0010	2	10	1010	A
3	0011	3	11	1011	B
4	0100	4	12	1100	C
5	0101	5	13	1101	D
6	0110	6	14	1110	E
7	0111	7	15	1111	F

★ Each hex digit = exactly 4 binary bits. This is why hex is used — compact representation of binary.

Powers of 2 (Memorize These!)

2⁰=1 2¹=2 2²=4 2³=8 2⁴=16 2⁵=32
2⁶=64 2⁷=128 2⁸=256 2⁹=512 2¹⁰=1024
2¹⁶=65,536 2³²=4,294,967,296

Binary Addition & Subtraction

Binary Addition Rules

+ 0 = 0
+ 1 = 1
+ 0 = 1
+ 1 = 0 (carry 1)
+ 1 + 1 = 1 (carry 1)

Example: 1011 + 0110


    1 1 1     ← carries
      1 0 1 1
    + 0 1 1 0
    ---------
    1 0 0 0 1  = 17₁₀ (11 + 6 = 17 ✓)

Binary Subtraction (Using 2's Complement)

A - B = A + (-B) = A + (2's complement of B)

Example: 1010 - 0011 (10 - 3 = 7)


  2's complement of 0011:
    Invert: 1100
    Add 1:  1101

  Now add: 1010 + 1101
      1 1
        1 0 1 0
      + 1 1 0 1
      ---------
      1 0 1 1 1  (ignore overflow bit)
      = 0111 = 7₁₀ ✓

Decimal ↔ Binary Conversion

Decimal to Binary (Divide by 2)

Convert 45₁₀ to binary:


  45 ÷ 2 = 22 remainder 1  ↑
  22 ÷ 2 = 11 remainder 0  │
  11 ÷ 2 = 5  remainder 1  │ Read
   5 ÷ 2 = 2  remainder 1  │ bottom
   2 ÷ 2 = 1  remainder 0  │ to top
   1 ÷ 2 = 0  remainder 1  │
   
  Answer: 101101₂

Binary to Decimal (Positional Weights)

Convert 101101₂ to decimal:


  Position:  5   4   3   2   1   0
  Binary:    1   0   1   1   0   1
  Weight:   32  16   8   4   2   1
  
  = 1×32 + 0×16 + 1×8 + 1×4 + 0×2 + 1×1
  = 32 + 8 + 4 + 1 = 45₁₀ ✓

Decimal ↔ Hexadecimal Conversion

Decimal to Hex (Divide by 16)

Convert 255₁₀ to hex:


  255 ÷ 16 = 15 remainder 15 (F)  ↑
   15 ÷ 16 = 0  remainder 15 (F)  │
   
  Answer: FF₁₆ (or 0xFF)

Hex to Decimal

Convert 2A3₁₆ to decimal:


  2A3₁₆ = 2×16² + A×16¹ + 3×16⁰
        = 2×256 + 10×16 + 3×1
        = 512 + 160 + 3
        = 675₁₀

Binary ↔ Hex (Group by 4)

Binary to Hex: 11010110₂


  1101 0110
    D    6   → D6₁₆

Signed Binary Representation

Two's Complement (Standard Method)

For n-bit number, range is: -2^(n-1) to 2^(n-1)-1

8-bit range: -128 to +127

Converting -45 to 8-bit two's complement:


  Step 1: +45 in binary = 00101101
  Step 2: Invert all bits = 11010010
  Step 3: Add 1          = 11010011
  
  -45₁₀ = 11010011₂

MSB (sign bit): 1 = negative, 0 = positive

Two's Complement Values (8-bit)

Binary	Unsigned	Signed
`00000000`	0	0
`00000001`	1	+1
`01111111`	127	+127
`10000000`	128	-128
`10000001`	129	-127
`11111111`	255	-1

Reading negative: 11010011₂


  MSB is 1 → negative number
  Invert: 00101100
  Add 1:  00101101 = 45
  Answer: -45

Big-Endian vs Little-Endian

How multi-byte values are stored in memory.

Example: Store 0x12345678 at address 0x100

Big-Endian (MSB First) Address: 0x100 0x101 0x102 0x103 Data: 12 34 56 78 [MSB] [LSB] Little-Endian (LSB First) Address: 0x100 0x101 0x102 0x103 Data: 78 56 34 12 [LSB] [MSB]

Type	Description	Used By
Big-Endian	Most significant byte at lowest address	Network protocols, MIPS, SPARC
Little-Endian	Least significant byte at lowest address	x86, x86-64, ARM (configurable)

Data Overflow & Underflow

⚠ Overflow

Occurs when result is too large for the number of bits.

Unsigned Overflow


  8-bit max = 255 (11111111)
  255 + 1 = 256... but 256 = 100000000 (9 bits!)
  Result wraps to 0 (overflow)

Signed Overflow Detection

Positive + Positive = Negative → Overflow!
Negative + Negative = Positive → Overflow!
Positive + Negative → Cannot overflow

Example: 8-bit signed, 100 + 50


  01100100 (100)
+ 00110010 (50)
= 10010110 = -106 in signed! (overflow)

★ Underflow is the opposite — result too small (negative overflow). Same detection rules apply.

Assembly Language to Machine Language Translation

The assembler converts human-readable assembly into binary machine code using the ISA specification.

MIPS Instruction Formats (32-bit)

R-Type (Register): ┌────────┬───────┬───────┬───────┬───────┬────────┐ │ opcode │ rs │ rt │ rd │ shamt │ funct │ │ 6 bits │ 5 bits│ 5 bits│ 5 bits│ 5 bits│ 6 bits │ └────────┴───────┴───────┴───────┴───────┴────────┘ I-Type (Immediate): ┌────────┬───────┬───────┬─────────────────────────┐ │ opcode │ rs │ rt │ immediate │ │ 6 bits │ 5 bits│ 5 bits│ 16 bits │ └────────┴───────┴───────┴─────────────────────────┘ J-Type (Jump): ┌────────┬────────────────────────────────────────┐ │ opcode │ address │ │ 6 bits │ 26 bits │ └────────┴────────────────────────────────────────┘

Translation Example

# Assembly instruction:
add $t0, $s1, $s2

# Register numbers:
# $t0 = $8, $s1 = $17, $s2 = $18

# R-Type encoding:
# opcode=0, rs=$s1=17, rt=$s2=18, rd=$t0=8, shamt=0, funct=32

# Binary:
# 000000 10001 10010 01000 00000 100000

# Hex: 0x02324020

Field	Value	Binary
opcode	0	000000
rs ($s1)	17	10001
rt ($s2)	18	10010
rd ($t0)	8	01000
shamt	0	00000
funct (add)	32	100000

Addressing Modes

Mode	Description	Example	Effective Address
Immediate	Operand is part of instruction	`addi $t0, $t1, 5`	N/A (value = 5)
Register	Operand is in a register	`add $t0, $t1, $t2`	N/A (value in $t2)
Base/Displacement	Base register + offset	`lw $t0, 8($s0)`	EA = $s0 + 8
PC-Relative	PC + offset (for branches)	`beq $t0, $t1, label`	EA = PC + 4 + offset×4
Pseudo-Direct	Upper PC bits + address field	`j target`	EA = PC[31:28] \| addr \| 00

Assembly Language Programming

MIPS Register Convention

Name	Number	Usage	Preserved?
`$zero`	$0	Constant 0	N/A
`$at`	$1	Assembler temporary	No
`$v0-$v1`	$2-$3	Return values	No
`$a0-$a3`	$4-$7	Arguments	No
`$t0-$t7`	$8-$15	Temporaries	No
`$s0-$s7`	$16-$23	Saved registers	Yes
`$t8-$t9`	$24-$25	More temporaries	No
`$gp`	$28	Global pointer	Yes
`$sp`	$29	Stack pointer	Yes
`$fp`	$30	Frame pointer	Yes
`$ra`	$31	Return address	Yes

Essential MIPS Instructions

Arithmetic

add  $d, $s, $t    # $d = $s + $t
addi $t, $s, imm   # $t = $s + imm
sub  $d, $s, $t    # $d = $s - $t
mul  $d, $s, $t    # $d = $s × $t (low 32 bits)
div  $s, $t        # lo = $s / $t, hi = $s % $t
mflo $d            # $d = lo (quotient)
mfhi $d            # $d = hi (remainder)

Logical

and  $d, $s, $t    # $d = $s AND $t
andi $t, $s, imm   # $t = $s AND imm
or   $d, $s, $t    # $d = $s OR $t
ori  $t, $s, imm   # $t = $s OR imm
xor  $d, $s, $t    # $d = $s XOR $t
nor  $d, $s, $t    # $d = NOT($s OR $t)
sll  $d, $t, shamt # $d = $t << shamt
srl  $d, $t, shamt # $d = $t >> shamt (logical)
sra  $d, $t, shamt # $d = $t >> shamt (arithmetic)

Memory Access Instructions

# Load instructions
lw  $t, offset($s)  # Load word: $t = Mem[$s + offset]
lh  $t, offset($s)  # Load halfword (sign-extended)
lhu $t, offset($s)  # Load halfword (zero-extended)
lb  $t, offset($s)  # Load byte (sign-extended)
lbu $t, offset($s)  # Load byte (zero-extended)

# Store instructions
sw  $t, offset($s)  # Store word: Mem[$s + offset] = $t
sh  $t, offset($s)  # Store halfword
sb  $t, offset($s)  # Store byte

# Load address (pseudo-instruction)
la  $t, label       # $t = address of label

★ MIPS uses byte addressing. Words are 4 bytes, so word addresses must be divisible by 4 (aligned).

Mathematical Expressions in Assembly

Example: f = (g + h) - (i + j)

Assume: g→$s1, h→$s2, i→$s3, j→$s4, f→$s0

# f = (g + h) - (i + j)
add $t0, $s1, $s2   # $t0 = g + h
add $t1, $s3, $s4   # $t1 = i + j
sub $s0, $t0, $t1   # f = $t0 - $t1

Example: f = 2*g + h/4

# f = 2*g + h/4
sll $t0, $s1, 1     # $t0 = g * 2 (shift left)
sra $t1, $s2, 2     # $t1 = h / 4 (shift right)
add $s0, $t0, $t1   # f = $t0 + $t1

★ Multiply/divide by powers of 2 using shifts: sll by n = multiply by 2^n, srl/sra by n = divide by 2^n

Decisions (Conditional Branching)

# Branch instructions
beq $s, $t, label   # if ($s == $t) goto label
bne $s, $t, label   # if ($s != $t) goto label

# Set on less than
slt  $d, $s, $t     # $d = ($s < $t) ? 1 : 0
slti $t, $s, imm    # $t = ($s < imm) ? 1 : 0
sltu $d, $s, $t     # unsigned comparison

# Unconditional jump
j   label           # goto label

If-Else Example

C Code:

if (i == j)
    f = g + h;
else
    f = g - h;

MIPS Assembly:

      bne $s3, $s4, Else
      add $s0, $s1, $s2
      j   Exit
Else: sub $s0, $s1, $s2
Exit:

Loops in Assembly

While Loop

C Code:

while (save[i] == k)
    i += 1;

MIPS: (i→$s3, k→$s5, save base→$s6)

Loop: sll  $t1, $s3, 2    # $t1 = i * 4
      add  $t1, $t1, $s6   # $t1 = &save[i]
      lw   $t0, 0($t1)    # $t0 = save[i]
      bne  $t0, $s5, Exit # if != k, exit
      addi $s3, $s3, 1    # i++
      j    Loop
Exit:

For Loop

C Code:

// Sum array elements
int sum = 0;
for (int i = 0; i < n; i++)
    sum += arr[i];

MIPS:

      li   $t0, 0         # sum = 0
      li   $t1, 0         # i = 0
Loop: bge  $t1, $s1, Exit # if i >= n, exit
      sll  $t2, $t1, 2    # $t2 = i * 4
      add  $t2, $t2, $s0   # $t2 = &arr[i]
      lw   $t3, 0($t2)    # $t3 = arr[i]
      add  $t0, $t0, $t3   # sum += arr[i]
      addi $t1, $t1, 1    # i++
      j    Loop
Exit:

Procedures (Functions)

Procedure Call Steps

Place arguments in $a0-$a3
Transfer control: jal ProcedureLabel
Acquire storage (if needed)
Perform procedure's task
Place result in $v0-$v1
Return: jr $ra

# Caller code
      li   $a0, 5       # arg1 = 5
      li   $a1, 3       # arg2 = 3
      jal  multiply    # call function
      move $s0, $v0    # save result

Procedure Implementation

# int multiply(int a, int b)
multiply:
      # Leaf procedure - no stack needed
      mul  $v0, $a0, $a1  # $v0 = a * b
      jr   $ra           # return

Stack Management for Non-Leaf Procedures

# int factorial(int n)
factorial:
      # Prologue - save registers
      addi $sp, $sp, -8    # allocate stack space
      sw   $ra, 4($sp)     # save return address
      sw   $a0, 0($sp)     # save n
      
      # Base case: if (n <= 1) return 1
      slti $t0, $a0, 2     # $t0 = (n < 2)
      beq  $t0, $zero, recurse
      li   $v0, 1          # return 1
      j    fact_exit
      
recurse:
      # Recursive case: return n * factorial(n-1)
      addi $a0, $a0, -1    # n - 1
      jal  factorial       # factorial(n-1)
      lw   $a0, 0($sp)     # restore n
      mul  $v0, $a0, $v0   # n * factorial(n-1)
      
fact_exit:
      # Epilogue - restore registers
      lw   $ra, 4($sp)     # restore return address
      addi $sp, $sp, 8     # deallocate stack
      jr   $ra             # return

Stack Layout During Recursive Call: High Address ┌──────────────┐ │ ... │ ├──────────────┤ │ saved $ra │ ← $sp + 4 ├──────────────┤ │ saved $a0 │ ← $sp ├──────────────┤ │ ... │ └──────────────┘ Low Address Stack grows downward (toward lower addresses)

System Calls (SPIM/MARS)

Service	$v0	Arguments	Result
Print int	1	$a0 = integer	-
Print string	4	$a0 = address	-
Read int	5	-	$v0 = integer
Read string	8	$a0=buffer, $a1=len	-
Exit	10	-	-
Print char	11	$a0 = char	-
Read char	12	-	$v0 = char

# Print "Hello" and read an integer
.data
msg:  .asciiz "Enter a number: "

.text
      li   $v0, 4        # print_string
      la   $a0, msg
      syscall
      
      li   $v0, 5        # read_int
      syscall
      move $s0, $v0     # save input in $s0

Hardware Performance Techniques

Increasing Clock Rate

✓ Advantages

More cycles per second
Faster single-thread execution
Direct performance boost
No software changes needed

✗ Disadvantages

Power increases linearly with freq
Heat dissipation problems
Signal integrity issues
Diminishing returns (power wall)

Decreasing Silicon Feature Size

✓ Advantages

More transistors per chip
Lower capacitance → less power
Faster switching speed
Lower voltage operation

✗ Disadvantages

Increased leakage current
Manufacturing complexity
Quantum effects at small scales
Higher defect rates
Extreme cost of fab facilities

★ Moore's Law: Transistor count doubles ~every 2 years. Currently at 3-5nm process nodes.

Pipelined Datapath

✓ Advantages

Higher throughput (instructions/second)
Better hardware utilization
Allows higher clock frequency
Ideally CPI approaches 1

✗ Disadvantages

Pipeline hazards cause stalls
Increased latency per instruction
More complex control logic
Branch misprediction penalty

Pipeline Hazards

Hazard Type	Cause	Solution
Structural	Hardware resource conflict	Duplicate resources, stall
Data	Instruction depends on prior result	Forwarding, stall, reorder
Control	Branch outcome unknown	Branch prediction, delay slot

Pipeline Execution (5-stage): Cycle: 1 2 3 4 5 6 7 8 9 Instr 1: [IF][ID][EX][MEM][WB] Instr 2: [IF][ID][EX][MEM][WB] Instr 3: [IF][ID][EX][MEM][WB] Instr 4: [IF][ID][EX][MEM][WB] Instr 5: [IF][ID][EX][MEM][WB] Without pipeline: 5 instructions × 5 cycles = 25 cycles With pipeline: 5 instructions in 9 cycles (after filling)

Branch Prediction Methods

Method	Description	Typical Accuracy	Cost
Static: Always Not Taken	Predict branch not taken	~30-40%	None
Static: Always Taken	Predict branch taken	~60-70%	None
Static: Backward Taken	Backward=taken (loops), Forward=not taken	~65-75%	Low
1-bit Dynamic	Remember last outcome	~80-85%	Medium
2-bit Saturating Counter	4 states: strongly/weakly taken/not	~85-92%	Medium
Correlating/Two-Level	Uses global/local branch history	~93-97%	High
Tournament	Multiple predictors, choose best	~95-98%	Very High

2-bit Saturating Counter State Machine

Taken Taken ┌────────────┐ ┌────────────┐ │ ▼ │ ▼ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ Strongly│ │ Weakly │ │ Weakly │ │ Strongly│ │ Not │◄─────│ Not │◄─────│ Taken │◄─────│ Taken │ │ Taken │ Not │ Taken │ Not │ │ Not │ │ │ 00 │Taken │ 01 │Taken │ 10 │Taken │ 11 │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │ │ │ Predict: NOT TAKEN Predict: TAKEN │ └───────────────────────────────────────────────────┘

Misprediction Penalty

Extra cycles = misprediction_rate × pipeline_stages_to_flush

Example: 5% mispredict, 10-stage pipeline
Penalty = 0.05 × 10 = 0.5 cycles per branch

Datapath Design

Single-Cycle vs Pipelined Design

Aspect	Single-Cycle	Multi-Cycle	Pipelined
CPI	1	Variable (3-5)	~1 (ideal)
Clock Period	Longest instruction	Longest stage	Longest stage
Throughput	Low	Medium	High
Hardware	Duplicated (ALU, memory)	Shared	Pipeline registers
Control	Simple	FSM-based	Complex (hazards)

Single-Cycle Datapath

┌─────────┐ PC ──────►│Instruction│──────────────────────────────────────┐ │ Memory │ │ └─────────┬─┘ │ │ │ ▼ │ ┌─────────────────┐ │ │ Control Unit │────────────────────┐ │ └─────────────────┘ │ │ │ │ │ ┌──────────────┼──────────────┐ │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │ │ Register│ │ Sign │ │ Register│ │ │ │ Read │ │ Extend │ │ Write │ │ │ └────┬────┘ └────┬────┘ └────▲────┘ │ │ │ │ │ │ │ └──────┬──────┘ │ │ │ │ │ │ │ ▼ │ │ │ ┌─────────┐ │ │ │ │ ALU │──────────────┤ │ │ └────┬────┘ │ │ │ │ │ │ │ ▼ │ │ │ ┌─────────┐ │ │ │ │ Data │─────────────┘ │ │ │ Memory │ │ │ └─────────┘ │ │ │ │ Clock: All operations complete in ONE cycle │

Pipelined Datapath

IF/ID ID/EX EX/MEM MEM/WB │ │ │ │ ▼ ▼ ▼ ▼ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ Instruction│ │ Decode │ │ Execute │ │ Memory │ │ WriteBack│ │ Fetch │──│ + Reg │──│ (ALU) │──│ Access │──│ (to Reg) │ │ │ │ Read │ │ │ │ │ │ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ │ │ │ │ └─────────────┴─────────────┴─────────────┴─────────────┘ Pipeline Registers (store intermediate results)

Performance Comparison

Single-cycle time: T_single = longest_instruction_time
Pipelined time: T_pipelined = longest_stage_time + register_overhead

Speedup ≈ number_of_stages (ideal, no hazards)
Actual speedup = stages / (1 + stall_cycles_per_instruction)

Memory Hierarchy

Types of Memory

Registers (~1ns)

L1 Cache (~2-4ns)

L2 Cache (~10ns)

L3 Cache (~40ns)

Main Memory DRAM (~100ns)

Level	Size	Speed	Cost/GB
Registers	~1 KB	< 1 ns	$$$$$
L1 Cache	32-64 KB	~2 ns	$$$$
L2 Cache	256 KB-1 MB	~10 ns	$$$
L3 Cache	8-64 MB	~40 ns	$$
DRAM	8-128 GB	~100 ns	$
SSD	256 GB-8 TB	~100 μs	¢
HDD	1-20 TB	~10 ms	¢¢

How Memory Types Are Used

Registers: Currently executing instruction's operands
L1 Cache: Split I-cache and D-cache, most recently used code/data
L2 Cache: Unified, backs up L1 misses
L3 Cache: Shared among cores, last level before DRAM
DRAM: Running programs, OS, active data
Disk/SSD: Virtual memory, persistent storage

★ Principle of Locality: Temporal (recently used → likely reused) and Spatial (nearby addresses → likely accessed soon)

Calculating Cache Miss Delays

Average Memory Access Time (AMAT)

AMAT = Hit_time + (Miss_rate × Miss_penalty)

For multi-level cache:
AMAT = L1_hit_time + L1_miss_rate × (L2_hit_time + L2_miss_rate × L2_miss_penalty)

Example Calculation

Given:

L1 hit time = 1 cycle
L1 miss rate = 5%
L2 hit time = 10 cycles
L2 miss rate = 20% (of L1 misses)
Memory access = 100 cycles

Calculate AMAT:


AMAT = 1 + 0.05 × (10 + 0.20 × 100)
     = 1 + 0.05 × (10 + 20)
     = 1 + 0.05 × 30
     = 1 + 1.5
     = 2.5 cycles

CPI with Memory Stalls

CPI_total = CPI_base + Memory_stall_cycles

Memory_stall_cycles = (Reads/Instruction × Read_miss_rate × Miss_penalty)
+ (Writes/Instruction × Write_miss_rate × Miss_penalty)

Cache Organization

Type	Description	Pros	Cons
Direct Mapped	Each block maps to exactly one cache line	Simple, fast lookup	Conflict misses
Fully Associative	Block can go anywhere in cache	No conflict misses	Expensive lookup
Set Associative	Block maps to a set, can go in any line within set	Balance of both	Moderate complexity

Address breakdown for cache access: │◄────────────── Address ──────────────►│ ┌────────────────┬───────────────┬──────┐ │ Tag │ Index │Offset│ │ │ │ │ └────────────────┴───────────────┴──────┘ │ │ │ │ │ └─► Byte within block │ └─────────────► Which set/line └─────────────────────────────► Verify correct block For direct-mapped cache with 2^n blocks, block size 2^m bytes: - Offset bits = m (selects byte in block) - Index bits = n (selects cache line) - Tag bits = address_bits - n - m

Transition to Multiprocessor Architectures

Why Multiprocessors?

The Power Wall

Around 2004, single-core frequency scaling hit physical limits:

Power consumption grew faster than performance gains
Heat dissipation became unmanageable
Voltage couldn't decrease further without reliability issues
Dennard scaling ended

The ILP Wall

Diminishing returns from single-thread optimizations:

Branch prediction accuracy plateaued
Memory latency gap widened
Out-of-order windows limited by power
Instruction-level parallelism exhausted

★ Solution: Instead of making one core faster, use multiple cores working in parallel. Performance scales with parallelizable workload.

Multiprocessor Types

Type	Description	Examples
SMP (Symmetric)	Shared memory, equal access time	Multi-core desktops
NUMA	Shared memory, non-uniform access time	Multi-socket servers
Cluster	Distributed memory, message passing	Supercomputers, cloud

Amdahl's Law

Speedup = 1 / ((1 - P) + P/N)

Where: P = parallelizable fraction, N = number of processors

Example: 90% parallelizable, 4 cores
Speedup = 1 / (0.1 + 0.9/4) = 1 / 0.325 = 3.08×

⚠ Challenges of Multiprocessing

Cache Coherence: Multiple caches must stay synchronized
Synchronization: Locks, barriers, atomic operations needed
Load Balancing: Work must be evenly distributed
Communication Overhead: Data sharing has latency cost
Software Complexity: Parallel programming is hard!