Computer Architecture & Organization

How Programs Are Executed by a Computer System

Basic Components of a CPU Design

The Central Processing Unit (CPU) is the brain of the computer. It consists of several fundamental components that work together to execute instructions:

Control Unit (CU)

The "manager" of the CPU. It fetches instructions from memory, decodes them to understand what operation to perform, and coordinates all other components to execute the instruction. Think of it as the conductor of an orchestra.

Arithmetic Logic Unit (ALU)

The "calculator" of the CPU. Performs all arithmetic operations (add, subtract, multiply, divide) and logical operations (AND, OR, NOT, XOR, comparisons). Every computation flows through here.

Registers

Ultra-fast, small storage locations inside the CPU. Used to hold data currently being processed, addresses, and status information. Much faster than RAM because they're on the CPU chip itself.

Program Counter (PC)

A special register that holds the memory address of the next instruction to execute. After each instruction fetch, it typically increments to point to the next sequential instruction (unless a jump/branch occurs).

Simplified CPU Architecture

┌─────────────────────────────────────────────────────────────┐
│                          CPU                                 │
│  ┌──────────────────────────────────────────────────────┐   │
│  │                   Control Unit                        │   │
│  │    ┌─────────┐  ┌─────────────┐  ┌──────────────┐    │   │
│  │    │   PC    │  │ Instruction │  │   Control    │    │   │
│  │    │ Counter │──│  Register   │──│   Signals    │────┼───┼──▶
│  │    └─────────┘  └─────────────┘  └──────────────┘    │   │
│  └──────────────────────────────────────────────────────┘   │
│                            │                                 │
│  ┌─────────────────────────┼────────────────────────────┐   │
│  │        Register File    │                            │   │
│  │   ┌────┐ ┌────┐ ┌────┐ │ ┌────┐ ┌────┐ ┌────┐       │   │
│  │   │ R0 │ │ R1 │ │ R2 │ │ │ R3 │ │... │ │ Rn │       │   │
│  │   └────┘ └────┘ └────┘ │ └────┘ └────┘ └────┘       │   │
│  └─────────────────────────┼────────────────────────────┘   │
│                            │                                 │
│  ┌─────────────────────────▼────────────────────────────┐   │
│  │                       ALU                             │   │
│  │         ADD  SUB  MUL  DIV  AND  OR  XOR  NOT        │   │
│  └──────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘
                             │
                    ◀────────┴────────▶
                         Memory Bus

The Program Compilation Process

When you write code in a high-level language like C or Python, it can't run directly on hardware. The compilation process transforms human-readable code into machine instructions the CPU understands.

Compilation Pipeline

Source Code (.c)
      │
      ▼
┌─────────────────┐
│  PREPROCESSOR   │  →  Handles #include, #define, macros
└────────┬────────┘     Produces expanded source code
         │
         ▼
┌─────────────────┐
│    COMPILER     │  →  Converts C to Assembly language
└────────┬────────┘     Syntax checking, optimization
         │
         ▼
┌─────────────────┐
│   ASSEMBLER     │  →  Converts Assembly to Object code
└────────┬────────┘     Machine instructions (binary)
         │
         ▼
┌─────────────────┐
│     LINKER      │  →  Combines object files + libraries
└────────┬────────┘     Resolves external references
         │
         ▼
  Executable File

Stage Details

Preprocessor: Text substitution phase. Expands macros, includes header files, processes conditional compilation directives. Output is still C code.
Compiler: The heavy lifting. Performs lexical analysis (tokenization), parsing (syntax tree), semantic analysis (type checking), optimization, and code generation. Output is assembly language.
Assembler: Translates assembly mnemonics into binary machine code. Creates object files containing machine instructions plus symbol tables for linking.
Linker: Combines multiple object files and libraries. Resolves symbol references (like function calls to library functions). Produces final executable with proper memory addresses.

The Instruction Cycle (Fetch-Decode-Execute)

Every instruction the CPU executes follows the same fundamental cycle. This continuous loop is the heartbeat of computation.

Instruction Cycle

        ┌─────────────────────────────────────────────────────┐
        │                                                     │
        ▼                                                     │
   ┌─────────┐      ┌──────────┐      ┌───────────┐          │
   │  FETCH  │ ──▶  │  DECODE  │ ──▶  │  EXECUTE  │ ─────────┘
   └─────────┘      └──────────┘      └───────────┘
        │                │                  │
        │                │                  │
   Read instruction   Determine         Perform the
   from memory at     what operation    operation:
   address in PC,     and operands,     ALU computation,
   increment PC       setup control     memory access,
                      signals           or branch

The Three Stages Explained

1. Fetch: The CPU reads the instruction from memory at the address stored in the Program Counter. The instruction is loaded into the Instruction Register. The PC is incremented to point to the next instruction.

2. Decode: The Control Unit examines the instruction to determine: What operation? Which registers? What addressing mode? It generates control signals to coordinate the execution.

3. Execute: The actual work happens. This might involve the ALU performing arithmetic, data being read/written to memory, or the PC being modified for branches/jumps.

Some architectures add additional stages:

Memory Access: For load/store instructions, access RAM to read or write data
Write Back: Store the result of the operation back into a register

Hardware Factors Impacting Performance

Measuring and Comparing Computer Performance

Performance can be measured in several ways, each telling a different part of the story:

Execution Time

Total time to complete a task. Lower is better. This is what end users care about most.

Throughput

Tasks completed per unit time. Higher is better. Important for servers handling many requests.

Performance = 1 / Execution Time

Comparing two systems:

Speedup = Performance_new / Performance_old = Time_old / Time_new

Key Metrics

MIPS (Million Instructions Per Second): Raw instruction throughput. Can be misleading because different instructions do different amounts of work.
FLOPS (Floating Point Operations Per Second): Better for scientific computing comparisons. Measures actual computational work.
Benchmarks (SPEC, Geekbench): Standardized programs that measure real-world performance across various workloads.

Propagation Delay

Propagation delay is the time it takes for a signal to travel through a logic gate or circuit. It's the fundamental speed limit of digital circuits.

t_pd = Time for output to change after input changes

Propagation delay is affected by:

Gate complexity: More transistors = more delay
Capacitive load: More outputs to drive = slower switching
Wire length: Signals take time to travel, even at near light speed
Temperature: Higher temps generally increase delay
Voltage: Lower voltage = slower transitions

Critical Path

The critical path is the longest path through combinational logic between registers. It determines the minimum clock period and thus the maximum clock frequency. All design optimization focuses on reducing critical path delay.

Clock Speed and Performance

Clock speed (frequency) determines how many cycles per second the CPU runs. Higher clock = more operations per second, but there are tradeoffs.

Clock Period (T) = 1 / Frequency (f)

CPU Time = Instruction Count × CPI × Clock Period

CPU Time = (Instruction Count × CPI) / Clock Frequency

Where CPI = Cycles Per Instruction (average)

Why Not Just Crank Up Clock Speed?

Power consumption: Power scales roughly with frequency cubed (P ∝ f³). Double the clock, roughly 8× the heat.
Heat dissipation: Can't cool it fast enough. This is why we hit the "frequency wall" around 4-5 GHz.
Signal integrity: At high frequencies, wires start acting like antennas and transmission lines.
Timing margins: Less room for error in meeting setup/hold times.

Computer Architecture's Influence on Performance

Architecture decisions fundamentally impact what's possible:

Architectural Feature	Performance Impact
Word Size (32-bit vs 64-bit)	More data processed per instruction, larger address space
Register Count	More registers = fewer memory accesses = faster
Cache Size & Levels	Larger/more cache = fewer slow main memory accesses
Pipeline Depth	More stages = higher clock possible, but hazard penalties
Superscalar Width	Execute multiple instructions per cycle
Out-of-Order Execution	Better utilization by reordering independent instructions
Branch Prediction	Reduce pipeline stalls from branches

Architecture vs. Instruction Set Relationship

The Instruction Set Architecture (ISA) is the interface between software and hardware — the "contract" that defines what instructions exist, their formats, registers, and addressing modes.

CISC (Complex Instruction Set)

x86

Many complex instructions
Variable instruction lengths
Instructions can access memory directly
Fewer instructions needed per program
Complex decode logic

RISC (Reduced Instruction Set)

ARM MIPS RISC-V

Simple, uniform instructions
Fixed instruction length
Load/Store architecture
More instructions per program
Simpler decode, easier pipelining

Key Insight

Modern "CISC" processors (like x86) actually translate complex instructions into RISC-like micro-operations internally. The ISA is preserved for compatibility, but execution is RISC-style for performance.

Predicting Assembly Program Execution Time

To calculate execution time, you need to account for instruction mix, cycles per instruction type, and clock speed.

Execution Time = Σ (Instruction_count_i × CPI_i) × Clock_period

Example Calculation

Given a program with:

100 ALU instructions (1 cycle each)
30 load instructions (3 cycles each — memory access)
20 store instructions (3 cycles each)
10 branch instructions (2 cycles each)
Clock frequency: 2 GHz

Total Cycles = (100 × 1) + (30 × 3) + (20 × 3) + (10 × 2) = 100 + 90 + 60 + 20 = 270 cycles

Clock Period = 1 / (2 × 10⁹) = 0.5 ns

Execution Time = 270 × 0.5 ns = 135 ns

Real-World Complications

Actual execution time is affected by cache hits/misses (huge impact!), branch prediction accuracy, pipeline hazards, memory bandwidth, and out-of-order execution effects. These make precise prediction difficult.

Data Representation, Instruction Sets, and Addressing Modes

Binary and Hexadecimal Number Systems

Binary (Base 2)

Computers use binary because transistors have two states: on (1) and off (0). Each digit is a "bit".

Binary Place Values

Position:    7      6      5      4      3      2      1      0
Value:      128    64     32     16      8      4      2      1
             2⁷     2⁶     2⁵     2⁴     2³     2²     2¹     2⁰

Example: 10110101₂ = 128 + 32 + 16 + 4 + 1 = 181₁₀

Hexadecimal (Base 16)

Hex is a compact way to represent binary. Each hex digit represents exactly 4 bits (a "nibble").

Decimal	Binary	Hex	Decimal	Binary	Hex
0	0000	0	8	1000	8
1	0001	1	9	1001	9
2	0010	2	10	1010	A
3	0011	3	11	1011	B
4	0100	4	12	1100	C
5	0101	5	13	1101	D
6	0110	6	14	1110	E
7	0111	7	15	1111	F

Example: 0xB5 = 1011 0101₂ = 181₁₀

Binary Addition and Subtraction

Binary Addition

Works just like decimal addition, but carry occurs at 2 instead of 10.

Rules:          Example: 0101 + 0011 = 1000  (5 + 3 = 8)
0 + 0 = 0                   
0 + 1 = 1           0 1 0 1      (5)
1 + 0 = 1         + 0 0 1 1      (3)
1 + 1 = 10        ---------
  (0 carry 1)     ¹ ¹
                    1 0 0 0      (8)

Binary Subtraction (Using Two's Complement)

Instead of subtracting, we add the negative. To negate in two's complement: invert all bits, then add 1.

To compute 5 - 3 using 4 bits:

Step 1: Convert 3 to -3 (two's complement)
        3 = 0011
        Invert: 1100
        Add 1:  1101 = -3

Step 2: Add 5 + (-3)
          0101    (5)
        + 1101   (-3)
        ------
       ¹ 0010    (2)  ← discard carry, answer is 2 ✓

Decimal to Binary Conversion

Method: Repeated Division by 2

Divide by 2, record remainder. Repeat until quotient is 0. Read remainders bottom-to-top.

Convert 181 to binary:

181 ÷ 2 = 90 remainder 1  ↑
 90 ÷ 2 = 45 remainder 0  │
 45 ÷ 2 = 22 remainder 1  │  Read upward
 22 ÷ 2 = 11 remainder 0  │
 11 ÷ 2 = 5  remainder 1  │
  5 ÷ 2 = 2  remainder 1  │
  2 ÷ 2 = 1  remainder 0  │
  1 ÷ 2 = 0  remainder 1  │

Result: 181₁₀ = 10110101₂

Binary to Decimal

Multiply each bit by its place value and sum:

10110101₂ = 1×128 + 0×64 + 1×32 + 1×16 + 0×8 + 1×4 + 0×2 + 1×1 = 181₁₀

Decimal to Hexadecimal Conversion

Method 1: Repeated Division by 16

Convert 450 to hex:

450 ÷ 16 = 28 remainder 2   ↑
 28 ÷ 16 = 1  remainder 12 (C)  │  Read upward
  1 ÷ 16 = 0  remainder 1   │

Result: 450₁₀ = 0x1C2

Method 2: Convert to Binary First, Then Group by 4

450₁₀ = 111000010₂ = 0001 1100 0010₂ = 0x1C2

Signed Binary Representation (Two's Complement)

Two's complement is the standard for representing signed integers. The MSB (leftmost bit) indicates sign: 0 = positive, 1 = negative.

4-bit Binary	Unsigned Value	Signed Value (Two's Complement)
0000	0	0
0001	1	+1
0111	7	+7 (max positive)
1000	8	-8 (min negative)
1111	15	-1
1110	14	-2

Range for n-bit Two's Complement

Range: -2^(n-1) to +2^(n-1) - 1

For 8 bits: -128 to +127
For 32 bits: -2,147,483,648 to +2,147,483,647

Converting Negative Numbers

To represent -5 in 8-bit two's complement:

1. Start with +5:     00000101
2. Invert all bits:   11111010
3. Add 1:             11111011  ← This is -5

To verify: add 5 + (-5) should equal 0
  00000101
+ 11111011
----------
 100000000  ← Discard 9th bit carry = 00000000 ✓

Big-Endian vs Little-Endian

Endianness describes byte ordering in multi-byte data. Consider the 32-bit value 0x12345678:

Memory Layout Comparison

Address:        0x100   0x101   0x102   0x103

Big-Endian:      12      34      56      78
                 ↑ Most significant byte first
                 (Network byte order, used by some RISC)

Little-Endian:   78      56      34      12
                 ↑ Least significant byte first
                 (x86, ARM default)

Big-Endian

Like reading left-to-right. MSB at lowest address. Used in: network protocols, some RISC processors, Java bytecode.

Little-Endian

LSB at lowest address. Used in: x86, x64, ARM (usually), most modern PCs. Easier for hardware to extend values.

Overflow and Underflow

Overflow

Occurs when the result of an operation is too large to represent in the available bits.

8-bit signed addition overflow example:

  01111111   (+127, max positive)
+ 00000001   (+1)
----------
  10000000   (interpreted as -128, not +128!)

Overflow detection (signed): 
If two positive numbers produce negative result → overflow
If two negative numbers produce positive result → overflow

Underflow

Occurs when the result is too negative (below minimum representable value).

8-bit signed subtraction underflow:

  10000000   (-128, min negative)
- 00000001   (-1)
----------
  01111111   (+127, wrapped around!)

Consequences

Overflow/underflow can cause security vulnerabilities (integer overflow attacks), incorrect calculations, and program crashes. Always validate ranges and use appropriate data types!

Assembly to Machine Language Translation

Assembly mnemonics are human-readable representations of machine code. Each instruction has a binary encoding defined by the ISA.

MIPS R-Type Instruction Format Example

Instruction: ADD $t0, $s1, $s2   (add registers s1 and s2, store in t0)

R-Type Format (32 bits total):
┌────────┬───────┬───────┬───────┬───────┬────────┐
│ opcode │   rs  │   rt  │   rd  │ shamt │ funct  │
│ 6 bits │5 bits │5 bits │5 bits │5 bits │ 6 bits │
└────────┴───────┴───────┴───────┴───────┴────────┘

For ADD $t0, $s1, $s2:
opcode = 000000 (R-type)
rs = 10001 ($s1 = register 17)
rt = 10010 ($s2 = register 18)
rd = 01000 ($t0 = register 8)
shamt = 00000 (not a shift)
funct = 100000 (ADD function)

Machine code: 000000 10001 10010 01000 00000 100000
            = 0x02324020

I-Type and J-Type Formats

I-Type (Immediate): LW, SW, BEQ, ADDI, etc.
┌────────┬───────┬───────┬──────────────────┐
│ opcode │   rs  │   rt  │    immediate     │
│ 6 bits │5 bits │5 bits │     16 bits      │
└────────┴───────┴───────┴──────────────────┘

J-Type (Jump): J, JAL
┌────────┬─────────────────────────────────┐
│ opcode │           address               │
│ 6 bits │           26 bits               │
└────────┴─────────────────────────────────┘

Assembly Language Programming

Mathematical Expressions in Assembly

High-level math expressions must be broken down into individual operations using registers.

Example: Compute result = (a + b) * (c - d)

# Assume: a in $s0, b in $s1, c in $s2, d in $s3
# Result will be in $s4

add  $t0, $s0, $s1    # $t0 = a + b
sub  $t1, $s2, $s3    # $t1 = c - d
mul  $s4, $t0, $t1    # $s4 = (a+b) * (c-d)

Example: Array Element Access (array[i] = array[i] + 5)

# Assume: base address of array in $s0, i in $s1

sll  $t0, $s1, 2       # $t0 = i * 4 (word offset)
add  $t0, $t0, $s0    # $t0 = address of array[i]
lw   $t1, 0($t0)       # $t1 = array[i]
addi $t1, $t1, 5      # $t1 = array[i] + 5
sw   $t1, 0($t0)       # array[i] = $t1

Decision Structures (If-Else)

Simple If Statement

// C code:
if (a == b) {
    c = d + e;
}

# a=$s0, b=$s1, c=$s2, d=$s3, e=$s4

      bne  $s0, $s1, skip    # if a != b, skip the body
      add  $s2, $s3, $s4    # c = d + e
skip:
      # continue...

If-Else Statement

// C code:
if (a < b) {
    c = 1;
} else {
    c = 0;
}

# a=$s0, b=$s1, c=$s2

      slt  $t0, $s0, $s1    # $t0 = 1 if a < b, else 0
      beq  $t0, $zero, else # if $t0==0 (a >= b), go to else
      addi $s2, $zero, 1   # c = 1
      j    done             # skip else block
else:
      add  $s2, $zero, $zero # c = 0
done:

Loops

While Loop

// C code: sum numbers from 1 to n
int sum = 0;
int i = 1;
while (i <= n) {
    sum = sum + i;
    i = i + 1;
}

# n=$s0, sum=$s1, i=$s2

        add  $s1, $zero, $zero  # sum = 0
        addi $s2, $zero, 1     # i = 1
loop:
        slt  $t0, $s0, $s2     # $t0 = 1 if n < i
        bne  $t0, $zero, done  # if n < i, exit loop
        add  $s1, $s1, $s2     # sum = sum + i
        addi $s2, $s2, 1      # i = i + 1
        j    loop              # repeat
done:

For Loop (Array Sum)

# Sum array elements: for(i=0; i
# arr base=$s0, n=$s1, sum in $s2

        add  $s2, $zero, $zero  # sum = 0
        add  $t0, $zero, $zero  # i = 0
        add  $t1, $s0, $zero    # $t1 = current address
loop:
        slt  $t2, $t0, $s1     # $t2 = 1 if i < n
        beq  $t2, $zero, done  # exit if i >= n
        lw   $t3, 0($t1)        # $t3 = arr[i]
        add  $s2, $s2, $t3     # sum += arr[i]
        addi $t1, $t1, 4      # next address
        addi $t0, $t0, 1      # i++
        j    loop
done:

Procedures (Functions)

Procedures in assembly require careful management of the stack, return addresses, and registers.

MIPS Calling Convention

$a0-$a3: Arguments to procedure
$v0-$v1: Return values
$ra: Return address (set by jal)
$sp: Stack pointer
$s0-$s7: Saved registers (callee must preserve)
$t0-$t9: Temporary registers (caller-saved)

Simple Leaf Procedure (No Nested Calls)

// C code:
int square(int x) {
    return x * x;
}

square:
        mul  $v0, $a0, $a0    # $v0 = x * x
        jr   $ra              # return

# Calling the function:
        addi $a0, $zero, 5   # arg = 5
        jal  square           # call square(5)
        # result now in $v0

Non-Leaf Procedure (Makes Nested Calls)

# int factorial(int n) { if(n<=1) return 1; return n * factorial(n-1); }

factorial:
        addi $sp, $sp, -8    # allocate stack space
        sw   $ra, 4($sp)      # save return address
        sw   $a0, 0($sp)      # save argument n
        
        slti $t0, $a0, 2     # $t0 = 1 if n < 2
        beq  $t0, $zero, recurse
        
        addi $v0, $zero, 1   # base case: return 1
        addi $sp, $sp, 8     # restore stack
        jr   $ra
        
recurse:
        addi $a0, $a0, -1    # n - 1
        jal  factorial        # recursive call
        
        lw   $a0, 0($sp)      # restore n
        lw   $ra, 4($sp)      # restore return address
        addi $sp, $sp, 8     # deallocate stack
        
        mul  $v0, $a0, $v0    # return n * factorial(n-1)
        jr   $ra

Stack Frame Layout

High Memory
┌─────────────────────┐
│  Previous frames    │
├─────────────────────┤
│  Saved $ra          │ ← $sp + 4 (after allocation)
├─────────────────────┤
│  Saved $a0 (n)      │ ← $sp (after allocation)
├─────────────────────┤
│  (next frame...)    │
└─────────────────────┘
Low Memory

Hardware Performance Techniques

Increasing Clock Rate

Advantages
More cycles per second = more instructions executed
Direct performance improvement for single-threaded code
Simple conceptually - just make everything faster

Disadvantages
Power consumption increases dramatically (P ∝ f³)
Heat generation becomes unmanageable
Requires faster transistors and better cooling
Hit practical limits around 4-5 GHz ("frequency wall")
Timing margins shrink, reliability concerns

Decreasing Silicon Feature Size

Feature size refers to the smallest elements that can be fabricated on a chip (measured in nanometers). Smaller = more transistors, lower power, higher speeds.

Advantages
More transistors per chip (Moore's Law)
Lower capacitance = faster switching
Lower voltage operation possible
Reduced power consumption per transistor
Smaller die size = lower cost per chip

Disadvantages
Leakage current increases (quantum tunneling)
Manufacturing complexity and cost skyrocket
Heat density increases (same power, smaller area)
Reliability issues (electromigration, variability)
Approaching physical limits (atomic scale)

Pipelined Datapath

Pipelining divides instruction execution into stages, allowing multiple instructions to be "in flight" simultaneously. Like an assembly line.

5-Stage Pipeline

Time →   1    2    3    4    5    6    7    8
      ┌────┬────┬────┬────┬────┬────┬────┬────┐
Instr1│ IF │ ID │ EX │ MEM│ WB │    │    │    │
      ├────┼────┼────┼────┼────┼────┼────┼────┤
Instr2│    │ IF │ ID │ EX │ MEM│ WB │    │    │
      ├────┼────┼────┼────┼────┼────┼────┼────┤
Instr3│    │    │ IF │ ID │ EX │ MEM│ WB │    │
      ├────┼────┼────┼────┼────┼────┼────┼────┤
Instr4│    │    │    │ IF │ ID │ EX │ MEM│ WB │
      └────┴────┴────┴────┴────┴────┴────┴────┘

IF = Instruction Fetch    MEM = Memory Access
ID = Instruction Decode   WB = Write Back
EX = Execute

Advantages
Higher throughput (1 instruction completing per cycle at steady state)
Better hardware utilization
Enables higher clock frequencies (shorter critical path per stage)

Disadvantages
Pipeline hazards (data, control, structural)
Increased complexity
Latency for single instruction unchanged
Branch mispredictions cause pipeline flushes
Power overhead from pipeline registers

Branch Prediction

Branch prediction guesses the outcome of conditional branches before they're resolved, keeping the pipeline full.

Static Prediction Methods

Always Predict Not Taken: Simple, but wrong ~50% for general code
Always Predict Taken: Better for loops (backward branches often taken)
Backward Taken, Forward Not Taken (BTFNT): Exploits loop behavior

Dynamic Prediction Methods

1-Bit Predictor: Remember last outcome. Problem: double misprediction at loop boundaries.
2-Bit Saturating Counter: Needs two wrong predictions to change. Much better for loops.
Correlating Predictors: Use history of recent branches to predict (captures patterns).
Tournament Predictors: Multiple predictors compete; choose best per-branch.

2-Bit Saturating Counter

                    Taken
       ┌───────────────────────────────┐
       │                               │
       ▼           Taken               │
  ┌─────────┐   ─────────▶    ┌─────────┐
  │ Strongly│                 │ Weakly  │
  │  Taken  │   ◀─────────    │  Taken  │
  └─────────┘    Not Taken    └─────────┘
       │                               ▲
       │                               │
       │ Not Taken          Taken      │
       ▼                               │
  ┌─────────┐   ─────────▶    ┌─────────┐
  │ Strongly│                 │ Weakly  │
  │Not Taken│   ◀─────────    │Not Taken│
  └─────────┘    Not Taken    └─────────┘
       ▲                               │
       │                               │
       └───────────────────────────────┘
                  Not Taken

Comparing Branch Prediction Performance

Predictor Type	Accuracy	Hardware Cost	Use Case
Static (BTFNT)	~65%	None	Simple embedded
1-Bit Local	~80%	Low	Basic processors
2-Bit Local	~85%	Low	Most common
Correlating/Global	~92%	Medium	Desktop CPUs
Tournament/Hybrid	~95%+	High	High-performance
Neural (TAGE-SC-L)	~97%+	Very High	Modern CPUs

Datapath Implementation

Single-Cycle Datapath

In a single-cycle design, every instruction completes in exactly one clock cycle. The clock period must be long enough for the slowest instruction.

Simplified Single-Cycle Datapath

        ┌─────┐                  ┌─────────┐
PC ────▶│ I   │─── Instr ──────▶│ Control │──── Control Signals
        │ Mem │                  └─────────┘
        └─────┘                       │
           │                          ▼
           │    ┌──────────────────────────────────────┐
           │    │          Register File               │
           │    │  ┌────┐                   ┌────┐    │
           │    │  │Read│───▶ Data1 ───▶┐   │Write│   │
           │    │  │Reg1│               │   │Data │   │
           │    │  └────┘               │   └────┘   │
           │    │  ┌────┐               │     ▲       │
           │    │  │Read│───▶ Data2 ──┐ │     │       │
           │    │  │Reg2│             │ │     │       │
           │    │  └────┘             │ │     │       │
           │    └─────────────────────┼─┼─────┼───────┘
           │                          │ │     │
           │                          ▼ ▼     │
           │                       ┌─────┐   │
           │                       │ ALU │───┼───────────▶ Result
           │                       └─────┘   │
           │                          │      │
           │                          ▼      │
           │                      ┌───────┐  │
           │                      │ Data  │──┘
           │                      │Memory │
           │                      └───────┘
           │
           ▼
      ┌─────────┐
      │ PC + 4  │───▶ Next PC
      └─────────┘

Single-Cycle Limitations

Clock period = time for slowest instruction (usually load: fetch + decode + ALU + memory + write-back). Faster instructions (like ADD) must wait the same time, wasting cycles.

Pipelined Datapath

Pipelining breaks execution into stages with pipeline registers between them. Each stage works on a different instruction simultaneously.

5-Stage Pipeline Datapath

┌────────┐   ┌────────┐   ┌────────┐   ┌────────┐   ┌────────┐
│   IF   │──▶│   ID   │──▶│   EX   │──▶│  MEM   │──▶│   WB   │
│        │   │        │   │        │   │        │   │        │
│ Fetch  │   │ Decode │   │Execute │   │ Memory │   │ Write  │
│ Instr  │   │ & Read │   │  ALU   │   │ Access │   │  Back  │
│        │   │  Regs  │   │        │   │        │   │        │
└────────┘   └────────┘   └────────┘   └────────┘   └────────┘
     │            │            │            │
     ▼            ▼            ▼            ▼
  IF/ID        ID/EX        EX/MEM      MEM/WB
  Register     Register     Register    Register

Pipeline Hazards

Structural Hazards: Hardware resource conflict (e.g., single memory for instruction and data). Solution: separate I-cache and D-cache.
Data Hazards: Instruction needs data not yet available. Solutions: forwarding/bypassing, stalling.
Control Hazards: Branch/jump changes flow. Solutions: branch prediction, delayed branches.

Comparison

Aspect	Single-Cycle	Pipelined
Clock Period	Long (slowest instruction)	Short (slowest stage)
Throughput	1 instr per (long) cycle	~1 instr per (short) cycle
Latency	1 cycle	N cycles (N stages)
Complexity	Simple	Complex (hazard handling)
Hardware	Less	More (pipeline registers)

Memory Hierarchy

Types of Memory

Memory Hierarchy Pyramid

                    ┌───────┐
                    │  CPU  │
                    │ Regs  │  ← Fastest, smallest, most expensive
                    └───┬───┘     ~1 cycle access
                        │
                   ┌────┴────┐
                   │   L1    │  ← ~4 cycles, 32-64 KB
                   │  Cache  │
                   └────┬────┘
                        │
                  ┌─────┴─────┐
                  │    L2     │  ← ~10-20 cycles, 256 KB - 1 MB
                  │   Cache   │
                  └─────┬─────┘
                        │
                 ┌──────┴──────┐
                 │     L3      │  ← ~40-75 cycles, 4-64 MB
                 │    Cache    │
                 └──────┬──────┘
                        │
              ┌─────────┴─────────┐
              │    Main Memory    │  ← ~100-300 cycles, 8-128 GB
              │       (RAM)       │
              └─────────┬─────────┘
                        │
           ┌────────────┴────────────┐
           │   Secondary Storage     │  ← millions of cycles
           │     (SSD / HDD)         │     TB capacity
           └─────────────────────────┘

Memory Type	Technology	Speed	Volatility	Use
Registers	Flip-flops	~0.25 ns	Volatile	Current operands
SRAM (Cache)	6 transistors/bit	~1-10 ns	Volatile	L1/L2/L3 cache
DRAM (RAM)	1 transistor + capacitor	~50-100 ns	Volatile	Main memory
Flash/SSD	Floating-gate transistors	~25-100 μs	Non-volatile	Storage
HDD	Magnetic platters	~5-10 ms	Non-volatile	Bulk storage

How Different Memory Types Are Used

Registers: Hold data actively being computed. Compiler allocates variables here when possible.
L1 Cache: Split into I-cache (instructions) and D-cache (data). Holds most recently accessed data. Checked first on every memory access.
L2/L3 Cache: Larger, slower backup. L3 often shared among cores. Catches misses from L1.
Main Memory (RAM): Holds active programs and data. OS manages allocation. Accessed on cache miss.
Virtual Memory: Uses disk as "backup" for RAM. Pages swapped in/out as needed.
Storage: Persistent file storage. Programs and data loaded from here into RAM.

Principle of Locality

Temporal Locality: Recently accessed data likely to be accessed again soon.
Spatial Locality: Data near recently accessed data likely to be accessed soon.
Caches exploit both by keeping recent data and fetching in blocks (cache lines).

Calculating Cache Miss Delays

Cache performance significantly impacts overall system performance.

Average Memory Access Time (AMAT) = Hit Time + (Miss Rate × Miss Penalty)

Example Calculation

Given:

L1 cache hit time: 1 cycle
L1 miss rate: 5%
L2 cache hit time: 10 cycles
L2 miss rate (of L1 misses): 20%
Main memory access time: 200 cycles

L2 AMAT = 10 + (0.20 × 200) = 10 + 40 = 50 cycles

Overall AMAT = 1 + (0.05 × 50) = 1 + 2.5 = 3.5 cycles average

For a Program

Given a program with 1000 memory references:

950 L1 hits (95%): 950 × 1 = 950 cycles
40 L2 hits (50 L1 misses × 80%): 40 × 10 = 400 cycles
10 Main memory accesses: 10 × 200 = 2000 cycles

Total = 950 + 400 + 2000 = 3350 cycles for 1000 references = 3.35 cycles/access

Multiprocessor Architectures

Why Multiprocessors?

The transition to multiprocessor (multi-core) architectures happened because single-processor performance improvements hit fundamental limits:

The Power Wall

Power consumption and heat generation grew faster than performance. Increasing clock speed beyond ~4 GHz became impractical due to cooling limitations. Power ≈ Capacitance × Voltage² × Frequency.

The Memory Wall

Processor speed improved much faster than memory speed. CPUs spend increasing time waiting for data. Even with caches, memory latency limits single-thread performance.

The ILP Wall

Instruction-Level Parallelism has diminishing returns. Finding independent instructions to execute simultaneously becomes harder. Out-of-order execution complexity grows exponentially.

The Solution: Parallelism

Instead of making one core faster, use multiple cores. Trade single-thread performance for multi-thread throughput. Let software exploit parallelism through threads/processes.

Types of Parallelism

Thread-Level Parallelism (TLP)

Multiple threads execute simultaneously on different cores. Requires parallel software design. Good for server workloads, scientific computing.

Data-Level Parallelism (DLP)

Same operation on multiple data elements (SIMD). GPUs excel at this. Good for graphics, machine learning, signal processing.

Multiprocessor Types

Type	Description	Example
SMP (Symmetric)	All processors equal, shared memory	Typical desktop/laptop CPUs
NUMA	Non-uniform memory access times	Multi-socket servers
Heterogeneous	Different processor types (CPU + GPU)	Modern SoCs, game consoles
Cluster	Networked independent computers	Data centers, HPC

Challenges

Cache Coherence: Keeping caches consistent when multiple cores access shared data
Synchronization: Coordinating access to shared resources (locks, barriers)
Amdahl's Law: Speedup limited by sequential portion of code. If 10% is sequential, max speedup is 10× regardless of core count.
Programming Difficulty: Parallel programming is harder than sequential. Race conditions, deadlocks, load balancing.

Amdahl's Law: Speedup = 1 / ((1 - P) + P/N)

Where P = parallel fraction, N = number of processors