Floating Point & Rounding

01. Non-Integer Numbers

In standard binary, we handle integers easily (unsigned or two's complement). But what about 3.1415 or 0.005? In base 10 (Decimal), digits to the right of the dot represent powers of 1/10 ($10^{-1}, 10^{-2}$...).

In Binary (base 2), we do the same. Digits to the right of the binary point represent powers of 1/2 ($2^{-1}, 2^{-2}$...).

... $2^2$ | $2^1$ | $2^0$ . $2^{-1}$ | $2^{-2}$ | $2^{-3}$ ...

There are two main ways to handle this in computer architecture:

Fixed Point

We lock the decimal point at a specific column (e.g., the last 4 bits are always fractions).

PRO: Simple math and logic.
CON: Inflexible. You lose range. If you need tiny numbers, you can't have huge numbers.

Floating Point (IEEE 754)

The decimal point "floats" using scientific notation ($1.xxx \times 2^{exp}$).

PRO: Huge dynamic range (tiny to massive).
CON: More complex hardware; precision varies based on magnitude.

02. Anatomy of IEEE 754

We will use the Hypothetical 8-Bit Model from your text to explain the standard 32-bit format. It divides bits into three distinct groups:

1 Sign Bit (1 bit): 0 = Positive, 1 = Negative.
2 Exponent (4 bits): Determines the scale (moving the dot). It uses a Bias of 7.
Actual Exp = BinaryValue - 7
3 Significand (3 bits): The precision. It assumes a leading "1" before the dot.
Value = 1.0 + bits

Value = $(-1)^S \times (1 + \text{Signif}) \times 2^{(Exp - 7)}$

03. Interactive Bit Lab

Click the bits below to flip them (0/1) and see how the floating point value is calculated in real-time. This models the 8-bit example.

SIGN BIT

Positive (+)

EXPONENT (Bias 7)

2^0

SIGNIFICAND (1.xxx)

1.0

Calculation:

Try These Text Examples:

Example 1 (1.0): 0 0111 000 (Exp is 7-7=0, Mantissa is 0)
Example 2 (1.875): 0 0111 111 (Exp is 0, Mantissa is 0.875)
Example 3 (-3.0): 1 1000 100 (Sign -, Exp 8-7=1, Mantissa 0.5)

04. Rounding Error

Because we have finite bits (8 in our model, 32 in standard), we cannot represent every number on the infinite number line. We can only store specific "dots".

Key Concept: The gaps between representable numbers get larger as the numbers get bigger.

Near Zero, accuracy is high (dense dots). At high magnitudes, accuracy drops (sparse dots).

0 ← Value Magnitude → MAX

With IEEE 754 (32-bit), the max rounding error is approx 0.000006%. Sufficient for engineering, but dangerous if you ignore it in high-precision comparison logic.