Floating Point Numbers

The Design Problem

With 64-bit integers, we only get 18,446,744,073,709,600,000 distinct values. But quantities in physics and mathematics can easily go well beyond this.

We need to find something that is not only efficient in storage space, but also in computational time, because we only have so many resources.

Choice 1: Fixed Point

Represent a number $x$ as $\pm x_{i} * x_{f}$ where:

$x_{i}$ is the integer part of $x$ , represented using $a$ radix- $k$ digits
$x_{f}$ is the fractional part of $x$ , represented using $b$ radix- $k$ digits

Represent $x$ as a $(1 + a + b)$ digit radix- $k$ string that we write from most-significant to least-significant digit at the triple (sign; the $a$ digits of $x_{i}$ ; the $b$ digits of $x_{f}$ )

Thus, in a fixed point system with ( $k = 2, a = 5, b = 16$ ):

The value $\frac{11}{2}$ would be represented as (1; 000 0000 0000 0101; 1000 0000 0000 0000), or 0x80058000
The value $\frac{1}{10}$ would be represented as (0; 000 0000 0000 0000; 0001 1001 1001 1001) or 0x00001999 (approximate)
The range would be $2^{- 16} \leq ∣ x ∣ \leq 2^{1} 5 - ϵ$ along with $\pm 0$

The problem with this method is there are a lot of wasted bits, we don’t use them properly.

For this reason, we move to using Choice 2:

Choice 2: Floating Point

Extend the idea that is used in scientific notation:

Represent $x$ as $\pm S \times 1 0^{E}$ with $S = (d_{0} \cdot d_{1} \dots)_{10}$ where $1 \leq d_{0} < 10$
$S$ : significant/mantissa; $E$ : exponent
The representation of the value $\frac{11}{2}$ is uniquely $+ 5.5 \times 1 0^{0}$

Why?

From Base-10 to Base-2

Represent $x$ as $\pm S \times 2^{E}$ with $S = (b_{0} \cdot b_{1} \dots)_{2}$ where $1 \leq b_{0} < 2$

Not an infinite number of bits after binary point, so $S = (1 \cdot b_{1} \cdot b_{2} \dots b_{p - 1})_{2}$ having $p$ bits in all, of which $b_{0} = 1$

We will call the set of representable numbers $FP (p, q; 2)$

Parameters for $FP (p, q; 2)$

The place value of $b_{i}$ is $2^{- i}$

Precision: the number of bits in the significand. Precision = $p$

The value has 1 representation $(1 \cdot 0^{p - 1})_{2} \times 2^{0}$
The smallest positive normalized $FP$ number greater than 1 has representation $(1 \cdot 0^{p - 1} 1)_{2} \times 2^{0}$ and value $(1 + 2^{- (p - 1)})$

Machine Epsilon $ϵ$ : the gap between this number and 1:

E = 2^{- (p - 1)}

For an arbitrary normalized FP number $x = \pm S \times 2^{E}$ , the unit in the last place gives the gap between $x$ and the next larger (smaller) normalized FP number, for $x > 0$ (for $x < 0$ ), if such a number exists

u lp (x) = ϵ \times 2^{E}

Ex: Toy FP System

Consider a system where all normalized FP numbers have the form $(b_{0} \cdot b_{1} b_{2})_{2} \times 2^{E}$ , with $E \in {- 1, 0, 1}$ and with $b_{0} = 1$ . What values can this system handle?

$b_{0} \cdot b_{1} b_{2}$	$E = - 1$	$E = 0$	$E = 1$
1.00	0.5	1	2
1.01	0.625	1.25	2.5
1.10	0.75	1.5	3
1.11	0.875	1.75	3.5
ulp	0.125	0.25	0.5

The Representation of $E$

Suppose we represent $E$ using $q$ bits

Must handle both positive and negative values of $E$

Define $ebits_{q} (E) = W_{U}^{- 1} (E + 2^{q - 1} - 1)$

We are biasing, why bias?
We do the bias because it is much, much easier to compare a number stored in a biased representation than it is to compare with twos-complement.

When you are encoding, you need to add the bias. Then, once you are trying to decode, subtract the bias.

Normalized FP Numbers

bias = 2^{q - 1} - 1

Minimum representable exponent:

E_{min} = ebits_{q}^{- 1} (0^{q - 1} 1) = 1 - bias = - (bias - 1)

Minimum normalized positive FP number:

N_{min} = (1. 0^{p} 1)_{2} \times 2^{E_{min}}

Maximum representable exponent:

E_{max} = ebits_{q}^{- 1} (1^{q - 1} 0) = bias

Maximum normalized positive FP number:

N_{max} = (1. 1^{p - 1})_{2} \times 2^{E_{max}} = (2 - 2^{- (p - 1)}) \times 2^{E_{max}} \approx 2^{E_{max} + 1}

$2^{- (p - 1)}$ is machine epsilon $E$ .

So, in all, we need $p + q$ bits.

How to interpret the mantissa

The mantissa can be interpreted using a simple property. Reading each bit from the left (ignore endian-ness, and instead look at this from a human reading form), we simply multiple the bit by $2^{- bit number (starting at 1)}$ .

Example:

Given the binary representation 0 0001 001 of a floating point (FP(p,q; 4)) number $x$ , what is $(x)_{10}$ ?

We know that there are 4 precision and exponent bits each.

See that the sign bit is 0, so this number will be positive
The exponent is $1 \times 2^{0} - 7$ (because the bias is $2^{4 - 1} - 1$ ) = $- 6$
The mantissa is $1.0 (hidden bit) + 2^{- 1} \times 0 + 2^{- 2} \times 0 + 2^{- 3} \times 1 = 1.125$ .
Combining all of these values: $2^{E} \times mantissa$
1. $2^{- 6} \times 1.125 = 0.017578125$

Note that this was also us finding $N_{min}$ ! The equation for that is here.

How to Represent Zero

Since zero is unique, we have a special case

The value 0 has two unique representations:

Gaps at Zero

Because we would never be able to approach zero because of the idea of halving, we decide to take equal steps from $N_{min}$ to zero.

Subnormal Numbers

Extend $ebits_{q}^{- 1} (E)$ by defining $ebits_{q}^{- 1} (0^{q}) = E_{min}$ , note that this is not equal to $E_{min} - 1$

Now, if $ebits_{q}^{- 1} (E) = 0^{q}$ , then treat the hidden bit $b_{0}$ as 0 rather than 1 in calculating the value.

s 0^{q} b_{1} \dots b_{p - 1} = \pm (0. b_{1} \dots b_{p - 1})_{2} \times 2^{E_{min}}

Essentially, we interpret the mantissa as we do regularly, except we set the hidden bit to 0, so it becomes 0.<mantissa>, and then we multiply by the smallest exponent we can represent, which is given by:

 $E_{\text{min}}=\text{ebits}_q^{-1}(0^{q-1}1)=1-\text{bias}=-(\text{bias}-1)$

Link to original

So, how can we actually interpret a subnormal number?

Interpreting Subnormal Numbers

Given the binary representation 0 0000 101 of a floating point (FP(p,q; 4)) number $x$ , what is $x$ ?

We know that there are 4 precision and exponent bits each.

See that the sign bit is 0, so this number will be positive
The exponent is all zeroes, so it must be a subnormal number (of course it is, that’s the point of this problem). Because this is our special case, we arbitrarily set the exponent to E min ( $E_{min}$ )
1. $E_{min}$ can be found by doing $- ((2^{q - 1} - 1) - 1) = - (bias - 1)$
2. So: we do $- ((2^{4 - 1} - 1) - 1) = - ((8 - 1) - 1) = 6$
Because the exponent is all zeroes, the hidden bit also becomes 0, therefore: The mantissa is $0.0 (hidden bit) + 2^{- 1} \times 1 + 2^{- 2} \times 0 + 2^{- 3} \times 1 = 0.625$ .
Combining all of these values: $2^{E} \times mantissa$
1. $2^{- 6} \times 0.625 = 0.0009765625$

Infinities

Similar to expressing 0, we need to somehow express infinity (but its a concept, not an actual number, so how?)

We point to infinity, but don’t actually write it out. How?

We write all 1’s.

The pattern $0 1^{q} 0^{p - 1}$ will represent the value $+ \infty$

The pattern $1 1^{q} 0^{p - 1}$ will represent the value $- \infty$

Not a Number (NaN)

What are the results of numbers like $\frac{1}{0}, \frac{1}{\infty}, \infty + \infty$

These situations are well defined in terms of limits: $\infty, 0, \infty$

But: $0 \times \infty, \frac{0}{0}, \infty - \infty$ ; or the square root of a negative number?

These are “ill-defined” or undefined
These numbers must become NaN’s, and they must propagate throughout operations. 1 + NaN = NaN

Any representation with $ebits_{q} (E) = 1^{q}$ but significand $\neq = 0^{p - 1}$ is a NaN.

Multiple NaN’s are unordered, but they may be used to represent what kind of illegal operation you hit (e.g., dividing by 0, negative square root, etc.)

Rounding

Why?

We have defined the finite subset $FP (p, q)$ of $R^{+} = R \cup {- \infty, + \infty}$ . How can we reasonably represent an arbitrary extended real number $x \in R^{+}$ ignoring NaN’s

Defining $x$

We define $x$ to be within the normalized range if $N_{min} \leq ∣ x ∣ \leq N_{max}$

If $x \in / FP (p, q)$ , then one of the following must hold:

$x$ is not within the normalized range
$x$ is within the normalized range but its representation requires more than $p$ significand bits to represent exactly (imagine $1 + 2 \times 1 0^{- 25}$ )

Sandwiching $x$

We want to sandwich $x$ between two numbers, because that means that we can round it up or down towards a value.

Therefore, we define:

$x_{-} = max {y \in FP (p, q) : y \leq x}$
$x_{+} = min {y \in FP (p, q) : y \geq x}$

Basically ceiling and floor functions

With these definitions, we can say that

$x \in [x_{-}, x_{+}]$
The range $[x_{-}, x_{+}]$ is tight
The range collapses to the single point $x$ iff $x \in FP (p, q)$

Representing $x_{-}$ and $x_{+}$

If $x$ is a positive real number in the normalized range whose exact, possibly infinite representation is

x = (1. b_{1} \dots b_{p - 1} b_{p} b_{b + 1} \dots)_{2} \times 2^{E}

Then:

$x_{-} = (1. b_{1} \dots b_{p - 1} b_{p} b_{b + 1} \dots)_{2} \times 2^{E}$

If $x \in / FP (p, q)$ : $x_{+} = ((1. b_{1} \dots b_{p - 1})_{2} + (0. 0^{p - 2} 1)_{2}) \times 2^{E}$

There is still a corner case where $x_{+}$ will be re-encoded, but the result will remain a normalized FP number

If $x \in / FP (p, q)$ , then either the lsb of $x_{-}$ is 0 and the lsb of $x_{+}$ is 1, or vice versa.

Corner Cases

If $x > N_{max}$ , then define $x_{-} = N_{max}$ and $x_{+} = + \infty$

If $0 < x < N_{min}$ , then $x_{-}$ is either a subnormal number, or zero, and $x_{+}$ is either a subnormal or $N_{min}$

Reflect around 0 for negative numbers.

Round Function

For $x \in R^{+}$ , if $x \in FP (p, q)$ , $ROUND (x) = x$

Otherwise, $ROUND (x)$ depends on the mode:

Round down (RD): $ROUND (x) = x_{-}$
Round up (RU): $ROUND (x) = x_{+}$
Round-to-zero (RTZ): $ROUND (x) = if x > 0 then x_{-} else x_{+}$
Round-to-nearest, with tiebreak to even (RTE): $ROUND (x) =$ whichever of $x_{-}$ or $x_{+}$ is closer to $x$
- If there is a tie, choose the one whose significand has lsb = 0

Rounding Arithmetic

Note: After every single arithmetic operation, a round is performed.

Floating Point Exceptions and Responses

Invalid operation: Set result to NaN
Division by zero: Set result to $\pm \infty$
Overflow: Set result to $\pm \infty$ or $\pm N_{max}$
Underflow (subnormal): Set result to $\pm 0, \pm N_{min}$ , or subnormal
Inexact (common): Set to correctly rounded value described above.

Notes

Explorer

The Design Problem

Choice 1: Fixed Point

Choice 2: Floating Point

From Base-10 to Base-2

Parameters for $FP (p, q; 2)$

Ex: Toy FP System

The Representation of $E$

Normalized FP Numbers

How to interpret the mantissa

Example:

How to Represent Zero

Gaps at Zero

Subnormal Numbers

Interpreting Subnormal Numbers

Infinities

Not a Number (NaN)

Rounding

Defining $x$

Sandwiching $x$

Corner Cases

Round Function

Rounding Arithmetic

Floating Point Exceptions and Responses

Graph View

Table of Contents

Backlinks

Notes

Explorer

Floating Point Numbers

The Design Problem

Choice 1: Fixed Point

Choice 2: Floating Point

From Base-10 to Base-2

Parameters for FP(p,q;2)

Ex: Toy FP System

The Representation of E

Normalized FP Numbers

How to interpret the mantissa

Example:

How to Represent Zero

Gaps at Zero

Subnormal Numbers

Interpreting Subnormal Numbers

Infinities

Not a Number (NaN)

Rounding

Defining x

Sandwiching x

Corner Cases

Round Function

Rounding Arithmetic

Floating Point Exceptions and Responses

Graph View

Table of Contents

Backlinks

Parameters for $FP (p, q; 2)$

The Representation of $E$

Defining $x$

Sandwiching $x$