The Design Problem

With 64-bit integers, we only get 18,446,744,073,709,600,000 distinct values. But quantities in physics and mathematics can easily go well beyond this.

We need to find something that is not only efficient in storage space, but also in computational time, because we only have so many resources.

Choice 1: Fixed Point

Represent a number as where:

  • is the integer part of , represented using radix- digits
  • is the fractional part of , represented using radix- digits

Represent as a digit radix- string that we write from most-significant to least-significant digit at the triple (sign; the digits of ; the digits of )

Thus, in a fixed point system with ():

  • The value would be represented as (1; 000 0000 0000 0101; 1000 0000 0000 0000), or 0x80058000
  • The value would be represented as (0; 000 0000 0000 0000; 0001 1001 1001 1001) or 0x00001999 (approximate)
  • The range would be along with

The problem with this method is there are a lot of wasted bits, we don’t use them properly.

For this reason, we move to using Choice 2:

Choice 2: Floating Point

Extend the idea that is used in scientific notation:

  • Represent as with where
  • : significant/mantissa; : exponent
  • The representation of the value is uniquely

Why?

From Base-10 to Base-2

Represent as with where

  • Not an infinite number of bits after binary point, so having bits in all, of which

We will call the set of representable numbers

Parameters for

The place value of is

Precision: the number of bits in the significand. Precision =

  • The value has 1 representation
  • The smallest positive normalized number greater than 1 has representation and value

Machine Epsilon : the gap between this number and 1:

For an arbitrary normalized FP number , the unit in the last place gives the gap between and the next larger (smaller) normalized FP number, for (for ), if such a number exists

Ex: Toy FP System

Consider a system where all normalized FP numbers have the form , with and with . What values can this system handle?

1.000.512
1.010.6251.252.5
1.100.751.53
1.110.8751.753.5
ulp0.1250.250.5

The Representation of

Suppose we represent using bits

  • Must handle both positive and negative values of

Define

  • We are biasing, why bias?
  • We do the bias because it is much, much easier to compare a number stored in a biased representation than it is to compare with twos-complement.

When you are encoding, you need to add the bias. Then, once you are trying to decode, subtract the bias.

Normalized FP Numbers

Minimum representable exponent:

Minimum normalized positive FP number:

Maximum representable exponent:

Maximum normalized positive FP number:

is machine epsilon .

So, in all, we need bits.

How to interpret the mantissa

The mantissa can be interpreted using a simple property. Reading each bit from the left (ignore endian-ness, and instead look at this from a human reading form), we simply multiple the bit by .

Example:

Given the binary representation 0 0001 001 of a floating point (FP(p,q; 4)) number , what is ?

We know that there are 4 precision and exponent bits each.

  1. See that the sign bit is 0, so this number will be positive
  2. The exponent is (because the bias is ) =
  3. The mantissa is .
  4. Combining all of these values:

Note that this was also us finding ! The equation for that is here.

How to Represent Zero

Since zero is unique, we have a special case

The value 0 has two unique representations:

Gaps at Zero

Because we would never be able to approach zero because of the idea of halving, we decide to take equal steps from to zero.

Subnormal Numbers

Extend by defining , note that this is not equal to

Now, if , then treat the hidden bit as 0 rather than 1 in calculating the value.

Essentially, we interpret the mantissa as we do regularly, except we set the hidden bit to 0, so it becomes 0.<mantissa>, and then we multiply by the smallest exponent we can represent, which is given by:

E_{\text{min}}=\text{ebits}_q^{-1}(0^{q-1}1)=1-\text{bias}=-(\text{bias}-1)
Link to original

So, how can we actually interpret a subnormal number?

Interpreting Subnormal Numbers

Given the binary representation 0 0000 101 of a floating point (FP(p,q; 4)) number , what is ?

We know that there are 4 precision and exponent bits each.

  1. See that the sign bit is 0, so this number will be positive
  2. The exponent is all zeroes, so it must be a subnormal number (of course it is, that’s the point of this problem). Because this is our special case, we arbitrarily set the exponent to E min ()
    1. can be found by doing
    2. So: we do
  3. Because the exponent is all zeroes, the hidden bit also becomes 0, therefore: The mantissa is .
  4. Combining all of these values:

Infinities

Similar to expressing 0, we need to somehow express infinity (but its a concept, not an actual number, so how?)

We point to infinity, but don’t actually write it out. How?

We write all 1’s.

The pattern will represent the value

The pattern will represent the value

Not a Number (NaN)

What are the results of numbers like

  • These situations are well defined in terms of limits:

But: ; or the square root of a negative number?

  • These are “ill-defined” or undefined
  • These numbers must become NaN’s, and they must propagate throughout operations. 1 + NaN = NaN

Any representation with but significand is a NaN.

Multiple NaN’s are unordered, but they may be used to represent what kind of illegal operation you hit (e.g., dividing by 0, negative square root, etc.)

Rounding

Why?

We have defined the finite subset of . How can we reasonably represent an arbitrary extended real number ignoring NaN’s

Defining

We define to be within the normalized range if

If , then one of the following must hold:

  1. is not within the normalized range
  2. is within the normalized range but its representation requires more than significand bits to represent exactly (imagine )

Sandwiching

We want to sandwich between two numbers, because that means that we can round it up or down towards a value.

Therefore, we define:

Basically ceiling and floor functions

With these definitions, we can say that

  • The range is tight
  • The range collapses to the single point iff

Representing and

If is a positive real number in the normalized range whose exact, possibly infinite representation is

Then:

If :

There is still a corner case where will be re-encoded, but the result will remain a normalized FP number

If , then either the lsb of is 0 and the lsb of is 1, or vice versa.

Corner Cases

If , then define and

If , then is either a subnormal number, or zero, and is either a subnormal or

Reflect around 0 for negative numbers.

Round Function

For , if ,

Otherwise, depends on the mode:

  • Round down (RD):
  • Round up (RU):
  • Round-to-zero (RTZ):
  • Round-to-nearest, with tiebreak to even (RTE): whichever of or is closer to
    • If there is a tie, choose the one whose significand has lsb = 0

Rounding Arithmetic

Note: After every single arithmetic operation, a round is performed.

Floating Point Exceptions and Responses

  1. Invalid operation: Set result to NaN
  2. Division by zero: Set result to
  3. Overflow: Set result to or
  4. Underflow (subnormal): Set result to , or subnormal
  5. Inexact (common): Set to correctly rounded value described above.