The Design Problem
With 64-bit integers, we only get 18,446,744,073,709,600,000
distinct values. But quantities in physics and mathematics can easily go well beyond this.
We need to find something that is not only efficient in storage space, but also in computational time, because we only have so many resources.
Choice 1: Fixed Point
Represent a number as where:
- is the integer part of , represented using radix- digits
- is the fractional part of , represented using radix- digits
Represent as a digit radix- string that we write from most-significant to least-significant digit at the triple (sign; the digits of ; the digits of )
Thus, in a fixed point system with ():
- The value would be represented as (1; 000 0000 0000 0101; 1000 0000 0000 0000), or
0x80058000
- The value would be represented as (0; 000 0000 0000 0000; 0001 1001 1001 1001) or
0x00001999
(approximate) - The range would be along with
The problem with this method is there are a lot of wasted bits, we don’t use them properly.
For this reason, we move to using Choice 2:
Choice 2: Floating Point
Extend the idea that is used in scientific notation:
- Represent as with where
- : significant/mantissa; : exponent
- The representation of the value is uniquely
Why?
From Base-10 to Base-2
Represent as with where
- Not an infinite number of bits after binary point, so having bits in all, of which
We will call the set of representable numbers
Parameters for
The place value of is
Precision: the number of bits in the significand. Precision =
- The value has 1 representation
- The smallest positive normalized number greater than 1 has representation and value
Machine Epsilon : the gap between this number and 1:
For an arbitrary normalized FP number , the unit in the last place gives the gap between and the next larger (smaller) normalized FP number, for (for ), if such a number exists
Ex: Toy FP System
Consider a system where all normalized FP numbers have the form , with and with . What values can this system handle?
1.00 | 0.5 | 1 | 2 | |
1.01 | 0.625 | 1.25 | 2.5 | |
1.10 | 0.75 | 1.5 | 3 | |
1.11 | 0.875 | 1.75 | 3.5 | |
ulp | 0.125 | 0.25 | 0.5 |
The Representation of
Suppose we represent using bits
- Must handle both positive and negative values of
Define
- We are biasing, why bias?
- We do the bias because it is much, much easier to compare a number stored in a biased representation than it is to compare with twos-complement.
When you are encoding, you need to add the bias. Then, once you are trying to decode, subtract the bias.
Normalized FP Numbers
Minimum representable exponent:
Minimum normalized positive FP number:
Maximum representable exponent:
Maximum normalized positive FP number:
is machine epsilon .
So, in all, we need bits.
How to interpret the mantissa
The mantissa can be interpreted using a simple property. Reading each bit from the left (ignore endian-ness, and instead look at this from a human reading form), we simply multiple the bit by .
Example:
Given the binary representation 0 0001 001
of a floating point (FP(p,q; 4)) number , what is ?
We know that there are 4 precision and exponent bits each.
- See that the sign bit is 0, so this number will be positive
- The exponent is (because the bias is ) =
- The mantissa is .
- Combining all of these values:
Note that this was also us finding ! The equation for that is here.
How to Represent Zero
Since zero is unique, we have a special case
The value 0 has two unique representations:
Gaps at Zero
Because we would never be able to approach zero because of the idea of halving, we decide to take equal steps from to zero.
Subnormal Numbers
Extend by defining , note that this is not equal to
Now, if , then treat the hidden bit as 0 rather than 1 in calculating the value.
Essentially, we interpret the mantissa as we do regularly, except we set the hidden bit to 0
, so it becomes 0.<mantissa>
, and then we multiply by the smallest exponent we can represent, which is given by:
Link to originalE_{\text{min}}=\text{ebits}_q^{-1}(0^{q-1}1)=1-\text{bias}=-(\text{bias}-1)
So, how can we actually interpret a subnormal number?
Interpreting Subnormal Numbers
Given the binary representation 0 0000 101
of a floating point (FP(p,q; 4)) number , what is ?
We know that there are 4 precision and exponent bits each.
- See that the sign bit is 0, so this number will be positive
- The exponent is all zeroes, so it must be a subnormal number (of course it is, that’s the point of this problem). Because this is our special case, we arbitrarily set the exponent to E min ()
- can be found by doing
- So: we do
- Because the exponent is all zeroes, the hidden bit also becomes 0, therefore: The mantissa is .
- Combining all of these values:
Infinities
Similar to expressing 0, we need to somehow express infinity (but its a concept, not an actual number, so how?)
We point to infinity, but don’t actually write it out. How?
We write all 1
’s.
The pattern will represent the value
The pattern will represent the value
Not a Number (NaN)
What are the results of numbers like
- These situations are well defined in terms of limits:
But: ; or the square root of a negative number?
- These are “ill-defined” or undefined
- These numbers must become
NaN
’s, and they must propagate throughout operations.1 + NaN = NaN
Any representation with but significand is a NaN
.
Multiple NaN
’s are unordered, but they may be used to represent what kind of illegal operation you hit (e.g., dividing by 0, negative square root, etc.)
Rounding
Why?
We have defined the finite subset of . How can we reasonably represent an arbitrary extended real number ignoring NaN
’s
Defining
We define to be within the normalized range if
If , then one of the following must hold:
- is not within the normalized range
- is within the normalized range but its representation requires more than significand bits to represent exactly (imagine )
Sandwiching
We want to sandwich between two numbers, because that means that we can round it up or down towards a value.
Therefore, we define:
Basically ceiling and floor functions
With these definitions, we can say that
- The range is tight
- The range collapses to the single point iff
Representing and
If is a positive real number in the normalized range whose exact, possibly infinite representation is
Then:
If :
There is still a corner case where will be re-encoded, but the result will remain a normalized FP number
If , then either the lsb of is 0 and the lsb of is 1, or vice versa.
Corner Cases
If , then define and
If , then is either a subnormal number, or zero, and is either a subnormal or
Reflect around 0 for negative numbers.
Round Function
For , if ,
Otherwise, depends on the mode:
- Round down (RD):
- Round up (RU):
- Round-to-zero (RTZ):
- Round-to-nearest, with tiebreak to even (RTE): whichever of or is closer to
- If there is a tie, choose the one whose significand has lsb = 0
Rounding Arithmetic
Note: After every single arithmetic operation, a round is performed.
Floating Point Exceptions and Responses
- Invalid operation: Set result to NaN
- Division by zero: Set result to
- Overflow: Set result to or
- Underflow (subnormal): Set result to , or subnormal
- Inexact (common): Set to correctly rounded value described above.