Floating Point Numbers

New Book Reviews!

 Floating Point Numbers
Written by Mike James
Article Index
Floating Point Numbers
Algorithms
A standard for floating point arithmetic

The place value system uses increasing powers of the “radix” with each place you move to the left, e.g. if the radix is ten then the powers are:

`        103    102   101   100        1000   100   10    1`

and the number 1234, for example, is just:

1000*1+100*2+10*3+1*4

To extend this system to fractions we need to invent the decimal point – or more generally the “radix” point.

Places to the right of the decimal point correspond to negative powers of the radix. That is:

`   103  102  101  100 . 10-1   10-2   1000 100  10   1     1/10  1/100`

and the number 1234.56 is

1000*1+100*2+10*3+1*4 +1/10*5+1/100*6

You can see that you need to know the mathematical facts that  100 is 1 and 10-x is 1/10x for this to make perfect sense.

Factions are easy enough to write down as long as you know how to find the sum of the negative radix powers, i.e. 1/10, 1/100, 1/1000, 1/1000 and so on, that equal the fraction.

It is sometimes the case that it is impossible to find an exact place value representation of a rational fraction. For example, 1/3 is 0.3333r, where the r means “recurring” and the threes go on forever. Clearly though, as the negative radix powers get smaller and smaller, we very quickly get as close as we like to any rational fraction.

The same ideas apply to fractions in any radix place value system. In binary for example everything works in the same way but we call the “point” a binary point and the powers to the right are powers of two, e.g.

`    23  22  21  20 . 2-1  2-2    8   4   2   1   1/2  1/4`

Notice that now the rational fractions have to be expressed as sums of 1/2, 1/4, 1/8, 1/16 and so on.

What this means is that the rational fractions that don’t have exact representations in binary are different to those in decimal. For example, in decimal 1/5 is exactly 0.2 but in binary it is 0.0011r where the “r” once again means recurring.

Fractional conversion

If you want to know how to convert a decimal fraction into a binary one the algorithm is fairly simple.

You multiply the decimal value by 2 and if the result is one or greater the first bit in the binary fraction is a one and if not it is a zero.

If it was a one you subtract one and repeat the operation for the next bit.

In pseudo code using v for the fraction and the string b for the binary fraction,the algorithm is:

` v = 1 / 5 b = "0." Do   v = v * 2  If v >= 1 Then   b = b + "1"   v = v - 1  Else   b = b + "0"  End If Loop`

Notice that the loop never ends. In a real implementation you would loop round according to the number of bits you needed in the result - this example does infinite precision conversions!

A binary point

Even though various people played around with binary back in the early 19th century, the idea of using a binary point wasn’t something that seemed obvious.

For example, when Leibniz asked Bernoulli to calculate pi in binary he simply multiplied the decimal value by 1035 and then converted it to binary in the usual fashion.

The good news is that all of the integer arithmetic methods work with binary fractions but there is a little problem. You have to assume that all of the numbers in the calculation have their binary point in the same place.

That is, if you are using 16-bit arithmetic you might use 8 bits to the left of the binary point and 8 bits to the right.

When you add, subtract, multiply or divide such numbers the binary point is automatically positioned in the same place in the result. This is called “fixed point” binary representation and it is the simplest type of non-integer arithmetic that computers use. It used to be common in mini-computers but today it has almost been completely superseded by the alternative “floating point” representation.

That is, if you are happy with always having the binary (or any radix) point in a fixed specified position then you can work with fractional values as if they were integers. All of your hardware stays the same and all you have to do is to remember to put the binary point back into the result at the fixed location.

Fixed point arithmetic is simple, so simple that early machines and many early microprocessors only supported fixed point. Basically if a computer had an arithmetic unit (ALU) that worked with integers then the programmer could use the very same hardware to do fixed point arithmetic just by specifying where the binary point was.

For a long time this was the only sort of arithmetic that was supported by the hardware.

<ASIN:0471163635>

<ASIN:0133224953>

<ASIN:0691147817>

<ASIN:0883856131>