Floating Point Numbers
Written by Mike James   
Thursday, 23 August 2018
Article Index
Floating Point Numbers
The Binary Point

Fractional conversion

If you want to know how to convert a decimal fraction into a binary one the algorithm is fairly simple.

You multiply the decimal value by 2 and if the result is one or greater the first bit in the binary fraction is a one and if not it is a zero.

If it was a one you subtract one and repeat the operation for the next bit.

In pseudo code using v for the fraction and the string b for the binary fraction,the algorithm is:

 v = 1 / 5
 b = "0."
   v = v * 2
  If v >= 1 Then
   b = b + "1"
   v = v - 1
   b = b + "0"
  End If

Notice that the loop never ends. In a real implementation you would loop round according to the number of bits you needed in the result - this example does infinite precision conversions!

A binary point

Even though various people played around with binary back in the early 19th century, the idea of using a binary point wasn’t something that seemed obvious.

For example, when Leibniz asked Bernoulli to calculate pi in binary he simply multiplied the decimal value by 1035 and then converted it to binary in the usual fashion.

The good news is that all of the integer arithmetic methods work with binary fractions but there is a little problem. You have to assume that all of the numbers in the calculation have their binary point in the same place.

That is, if you are using 16-bit arithmetic you might use 8 bits to the left of the binary point and 8 bits to the right.

When you add, subtract, multiply or divide such numbers the binary point is automatically positioned in the same place in the result. This is called “fixed point” binary representation and it is the simplest type of non-integer arithmetic that computers use. It used to be common in mini-computers but today it has almost been completely superseded by the alternative “floating point” representation.

That is, if you are happy with always having the binary (or any radix) point in a fixed specified position then you can work with fractional values as if they were integers. All of your hardware stays the same and all you have to do is to remember to put the binary point back into the result at the fixed location.

Fixed point arithmetic is simple, so simple that early machines and many early microprocessors only supported fixed point. Basically if a computer had an arithmetic unit (ALU) that worked with integers then the programmer could use the very same hardware to do fixed point arithmetic just by specifying where the binary point was.

For a long time this was the only sort of arithmetic that was supported by the hardware.

Floating point binary

Fixed point may be simple but it is very limited in the range of numbers it can represent. If a calculation resulted in big scale changes i.e. results very much smaller or bigger than the initial set of numbers then the calculation would fail because or under or overflow.

A better scheme if you want trouble free calculations is to use a floating point representation.

In this case the radix point is allowed to move during the calculation and extra bits are allocated to keep track of where it is i.e. the binary point "floats".

The advantage of this approach is clear if you consider multiplying a value such as 123.4 by 1000. If the hardware (decimal in this case!) can only hold four digits then the result is an overflow error

i.e. 123.4 * 1000 = 123400

which truncates to 3400 which is clearly not the right answer.

If the hardware uses the floating point approach it can simply record the shift in the decimal point four places to the right.

You can think of this as a way of allowing a much larger range of numbers to be represented but with a fixed number of digits’ precision.

A floating point number is represented by two parts – an exponent and a fractional part.

The fractional part is just a fixed-point representation of the number – usually with the radix point to the immediate left of the first bit making its value less than 1.

The exponent is a scale factor which determines the true magnitude of the number.

In decimal we are used to this scheme as scientific notation, standard form or exponential notation. For example, Avogadro’s number is usually written as 6.02252 x 1023 and the 23 is the exponent and the 6.02252 is the fractional part – notice that in standard form the fractional part is always less than 10 and more than 1. In floating point representation it is usual for the fractional part to be normalised to be just less than 1.

Floating point in this form was known to the Babylonian’s in 1800 BC but computers took a long time to get round to using it. It was independently proposed by Leonardo Torres y Quevedo at the end of the nineteenth century, by Konrad Zuse in 1936 and George Stibitz in 1939. The first relay computers, the Harvard Mark II for example, had floating point hardware but it was too costly to include in the ENIAC, though it was considered.

In binary, floating point is just the binary equivalent of standard form in decimal. The exponent is the power of two by which you have to multiply the fraction to get the true magnitude. At this point you might want to write floating point off as trivial but there are some subtleties.

For example when the fractional part is zero what should the exponent be set to?

Clearly there is more than one representation for zero. By convention the exponent is made as negative as it can be, i.e. as small as possible in the representation of zero. If two's complement were used this would result in a zero that didn’t have all its bits set to zero and we don't like this for many obvious reasons.

To achieve this a small change is needed to use a biased exponent by adding the largest negative value to it.

For example, if the exponent is six bits in size, the two's complement notation range is –32 to +31.

If instead of two's complement a simple positive representation is used then 0, i.e. all bits zero, represents –32, 32 represents 0 and 63 represents 31. The same range is covered but now the representation of zero has all bits set to zero even if it does now mean 0x2-32.

The value you subtract from the exponent to obtain its true value is called the “bias” – 32 in our example.






Last Updated ( Thursday, 23 August 2018 )