Applying C - Fixed Point Arithmetic
Monday, 11 May 2020
Article Index
Applying C - Fixed Point Arithmetic
Rescaling
Printing Fixed Point

## Rescaling

Let’s consider the way the four operations of arithmetic behave under fixed point with the binary point at position s.

Suppose we have two variables containing variables a and b with the binary point at bit s, then:

`a+b;`

and:

`a-b;`

Both have the binary point at bit s as:

`a*2s + b*2s = (a+b)* 2s`

and:

`a*2s -b*2s = (a-b)* 2s`

Things are not so simple for multiplication:

`a*b`

is the same as:

`a*2s*b*2s = (a*b)*22s`

and hence a*b has the binary point at bit 2s. To keep the binary point at s you have to rescale by 2s. That is, you implement a*b as:

`a*b/2s`

or:

`a*b>>s`

That is, to multiply two fixed-point binary numbers you need to perform an s-bit shift right to restore the binary point to its fixed position.

Notice that you are throwing away s fractional bits by truncation. In some cases rounding is needed to keep all of the precision possible.

There are many different ways of rounding, but rounding up is easy to implement. All you have to do is take the lower s bits that are about to be thrown away and, if they are greater than ½ of the new lowest-order bit, add one to the low-order bit. You don’t need to use a conditional to do this. Simply take the fractional bits about to be discarded, shift left by one bit and add to the value then truncate. That is, if s is 8:

```value = value+((0xFF & value)<<1);
value>> =8;```

is a rounded result.

Notice that it is possible for the multiplication to overflow, even if in principle the result could be represented – more on this later.

Division has the same problem as multiplication in that it moves the binary point, but in the other direction.

That is:

`a/b`

is the same as:

`a*2s/(b*2s) = (a/b)`

and the binary point has been shifted to the far left. In this case we have to perform a scaling to return the binary point to the same position.

`(a*b)*2s`

or:

`(a/b)<<s;`

Notice that as the result has no fractional bits, a simple rescaling after the division will have all zeros as the fractional part. The fractional part of the division has been lost during the implicit shift right that occurs during division. We can avoid the loss of precision by performing the scaling before the division:

`a*2s/b`

which gives an answer with the binary point at s and preserves the fractional part of the result.

That is, implement:

`a/b`

as:

`(a<<s)/b;`

and not as:

`(a/b)<<s;`

Notice that, despite this being a division which makes the result smaller you can still generate an overflow by the shift left.

This is nearly all there is to know about fixed point arithmetic. Other functions such as square, square root and so on have to analyzed in the same way and their results have to be scaled to bring the binary point back to the same position.

## Temperature Conversion

Let’s look at an example. Consider the problem of a temperature measuring sensor returning a fixed point value in Celsius and we need to convert it to Fahrenheit. The usual formula is to multiple by 9, divide by 5 and then add 32 and for the sake of a fixed point demonstration let’s do exactly this using two digits precision with the binary point at bit 8, i.e. using a scale factor of 28.

Our first problem is how to get some example values into the program. Suppose the temperature is 43.25C, how do we convert this decimal value into fixed point? We can implement the scale factor by a shift of the integer part, but this doesn’t work for the fractional part. However, the compiler will evaluate a floating point expression for you and assign it as an integer value. So we can get the fixed point value into temp using:

`int temp = 43.25*(1<<8);`

For the sake of a more interesting example, we can store the 5, 9 and 32 in variables in the same fixed point format.

```int nine = 9.00*(1<<8);
int five = 5.00*(1<<8);
int constant = 32.00*(1<<8);```

Now we can proceed to the fixed point calculations broken out one instruction at a time to make things clear. First we multiply by nine and scale the result to bring the binary point back to its original position:

`temp = temp*nine >>8;`

next we can divide by five, this time scaling temp to ensure we get a fractional part in the answer:

`temp = (temp<<8)/five;`

Finally we add the constant and, of course, no scaling is needed:

`temp = temp+constant;`

Of course, in practice, as the 9, the 5 and the 32 are constants, we wouldn’t normally have scaled them before use and so the correctly scaled fixed-point calculation would be written:

`temp = temp*9/5+(32<<8);`

As 9/5 is 1.8 exactly in decimal you might think that a better way to do the calculation is to simply multiply by 1.8 and cut out the costly integer division.

You can do this quite easily:

```int fact = 1.8*(1<<8);
int temp = 43.25*(1<<8);
int temp = (temp*fact >>8)+(32<<8);```

and it works, but it isn’t accurate as the previous method. This is puzzling as 9/5 is exactly 1.80, which can be represented to two decimal places using a scale factor of 8, but this doesn’t take into account that in binary 1.8 is not exactly representable. If you work it out you will find that 1.8 is:

`1.1100110011001100…`

To get the same precision as the multiply by 9 divide by 5 method, you need to go to a scale factor of 210. Notice that if you go to 216, i.e. Q16.16, with a signed int, then the calculation overflows.

Arithmetic can be tricky.

Last Updated ( Saturday, 16 May 2020 )