Applying C - Floating Point
Written by Harry Fairhead   
Tuesday, 21 May 2024
Article Index
Applying C - Floating Point
Floating Point Algorithms
Detecting Problems
Floating Point Reconsidered

Floating Point Reconsidered

This is by no means all you need to know about floating point arithmetic. What really matters is that you don't take it for granted that you will get the right answer when you make use of it. It should now be obvious why:

float a = 0.1f;
float total;
for(int i = 0;i<1000;i++){
    total = total+a;
printf("%.7f\n", total);
printf("%d\n",total == 100.0);

prints 99.9990463 and 0, 0.1. isn't exactly representable as a binary fraction it is 0.0001100110011.. This is the reason for the usual advice of "don't test floating point numbers for equality". However, in this case it is more general in that the same problem arises with the corresponding fixed point value. That is, it is more to do with binary fractions than it is to do with floating point representation.

If you think that such a small error could never make a difference, consider the error in the Patriot missile system. The system used an integer timing register which was incremented at intervals of 0.1 seconds. However, the integers were converted to decimal numbers by multiplying by the binary approximation of 0.1. After 100 hours, an error of approximately 0.3433 seconds was present in the conversion. As a result, an Iraqi Scud missile could not be accurately targeted and was allowed to detonate on a barracks, killing 28 people.

The recommended way of testing for equality between floating point values is to use something like:

if(fabsf((total - 100.0))/100.0 <=FLT_EPSILON))...

FLT_EPSILON is a macro that gives you the accuracy of a float. There are other useful constants defined in float.h. The idea is that if two numbers differ by less than the accuracy of the representation then they can be considered equal. Of course in practice numbers computed in different ways accumulate errors that are larger than the representational error. In the case of the example above with a = 0.1, the two numbers are very much further apart than FLT_EPSILON due to the inability to represent 0.1 in binary. In practice is usual to include a factor that summarizes the errors in the computation something like:

if(fabsf((total - 100.0))/100.0 <=K*FLT_EPSILON))...

To get our example to test equal, K has to be 80 or more. However, a small change and K has to be bigger. Run the loop a thousand times and compare the result to 1000 and K has to be even bigger.

The point is that there is no single way of setting a reasonable interval that works for a range of computations. You have to analyze the computation to find out what it is safe to regard as being equal. This leads us into the realm of numerical analysis.

If you applying any formula then it is always worth checking what the best way to compute it is. It is rare that the form given in a textbook is the best way to compute a quantity. For example, the mean is traditionally computed using:

float total;
int n = 1000000;
for (int i = 0; i < n; i++) {
   total = total + (float)i;
total = total/(float)n;

This forms a total and then divides by the number of items. The problem with this is that the total gets very big and we lose precision by adding comparatively small values to it. If you try it, you will discover that instead of 500000.00 the result is 499940.375000.

Using the alternative iterative method, which keeps the size of the running estimate down:

total = 0;
for (int i = 0; i < n; i++) {
    total = total + ((float)i-total)/(i+1);

gives a result of 499999.500000 which is only wrong by 0.5.

There are even better methods of computing the mean - see Kahan Summation and Pairwise Summation.

In many cases you can't avoid a detailed analysis of a calculation but it helps to have an idea of why things go wrong when you are using floating point. Imagine that you are working with three significant digits. For addition everything is fine as long as the exponents allow the digits to interact. For example consider:
1.23 x 102 + 4.67 x 103

written out like this:

  123 +

Normalizing this gives 4.79x103 and you can see that, ignoring rounding etc, only two digits of each value "overlapped" in the sum. If the exponents differ by 4 then none of digits are involved in the sum. For example 1.23 x 102 + 4.67 x 106 =

     123 +

and after normalizing the result we have 4.67x106. Clearly for addition and subtraction if you are working with floating point numbers with a precision of d then the accuracy of adding and subtracting goes down as the difference between the exponents approaches d. This is the sense in which you need to be careful about floating point arithmetic involving large and small numbers.

There are no similar problems with multiplication and division, apart from the accumulation of errors if operations are performed in succession.

Finally, if possible always use double or larger floating point types. Whereas float has 7 decimal digits of precision, double has 15 digits and this provides useful latitude.


  • Floating point arithmetic is so easy to use that we simply expect an arithmetic expression to be worked out correctly – this isn’t true.

  • Modern floating point hardware is almost as fast as integer operations and using double precision values doesn’t have an overhead, except for division.

  • Floating point arithmetic can give very wrong answers if the two operands differ by a large amount. The necessary normalization can reduce non-zero quantities to zero and the loss of precision can make results close to random.

  • A confusing factor is the use of extended precision during a calculation to minimize this loss of precision. This always gives a result that is as accurate, or more accurate, than if extended precision wasn’t used, but it can result in quantities that are supposed to be equal not testing as equal.

  • Standard floating point has two special values, NaN, not a number, and inf, infinity. The rules that govern how these are used in an expression are reasonable, but not foolproof. You can get a very wrong answer without even knowing that a special value is involved.

  • There are some standard ways of detecting special values and problems with floating point, but only in C99 and later. In practice, the results vary according to architecture.

  • You can cast integer to float and float to integer types. Everything works as you would expect, but casting to an integer type that is too small to hold the integer part of a float is undefined behavior.

  • Implementing floating point calculations is difficult an in many cases you need to find out how other people have tackled the problem. There are often optimized ways of computing the formulae you find in text books.

Now available as a paperback or ebook from Amazon.

Applying C For The IoT With Linux

  2. Kernel Mode, User Mode & Syscall
  3. Execution, Permissions & Systemd
    Extract Running Programs With Systemd
  4. Signals & Exceptions
    Extract  Signals
  5. Integer Arithmetic
    Extract: Basic Arithmetic As Bit Operations
    Extract: BCD Arithmetic  ***NEW
  6. Fixed Point
    Extract: Simple Fixed Point Arithmetic
  7. Floating Point 
  8. File Descriptors
    Extract: Simple File Descriptors 
    Extract: Pipes 
  9. The Pseudo-File System
    Extract: The Pseudo File System
    Extract: Memory Mapped Files 
  10. Graphics
    Extract: framebuffer
  11. Sockets
    Extract: Sockets The Client
    Extract: Socket Server
  12. Threading
    Extract:  Pthreads
    Extract:  Condition Variables
    Extract:  Deadline Scheduling
  13. Cores Atomics & Memory Management
    Extract: Applying C - Cores 
  14. Interupts & Polling
    Extract: Interrupts & Polling 
  15. Assembler
    Extract: Assembler

Also see the companion book: Fundamental C



Related Articles

Floating Point Numbers

Remote C/C++ Development With NetBeans

Raspberry Pi And The IoT In C

Getting Started With C/C++ On The Micro:bit

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.


Apache SkyWalking 10 Adds Layer and Service Hierarchy

Apache SkyWalking 10 has been released with improvements including a Layer and Service Hierarchy that streamlines monitoring by organizing services and metrics into distinct layers. The Kubernetes Net [ ... ]

Andrew Tanenbaum Gains ACM Award

Andrew Tanenbaum has been awarded the 2023 ACM System Software Award for MINIX the operating system he created for teaching purposes and which was an important influence on Linux.

More News

kotlin book



or email your comment to:

Last Updated ( Tuesday, 21 May 2024 )