Applying C - Floating Point |

Written by Harry Fairhead | |||||

Tuesday, 21 May 2024 | |||||

Page 4 of 4
## Floating Point ReconsideredThis is by no means all you need to know about floating point arithmetic. What really matters is that you don't take it for granted that you will get the right answer when you make use of it. It should now be obvious why: float a = 0.1f; float total; for(int i = 0;i<1000;i++){ total = total+a; } printf("%.7f\n", total); printf("%d\n",total == 100.0); prints 99.9990463 and 0, 0.1. isn't exactly representable as a binary fraction it is 0.0001100110011.. This is the reason for the usual advice of "don't test floating point numbers for equality". However, in this case it is more general in that the same problem arises with the corresponding fixed point value. That is, it is more to do with binary fractions than it is to do with floating point representation. If you think that such a small error could never make a difference, consider the error in the Patriot missile system. The system used an integer timing register which was incremented at intervals of 0.1 seconds. However, the integers were converted to decimal numbers by multiplying by the binary approximation of 0.1. After 100 hours, an error of approximately 0.3433 seconds was present in the conversion. As a result, an Iraqi Scud missile could not be accurately targeted and was allowed to detonate on a barracks, killing 28 people. The recommended way of testing for equality between floating point values is to use something like: if(fabsf((total - 100.0))/100.0 <=FLT_EPSILON))... FLT_EPSILON is a macro that gives you the accuracy of a float. There are other useful constants defined in float.h. The idea is that if two numbers differ by less than the accuracy of the representation then they can be considered equal. Of course in practice numbers computed in different ways accumulate errors that are larger than the representational error. In the case of the example above with a = 0.1, the two numbers are very much further apart than FLT_EPSILON due to the inability to represent 0.1 in binary. In practice is usual to include a factor that summarizes the errors in the computation something like: if(fabsf((total - 100.0))/100.0 <=K*FLT_EPSILON))... To get our example to test equal, K has to be 80 or more. However, a small change and K has to be bigger. Run the loop a thousand times and compare the result to 1000 and K has to be even bigger. The point is that there is no single way of setting a reasonable interval that works for a range of computations. You have to analyze the computation to find out what it is safe to regard as being equal. This leads us into the realm of numerical analysis. If you applying any formula then it is always worth checking what the best way to compute it is. It is rare that the form given in a textbook is the best way to compute a quantity. For example, the mean is traditionally computed using: float total; int n = 1000000; for (int i = 0; i < n; i++) { total = total + (float)i; } total = total/(float)n; printf("%f\n",total); This forms a total and then divides by the number of items. The problem with this is that the total gets very big and we lose precision by adding comparatively small values to it. If you try it, you will discover that instead of 500000.00 the result is 499940.375000. Using the alternative iterative method, which keeps the size of the running estimate down: total = 0; for (int i = 0; i < n; i++) { total = total + ((float)i-total)/(i+1); } printf("%f\n",total); gives a result of 499999.500000 which is only wrong by 0.5. There are even better methods of computing the mean - see Kahan Summation and Pairwise Summation. In many cases you can't avoid a detailed analysis of a calculation but it helps to have an idea of why things go wrong when you are using floating point. Imagine that you are working with three significant digits. For addition everything is fine as long as the exponents allow the digits to interact. For example consider: written out like this: 123 + 4670 4793 Normalizing this gives 4.79x10 123 + 4670000 4670123 and after normalizing the result we have 4.67x10 There are no similar problems with multiplication and division, apart from the accumulation of errors if operations are performed in succession. Finally, if possible always use double or larger floating point types. Whereas float has 7 decimal digits of precision, double has 15 digits and this provides useful latitude. ## Summary-
Floating point arithmetic is so easy to use that we simply expect an arithmetic expression to be worked out correctly – this isn’t true. -
Modern floating point hardware is almost as fast as integer operations and using double precision values doesn’t have an overhead, except for division. -
Floating point arithmetic can give very wrong answers if the two operands differ by a large amount. The necessary normalization can reduce non-zero quantities to zero and the loss of precision can make results close to random. -
A confusing factor is the use of extended precision during a calculation to minimize this loss of precision. This always gives a result that is as accurate, or more accurate, than if extended precision wasn’t used, but it can result in quantities that are supposed to be equal not testing as equal. -
Standard floating point has two special values, NaN, not a number, and inf, infinity. The rules that govern how these are used in an expression are reasonable, but not foolproof. You can get a very wrong answer without even knowing that a special value is involved. -
There are some standard ways of detecting special values and problems with floating point, but only in C99 and later. In practice, the results vary according to architecture. -
*You can cast integer to float and float to integer types. Everything works as you would expect, but casting to an integer type that is too small to hold the integer part of a float is undefined behavior.* -
Implementing floating point calculations is difficult an in many cases you need to find out how other people have tackled the problem. There are often optimized ways of computing the formulae you find in text books.
## Now available as a paperback or ebook from Amazon.
- C,IoT, POSIX & LINUX
- Kernel Mode, User Mode & Syscall
- Execution, Permissions & Systemd
Extract Running Programs With Systemd - Signals & Exceptions
Extract Signals - Integer Arithmetic
Extract: Basic Arithmetic As Bit Operations Extract: BCD Arithmetic ***NEW - Fixed Point
Extract: Simple Fixed Point Arithmetic - Floating Point
- File Descriptors
Extract: Simple File Descriptors Extract: Pipes - The Pseudo-File System
Extract: The Pseudo File System Extract: Memory Mapped Files - Graphics
Extract: framebuffer - Sockets
Extract: Sockets The Client Extract: Socket Server - Threading
Extract: Pthreads Extract: Condition Variables Extract: Deadline Scheduling - Cores Atomics & Memory Management
Extract: Applying C - Cores - Interupts & Polling
Extract: Interrupts & Polling - Assembler
Extract: Assembler
Also see the companion book: Fundamental C <ASIN:1871962609> <ASIN:1871962617> ## Related ArticlesRemote C/C++ Development With NetBeans Getting Started With C/C++ On The Micro:bit To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.
## Comments
or email your comment to: comments@i-programmer.info |
|||||

Last Updated ( Tuesday, 21 May 2024 ) |