Applying C - Floating Point |

Written by Harry Fairhead | |||||

Tuesday, 21 May 2024 | |||||

Page 1 of 4 Floating point arithmetic solves all of your problems - except when it doesn't. It really is simple once you read this extract is from my book on using C in an IoT context. ## Now available as a paperback or ebook from Amazon.
- C,IoT, POSIX & LINUX
- Kernel Mode, User Mode & Syscall
- Execution, Permissions & Systemd
Extract Running Programs With Systemd - Signals & Exceptions
Extract Signals - Integer Arithmetic
Extract: Basic Arithmetic As Bit Operations Extract: BCD Arithmetic ***NEW - Fixed Point
Extract: Simple Fixed Point Arithmetic - Floating Point
- File Descriptors
Extract: Simple File Descriptors Extract: Pipes - The Pseudo-File System
Extract: The Pseudo File System Extract: Memory Mapped Files - Graphics
Extract: framebuffer - Sockets
Extract: Sockets The Client Extract: Socket Server - Threading
Extract: Pthreads Extract: Condition Variables Extract: Deadline Scheduling - Cores Atomics & Memory Management
Extract: Applying C - Cores - Interupts & Polling
Extract: Interrupts & Polling - Assembler
Extract: Assembler
Also see the companion book: Fundamental C <ASIN:1871962609> <ASIN:1871962617> Fixed point may be simple, but it is very limited in the range of numbers it can represent. If a calculation involves big scale changes, i.e. with results very much smaller or bigger than the initial set of numbers, then it fails due to under- or over-flow unless you are paying a great deal of attention to the fine detail. A better scheme if you want trouble free calculations is to use a floating point representation. ## Pro and Cons of Floating PointFloating point is easy to use. You feed it the numeric values and the expression and simply expect it to get the right answer. You can usually forget about overflow and other problems and just rely on the FPU to get on with it. This is one of the many things that programmers believe about floating point and it is mostly wrong. Floating point is flexible and easy to use, but unless you know what you are doing you can get almost random values back from a calculation. Not so long ago floating point was the exception rather than the rule for small machines. Even processors that had floating point hardware were often rendered unusable because of lack of software support. For example, Linux for the Raspberry Pi, Raspbian, took some years before making floating point available. In particular, floating point on ARM processors was a mess of confusing different types of hardware. Today there are still some processors that don’t implement floating point hardware to save cost and power, including the Arduino Uno and most of the PIC range of processors. Another big change is that today’s FPUs are fast. Only a few years ago, floating point hardware incurred a significant overhead in communicating with the CPU, making floating point much slower than integer arithmetic and the default practice was to use fixed point wherever speed was important. Today, FPUs are much better integrated with the CPU and they are optimized. In most cases, you can expect a speed penalty of only 20 to 30%. What this means is that if you have to use as few as two integer operations to implement an alternative to floating point then it runs slower. If you have a modern FPU then use it. In this chapter we look at some of the aspects of floating point that are important to the general programmer. The whole subject is very big and leads into issues of numerical analysis and exactly how to do a computation. Here the aim is to make you aware of some of the subtle problems that you can encounter with floating point – it’s stranger than you might imagine. In particular, unless your calculation is with moderate values and only a few decimal points of accuracy are important, you can’t simply supply an arithmetic expression and just expect it to give you the right answer. Floating point arithmetic can go very wrong unless you understand it – and even then it can still go wrong. Computers working with numbers is a complete field of study in its own right - numerical analysis - and there is no way that a single chapter can even touch on the subject. What this chapter is about is the way floating point works and some of the problems that arise in simple computations. ## The Floating IdeaFloating point allows the precision and magnitude of the representation to change as the computation proceeds. You can do the same thing with fixed point by varying the position of the binary point to accommodate the result and this can be considered a primitive form of floating point. Of course, as you move the fixed point you lose precision to gain an increase in magnitude. So it is with floating point, but there are generally many more bits allocated to the problem. In floating point the binary point is allowed to move during the calculation, i.e. the binary point "floats", but extra bits have to be allocated to keep track of where it is. The advantage of this approach is clear if you consider multiplying a value such as 123.4 by 1000. If the hardware (decimal in this case) can only hold four digits then the result is an overflow error. That is: 123.4 * 1000 = 123400 truncates to 3400, which is clearly not the right answer. If the hardware uses the floating point approach it can simply record the shift in the decimal point four places to the right. You can think of this as a way of allowing a much larger range of numbers to be represented, but with a fixed number of digits’ precision. Notice that the number of digits of precision remains the same, but the percentage accuracy changes. A floating point number is represented by two parts – an exponent and a fractional part. The fractional part is just a fixed-point representation of the number – usually with the fractional point to the immediate left of the first bit, making its value less than 1. The exponent is a scale factor which determines the true magnitude of the number. In decimal we are used to this scheme as scientific notation, Binary floating point is just the binary equivalent of decimal standard form. The exponent is the power of two by which you have to multiply the fraction to get the true magnitude. At this point you might want to write floating point off as trivial, but there are some subtleties. For example, when the fractional part is zero what should the exponent be set to? Clearly there is more than one representation for zero. By convention, the exponent is made as negative as it can be, i.e. as small as possible in the representation of zero. If two's-complement were used this would result in a zero that didn’t have all its bits set to zero and this is to be avoided for obvious reasons. To achieve this a small change is needed to use a biased, rather than two's-complement exponent, i.e. by adding the largest negative value to it. For example, if the exponent is six bits in size, the two's- complement notation range is –32 to +31. If instead of two's-complement, a simple biased representation is used then we have to subtract 32 from the exponent to get the signed value. In this case an exponent of 0 represents –32, 32 represents 0, and 63 represents 31. The same range is covered, but now the representation of zero has all bits set to 0 and it corresponds to 0x2 |
|||||

Last Updated ( Tuesday, 21 May 2024 ) |