|Fundamental C - Dependent v Independent & Undefined Behavior|
|Written by Harry Fairhead|
|Monday, 03 December 2018|
Lots has been written about undefined behavior in C, but not much about the reasons why it exists. This extract, from a forthcoming book on programming C in an IoT context, provides a very helpful explanation.
Fundamental C: Getting Closer To The Machine
Now available as a paperback and ebook from Amazon.
Also see the companion volume: Applying C
One of the big problems with modern C is that there are two distinct types of programmer wanting very different things from it.
Applications programmers, and to an extent systems programmers, would like C to be platform-independent.
This basically means that they are free to write their programs without considering the architecture of the machine it is running on. To a lesser extent they can also ignore differences in operating system, but only really to different flavors of Linux. If the target machine is running Windows then no matter how machine-independent the language is your code has to take the differences into account.
It is clear that there are different levels of platform independence, but a significant number of users are convinced that nothing in the C language should be machine-independent. As a result, the C standards identify behavior that is likely to be machine dependent and mark it as undefined behavior.
As the behavior is undefined it should never occur in a correct program.
The big problem is that undefined behavior has been taken to mean “any behavior at all”. This has given compiler writers a free hand to make anything at all happen as the result of a program that has undefined behavior and, as will be explained later, this means optimization can have very unexpected consequences. It is true to say that undefined behavior is one of the biggest problems C programmers have at the moment – at least in theory. In practice things are rarely so extreme, but you need to know about this as early as possible.
By contrast there is a second group of programmers who don’t want the language to be machine-independent. They want C to allow the natural machine behavior in each situation. The fact that this means that the language is defined in a different way depending on the platform the program is running on is seen as the price you have to pay to gain access to the machine’s natural behavior.
To illustrate, if I’m coding on a specific machine and signed overflow occurs then the result depends on the hardware used to perform arithmetic and what representation of negative numbers it uses. Just letting the machine do what is natural is perfectly acceptable and even desirable. It is certainly infinitely preferable than the actual solution, which is to make signed integer overflow officially undefined behavior. This means that the compiler writer can legally change the code to make anything happen. As a result undefined behavior is also known as “nasal demons” after a public suggestion that even demons flying out of your nose is acceptable as an interpretation of such a program.
A more reasonable interpretation is that any program that contains undefined behavior is an incorrect program. This would be fine if compilers flagged undefined behavior as errors rather than optimizing it way in ways that leave the programmer mystified.
Less controversially but equally troubling is the way the compiler writers can assume that in a correctly written program undefined behavior will never occur. For example, by having signed overflow as undefined behavior the compiler is free to move the value in question to a machine register that is perhaps bigger than the variable that has been assigned to hold the value. The result is that the program runs faster and the program has to be valid because signed overflow is undefined behavior and therefore cannot happen in a correct program.
Of course, low-level programmers often make use of signed overflow as part of a correct algorithm and when it doesn’t happen as expected, because the value is in a register that is big enough not to overflow, then things don’t go according to plan. It seems that the requirement of the low-level programmer to have the machine do exactly what the program tells it are overruled by the need of the compiler writers to optimize the output code.
This battle has been going on for longer than undefined behavior.
For example, low-level programmers often use empty loops to “busy wait”, i.e. to use up some time. This is fine, but when an optimizing compiler scrutinizes the code it decides that the empty loop isn’t doing anything and removes it. The result isn’t an optimized program; it is a broken program.
If all of this seems silly, then most programmers would agree. The problem is that the two groups don’t see the problem in the same way and each regards the other as trying to ruin a perfectly good language.
For the low-level programmer working close to the machine’s hardware, the idea that C should be machine-independent is not an obvious thought. It might not even be possible. For example, the C standard could mandate that signed arithmetic overflow was done in a particular way. This would make C machine-independent but it would put a big overhead on any machine that did signed arithmetic in a different way.
There are times when it seems that the low-level programmer is doing battle with the compiler writers and the language designers and there is a very real sense in which this is true.
There are many idioms used by low-level programmers that have been in use for a long time and yet are undefined behavior according to the latest standards. For example, low-level programmers often use pointers of different types to the same area of memory so as to get at the internals of a representation. This often called "type punning" but how it works is machine dependent and so it is undefined behavior even if it is a fundamental low-level idiom.
When there are alternative ways of implementing the same behavior then it is a good idea to make use of them. For example you can avoid type punning by using unions - same area of memory regarded as multiple types. When there isn’t, you have no choice but to carry on using them no matter what the standards say.
As C continues to evolve, this problem is likely to get worse not better. The tension between the two groups threatens to tear the language apart.
or email your comment to: email@example.com
|Last Updated ( Wednesday, 13 March 2019 )|