This may seem like a very complicated way of thinking about a very simple code but it now opens up the possibility of constructing more advanced codes.

For example, the key to the parity checking code is that every valid code is surrounded by invalid codes at one unit’s distance and this is why one-bit errors are detected.

To detect two-bit errors all we need to do is increase the radius of the sphere of invalid codes from one unit to two units.

In the three-bit code case this leaves us with only two valid codes – 000 and 111 – anything else is an invalid code that was generated by changing one or two bits from the original code words. Notice that 000 and 111 are three units away from each other. Now we have a code that can detect a one- or two-bit error with certainty but, of course, it can’t detect a three-bit error!

A two-bit error detecting code

Problems start to happen when you increase the number of bits in the code.

If the code has n bits then it labels the 2n corners of an n-dimensional hypercube. All we have to do to make a code that detects m-bit errors is to surround each valid code word with invalid codes out to a distance of m units.

Although it might seem like strange terminology, you can refer to the surrounding group of invalid codes as a “sphere of radius m” - after all they all the same distance away from the valid center code.

It is also clear that error detection carries an increasing overhead. A three-bit code that detects one-bit errors has four valid codes and four invalid codes. A three-bit code that detects two-bit errors has only two valid codes and six invalid ones.

Error detecting codes don’t carry information efficiently – you have to use more bits than strictly necessary for the information content.

This is called “adding redundancy” and follows the general principle of information theory that in the presence of noise adding redundancy increases reliability.

Error correction

Now we make the small but amazing step from error detection to error correction.

I’d better tell you now that we already have constructed an error correction code!

Consider the three-bit code that can detect up to two-bit errors. In this code the only valid words are 000 and 111. Now consider what it means if you retrieve 001. This is clearly an illegal code and so an error has occurred, but if you assume that only a one-bit error has occurred then it is also obvious that the true code word was 000.

Why?

Because the only valid code word within 1 unit’s distance of 001 is 000 – to get to 111 you have to change two bits. In other words you pick the valid code closest to the invalid code you have received.

The yellow incorrect code word is closest to 000

So what this means is that a code that has invalid words surrounding it, forming a sphere of radius two, can detect errors in up to two bits and can correct one-bit errors.

The general principle isn’t difficult to see.

Create a code in which each valid code word is surrounded by a sphere of radius m and you can detect up to m-bit errors and correct m-1 bit errors.

The correct algorithm is simply to assign any incorrect code to the closest correct code.

Easy really!

Real codes

Of course this isn’t how error detecting/correcting codes are implemented in practice! It is the theory but making it work turns out to be much more difficult.

The additional problems are many. For example to have a high probability of dealing with errors you have to use a large number of bits. You can make any unreliable storage or transmission medium as reliable as you like by throwing enough bits at it.

This means that you might have to use a huge table of valid and invalid code words, which is just not practical. In addition whatever code you use it generally has to be quick to implement.

For example, if an error occurs on a CD then the electronics corrects it in real time so that the data flow is uninterrupted. All this means that real codes have to be simple and regular.

There is also the small problem of burst errors. Until now we have been assuming that the probability of a bit being affected by noise is the same no matter where it is in the data word. In practice noise tends to come in bursts that affect a group of bits.

What we really want our codes to do is protect against burst errors of m bits in a row rather than m bits anywhere in the word. Such burst error correcting codes are more efficient but how to create them is a difficult problem.

For example, if you want to use a code that can detect a burst error of b bits then you can make use of a Cyclic Redundancy Checksum (CRC) of b additional bits. The CRC is computed in a simple way as a function of the data bits, and checking the resulting data word is equally simple, but the theory of how you arrive at a particular CRC computation is very advanced and to understand it would take us into mathematical areas such as Galois theory, linear spaces and so on.

The same sort of theory can be extended to create error-correcting codes based on CRC computations.

If you want to learn more about error detection and correction codes then you have to find out first about modern mathematics.

If you want a comprehensive and mathematical approach then Error-Correcting Codes by W. Wesley Peterson is recommended.

For a more general and short but still mathematical introduction try: A First Course in Coding Theory by Raymond Hill

Monte Carlo methods are powerful ways of getting answers using random numbers to problems that really don't seem to have anything much to do with randomness. For example, you can find Pi and multiply [ ... ]