This xkcd cartoon provides an ideal excuse to explain Kolmogorov complexity. It is an interesting topic and one that gets right to the heart of programming of how programming relates to ideas like information and entropy.

Kolmogorov Directions

More cartoon fun at xkcd a webcomic of romance,sarcasm, math, and language

Suppose I give you a string like 111111... which goes on for say one hundred ones in the same way. The length of the string is 100 characters but you can write a short program that generates it very easily -

repeat 100 print "1" end repeat

Now consider the string "231048322087232.." and so on for one hundred digits. This is supposed to be a random string and you would be hard pressed to create a program that could print it that was shorter than it is. In other words, there is no way to specify this random-looking string other than to quote it.

This observation of the difference between these two strings is what leads to the idea of Kolmogorov complexity. The first string can be generated by a program with roughly 30 characters, and so you could say it has 30 bytes of information, but the second string needs a program of at least the hundred characters to quote the number as a literal and so it has 100 bytes of information.

You can already see that this is a nice idea but it has problems. Clearly the number of bytes needed for a program that prints one hundred ones isn't a well-defined number - it depends on the programming language you use. However, in any programming language we can define the Kolomogorov complexity as the length of the smallest program that generates the string in question. So complexity is defined with respect to a given description language.

For infinite strings things are a little more interesting because, if you don't have a program that will generate the string, you essentially don't have the string in any practical sense. That is, without a program that generates the digits of an infinite sequence you can't actually define the string. This is also the connection between irrational numbers and non-computable numbers.

An irrational number is an infinite sequence of numbers. Some irrationals have programs that generate them and as such their Kolmogorov complexity is a lot less than infinity. However, as there are only a countable number of programs and there are an uncountable number of irrationals (see The Programmer's Guide To The Transfinite) there has to be a lot of irrational numbers that don't have programs that generate them and hence that have an infinite Kolmogorov complexity. Put simply, there aren't enough programs to compute all of the irrationals.

The second difficulty inherent in the measure of Kolmogorov complexity is that given a random-looking string you can't really be sure that there isn't a simple program that generates it. This situation is slightly worse than it seems because you can prove that the Kolmogorov complexity of a string is itself a non-computable function. That is, there is no program (or function) that can take a string and output its Kolmogorov complexity. The proof of this is essentially in the long tradition of such non-computable proofs, i.e. proof by contradiction. If you assume that you have a program that will work out the Kolmogorov complexity of any old string then you can use this to generate the string using a smaller program and hence you get a contradiction.

However, this isn't to say that all is lost. You can estimate the Kolmogorov complexity for any string fairly easily. If a string is of length L and you run it through a compression algorithm you get a representation which is L-C in length where C is the number of characters removed in the compression. You can see that this compression factor is related to the length of a program that generates the string, i.e. you can generate the string from a description that is only L-C characters plus the decompression program.

Any string which cannot be compressed by even one character is called incompressible. There have to be incompressible strings because, by another counting argument, there are 2^{n} strings of length n but only 2^{n}-1 strings of length less than n. That is, there aren't enough shorter strings to represent all of the strings of length n.

Again we can go further and prove that most strings aren't significantly compressible, i.e. the majority of strings have a high Kolomogorov complexity. The theory says that if you pick a string of length n at random then the probably that it is compressible by c is given by 1-2^{1-c}+2^{-n}.

Plotting the probability of compressing a string by c characters for a string length of 50 you can see at once that most strings are fairly incompressible once you reach a C of about 5.

Armed with all of these ideas you can see that a string is algorithmically random if there is no program shorter than it that can generate it - and most strings are algorithmically random. If complexity is related to randomness then it should also be connected to entropy. If you use a Markov source to generate strings then their complexity divided by their length converges in a reasonable sense to the entropy of the source.

It is also worth spending a moment to consider what all this means for the generation of random sequences. A pseudo random number generator is usually a small program, which means that that the numbers it produces might well be statistically random but they are clearly not algorithmically random as their Kolmogorov complexity is low.

So how could we create an algorithmic random number generator?

You can probably notice the paradox in attempting to build a random number generator that is algorithmic but larger than the sequence it produces. Perhaps this is the key to the difference between pseudo random and physically random numbers.

The annoying thing about Kolmogorov complexity is that it is easy to understand and yet in any given case you really can't measure it in any absolute sense. Even so, it seems to say a lot about the nature of the physical world and our attempts to describe it. You will notice that at the core of this understanding is the idea of a program that generates the sequence of behavior. It is tempting to speculate that many of the problems we have with, say, modern physics is due to there simply not being enough programs to go around or, what amounts to the same thing, too many random sequences.

So to return to the xkcd cartoon.

The minimum form of the directions to get to a location gives its Kolmogorov complexity and is the most compact representation of the information. In this case what is offered is a program that generates the directions, but as you can appreciate this is not the most useful form of "how to get there" from the point of view of a typical human. One more reason we should all learn to program....

You may not have heard of SMFs, Storage Mapping Functions, but you are likely to have used them. They tend to be overlooked because there are more exciting methods of implementing storage, such as has [ ... ]