Page 1 of 2 Information theory – perhaps one of the most remarkable inventions of the twentieth century  naturally leads on to the consideration of how information can be coded and hence coding theory.
In Information Theory we learned that if a message or symbol occurs with probability p then the information contained in that symbol or message is log_{2}(p) bits.
For example, consider the amount of information contained in a single letter from an alphabet. Assuming there are 26 letters in an alphabet and assuming they are all used equally often the probability of receiving any one particular letter is 1/26 which gives log_{2}(1/26)=4.7 bits per letter.
It is also obvious from other elementary considerations that five bits are more than enough to represent 26 symbols. Quite simply five bits allow you to count up to more than 26 and so you can assign one letter of the alphabet to each number. In fact, five bits is enough to represent 32 symbols because you can count up to 31 i.e. 11111=31.
Using all five bits you can associate say A with 0 and Z with 25 and the values from 26 to 31 are just wasted. It seems a shame to have wasted bits just because we can’t quite find a way to use 4.7 bits – or can we.
Can we split the bit?
Average information
Before we go on to consider “splitting the bit” we need to look more carefully at the calculation of the number of bits of information in one symbol from an alphabet of 26.
Clearly the general case is that a symbol from an alphabet of n symbols has information equal to –log_{2}(n) bits assuming that every symbol has equal probability.
Of course what is wrong with this analysis is that the letters of the alphabet are not all equally likely to occur. For example, if you don’t receive a letter “e” you would be surprised – count the number in this sentence. However your surprise at seeing the letter “z” should be higher – again count the number in some other sentence than this one!
Empirical studies can provide us with the actual rates that letters occur – just count them in a book for example. If you add the “Space” character to the set of symbols you will discover that is by far the most probable character with a probability of 0.185 of occurring, followed by “e” at 0.100, “t” at 0.080 and so on.. down to “z” which has a probability of only 0.0005. It makes you wonder why we bother with “z” at all!
But, given that “z” is so unlikely its information content is very high –log_{2}(0.0005)=10.96 bits. In other words “z” contains nearly 11bits of information compared to just over 3 bits for “e”.
If you find this argument difficult to follow just imagine that you are placing bets on the next symbol to appear. You expect the next symbol to be a Space character so you are very surprised when it turns out to be a z. Your betting instincts tell you what the expected behaviour of the world is and when something happens that isn't expected you gains some information.
Taken to the extreme it seems almost obvious. The sun rises every morning and so the news that it has risen this morning is really no news  no information. Now consider the opposite event, that is some information.
We can define an average information that we expect to get from a single character taken from the alphabet.
If the i^{th} symbol occurs with probability p_{i} then the information it contains is –log_{2}(p_{i}) bits and the average information contained in one symbol, averaged over all symbols is:
Average information= –p_{1}log_{2}(p_{1}) – p_{2}log_{2}(p_{2}) –p_{3}log_{2}(p_{3}) … –p_{n}log_{2}(p_{n})
You can see that this is reasonable because each symbol occurs with probability p_{i }and each time it occurs it provides –log_{2}(pi) bits of information. Notice that while “z” carries 11 bits of information the fact that it doesn’t occur very often means that its contribution to the average information is only about 0.0005x11 or 0.0055 bits per symbol.
Applying this formula to the letters of the alphabet and their probabilities gives an average information of 4.08 bits per symbol which should be compared to 4.76 bits per symbol for an alphabet of 27 equally likely characters. Notice that the average information in 27 equally likely characters is also 4.76 bits – to see why try working it out.
There is also a nice theorem which says that the average information per symbol is largest if all of the symbols are equally likely. This means that if any of the symbols of an alphabet are used more or less often than others the average information per symbol is lower and this observation is the key to data compression techniques.
Make it equal
If you have a set of symbols then the most efficient way of using them is to make sure that the average amount of information per symbol is maximised.
From our discussion so far it should now be clear that the way to do this is to ensure that each symbol is used equally often – but how?
The answer is to use a code that forces the symbols to occur equally often even if they don’t want to!
Consider the standard alphabet and divide it into two groups of letters so that the probability of a letter belonging to either group is the same i.e. 0.5. To do this we have to include a mix of likely and not so likely letters to get the right probability. Now with this division of the alphabet we can begin to construct our code by using a zero to indicate that the letter is in the first group and a one to indicate that it is in the second group.
Clearly the first bit of our code has an equal chance of being a one or a zero and we can continue on in this way by subdividing each of the two groups into equally likely subsets and assigning 0 to one of them and a 1 to the other
Now when we receive a set of data bits we use them one after another to find which group the letter is in much like a well known “yes/no” question and answer game.
Each group is equally likely and hence each symbol used in the code is also equally likely (only 8 letters shown in the diagram).
This code was invented by Shannon and Fanno and named after them. If you can construct such a code it is optimal in the sense that each symbol in the code carries the maximum amount of information.
The problem is that you can’t always divide the set into two equally probable groups – not even approximately!
There is a better code.
<ASIN:0521852293>
<ASIN:0252725484>
<ASIN:0198538030>
