Page 1 of 2
So you know what a bit is – or do you? How much information does a bit carry? What is this "information" stuff anyway? The answers are, unsurprisingly, all contained in the subject called Information Theory, which was invented by one man.
The theory of information is based on our ability to measure it and the basic unit of information is the bit.
This much is easy to say but it raises a whole set of interesting follow on questions.
You can take any set of symbols – the word “the” or the letter “A” for example and specify the amount of information it contains in bits.
How does this work?
How can you say how many bits of information the letter “A” contains?
This is all a little mysterious, although we use the same ideas every day. For example, if you have ever compressed a file then you are making use of information theory. The most amazing thing about information theory is that it is a relatively recent discovery (1948), and it is almost entirely the work of one man, Claude Shannon, who more-or-less simply wrote it all down one day.
Claude Shannon (1916-2001)
Suppose I tell you that the sun rose this morning.
I hope not. There is a sense in which a statement such as this conveys very little information. You almost don’t need me to tell you it, you already know it. In this sense, if you were already 100% certain that the sun rose this morning and you could even say that me telling you so, carried no information at all!
On the other hand, if I told you that the sun wasn’t going to rise tomorrow then you would presumably be shocked at this unlikely news – never mind about your horror or any other emotion that’s for the psychologists to sort out. We are only interested in the information content of the statement not its implications for the end of the world.
From this point of view it seems reasonable to relate the information in a message to the probability of receiving it. For example, if I am to tell you whether or not the sun rose each morning then on the mornings that I tell you it did you get little information compared to any morning I tell you it didn't. The relative likelihood of a message is related to the amount of information it contains.
This was the key idea that got Shannon started on information theory.
He reasoned like this:
If the probability of message A is p then we can assume that the information contained in the message is some function of p – call it I(p) say.
Now what properties does I(p) have to have?
Suppose I send you the message – now you have I(p)’s worth of information. If I now send you another message B with probability q you now have I(p)+I(q) of information.
But suppose you send me message A and B at the same time. The probability of getting both messages at the same time is p times q.
For example, the probability of getting a head is 0.5 the probability of getting two heads is 0.5 times 0.5 i.e. 0.25.
As it really doesn’t matter if I get message A and then message B or if they arrive together – the information they provide should be the same.
So we are forced to the conclusion that
If you know some simple maths you will recognise the basic property of logarithms. That is
What all this means is that if you accept that the information carried by a symbol or a message is a function of its probability then that function has to be a log or something like a log.
What Shannon did next was to decide that the information in a message A which occurs with probability p should be
where the logarithm is taken to the base 2 instead of the more usual base 10 or base e.
You can have two reactions to this sort of argument – “obvious!” or “what?”.
(Skip this section if you are happy with all things logarithmic).
Part of the problem here is the use of a negative logarithm so a brief reminder of what logs are all about is sure to help.
The log of a number x to a base b is just the number that you have to raise b to get x.
That is if
by=x (or b^y=x)
then y is the log of x to the base b.
For logs to the base two we simply have to find the number that we raise 2 by to get the specified value.
So log2(4)=2 because 22=4 (or 2^2=4) and continuing in this way you can see that
1 0 (20=1)
2 1 (21=2)
4 2 (22=4)
8 3 (23=8)
16 4 (24=16)
The only problem you might have is with 20=1 which is so just because it makes overall sense.
Clearly powers of 2 have nice simple logs to the base 2 and values in between powers of two have logs which aren’t quite so neat.
For example, log2(9) is approximately 3.17 because 23.17 is 9.
We need to know one more thing about logs and this is how to take the log of a fraction.
If you raise a number to a negative power then that’s the same as raising one over the number to the same power.
For example, 2-1=1/21=1/2 i.e. .5.
You should be able to see that logs of fractions are negative and, given probabilities are always less than one, –log2(p) is positive – hence the reason for including the minus sign in the definition.
n log2(n) -log2(n)
1/2 -1 1
1/4 -2 2
1/8 -3 3
1/16 -4 4