|
Page 1 of 2
The theory of information is based on our ability to measure it and the basic unit of information is the bit.
You can take any set of symbols – the word “the” or the letter “A” for example and specify the amount of information it contains in bits.
How does this work?
How can you say how many bits of information the letter “A” contains?
This is all a little mysterious although we use the same ideas every day. For example, if you have ever compressed a file then you are making use of information theory. The most amazing thing about information theory is that it is a relatively recent discovery, 1948, and it is almost entirely the work of one man – Claude Shannon - who more-or-less simply wrote it all down one day.

Claude Shannon (1916-2001)
Surprise, surprise
Suppose I tell you that the sun rose this morning.
Surprised?
I hope not. There is a sense in which a statement such as this conveys very little information. You almost don’t need me to tell you it to know it and in this sense, if you were already 100% certain that the sun rose this morning, you could even say that it carried no information at all!
On the other hand if I told you that the sun wasn’t going to rise tomorrow then you would presumably be shocked at this unlikely news – never mind about your horror or any other emotion that’s for the psychologists to sort out. We are only interested in information.
From this point of view it seems reasonable to relate the information in a message to its probability of occurring. This was the key idea that got Shannon started on information theory.
He reasoned like this:
If the probability of receiving a message A is p then we can assume that the information contained in the message is some function of p – call it I(p) say.
Now what properties does I(p) have to have?
Suppose I send you the message – now you have I(p)’s worth of information. If I now send you another message B with probability q you now have I(p)+I(q) of information.
But suppose you send me message A and B at the same time. The probability of getting both messages is p times q. For example, the probability of getting a head is 0.5 the probability of getting two heads is 0.5 times 0.5 i.e. 0.25. It really doesn’t matter if I get message A and then message B or if they arrive together – the information they provide should be the same. So we are forced to the conclusion that
I(pq)=I(p)+I(q)
If you know some simple maths you will recognise the basic property of logarithms. That is
log(xy)=log(x)+log(y)
What all this means is that if you accept that the information carried by a symbol or a message is a function of its probability then that function has to be a log or something like a log.
What Shannon did next was to decide that the information in a message A which occurs with probability p should be
I(A)=-log2(p)
where the logarithm is taken to the base 2 instead of the more usual base 10 or base e.
You can have two reactions to this sort of argument – “obvious!” or “what?”.
Logs
(Skip this section if you are happy with all things logarithmic).
Part of the problem here is the use of a negative logarithm so a brief reminder of what logs are all about is sure to help.
The log of a number x to a base b is just the number that you have to raise b to get x.
That is if by=x (or b^y=x) then y is the log of x to the base b.
For logs to the base two we simply have to find the number that we raise 2 by to get the specified value. So log2(4)=2 because 22=4 (or 2^2=4) and continuing in this way you can see that
n log2(n) 1 0 (20=1) 2 1 (21=2) 4 2 (22=4) 8 3 (23=8) 16 4 (24=16)
The only problem you might have is with 20=1 which is so just because it makes overall sense.
Clearly powers of 2 have nice simple logs to the base 2 and values in between powers of two have logs which aren’t quite so neat.
For example, log2(9) is approximately 3.17 because 23.17 is 9.
We need to know one more thing about logs and this is how to take the log of a fraction. If you raise a number to a negative power then that’s the same as raising one over the number to the same power. For example, 2-1=1/21=1/2 i.e. .5.
You should be able to see that logs of fractions are negative and, given probabilities are always less than one, –log2(p) is positive – hence the reason for including the minus sign in the definition.
n log(n) 1/2 -1 1/4 -2 1/8 -3 1/16 -4
<ASIN:0143038397>
<ASIN:0471399744>
<ASIN:0307275175>
<ASIN:0309101921>
|