Information Theory
Written by Alex Armstrong   
Thursday, 27 April 2017
Article Index
Information Theory
Bits and compression
So you know what a bit is – or do you? How much information does a bit carry? What is this "information" stuff anyway? The answers are, unsurprisingly, all contained in the subject called Information Theory, which was invented by one man.


How much data can you store on a 1GByte drive?

The answer is obvious - 1GByte of course!

But what does this exactly mean? How much music, photos, video or books does this equate to? What does a Gigabyte mean in terms of the real world. To answer these questions in any meaningful way we need a theory of information and more importantly we need to define a way to measure information.

Until quite recently it wasn't even clear that something called information existed. For a long time we just muddled along and put things down in books, listened to music and didn't really think about how it all worked. It was only when we started to send messages between distant points did it really become apparent that information was a quantity that we might measure.

Today we know that the amount of information in something is measured in bits and this gives you the amount of storage you need to store it and the amount of communications bandwidth we need to transmit it.  

This much is easy to say but it raises a whole set of interesting follow on questions.

You can take any set of symbols – the word “the” or the letter “A” for example and specify the amount of information it contains in bits.

How does this work?

How can you say how many bits of information the letter “A” contains?

This is all a little mysterious, although we use the same ideas every day. For example, if you have ever compressed a file then you are making use of information theory. The most amazing thing about information theory is that it is a relatively recent discovery (1948), and it is almost entirely the work of one man, Claude Shannon, who more-or-less simply wrote it all down one day.



Claude Shannon 
(April 30, 1916 - February 24, 2001)

Surprise, surprise

Suppose I tell you that the sun rose this morning.


I hope not.

There is a sense in which a statement such as this conveys very little information. You almost don’t need me to tell you it, you already know it. In this sense, if you were already 100% certain that the sun rose this morning and you could even say that me telling you so, carried no information at all!

On the other hand, if I told you that the sun wasn’t going to rise tomorrow then you would presumably be shocked at this unlikely news – never mind about your horror or any other emotion that’s for the psychologists to sort out. We are only interested in the information content of the statement not its implications for the end of the world.

From this point of view it seems reasonable to relate the information in a message to the probability of receiving it. For example, if I am to tell you whether or not the sun rose each morning then on the mornings that I tell you it did you get little information compared to any morning I tell you it didn't. The relative likelihood of a message is related to the amount of information it contains.

This was the key idea that got Shannon started on information theory.

He reasoned like this:

If the probability of message A is p then we can assume that the information contained in the message is some function of p – call it I(p) say.

Now what properties does I(p) have to have?

Suppose I send you the message – now you have I(p)’s worth of information. If I now send you another message B with probability q you now have I(p)+I(q) of information.

But suppose you send me message A and B at the same time. The probability of getting both messages at the same time is p times q.

For example, the probability of getting a head is 0.5 the probability of getting two heads is 0.5 times 0.5 i.e. 0.25.

As it really doesn’t matter if I get message A and then message B or if they arrive together – the information they provide should be the same.

So we are forced to the conclusion that


If you know some simple maths you will recognise the basic property of logarithms. That is


What all this means is that if you accept that the information carried by a symbol or a message is a function of its probability then that function has to be a log or something like a log. The reason is that when things happen together their probabilities multiply but the information they carry has to add. 

What Shannon did next was to decide that the information in a message A which occurs with probability p should be


where the logarithm is taken to the base 2 instead of the more usual base 10 or base e.

You can have two reactions to this sort of argument –





(Skip this section if you are happy with all things logarithmic).

Part of the problem here is the use of a negative logarithm so a brief reminder of what logs are all about is sure to help.

The log of a number x to a base b is just the number that you have to raise b to get x.

That is if

by=x (or b^y=x)

then y is the log of x to the base b.

For logs to the base two we simply have to find the number that we raise 2 by to get the specified value.

So log2(4)=2 because 22=4 (or 2^2=4) and continuing in this way you can see that

   n    log2(n)
1 0 (20=1)
2 1 (21=2)
4 2 (22=4)
8 3 (23=8)
16 4 (24=16)

The only problem you might have is with 20=1 which is so just because it makes overall sense.

Clearly powers of 2 have nice simple logs to the base 2 and values in between powers of two have logs which aren’t quite so neat.

For example, log2(9) is approximately 3.17 because 23.17 is almost 9.

We need to know one more thing about logs and this is how to take the log of a fraction.

If you raise a number to a negative power then that’s the same as raising one over the number to the same power.

For example, 2-1=1/21=1/2 i.e. .5.

The reason this works is that powers add and you can deduce what a negative power must be by looking at expressions like:

2+1 x 2-1= 21-1=20=1

and you can see from this that 

2-1=1/21= 1/2

and more generally 


This also means that that logs of fractions are negative and, given probabilities are always less than one, log2(p) is negative – hence the reason for including the minus sign in the definition. Put simply the minus is that to make the log of probabilities positive. 

  n     log2n   -log2n 
1/2 -1 1
1/4 -2 2
1/8 -3 3
1/16 -4 4






Last Updated ( Thursday, 27 April 2017 )