Information Theory
Written by Mike James   
Thursday, 06 June 2019
Article Index
Information Theory
Logs & Bits
A 2-bit Tree
So you know what a bit is – or do you? How much information does a bit carry? What is this "information" stuff anyway? The answers are, unsurprisingly, all contained in the subject called Information Theory, which was invented by one man.

A Programmers Guide To Theory
First Draft

There is a more recent version of this:

Now available as a paperback and ebook from Amazon.

A Programmers Guide To Theory - NP & Co-NP



  1. What Is Computable?
  2. Finite State Machines
  3. What is a Turing Machine? 
  4. Computational Complexity
  5. Non-computable numbers
  6. The Transfinite
  7. Axiom Of Choice
  8. Lambda Calculus
  9. Grammar and Torture
  10. Reverse Polish Notation - RPN
  11. Introduction to Boolean Logic
  12. Confronting The Unprovable - Gödel And All That
  13. The Programmer's Guide to Fractals
  14. The Programmer's Guide to Chaos*
  15. Prime Numbers And Primality Testing
  16. Cellular Automata - The How and Why
  17. Information Theory
  18. Coding Theory
  19. Kolmogorov Complexity

*To be revised

How much data can you store on a 1GByte drive?

The answer is obvious - 1GByte of course!

But what does this exactly mean?

How much music, photos, video or books does this equate to?

What does a Gigabyte mean in terms of the real world?

To answer these questions in any meaningful way we need a theory of information and more importantly we need to define a way to measure information.

Until quite recently it wasn't even clear that something called information existed. For a long time we just muddled along and put things down in books, listened to music and didn't really think about how it all worked. It was only when we started to send messages between distant points did it really become apparent that information was a quantity that we might measure.

Today we know that the amount of information in something is measured in bits and this gives you the amount of storage you need to store it and the amount of communications bandwidth we need to transmit it.  

This much is easy to say but it raises a whole set of interesting follow on questions.

You can take any set of symbols – the word “the” or the letter “A” for example and specify the amount of information it contains in bits.

How does this work?

How can you say how many bits of information the letter “A” contains?

This is all a little mysterious, although we use the same ideas every day. For example, if you have ever compressed a file then you are making use of information theory. The most amazing thing about information theory is that it is a relatively recent discovery (1948), and it is almost entirely the work of one man, Claude Shannon, who more-or-less simply wrote it all down one day.



Claude Shannon 
(April 30, 1916 - February 24, 2001)

Surprise, surprise

Suppose I tell you that the sun rose this morning.


I hope not.

There is a sense in which a statement such as this conveys very little information. You almost don’t need me to tell you it, you already know it. In this sense, if you were already 100% certain that the sun rose this morning and you could even say that me telling you so, carried no information at all!

On the other hand, if I told you that the sun wasn’t going to rise tomorrow then you would presumably be shocked at this unlikely news – never mind about your horror or any other emotion that’s for the psychologists to sort out. We are only interested in the information content of the statement not its implications for the end of the world.

From this point of view it seems reasonable to relate the information in a message to the probability of receiving it. For example, if I am to tell you whether or not the sun rose each morning then on the mornings that I tell you it did you get little information compared to any morning I tell you it didn't. The relative likelihood of a message is related to the amount of information it contains.

This was the key idea that got Shannon started on information theory.

He reasoned like this:

If the probability of message A is p then we can assume that the information contained in the message is some function of p – call it I(p) say.

Now what properties does I(p) have to have?

Suppose I send you the message – now you have I(p)’s worth of information. If I now send you another message B with probability q you now have I(p)+I(q) of information.

But suppose you send me message A and B at the same time. The probability of getting both messages at the same time is p times q.

For example, the probability of getting a head is 0.5 the probability of getting two heads is 0.5 times 0.5 i.e. 0.25.

As it really doesn’t matter if I get message A and then message B or if they arrive together – the information they provide should be the same.

So we are forced to the conclusion that


If you know some simple maths you will recognise the basic property of logarithms. That is


What all this means is that if you accept that the information carried by a symbol or a message is a function of its probability then that function has to be a log or something like a log. The reason is that when things happen together their probabilities multiply but the information they carry has to add. 

What Shannon did next was to decide that the information in a message A which occurs with probability p should be


where the logarithm is taken to the base 2 instead of the more usual base 10 or base e.

You can have two reactions to this sort of argument –








Last Updated ( Thursday, 06 June 2019 )