Programmer's Guide To Theory - Information Theory
Written by Mike James   
Monday, 27 July 2020
Article Index
Programmer's Guide To Theory - Information Theory
So you know what a bit is – or do you? How much information does a bit carry? What is this "information" stuff anyway? The answers are, unsurprisingly, all contained in the subject called Information Theory, which, surprisingly, was invented by one man, Claude Shannon.

This is an extract from my recent book on theory:

A Programmers Guide To Theory

Now available as a paperback and ebook from Amazon.



  1. What Is Computer Science?
    Part I What Is Computable?
  2. What Is Computation?
  3. The Halting Problem
  4. Finite State Machines
    Extract 1: Finite State Machines
  5. Practical Grammar
  6. Numbers, Infinity and Computation
    Extract 1: Numbers 
    Extract 2: Aleph Zero The First Transfinite
    Extract 3: In Search Of Aleph-One
    Extract 4: Transcendental Numbers
  7. Kolmogorov Complexity and Randomness
    Extract 1:Kolmogorov Complexity 
  8. The Algorithm of Choice 
  9. Gödel’s Incompleteness Theorem
  10. Lambda Calculus ***NEW!
    Part II Bits, Codes and Logic
  11. Information Theory 
  12. Coding Theory – Splitting the Bit
  13. Error Correction 
  14. Boolean Logic
    Part III Computational Complexity
  15. How Hard Can It Be?
    Extract 1: Where Do The Big Os Come From
  16. Recursion
    Extract 1: What Is Recursion
    Extract 2: Why Recursion
  17. NP Versus P Algorithms
    Extract 1: NP & Co-NP
    Extract 2: NP Complete



There has been an undercurrent in a lot of the earlier discussion (in previous chapters of the book) about algorithms only being able to capture the regularity in infinite things. If something infinite doesn’t have enough regularity then it cannot be generated by a finite algorithm. There is something about information and complexity that we are just not making quite explicit. These ideas are part of a subject known as computational information, which has been discussed in Chapter 7, but there is another theory of information that is based on how often something happens. It is this theory of information that gives rise to the idea of the bit as the basic unit of information and leads to questions such as:

How much information does a bit carry?

What is this "information" stuff anyway?

The answers are all contained in the subject called, unsurprisingly, Information Theory, which was invented by one man a surprisingly short time ago.

Let us try an easier question. How much data can you store on a 1GByte drive? The answer is obvious - 1GByte of course! But what exactly does this mean? How much music, photos, video or books does this equate to? What does a Gigabyte mean in terms of the real world? To answer these questions in any meaningful way we need a theory of information and more importantly we need to define a way to measure information.

Until quite recently it wasn't even clear that something called information existed. For a long time we just muddled along and put things down in books, listened to music and didn't really think about how it all worked. It was only when we started to send messages between distant points did it really become apparent that information was a quantity that we might measure.

Today we know that the amount of information in something is measured in bits and this gives you the amount of storage you need to store it and the amount of communications bandwidth we need to transmit it. This much is easy to say, but it raises a whole set of interesting questions.

You can take any set of symbols – the word “the” or the letter “A”, for example - and specify the amount of information it contains in bits. How does this work? How can you say how many bits of information the letter “A” contains?

This is all a little mysterious, although we use the same ideas every day. For example, if you have ever compressed a file then you are making use of information theory. The most amazing thing about information theory is that it is a relatively recent discovery (1948), and it is almost entirely the work of one man, Claude Shannon, who more-or-less simply wrote it all down one day. 


Claude Shannon
1916 - 2001

Surprise, Sunrise!

Suppose I tell you that the sun rose this morning. Surprised? I hope not. There is a sense in which a statement such as this conveys very little information. You don’t need me to tell you it, you already know it. In this sense, if you were already 100% certain that the sun rose this morning you could even say that me telling you so carried no information at all!

On the other hand, if I told you that the sun wasn’t going to rise tomorrow then you would presumably be shocked at this unlikely news – never mind about your horror or any other emotion, that’s for psychologists to sort out. We are only interested in the information content of the statement, not its implications for the end of the world. From this point of view it seems reasonable to relate the information in a message to the probability of receiving it. For example, if I am to tell you whether or not the sun rose each morning then on the mornings that I tell you it did you get little information compared to any morning I tell you it didn't. The relative likelihood of a message is related to the amount of information it contains.

This was the key insight that got Shannon started on information theory. He reasoned like this:

If the probability of message A is p then we can assume that the information contained in the message is some function of p – call it I(p) say.

What properties does I(p) have to have? Suppose you are sent message A – now you have I(p)’s worth of information. If you are then sent another message, B with probability q, you have I(p)+I(q) of information.

But suppose A and B are sent at the same time. The probability of getting both messages at the same time is p times q. For example, when flipping a coin, the probability of getting a head is 0.5, whereas the probability of getting two heads is 0.5 times 0.5, i.e. 0.25. As it really doesn’t matter if message A is received first and then message B or if they arrive together – the information they provide should be the same. So we are forced to the conclusion that:


If you know some simple math you will recognize the basic property of logarithms. That is:


What all this means is that if you accept that the information carried by a symbol or a message is a function of its probability then that function has to be a log or something like a log. The reason is that when things happen together their probabilities multiply, but the information they carry has to add. 

What Shannon did next was to decide that the information in a message A which occurs with probability p should be:


where the logarithm is taken to the base 2 instead of the more usual base 10 or base e.

Last Updated ( Monday, 27 July 2020 )