Programmer's Guide To Theory - Information Theory
Written by Mike James   
Monday, 27 July 2020
Article Index
Programmer's Guide To Theory - Information Theory

A Two-Bit Tree

Now consider the task of representing the outcome of the coin-tossing in a program. Most programmers would say let the first coin be represented by a single bit, 0 for tails 1 for head. Then the two-coin situation can be represented by two bits 00, 01, 10, and 11. The three-coin situation can be represented by three bits 000, 001, 010, and so on. You can see that this looks a lot like the way information increases with number of coins. If you have n coins you need n bits to represent a final outcome and you have n units of information when you discover the final outcome.

The connection should by now be fairly obvious.

The unit of information is the bit and if a symbol occurs with probability p then it carries –log2(p) bits of information and needs a minimum of –log2(p) bits of information to represent or code it. You can of course use more bits to code it if you want to, but –log2(p) is the very smallest number you can use without throwing away information.

Notice that this conclusion is very deep. There is a connection between the probability of a symbol occurring and the amount of information it contains and this determines the smallest number of bits needed to represent it. We go from probabilities to bits - which is not an obvious connection before you make it.

The Alphabet Game

At this point you should be thinking that this is all very reasonable, but how does it work in more complicated situations? It is time for a very small simple example to show that it all does make good sense. Consider a standard alphabet of 26 letters. Let’s for the moment assume that all of the letters are equally likely. Then the probability of getting any letter is 1/26 and the information contained in any letter is –log2(1/26) = 4.7 bits.

What this means is that you need 5 bits to encode this standard alphabet and a single letter carries just less than 5 bits of information.

Notice that in theory we can find a code that only uses 4.7 bits per letter on average and this is the sort of question that information theorists spend their lives working out. For most of us a storage scheme that uses 5 bits is good enough.

As a sanity check, you can also see that five bits really are enough to code 26 letters or other symbols. If you run though the list of possibilities – 00000, 00001, 00010, and so on you will find that there are 32 possible five-bit codes and so, as predicted, we have more than enough bits for the job. The guessing game analogy still works. If I think of a letter at random then how many questions do you have to ask to discover which letter it is? The simple minded answer is 26 at most - “is it an A?”, “is it a B?” and so on. But by taking a slightly more clever approach the same job can be done in five questions at most. Can you work out what they are?


Information theory has lots of interesting applications in computing and one worth knowing about is “coding theory”.

A good few years ago at the start of the personal computer revolution a self-appointed guru announced that he had invented a “data compression machine” – in hardware. These were days well before zipping files was common and the idea of data compression was strange, exotic and very, very desirable – after all the largest disks at the time were only 180Kbytes! Compression is fine but this particular guru claimed that he could get an 80% compression ratio and all the data could be fed through a second time to get a compounded compression rate of 80% of the 80%.

Such ideas are in the same category as perpetual motion machines and faster than light travel but without information theory you can’t see why this particular claim is nonsense. Compression works by finding the code that represents the data most efficiently. If the data D contains I(D) bits of information and it is currently represented by more bits than I(D) you can compress it by finding the optimal code. That’s what programs that zip files attempt to do – they scan the data and work out a code custom made for it. The details of the code are stored along with the data in a “dictionary” but even with this overhead the savings made are enough to provide large space savings in most cases.

Now why can’t you repeat the procedure to get twice the savings?

It should be obvious now with information theory. You can’t do it again because once you have coded the information using an optimal code –you can’t improve on perfection! So the smallest number of bits needed to represent any data is the measure of the information it contains.

It really is that easy.

But how do you find the optimal code?

Find out in our introduction to Coding Theory in Chapter 12.

Channels, Rates and Noise

Before information theory we had little idea how much information a communication channel could carry. Ideas such as sending TV data over medium wave radio transmitters, or down a standard phone line sounded reasonable until you work out the amount of information involved and the capacity of the channel. It was this problem that really motivated Shannon to invent information theory – after all he was employed by a telephone company. He wanted to characterize the information capacity of a communication channel.

Such questions are also the concern of information theory and here we have to consider noise as well as information. The key theorem in this part of information theory, the channel capacity theorem, says that if you have an analog communication channel C with bandwidth B, i.e. it can carry frequencies within the bandwidth, measured in Hz and subject to noise, then the capacity of the channel in bits per second is:

C = B log2(1+S/N)

where S/N is the signal to noise ratio, i.e. the ratio of the signal power to the noise power. For example, if the bandwidth is 4 kHz and the signal to noise ratio is 100, which is typical of an analog phone connection then:

C = 4000 log2(1+100) = 26632.8 ≈ 26 kbits/s

If this is really the specification of the phone line, then this is the maximum rate at which you can send data. If you try to send data faster than this then errors will occur to reduce the achieved rate to 26 kbits/s. You can see that as you increase the noise the channel capacity drops. Intuitively what this means is that as the noise gets louder you have to repeat things to ensure that the signal gets through. Alternatively you could shout louder and increase the signal to noise ratio. Finally you could increase the bandwidth. This is the reason you need to use high frequency radio to send a lot of data.

If you are transmitting using a 1 GHz signal it is easy to get a 1 MHz bandwidth. If you are transmitting at 10 MHz then 1 MHz is a sizable chunk of the available spectrum.

More Information - Theory

From here information theory develops, becoming even more interesting and leading on to all sorts of ideas - how can you achieve the channel capacity with suitable codes, error correction, the Nyquist rate, reconstructing signals from samples and so on.

Information theory also provides a model that enables you to analyze many different types of system. In psychology you can work out how much information is being processed by the brain. Information is also related to entropy which is a natural quantity in many physical systems and this makes the connection between physics and computer science.

You can find out more from any book on information theory and there is still a lot of theoretical work to be done.


  • Before information theory we had little idea how much information any signal or set of symbols contained.

  • The key observation that led Shannon to his formulation of information theory was that the amount of information was related to the degree of surprise at receiving the message.

  • After some consideration, it becomes clear that the amount of information in a message A is I(A)=-log2(p) where p is the probability of receiving the message.

  • The unit of information is the bit and another way to understand information is as the number of binary questions you have to ask to determine a particular outcome.

  • Data compression relies on using the most efficient code to represent the data.

  • Although information theory is applied to digital systems, it was originally invented to deal with analog systems such as telephone lines and radio channels.

  • The key theorem from classical information theory is the channel capacity theorem, which relates the bandwidth and signal to noise ratio to the information capacity of a channel.

Related Articles

The Bit Player - Shannon Bio Pic On Amazon

Claude Shannon - Information Theory And More

Google Doodle for Claude Shannon's 100th Anniversary.

A Programmers Guide To Theory

Now available as a paperback and ebook from Amazon.



  1. What Is Computer Science?
    Part I What Is Computable?
  2. What Is Computation?
  3. The Halting Problem 
  4. Finite State Machines
  5. Practical Grammar
  6. Numbers, Infinity and Computation
    Extract 1: Numbers 
    Extract 2: Aleph Zero The First Transfinite
  7. Kolmogorov Complexity and Randomness
    Extract 1:Kolmogorov Complexity 
  8. Algorithm of Choice
  9. Gödel’s Incompleteness Theorem
  10. Lambda Calculus
    Part II Bits, Codes and Logic
  11. Information Theory 
  12. Coding Theory – Splitting the Bit
  13. Error Correction 
  14. Boolean Logic
    Part III Computational Complexity
  15. How Hard Can It Be?
  16. Extract 1: Where Do The Big Os Come From 
  17. Recursion
    Extract 1: Why Recursion
  18. NP Versus P Algorithms
    Extract 1: NP & Co-NP 
  19. Extract 2: NP Complete ***NEW!



To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on, Twitter, Facebook or Linkedin.


Mozilla Showcases Badges For Firefox Add-Ons

Having selected participants for the pilot of its Promoted Add-ons program, Mozilla has started to review their extensions. As a result Verified badges have begun to appear on Firefox Extensions.  [ ... ]

And So It Begins - SCOTUS Hears Google v Oracle

Today, October 7th, 2020 is the day that the Supreme Court hears oral arguments in Google v Oracle, summed up as a case concerning software code and copyright law. Will we look back on it as [ ... ]

More News





or email your comment to:

Last Updated ( Monday, 27 July 2020 )