Ads by Lake Quincy Media

Ads by Lake Quincy Media
 
Banner

Ads by Lake Quincy Media
Coding theory
Article Index
Coding theory
Make it equal
Huffman code

 

In Information theory we learned that if a message or symbol occurs with probability p then the information contained in that symbol or message is -log2(p) bits.

For example consider the amount of information contained in a single letter from an alphabet. Assuming there are 26 letters in an alphabet and assuming they are all used equally often the probability of receiving any one particular letter is 1/26 which gives -log2(1/26)=4.7 bits per letter. It is also obvious from other elementary considerations that five bits are more than enough to represent 26 symbols. In fact they can represent 32 symbols and the extra is just because we can’t quite find a way to use 4.7 bits – or can we.

Can we split the bit?

Average information

Before we go on to consider “splitting the bit” we need to look more carefully at the calculation of the number of bits of information in one symbol from an alphabet of 26.

Clearly the general case is that a symbol from an alphabet of n symbols has information equal to –log2(n) bits assuming that every symbol has equal probability.

Of course what is wrong with this analysis is that the letters of the alphabet are not all equally likely to occur. For example, if you don’t receive a letter “e” you would be surprised – count the number in this sentence. However your surprise at seeing the letter “z” should be higher – again count the number in some other sentence than this one!

Empirical studies can provide us with the actual rates that letters occur – just count them in a book for example. If you add the “Space” character to the set of symbols you will discover that is by far the most probable character with a probability of 0.185 of occurring, followed by “e” at 0.100, “t” at 0.080 and so on.. down to “z” which has a probability of only 0.0005. It makes you wonder why we bother with “z” at all!

But, given that “z” is so unlikely its information content is very high –log2(0.0005)=10.96 bits. In other words “z” contains nearly 11bits of information compared to just over 3 bits for “e”.

We can define an average information that we expect to get from a single character taken from the alphabet. If the ith symbol occurs with probability pi then the information it contains is –log2(pi) bits and the average information contained in one symbol, averaged over all symbols is:

Average information=
–p1log2(p1) – p2log2(p2) –p3log2(p3)
… –pnlog2(pn)

You can see that this is reasonable because each symbol occurs with probability pi and each time it occurs it provides –log2(pi) bits of information. Notice that while “z” carries 11 bits of information the fact that it doesn’t occur very often means that its contribution to the average information is only about 0.0005x11 or 0.0055 bits per symbol.

Applying this formula to the letters of the alphabet and their probabilities gives an average information of 4.08 bits per symbol which should be compared to 4.76 bits per symbol for an alphabet of 27 equally likely characters. Notice that the average information in 27 equally likely characters is also 4.76 bits – to see why try working it out.

There is also a nice theorem which says that the average information per symbol is largest if all of the symbols are equally likely. This means that if any of the symbols of an alphabet are used more or less often than others the average information per symbol is lower and this observation is the key to data compression techniques.

<ASIN:0521852293>

<ASIN:0252725484>



 
     

I Programmer Poll

Java v .NET
 
I Programmer
Copyright © 2010 i-programmer.info. All Rights Reserved.
Joomla! is Free Software released under the GNU/GPL License.