The optimal code for any set of symbols can be constructed by assigning shorter codes to symbols that are more probable and longer codes to less commonly occurring symbols.

The way that this is done is very similar to the binary division used for Shannon-Fanno coding but instead of trying to create groups with equal probability we are trying to put unlikely symbols at the bottom of the “tree”.

The way that this works is that we sort the symbols into order of increasing probability and select the two most unlikely symbols and assign these to a 0/1 split in the code. The new group consisting of the pair of symbols is now treated as a single symbol with a probability equal to the sum of the probabilities and the process is repeated.

This is called Huffman coding, after its inventor, and it is the optimal code that we have been looking for.

For example suppose we have five symbols A,B,C,D and E with probabilities 0.1, 0.15, 0.2, 0.25 and 0.3 respectively. i.e.

A

B

C

D

E

0.1

0.15

0.2

0.25

0.3

The first stage groups A and B together because these are the least often occurring symbols. The probability of A or B occurring is 0.25 and now we repeat the process treating A/B as a single symbol.

The first stage of coding

Now the symbols with the smallest probability are C and the A/B pair which gives another split and a combined A/B/C symbol with a probability of 0.45. Notice we could have chosen C and D as the least likely given a different but just as good code.

The second stage

The two symbols that are least likely now are D and E with a combined probability of .55. This also completes the coding because there are now only two groups of symbols and we might as well combine these to produce the finished tree.

The final step

This coding tree gives the most efficient representation of the five letters possible.

To find the code for a symbol you simply move down the tree reading off the zeros and ones as you go until you arrive at the symbol.

To decode a set of bits that has just arrived you start at the top of the tree and take each branch in turn according to whether the bit is a zero or a one until you run out of bits and arrive at the symbol.

Notice that the length of the code used for each symbol varies depending on how deep in the tree the symbol is.

The theoretical average information in a symbol in this example is 2.3 bits - this is what you get if you work out the average information formula given earlier.

If you try to code B you will find that it corresponds to 111 i.e. three bits and it corresponds to moving down the far right hand branch of the tree.

If you code D you will find it corresponds to 00 i.e. the far left hand branch on the tree.

In fact each remaining letter is either coded as a two or three bit code and guess what? If the symbols occur with their specified probabilities the average length of code used is 2.3 bits.

So we have indeed split the bit!

The code we are using averaged 2.3 bits to send a symbol.

Notice that there are some problems with variable length codes in that it is more difficult to store them because you need to indicate how many bits are in each group of bits. The most common way of overcoming this is to use code words that have a unique sequence of initial bits. This wastes some code words but it still generally produces a good degree of data compression.

ZIP it!

If you have some data stored say on disk then it is unlikely to be stored using an efficient code. After all the efficient code would depend on the probabilities that each symbol occurred with and this is not something taken into account in simple standard codings.

What this means is that almost any file can be stored in less space if you switch to an optimal code.

So now you probably think that data compression programs build Huffman codes for the data on a disk?

They don’t because there are other considerations than achieving the best possible data compression such as speed of coding and decoding.

However, what they do is based on the principle of the Huffman code.

They scan through data and look for patterns of bits that occur often. When they find one say “01010101” they record it in a table and assign it a short code 11 say. Now when ever the code 11 occurs in the coded data this means 01010101 i.e. 8 bits are now represented by 2. As the data is scanned and repeating patterns found the table or “dictionary” is built up and sections of the data are replaced by shorter codes.

This is how the data compression is achieved but when the file is stored back on disk its dictionary has to be stored along with it. In practice data is generally so repetitive that the coded file plus its dictionary is much smaller than the original.

There are even schemes called "data deduping" that build a system wide dictionary and apply it to everything in a storage system. If every document starts in the same way with a standard heading and ends with a legal statement then this produces huge compression ratios.

What next

Coding theory has a lot to contribute to computing and data compression is just a tiny part of the story. The next application we look at is error detecting and error correcting codes.

Computational grammar is a subject that is sometimes viewed as a form of torture by computer science students, but understanding something about it really does help ....

One of the most important lossless forms of compression is the LZW dictionary based method. It turns up in lots of compression utilities - ZIP, Compress, Deflate and in GIF and PNG format files. It is [ ... ]