|Written by Harry Fairhead|
|Thursday, 14 February 2019|
Page 3 of 3
The two symbols that are least likely now are D and E with a combined probability of .55. This also completes the coding because there are now only two groups of symbols and we might as well combine these to produce the finished tree.
The final step
This coding tree gives the most efficient representation of the five letters possible.
To find the code for a symbol you simply move down the tree reading off the zeros and ones as you go until you arrive at the symbol.
To decode a set of bits that has just arrived you start at the top of the tree and take each branch in turn according to whether the bit is a zero or a one until you run out of bits and arrive at the symbol.
Notice that the length of the code used for each symbol varies depending on how deep in the tree the symbol is.
The theoretical average information in a symbol in this example is 2.3 bits - this is what you get if you work out the average information formula given earlier.
If you try to code B you will find that it corresponds to 111 i.e. three bits and it corresponds to moving down the far right hand branch of the tree.
If you code D you will find it corresponds to 00 i.e. the far left hand branch on the tree.
In fact each remaining letter is either coded as a two or three bit code and guess what? If the symbols occur with their specified probabilities the average length of code used is 2.3 bits.
So we have indeed split the bit!
The code we are using averaged 2.3 bits to send a symbol.
Notice that there are some problems with variable length codes in that it is more difficult to store them because you need to indicate how many bits are in each group of bits. The most common way of overcoming this is to use code words that have a unique sequence of initial bits. This wastes some code words but it still generally produces a good degree of data compression.
If you have some data stored say on disk then it is unlikely to be stored using an efficient code. After all the efficient code would depend on the probabilities that each symbol occurred with and this is not something taken into account in simple standard codings.
What this means is that almost any file can be stored in less space if you switch to an optimal code.
So now you probably think that data compression programs build Huffman codes for the data on a disk?
They don’t because there are other considerations than achieving the best possible data compression such as speed of coding and decoding.
However, what they do is based on the principle of the Huffman code.
They scan through data and look for patterns of bits that occur often. When they find one say “01010101” they record it in a table and assign it a short code 11 say. Now when ever the code 11 occurs in the coded data this means 01010101 i.e. 8 bits are now represented by 2. As the data is scanned and repeating patterns found the table or “dictionary” is built up and sections of the data are replaced by shorter codes.
This is how the data compression is achieved but when the file is stored back on disk its dictionary has to be stored along with it. In practice data is generally so repetitive that the coded file plus its dictionary is much smaller than the original.
There are even schemes called "data deduping" that build a system wide dictionary and apply it to everything in a storage system. If every document starts in the same way with a standard heading and ends with a legal statement then this produces huge compression ratios.
Coding theory has a lot to contribute to computing and data compression is just a tiny part of the story. The next application we look at is error detecting and error correcting codes.
A Programmers Guide To Theory
|Last Updated ( Thursday, 14 February 2019 )|