|Data Compression The Dictionary Way|
|Written by Alex Armstrong|
|Monday, 15 July 2013|
Page 1 of 2
One of the most important lossless forms of compression is the LZW dictionary based method. It turns up in lots of compression utilities - ZIP, Compress, Deflate and in GIF and PNG format files. It is also an important idea in programming and you really do need to know something about how it works - if only to avoid reinventing it from scratch.
If you already know something about coding theory you might have heard of Huffman coding and statistical modeling of data etc.
You might also think that the data compression methods that we actually use, e.g. when zipping a file, are sophisticated variation on these methods.
While it is true that these traditional methods are capable of excellent compression performance they are very slow to implement and not really suitable for real time use.
First it is important to distinguish between lossless and lossy compression. In lossy compression you can throw away data and the reconstruction, i.e. the decompression doesn't have to produce an identical copy of the original. In this case it is enough that the decompressed version looks or sounds as good. Good lossy compression is as much about human perception as it is computer science. Lossless data compression doesn't throw away any data - it simply finds the most efficient coding for the data by eliminating redundancies.
As already mentioned the theoretical solution to lossless compression is the Huffman code which finds the most efficient coding and stores the data in the smallest number of bits. Good in theory but not so good in practice.
The breakthrough came in 1977 with a theoretical paper by Jacob Ziv and Abraham Lempel.
This paper started the whole subject of dictionary compression but, although it was a very theoretical sounding paper, it is amazing that no-one thought of the technique before. It is even possible that someone might have implemented the idea in a practical form before the great breakthrough - it is so simple.
So what is dictionary coding?
Consider the text of this article as a subject for compression. Instead of giving you each word spelled out letter by letter I could give you the page number and line number of the word in a standard dictionary. You can easily imagine, and a back of the envelope calculation will quickly convince you, that you reduce the amount of data needed to store the text using dictionary references.
This is the basis of dictionary compression.
The problem is what dictionary should you use?
If you use a standard dictionary you can refine the contents of the dictionary to reduce the storage needed to a minimum but there is a hidden cost. To make sense of the data you need the dictionary and the storage needed for it has to be taken into account. You may be able to store this article in only a few KBytes using dictionary coding but you would be less impressed if I then told you that you needed several Megabytes to store the dictionary needed to decode it.
The Sliding Window
Clearly dictionary coding is tantalising but how to construct the dictionary without incurring large storage overheads is the real problem.
This is where "sliding window" dictionary compression comes in.
Why not use the document as its own dictionary?
Childishly simple isn't it? But what exactly does this mean and how can you implement this in detail?
The algorithm described is essentially that introduced in the 1977 paper and is usually known as LZ77 but there are many slight variations on the basic idea.
Start off with a buffer that acts as a window onto the file being compresses.
Divide the buffer into two portions;
The chunk to the left is used as the dictionary and the chunk to the right is used as a look ahead buffer that holds the section of the file that we are trying to compress. It is assumed that hte dictionary portion has already been processed by the compression algorithm.
What happens next is that you try to find a match for the string in the look-ahead buffer in the dictionary section. Once you find a match the string in the look-ahead is coded by generating a three-element token:
So for example, the token 32,14,d means that the phrase in the look-ahead buffer matches the dictionary at position 32 and the match continues for 14 characters. The first character in the look-ahead after the match fails is "d".
This token is output as the next token in the compressed file which is simply a stream of such tokens generated in compressing the entire file.
You should be able to see that this token can be decoded back to the phrase that it represents simply by getting the 14 characters starting at position 32 from the dictionary and adding a letter "d" onto the end.
Thus we have succeeded in reducing a phrase of 15 characters to a three element token and this must take less space to store.
Notice that as the dictionary is just the first part of the file read into the window there is no need to store a separate dictionary as part of the compressed file. If you imagine for a moment that we are part way though decompressing the file then we already have the portion of the file to use as a dictionary to decode the next token. This avoids the question of how we get the whole thing started with a dictionary portion but more of this after we have finished examining the basic algorithm.
You don't just want to accept the first match that you find in the dictionary.
Well it doesn't matter what sort of match you get, the token still consists of three elements so obviously you do better to code as many characters per token as possible.
Thus assuming that the window has a portion of the file which has already been compressed - the dictionary and a portion we are trying to compress - the look ahead buffer then this makes the first part of the algorithm:
1) Find the longest match between the phrase in the look-ahead buffer and the dictionary and code this as a three-element token P,N,C where P is the position of the match, N is the length and C is the character following the matching phrase.
Now what do you do?
You have successfully coded N+1 characters and output a token to the compressed file you are building. This means that you have finished with N+1 characters in the look ahead buffer.
This is where the "Sliding" part of the sliding windows algorithm comes in. Having coded N+1 characters why not just slide the window N+1 characters to the right to read in N+1 more characters:
The N+1 characters that were in the look-ahead are now in the dictionary. The look-ahead buffer now contains nothing but characters that are yet to be coded so we can repeat the step of finding the best match and issuing a token.
But wait, what about the N+1 characters that fell off the left-hand end of the dictionary buffer when the window moved?
The answer is that if the window slides over the entire file so that data enters at the right and is coded as it moves from the look-ahead to the dictionary then the data that is lost out of the left-hand end of the window will always have been coded at some time in the past. In this way the dictionary is updated with new characters but all of these have already been compressed and output as a token.
So the second step in the method is:
2) Shift the window N+1 character to the right and repeat step 1
That's all there is too it.
The final compressed file is just a stream of tokens and you don't need to store a dictionary within the compressed file because the dictionary is just part of the uncompresed file which is built up as you uncompress it.
That is the uncompressed file is its own dictionary.
|Last Updated ( Monday, 15 July 2013 )|