Page 2 of 2
Getting Started And Decompression
There are of course plenty of practical details that I have glossed over in the description but these are the sort of thing that a programmer can easily notice and solve.
The main question that must be worrying you concerns how the whole coding process gets going. When the window first slides over a data file there is nothing in the dictionary and data slides in until it fills the look-ahead buffer. At this point the contents of the look-ahead have to be coded before they can pass on to the dictionary - but the dictionary is empty!
The simple solution is to use a token of the form 0,0,C to code any character C that doesn't match anything in the dictionary. Notice that using a 0,0,C token actually takes more space than storing the symbol C - so we don't want to use many of these.
You should be able to see that the first portion of any file will be codes as a stream of 0,0,C tokens which become less common as the dictionary fills and matches for longer strings are found.
The 0,0,C tokens are also the solution to how the dictionary is initially constructed during decompression.
Decompression also starts out with an empty dictionary. The first token must be a 0,0,C type and this is used to place a character in the look-ahead buffer, which would now be better named an expansion buffer as each token is read from the compressed file and expanded into the the look-ahead.
Notice that the expansion is done so the the phrase is inserted at the boundary between the dictionary and the look-ahead. The expansion works by taking the N characters starting at P from the dictionary and then adding the character in the token to the end. Thus after each token is expanded the look-ahead contains N+1 new characters. These are then shifted into the dictionary ready to help with the decoding of the next token.
You can see that the 0,0,C tokens boot-strap the whole sliding window dictionary method into life. There are many alternative ways of doing this however.
Why Change The Dictionary?
As already mentioned this form of sliding window compression is generally called LZ77 after its inventors. It is remarkably good at file compression because the dictionary changes to suit the section of the file currently being processed. This is good because most files follow a principle of local similarity.
In other words, if a symbol or phrase has just occurred it is very likely to occur again somewhere near the first occurrence. For example consider an alternative method of simply taking the first part of the file and using it as a dictionary. The first part of any text is usually not representive of the rest and it would generally not provide a good compression ratio. The fact that the dictionary is dynamic is what makes the sliding window scheme so good.
Of course the efficiency of the whole scheme depends on the size chosen for the window.
So that's all there is to sliding window dictionary compression.
And this is the method that all of the super quick and powerful data compression methods use - PKZIP, Bitlocker and even the QIC-122 tape compression standard.
Of course they all throw in their own added extras to make the process more efficient but sliding window compression is the foundation of them all. It is also the basis of other compression algorithms such as some "de-dup" methods that claim to scan an entire file system and remove redundant files. It is also important in computer science as a measure of the amount of information in a string.
Even though the principle is simple, tuning for good efficiency is a difficult job. Don't be tempted to implement your own sliding window compression algorithm - there are too many good libraries just waiting to be used.