|
Page 1 of 4
The trouble with sorting is that it is such a nice neat subject that it is all too tempting to get carried away with the theoretical niceties. There is nothing more fascinating than trying to figure out just exactly how quick sort, heap sort etc. work but all of these methods assume that you have fast direct access to each item of data.
This is only true if you can read the entire data set into memory. In short most of the sorting methods that you come across are in memory methods - what would have been called in core methods in the old days of magnetic core memory!
There are sorting methods that are suitable for large data sets stored on disk (or tape!) but they just aren't talked about very much. This is odd because they are far from boring! The basic ideas are simple and fascinating but the degree of complexity that results as you push towards increasing efficiency is also slightly frightening.
It comes as something of a surprise that the task of sorting has yet another layer of complexity.
The Merge
The key to all of the sorting methods described below is the natural merge.
Suppose you have two fully sorted files then how can you combine them into a single fully sorted file? The best way to see how this can be done is to try it. If the two files are:
Start of file -> File 1 File 2 1 4 7 8 2 3 5 9
then as you can only see the first item of each file it is obvious that you should take the 9 from file 2 to copy to the merge file giving:
File 1 File 2 Merge file 1 4 7 8 2 3 5 9
Next you should copy the 8 from file 1 to give:
File 1 File 2 Merge file 1 4 7 2 3 5 8 9
Then the 7 from file 1, the 5 from file 2 and so on copying the largest of the top two elements of the pair of files. The result, is 1 2 3 4 5 7 8 9 i.e. a fully sorted merge file.
This merge principle can be extended to any number of files. The algorithm for an N-way merge is:
REPEAT UNTIL(all input files are empty) examine the current item on each file and copy the largest to the merge file.
This suggests a very simple method for sorting more records than can be stored in memory. All you have to do is split the file that you want sorted ito memory sized chunks. Read each one in turn, sort it into order using something fast like Quicksort or heap sort and then use an N-way merge algorithm to create a single sorted file.
Simple but there are practical problems if the number of segments that you split the file up into is great then the practical limits of the operating system. If you can allocate a 250MByte buffer to sort a 1GByte file then merging four files seems reasonable but what about a 10GByte file with forty merge files or a 100GByte file with 400 merge files.
Even if you're not worried about the operating system having to cope with reading 400 files there is the question of the time it takes to read in one record from each file. With a random access device like a disk drive the head is likely to have to move to the current position of each file - i.e. 400 seeks per record written to the merge file.
Now at this point, if you have an inventive turn of mind you are probably trying to think up file organisations that make the merge process work with fewer files.
For example, suppose instead of 400 sorted files you have a single file but with the records not sorted but arranged so that the first 400 are the 400 that would have been read one from each of the original files. In this case you would only have to read a single file in a sequential fashion to find the largest of the 400. Before you get carried away I should point out that I don't know of an easy way to form such a file, i.e. one that involves fewer than 400 seeks per record in the first place, and even if I did what makes you think that there is enough memory to read in 400 records anyway!
There is a way to reduce the number of files to more manageable proportions but it isn't that simple.
<ASIN:0387001638>
<ASIN:0521547652>
<ASIN:0470121688>
<ASIN:0471963550>
<ASIN:0471241342>
|