|Magic of Merging|
|Written by Mike James|
|Tuesday, 01 October 2013|
Page 1 of 3
The merge sort is an under-appreciated algorithm - yet it is neat, clever and it still has its uses. With the rise of big data, parallel methods and online processing, you can even argue that it is growing in importance. Let's take a look at how it works and when you should use it.
The trouble with sorting is that it is such a nice neat subject that it is all too tempting to get carried away with the theoretical niceties and go for the fastest most elegant and perhaps the most recent. There is nothing more fascinating than trying to figure out just exactly how quick sort, heap sort etc. work but all of these methods assume that you have fast direct access to each item of data.
This is only true if you can read the entire data set into memory. In short most of the sorting methods that you come across are in memory methods - what would have been called in core methods in the old days of magnetic core memory!
Back in the early days of computing there wasn't much in the way of memory but there usually were a number of tape drives and these could, sometimes store the data on a single tape. There are sorting methods that are suitable for large data sets stored on disk or tape but they just aren't talked about very much. This is odd because they are far from boring! The basic ideas are simple and fascinating but the degree of complexity that results as you push towards increasing efficiency is also slightly frightening.
There is also the small matter of average versus worse case performance. For example, QuickSort may be fast but in the worse case it can take O(n2). Merge sort on the other hand is O(nlog2n) even in the worst case you can throw at it. This and the fact that it provides a stable sort i.e. it wont mess up any sorted order on another filed is the reason that the Java sort is based on it.
All in all it is well worth knowing about merge sort.
The key to all of the sorting methods described below is the natural merge.
Suppose you have two fully sorted files then how can you combine them into a single fully sorted file?
The best way to see how this can be done is to try it. If the two files are with the first item on the far right:
then as you can only see the first item of each file it is obvious that you should take the 9 from file 2 to copy to the merge file giving:
Next you should copy the 8 from file 1 to give:
Then the 7 from file 1, the 5 from file 2 and so on copying the largest of the top two elements of the pair of files. The result, is 1 2 3 4 5 7 8 9 i.e. a fully sorted merge file.
This merge principle can be extended to any number of files.
The algorithm for an N-way merge is:
This suggests a very simple method for sorting more records than can be stored in memory.
All you have to do is split the file that you want sorted into memory sized chunks. Read each one in turn, sort it into order using something fast like Quicksort or heap sort and then use an N-way merge algorithm to create a single sorted file.
Simple but there are practical problems if the number of segments that you split the file up into is great then the practical limits of the operating system might get in the way. If you can allocate a 250MByte buffer to sort a 1GByte file then merging four files seems reasonable but what about a 10GByte file with forty merge files or a 100GByte file with 400 merge files.
Even if you're not worried about the operating system having to cope with reading 400 files there is the question of the time it takes to read in one record from each file. With a random access device like a disk drive the head is likely to have to move to the current position of each file - i.e. 400 seeks per record written to the merge file.
Now at this point, if you have an inventive turn of mind you are probably trying to think up file organisations that make the merge process work with fewer files.
For example, suppose instead of 400 sorted files you have a single file but with the records not sorted but arranged so that the first 400 are the 400 that would have been read one from each of the original files. In this case you would only have to read a single file in a sequential fashion to find the largest of the 400. Before you get carried away I should point out that I don't know of an easy way to form such a file, i.e. one that involves fewer than 400 seeks per record in the first place, and even if I did what makes you think that there is enough memory to read in 400 records anyway!
There is a way to reduce the number of files to more manageable proportions but it isn't as simple as my simple suggestion.
Getting the runs
Clearly we can only use memory to form sorted chunks of a given size. If you want to use fewer files then each file is going to have to hold more than one sorted chunk so perhaps we should examine more complicated merging procedures.
The usual name for a sorted chunk of a file is a "run" and now we need a merging process that will cope with files that have more than one run. You can identify how many runs a file has by simply counting how many times it fails to be sorted. So for example 1,5,8,2,3 has two runs because 1,5,8 are in sort order but this fails at the 2 giving two runs of 1,5,8 and 2,3.
Consider merging the following pair of multi-run files:
At first everything looks the same, the 9 is the bigger than 3 so it is passed to the merge file, then the 6 is bigger than the 3 so it is passed to the merge file.
At this point the we come to the end of the fist run on file 2.
How do we know this?
Well the answer is that the next item on file 2 is bigger than the one that has just been passed to the merge file. So to detect the end of a run we also have to compare the next item in each file with the one before.
What should happen when we come to the end of a run?
The answer is can be a little tricky to see but if you go back to the original example you can see that a file containing 1 run when merged with a file containing 1 run results in a file containing 1 run. In other words you have to merge runs and not files.
This means that when you get to the end of a run on one of the files you should copy the remaining records in the run on the other file to the merge file and start afresh with a new run on both files. In this case after copying 9 and 6 to the merge file you have to copy 3 and then 2 giving a single run on the merge file of 2,3,6,9.
Next you repeat the natural merge on the second pair of runs and so on.
It really is a matter of merging runs rather than files.
To extend the method to more than two files is a bit more tricky but you should be able to see that you start off by comparing the current values from each file and copying the largest to the merge file i.e. just a standard merge operation.
The copy process uses up the current run on each file. As each file reaches the end of its current run it becomes inactive and is left out of the comparison process until only one file remains. The remainder of the final run is then copied to the merge file and the whole process starts off again with a fresh run on each file.
That is the algorithm, extended to n files, is:
At the end of the merge you have a single file with same number of runs on it that file with the most runs had.
For example, if you merge two files one with 4 runs and one with 6 runs the resulting file has 6 runs on it.
This sounds like no gain but before the merge the entire data set was spit into 4+6 i.e. 10 runs and after the merge it is in only 6 runs.
In other words, a merge reduces the total number of runs.
Reduction to a single run
Obviously to get a completely sorted file the total number of runs has to be reduced to 1 and this means splitting the merge file and repeating the merge process.
This is very clever! By working with runs rather than files you are implementing an N-way merge sort using a fixed number of files. The cost of this simplification is that you now have an additional step to partition to the number of files you can handle.
|Last Updated ( Wednesday, 02 October 2013 )|