Page 2 of 4
Getting the runs
Clearly we can only use memory to form sorted chunks of a given size. If you want to use fewer files then each file is going to have to hold more than one sorted chunk so perhaps we should examine more complicated merging procedures.
The usual name for a sorted chunk of a file is a "run" and now we need a merging process that will cope with files that have more than one run. Consider merging the following pair of multi-run files:
File 1 File 2
1,5,8 2,3 4,7 6,9
At first everything looks the same, the 9 is the bigger than 3 so it is passed to the merge file, then the 6 is bigger than the 3 so it is passed to the merge file.
At this point the we come to the end of the fist run on file 2.
How do we know this?
Well the answer is that the next item on file 2 is bigger than the one that has just been passed to the merge file. So to detect the end of a run we also have to compare the next item in each file with the one before.
What should happen when we come to the end of a run?
The answer is can be a little tricky to see but if you go back to the original example you can see that a file containing 1 run when merged with a file containing 1 run results in a file containing 1 run. In other words you have to merge runs and not files.
This means that when you get to the end of a run on one of the files you should copy the remaining records in the run on the other file to the merge file and start afresh with a new run on both files. In this case after copying 9 and 6 to the merge file you have to copy 3 and then 2 giving a single run on the merge file of 2,3,6,9.
Next you repeat the natural merge on the second pair of runs and so on.
It really is a matter of merging runs rather than files.
To extend the method to more than two files is a bit more tricky but you should be able to see that you start off by comparing the current values from each file and copying the largest to the merge file.
The copy process uses up the current run on each file. As each file reaches the end of its current run it becomes inactive and is left out of the comparison process until only one file remains. The remainder of the final run is then copied to the merge file and the whole process starts off again with a fresh run on each file.
That is the algorithm, extended to n files, is:
REPEAT UNTIL (all files are empty)
mark all files as active
REPEAT UNTIL (only one file is active)
compare the current items of all
active files and copy the largest
to the merge file
IF(a file is at the end of a run)THEN
mark it as inactive
copy remainder of the run on
the only active file
At the end of the merge you have a single file with same number of runs on it that file with the most runs had.
For example, if you merge two files one with 4 runs and one with 6 runs the resulting file has 6 runs on it.
This sounds like no gain but before the merge the entire data set was spit into 4+6 i.e. 10 runs and after the merge it is in only 6 runs.
In other words, a merge reduces the total number of runs.
Reduction to a single run
Obviously to get a completely sorted file the total number of runs has to be reduced to 1 and this means splitting the merge file and repeating the merge process.
copy data to merge file
REPEAT UNTIL (number of runs on
merge file is 1)
split merge file into n input files
merge input files to give
new merge file
The splitting process is a bit wasteful because it moves data around the disk but doesn't do anything to improve the order of the file.
A better method would be to split the runs as they were being produced by the merge process. All this involves is using n input files and n output merge files.
Each time you come to the end of the current run on all of the input files you simply switch to a different output file.
As each set of runs on the input files is merged to a single run each output file receives a run in turn so building up a roughly equal number of runs on each file ready for the next phase of the merge.
There are a few fine details to clear up but this is the essence of the balanced merge procedure and it works very well.
If you start off with a total of r runs equally distributed across n files then the first merge reduces the number of runs to r/n. The next pass cuts this down to r/n^2 and after k passes there are r/n^k runs. The total number of passes required to sort N items is therefore logn(N) and as each pass requires N copy operations the performance of this balanced merge is Nlogn(N) which is very acceptable.