Page 4 of 4
The final touch
If your head is spinning trying to flow the intricacies of polyphase merging the final touch to the performance is a simple and pleasing idea.
Obviously the longer the initial runs in the data are the fewer the merge operations needed to sort the file. It is possible to use merge operations to sort data without any pre-sorting and these pure merge sorts offer a surprisingly good performance.
If you merge data that is in random order the average run length is 2. If you make use of a sort routine using enough memory to hold N records then the run length can be increased to N with a subsequent reduction in the number of merges need to sort the whole file.
Is there any way of using the memory to increase the average run lengths beyond N?
The answer is yes. If you use an incremental sort procedure, i.e. one that can add and remove data items while maintain the sorted order, such as heap sort. In this case you can initially read in and sort N records into a heap and write out the largest record to start the run reading in a new record to replace it.
As the sort procedure is incremental the new record can be placed in its correct sorted position in the heap and the largest record written out again. In this way the run length can be extended beyond N and it only fails when the new record that is read in is larger than the first record that was written out to start the run. When this happens you have no choice but to write out all N records and start a new run by reading in N more records.
If you do the statistics it turns out that using this method the average length of the run is 2N and of course never less than N.
Niklaus Wirth describes a procedure that works in exactly this way that using 6 files and enough RAM to store only 100 records it is possible to sort a file with 165,680,100 initial accidental runs in only 20 passes!
Sorting is a strange subject.