Magic of Merging

Written by Mike James

Thursday, 20 August 2020

Article Index
Magic of Merging
Balanced And Polyphase
Final touches

Page 3 of 3

The final touch

If your head is spinning trying to flow the intricacies of polyphase merging the final touch to the performance is a simple and pleasing idea.

Obviously the longer the initial runs in the data are the fewer the merge operations needed to sort the file. It is possible to use merge operations to sort data without any pre-sorting and these pure merge sorts offer a surprisingly good performance.

If you merge data that is in random order the average run length is 2. If you make use of a sort routine using enough memory to hold N records then the run length can be increased to N with a subsequent reduction in the number of merges need to sort the whole file.

Is there any way of using the memory to increase the average run lengths beyond N?

The answer is yes.

If you use an incremental sort procedure, i.e. one that can add and remove data items while maintain the sorted order, such as heap sort. In this case you can initially read in and sort N records into a heap and write out the largest record to start the run reading in a new record to replace it.

As the sort procedure is incremental the new record can be placed in its correct sorted position in the heap and the largest record written out again. In this way the run length can be extended beyond N and it only fails when the new record that is read in is larger than the first record that was written out to start the run. When this happens you have no choice but to write out all N records and start a new run by reading in N more records.

If you do the statistics it turns out that using this method the average length of the run is 2N and of course never less than N.

Niklaus Wirth describes a procedure that works in exactly this way that using 6 files and enough RAM to store only 100 records it is possible to sort a file with 165,680,100 initial accidental runs in only 20 passes!

Sorting is a strange subject.

The Future

Ok I admit it that merge sorting's best days were when computers kept you warm, had lots of flashing lights and cool tape drives.

tape0

This is what a computer should look like - and now you know what the tape drives are doing for most of the time....

The point is however that merge sorting isn't a relic - Java for one uses it as a standard way of sorting collections and many programmers are puzzled as to why this is - why isn't QuickSort in use?

As mentioned in the introduction the answer is that QuickSort has a worst case running time of O(n²) but merge sort doesn't have this worst case performance and so offers O(nlog₂n) in the worse case.

It is a method of sorting that uses only sequential access and there are lots of situations when this is the case. If you have data coming in over a live link then sorting it using merge sort like methods may be a cost effective way - building up runs by sending it to a set of files and then merging the files. Similarly with today's huge datasets - big data - you might not be able to store the whole thing in memory and using some merge sort gain becomes a possibility.

So even when sequential access isn't an issue simple merge sorting has advantages.

One completely new application is in parallel processing. Hadoop makes it relatively easy to split up a calculation among many computers and get back a single result. The splitting up is called a map operation where each computer gets to compute its part of the result. Getting the final answer is by way of a reduce operation when each of the machine result is merged to a single answer. A merge sort can be implemented in parallel by allowing each machine to sort a portion of the data small enough to fit into memory and then the final result is obtained by merging. Of course you might not have enough machines to get the job done in one merge so balanced and polyphase merge becomes another useful algorithm.

There are even good reasons to use modified merge sorts to make best use of any caches that are available. In this case the size of each run is arranged to just fit into the cache.

The days of the merge sort are far from over.

Merge sort as folk dance

Sorting Algorithms as Dances

Sequential storage

QuickSort Exposed

Quick Median

Mike James is the author of Programmer's Python: Everything is an Object published by I/O Press as part of the I Programmer Library. With the subtitle "Something Completely Different" this is for those who want to understand the deeper logic in the approach that Python 3 takes to classes and objects. His latest book is The Programmer’s Guide To Theory: Great ideas explained.

What Programmers Know

knowcover

* Recently revised

Comments

or email your comment to: comments@i-programmer.info

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Coded Easter Eggs

A software Easter Egg is an intentionally hidden novelty or message concealed for personal reasons within a computer program or application. We take a look at its history and original motivation [ ... ]

+ Full Article

A Programmers Guide To Interrupts

The trick the computer uses in order to be so productive is to divide its attention between a number of tasks – and for this it uses interrupts. But what exactly is an interrupt and how should progr [ ... ]

+ Full Article

The final touch

The Future

Related Articles

What Programmers Know

Contents

Comments