External Sort

Sorting CS 102 File Structures & File Organizations Sorting – arranging the items in a list in ascending or descending order by a key value. Applicable for all file organizations, not just sequential Why sort ? to make a report, to merge files in queries, to merge files in master file maintenance, to make searches easier, to prioritize, etc. Chapter 05 External Sorting Algorithms Internal vs External Sorts Internal Sort – sorting items entirely in main memory ICS 2, ICS 3, CS 101 External Sort – sorting files in secondary storage using main memory CS 102 Why external sort ?

Some files may be too large to fit in main memory Some Terminologies A Pass – an iteration that goes through the items (or records) of a list (or file) once to include reading it from file, processing it in main memory and writing it to file. A Run – a grouping of some items of a list. Usually a run starts as a block of records but eventually increases in size. Size of a Run – the number of items in a run. Usually no less than the blocking factor. A Merge – combining lists into one The Algorithms

External Sort Algorithms 2-way Sort Merge Balanced 2-way Sort Merge Balanced k-way Sort Merge Polyphase Sort Merge Overview : 2-Way Sort Merge A simple 2-way Sort Merge repeatedly merges 2 smaller sorted components of a file into a sorted bigger component of the file. The algorithm Phase 1 : The Sort Phase Phase 2 : The Merge Phase The Sort Phase Phase 1 : The Sort Phase Divide the records of a file into several runs, internal sort the records in a run, and distribute the runs “evenly” to two external files file_1 and file_2

The Merge Phase Phase 2 : The Merge Phase For each pair of runs, one from file_1 and another from file_2, merge the pair resulting in a longer run. Store the new resulting run in a third external file file_3 Redistribute the runs evenly in file_3 to file_1 and file_2 Repeat Phase 2 until all records are in one long run. Tips for Efficiency As much sorting in main memory must be performed using internal sort because file accesses are slower than main memory accesses.

The size of a run must be as large as available space in main memory, limited by other data that must also be in main memory. Each file must be on a separate device (such as tapes or disks) to allow easy access during the merge phase. The original file and file_3 may be assigned to the same device. The output will be in file_3. Algorithm Simulation (1) File : 50 110 95 10 100 36 153 40 120 60 70 130 22 140 80 File Size = 15 records Size of Run = 3 initially (Usually a large number ? locking factor) Number of Runs = 5 initially Sort Phase : Pass 1: Group records into size of run 50 110 95 – 10 100 36 – 153 40 120 – 60 70 130 – 22 140 80 3 records are in main memory at a time Do an internal sort and distribute File 1: 50 95 110 – 40 120 153 – 22 80 140 File 2: 10 36 100 – 60 70 130 File 3: empty Algorithm Simulation (2) File 1: 50 95 110 – 40 120 153 – 22 80 140 File 2: 10 36 100 – 60 70 130 File 3: empty Merge Phase : Pass 2: Merge : 2 sets of 3 records are in main memory at a time File 1: empty File 2: empty File 3: 10 36 50 95 100 110 – 0 60 70 120 130 153 – 22 80 140 Pass 3: Distribute: 3 records are in main memory at a time File 1: 10 36 50 95 100 110 – 22 80 140 File 2: 40 60 70 120 130 153 File 3: empty Algorithm Simulation (3) File 1: 10 36 50 95 100 110 – 22 80 140 File 2: 40 60 70 120 130 153 File 3: empty Pass 4: Merge : 2 sets of 3 records are in main memory at a time File 1: empty File 2: empty File 3: 10 36 40 50 60 70 95 100 110 120 130 153 – 22 80 140 Pass 5: Distribute: 3 records are in main memory at a time File 1: 10 36 40 50 60 70 95 100 110 120 130 153 File 2: 22 80 140 File 3: empty

Algorithm Simulation (4) File 1: 10 36 40 50 60 70 95 100 110 120 130 153 File 2: 22 80 140 File 3: empty Pass 6: 2 sets of 3 records are in main memory at a time File 1: empty File 2: empty File 3: 10 22 36 40 50 60 70 80 95 100 110 120 130 140 153 Number of passes = 1 Sort Pass + 5 Merge Passes Total Number of passes = 6 Algorithm Analysis (1) Amount of space occupied Requires 3 files Ease in the implementation of the algorithm Straightforward Awkward in having separate merge and redistribution steps in merge phase The redistribution increases the number of passes Algorithm Analysis (2)

Speed of the algorithm Let NR be the number of runs initially. If NR = 1, Total Passes = 2 (one sort phase and one trivial merge phase) Suppose NR > 1. Total Passes = ? log2 NR ? * 2 Algorithm Analysis (3) Assume NR is a power of 2. That is NR = 2 j Each iteration is composed of a distribution step and a merge step. It divides the number of runs by 2 Until there is only 1 run. The number of divisions to go from 2 j to 1 is j. So the number of iterations is j = log2 NR If NR is not a power of 2, the number of iterations is ? log2 NR ? Algorithm Analysis (4) 1 2 3 4 5 6 7 8 Merge 1

Balanced 2-Way Sort Merge Overview : The Balanced 2-Way Sort Merge improves the 2-Way Sort merge. It combines the merge and distribution steps. It requires 4 files. The algorithm Phase 1 : The Sort Phase Phase 2 : The Merge Phase Merge 2 Merge 3 Each iteration requires 2 passes : to distribute and to merge So Total Passes = ? log2 NR ? * 2 When NR=5, algorithm requires 6 passes The Sort Phase Phase 1 : The Sort Phase Divide the records of a file into several runs, internal sort the records in a run, and distribute the runs “evenly” to two external files file_1 and file_2

The Merge Phase Phase 2 : The Merge Phase For each pair of runs, one from file_1 and another from file_2, merge the pairs resulting in longer runs. alternately store the resulting runs in external files file_3 and file_4 Repeat Phase 2 until all records are in one long run. Alternate the roles of file_1 and file_2 with file_3 and file_4 depending on which files need to be merged and which would hold the redistributed resultant longer runs. End of Algorithm Copy or assign the file that contains the one long run to the desired output file.

It is called “balanced” because in each iteration, the number of input files used is equal to the number of output files used. Algorithm Simulation (1) File : 50 110 95 10 100 36 153 40 120 60 70 130 22 140 80 File Size = 15 records Size of Run = 3 initially Number of Runs = 5 initially Sort Phase : Pass 1 50 110 95 – 10 100 36 – 153 40 120 – 60 70 130 – 22 140 80 3 records are in main memory at a time do an internal sort and distribute File 1: 50 95 110 – 40 120 153 – 22 80 140 File 2: 10 36 100 – 60 70 130 File 3: empty File 4 : empty Algorithm Simulation (2)

Merge Phase : 2 sets of 3 records are in main memory at a time merge and distribute Pass 2 File 1, File 2: empty File 3: 10 36 50 95 100 110 – 22 80 140 File 4: 40 60 70 120 130 153 Pass 3 File 1: 10 36 40 50 60 70 95 100 110 120 130 153 File 2: 22 80 140 File 3, File 4: empty Pass 4 File 1, File 2, File 4: empty File 3: 10 22 36 40 50 60 70 80 95 100 110 120 130 140 153 Algorithm Analysis (1) Amount of space occupied Requires 4 files (instead of 3) Ease in the implementation of the algorithm Straightforward implementation Requires an additional file Easy to combine merge and redistribution steps in merge phase Reduces the number of passes

Algorithm Analysis (2) Speed of the algorithm Let NR be the number of runs initially If NR = 1, Total Passes = 1 (Sort Phase only) Suppose NR > 1. Total Passes = ? log2 NR ? + 1 Algorithm Analysis (3) Assume NR is a power of 2. That is NR = 2 j The sort phase takes 1 pass: sorts each run, but does not reduce the number of runs. Each execution of merge phase is composed of a merge step and a distribution step. It divides the number of runs by 2 Until there is only 1 run. The number of divisions to go from 2 j to 1 is j.

So the number of merges is j = log2 NR And the total number of passes is j + 1 = (log2 NR) + 1, including the one for sort phase If NR is not a power of 2, the number of passes is ? log2 NR? + 1 When NR=5, requires 4 passes instead of 6 Balanced k-Way Sort Merge Overview : The Balanced k-Way Sort Merge is an improvement of Balanced 2-way sort merge. Instead of merging 2 files at a time, merge k files at a time, k ? 2 Requires 2k files Each pass results in fewer number of runs compared to each pass of balanced 2-way sort merge. The Sort Phase

Phase 1 : The Sort Phase Divide the records of a file into several runs, internal sort the records in a run, and distribute the runs alternately to k external files The Merge Phase Phase 2 : The Merge Phase For each k-tuple of runs, one from each file with the distribution of runs, merge the k-tuples resulting in longer runs. alternately distribute the resulting runs in the other k external files Repeat Phase 2 until all records are in one long run. Alternate the roles of the first k files and the second k files depending on which files need to be merged and which would hold the redistributed resultant longer runs.

Algorithm Simulation (1) File : 50 110 95 10 100 36 153 40 120 60 70 130 22 140 80 File Size = 15 records; Size of Run = 3 initially Number of Runs = 5 initially k=3 Sort Phase : Pass 1 50 110 95 – 10 100 36 – 153 40 120 – 60 70 130 – 22 140 80 3 records are in main memory at a time do an internal sort and distribute File 1: 50 95 110 – 60 70 130 File 2: 10 36 100 – 22 80 140 File 3: 40 120 153 File 4, File 5, File 6 : empty Copy or assign the file that contains the one long run to the desired output file. Algorithm Simulation (2)

File 1: 50 95 110 – 60 70 130 File 2: 10 36 100 – 22 80 140 File 3: 40 120 153 File 4, File 5, File 6 : empty Merge Phase : 3 sets of 3 records are in main memory at a time Pass 2 File 1, File 2, File 3: empty File 4 : 10 36 40 50 95 100 110 120 153 File 5 : 22 60 70 80 130 140 File 6 : empty Pass 3 File 1: 10 22 36 40 50 60 70 80 95 100 110 120 130 140 153 File 2, File 3, File 4, File 5, File 6: empty Algorithm Analysis (1) Amount of space occupied k input files (File_1, File_2, …, File_k) k output files (File_k+1, File_k+2, …, File 2k) 2k = total no. f files needed Ease in the implementation of the algorithm More complicated than a (balanced) 2-way sort merge Requires more files Merging becomes more complicated Algorithm Analysis (2) Speed of the algorithm Let NR be the number of runs initially If NR = 1, Total Passes = 1 (Sort Phase only) Suppose NR > 1. Total Passes = ? logk NR ? + 1 Algorithm Analysis (3) Assume NR is a power of k. That is NR = kj The sort phase takes 1 pass: sorts each run, but does not reduce the number of runs. Each execution of merge phase is composed of a merge step and a distribution step.

It divides the number of runs by 2 Until there is only 1 run. The number of divisions to go from k j to 1 is j. So the number of merges is j = logk NR Algorithm Analysis (4) 1 2 3 4 5 6 7 8 9 Merge 1 Exercise 1 Fill in the following table with NR = 100 NR = 100 2-Way Sort Merge 3 14 Balanced 2-Way Balanced 3-Way Balanced 4-Way Merge 2 No. of Files Used And the total number of passes is j + 1 = (logk NR) + 1, including the one for sort phase If NR is not a power of 2, the number of passes is ? logk NR? + 1 When NR=5 and k=3, requires 3 passes instead of 4 Balanced 2-way) 4 8 6 6 8 5 Total No. of Passes Question: What conclusion/s can you draw based on the above table. More Exercises Exercise 2: Using Balanced 3-way Sort Merge algorithm, sort the given master file with the following records. Assume that the size of the run is 3. Determine the total number of passes. File : 28 17 79 38 5 70 24 91 37 3 19 63 15 44 8 Exercise 3: Using Balanced 3-way Sort Merge algorithm, sort the given master file with the following records. Assume that the size of the run is 4. Determine the total number of passes.

File : 50 110 95 10 100 36 153 40 120 60 70 130 22 140 80 Exercise 4: Using Balanced 4-way Sort Merge algorithm, sort the given master file with the following records. Assume that the size of the run is 3. Determine the total number of passes. File : 50 110 95 10 100 36 153 40 120 60 70 130 22 140 80 Challenges What if each sorted run from the sort phase is distributed to a separate file and all such files are merged into one output file. What are the implications ? What factors make this approach possible? impossible?

There are main memory and number of file devices limitations How do you implement a k-way merge efficiently if k > 2 ? If k is large, use priority queues CS101 (or an advanced CS101 course) The realistic sort/merge situation is somewhere between the basic balanced two-way sort merge, and the idealistic balanced k-way sort/merge, which uses k input files for k runs and merges to one output file. Polyphase Sort Merge Overview : Balanced 2-way sort/merge closer analysis : Suppose the number of runs in a pass does not divide evenly among the 2 files.

The last pair of runs to be merged is trivial (i. e. a copy) as one of the files is already empty. Possible improvement : reduce this trivial merge Another possible improvement : reduce the number of files required For a balanced k-way sort/merge, uneven distribution cause merging of less than k runs. Polyphase Improvements In the merge phase, do not distribute the merges into several files – just send them to a (k+1)st file. When an input file becomes empty, discontinue the previous merge phase.

Instead, merge the (k-1) possibly non-empty file(s) with the (k+1)st file into the empty file. Perform this repeatedly each time a file becomes empty, until there is only one nonempty file containing one long sorted run. The name “Polyphase” is attributed to the many phases of the merging process to sort the records. The Sort Phase Phase 1 : The Sort Phase Divide the records of a file into several runs, internal sort the records in a run, and distribute the runs to k external files using a distribution special to Polyphase Sort Merge. The Merge Phase

Phase 2 : The Merge Phase while not all records are in one long run repeat merge the next k-tuple of runs, one from each of the k input files, into the output (k+1)st file until an input file becomes empty reassign the empty file as the next output file and the other k files as input files. Copy or assign the file that contains the one long run to the desired output file. CJD CJD Algorithm Simulation (1) File : 50 110 95 10 100 36 153 40 120 60 70 130 22 140 80 File Size = 15 records Size of Run = 3 initially Number of Runs = 5 initially Algorithm Simulation (2)

Merge Phase : Pass 2 File 1, File 2: empty File 3: 10 36 50 95 100 110 – 22 80 140 (trivial merge : a copy) File 4: 40 60 70 120 130 153 Pass 3 File 1: 10 36 40 50 60 70 95 100 110 120 130 153 File 2: 22 80 140 (another trivial merge : a copy) File 3, File 4: empty Pass 4 File 1, File 2: empty File 3: 10 22 36 40 50 60 70 80 95 100 110 120 130 140 153 File 4: empty CJD CJD Recall : With a balanced 2-way sort merge : Sort Phase : Pass 1 50 110 95 – 10 100 36 – 153 40 120 – 60 70 130 – 22 140 80 File 1: 50 95 110 – 40 120 153 – 22 80 140 File 2: 10 36 100 – 60 70 130 File 3, File 4: empty

Algorithm Simulation (3) With Polyphase sort merge : Sort Phase : 50 110 95 – 10 100 36 – 153 40 120 – 60 70 130 – 22 140 80 (Assume 1 block = 3 records) (after 5 blocks passed) File 1: 50 95 110 – 40 120 153 – 22 80 140 File 2: 10 36 100 – 60 70 130 File 3: empty Algorithm Simulation (4) Merge Phase : (after 2 run merges, 4 more blocks passed = 9 total blocks passed) File 1: 22 80 140 File 2: empty File 3: 10 36 50 95 100 110 – 40 60 70 120 130 153 (after 1 run merge, 3 more blocks passed = 12 total blocks passed)

File 1: empty File 2: 10 22 36 50 80 95 100 110 140 File 3: 40 60 70 120 130 153 (after 1 run merge, 5 more blocks passed = 17 total blocks passed) File 1: 10 22 36 40 50 60 70 80 95 100 110 120 130 140 153 File 2: empty File 3: empty CJD CJD Algorithm Simulation Summary 17 total blocks / 5 blocks per file pass = 3. 4 passes with Polyphase 2-way Sort Merge compared to 4 passes with balanced 2-way sort merge Summary Number of Runs in … File 1 Sort Phase Merge Phase 1 Merge Phase 2 Merge Phase 3 3 1 0 1 File 2 2 0 1 0 File 3 0 2 1 0 CJD

Another Run Distribution Consider Polyphase 3-way sort merge with NR = 17 and distribute the initial runs 7-6-4 into files 1 to 3. Summary Sort Phase Merge Phase 1 Merge Phase 2 Merge Phase 3 Merge Phase 4 Level 4 3 2 1 0 Number of Number of Runs in … merges File 1 File 2 File 3 File 4 0 7 6 4 0 4 3 2 0 4 2 1 0 2 2 1 1 0 1 1 0 1 0 1 0 Observe that by distributing the initial runs 17=7-6-4, at most only one file becomes empty after each merge, except the last. This is because 17 has a perfect 3rd order Fibonacci distribution of 7-6-4. CJD

Fibonacci Sequence 2nd order Fibonacci sequence F ( 2) 0 Polyphase Fibonacci Distrib (ex1) In first example, NR = 5 For k=2, using 3 files = 2 input files + 1 output file The recommended distribution of NR=5 over two input files is to use 2 nd order Fibonacci distribution. Summary Sort Phase Merge Phase 1 Merge Phase 2 Merge Phase 3 Level (n) 3 2 1 0 = 0, F1( 2 ) = 1, ) Fn( 2 ) = Fn(? 21) + Fn(? 22 , {Fi ( 2 ) } = {0,1,1,2,3,5,8,13,21,… } 3rd order Fibonacci sequence ) 3) F0( 3) = 0, F1(3 ) = 0, F2( 3) = 1, Fn(3 ) = Fn(? 31) + Fn(? 2 + Fn(? 3 , {Fi (3 ) } = {0,0,1,1,2,4,7,13,24, 44… } Number of merges 0 2 1 1 Number of Runs in … File 1 3 1 0 1 File 2 2 0 1 0 File 3 0 2 1 0 kth order Fibonacci sequence Fi ( k ) = 0, ? i < k ? 1, k k Fk(? 1) = 1, Fn( k ) = Fn(? 1) + Fn(? k2) + L + Fn(? kk) The ith largest file on the nth level (n>0) initially contains the following number of runs : k k k Fn(+ k) ? 2 + Fn(+ k) ? 3 + L + Fn(+ i)? 2 CJD the number of runs on ith largest file on the nth level is : Fn(+kk) ? 2 + Fn(+kk) ? 3 + L + Fn(+ki)? 2 = Fn( 2) + L + Fn(+2i)? which means largest is Fn+Fn-1 and 2 nd largest is Fn. CJD Polyphase Fibonacci Distrib (ex2) In second example, NR=17 and k=3, the number of runs on the ith largest file on the nth level (n>0) initially contains Imperfect NR If NR is not perfect, add dummy runs to make it perfect. This is done during the Sort Phase. Where? Some say distribute them either at the end or beginning of each file F level (n) 5 4 3 2 1 0 (k ) n+k ? 2 +F (k ) n+ k ? 3 +L+ F (k ) n+i ? 2 Largest File (i=1) 13=7+4+2 7=4+2+1 4=2+1+1 2=1+1+0 1=1+0+0 1 nd Largest File (i=2) 11=7+4 6=4+2 3=2+1 2=1+1 1=1+0 0 3rd largest File (i=3) 7=7 4=4 2=2 1=1 1=1 0 Runs (perfect run sizes for k=3) 31=13+11+7 17=7+6+4 9=4+3+2 5=2+2+1 3=1+1+1 1 Exercise : build the table for k=4 from level 1 to level 5 CJD CJD Perfect Fibonacci Numbers Let n = level number an, bn, cn = decreasing order of sizes of non-empty files at level n for k=3, 3) c n = Fn(+1 3 bn = Fn(+1) + Fn(3) 3 3 3) an = Fn(+1) + Fn(3) + Fn(? 1) = Fn(+2 Perfect Polyphase Distributions level (n) n n+1 0 1 2 3 4 5 n+1 CJD

Largest 2nd 3rd 4th File File Largest Largest (Empty (i=1) File (i=2) File (i=3) File) an bn cn dn an+bn 1 1 2 4 7 13 an+bn an+cn 0 1 2 3 6 11 an+cn an 0 1 1 2 4 7 an+dn 0 0 0 0 0 0 0 0 Runs (perfect run sizes for k=3) tn tn+2an 1 3 5 9 17 31 tn+2an CJD t n = an + bn + c n 3) c n +1 = Fn(+ 2 = a n + 0 3) 3 bn +1 = Fn(+2 + Fn(+1) = an + cn 3) 3 an +1 = Fn(+ 2 + Fn(+1) + Fn( 3) = an + bn t n +1 = an +1 + bn +1 + c n +1 = 3a n + bn + cn = (an + bn + c n ) + 2an = t n + 2a n Algorithm Analysis (1) Amount of space occupied + 1 files for polyphase k-way sort merge vs 2k files for balanced k-way sort merge Algorithm Analysis (2) Speed of the algorithm According to theoretical computations by Knuth, Note : logk x = ln x / ln k Number of Passes for NR = 100 Polyphase Balanced k-way k-way 7. 90 8 5. 66 6 4. 88 5 4. 54 4 4. 01 4 Ease in the implementation of the algorithm Polyphase adds more complication to the balanced kway algorithm. The sort phase must distribute the runs according to the Fibonacci perfect distribution, adding dummy runs when necessary.

Each merge phase iteration may not run its full course due to some files becoming empty, thereby making it a little more difficult to trace the algorithm. CJD k 2 3 4 5 8 Polyphase k-way 1. 50 1. 02 0. 86 0. 80 0. 73 ln NR ln NR ln NR ln NR ln NR + + + + + 0. 99 0. 96 0. 92 0. 86 0. 65 Balanced k-way ? log2 NR? + 1 ? log3 NR? + 1 ? log4 NR? + 1 ? log5 NR? + 1 ? log8 NR? + 1 CJD Algorithm Analysis (3) Speed of the algorithm Runs best when NR has a perfect kth order Fibonacci distribution For small k, Polyphase Sort Merge performs better than Balanced k-Way Sort Merge Why ?

Because trivial copies are minimized. For large k, Balanced k-Way Sort Merge beats Polyphase Sort Merge. Why ? Because with more files, the merged runs are distributed to more files. Summary of Analyses Comparison : Algorithm Space Time (# of (# of Passes) Files) 3 2 * ? log2 NR? ? 4 1 + ? log2 NR? ? 2k 1 + ? logk NR? ? 3 4 5 1. 50 ln NR + 0. 99 1. 02 ln NR + 0. 96 0. 86 ln NR + 0. 92 # of Passes (NR=100) 14 8 (6 if k=3; 5 if k=4) 7. 90 5. 66 4. 88 2-way Balanced 2-way Balanced k-way Polyphase 2-way Polyphase 3-way Polyphase 4-way CJD CJD Impact of Devices

Device Impact on External Sorts The sort time is of course highly influenced by the secondary storage device being used. Tapes require to be rewound between passes. On disk, all files may reside on the same disk but has more overhead because of seek time and latency time as the head(s) switch from file to file. If possible, store the files on separate disks. This allows I/O to overlap and run in parallel. If a disk is dedicated to a file, it will reduce seek time and latency time. Further complications arise in a multi-user environment. CJD End CJD