Xref: utzoo alt.folklore.computers:9107 comp.periphs:3399 Path: utzoo!utgpu!cs.utexas.edu!usc!samsung!think.com!hsdndev!spdcc!iecc!johnl From: johnl@iecc.cambridge.ma.us (John R. Levine) Newsgroups: alt.folklore.computers,comp.periphs Subject: Re: External Sorting Keywords: Sorting, Tapes Message-ID: <1991Jan28.181354.11095@iecc.cambridge.ma.us> Date: 28 Jan 91 18:13:54 GMT References: <1991Jan28.031017.19886@comp.vuw.ac.nz> Organization: I.E.C.C. Lines: 43 In article <1991Jan28.031017.19886@comp.vuw.ac.nz> robert@comp.vuw.ac.nz (Robert Biddle) writes: >Is External Sorting done much anymore? It sure is. The traditional rule of thumb is that 1/3 of all commercial computing is sorting, and I see no reason to think that is any less true now than it used to be. The Unix "sort" command which is respectable but not stellar in its performance gets plenty of use on most systems. External sorts seem all to be variations on the same basic plan, in which you put sorted chunks of the input into external storage, then merge those chunks together. These days, the external storage is invariably disks rather than tapes, but that doesn't to affect the algorithms much, since reading a disk sequentially is much faster than reading it randomly. IBM mainframes and their operating systems have lots of features that are specifically for speeding up sorts. For example, under MVS you can start read operations on several files, then process the records in whatever order the physical reads happen to finish. This is useful for merging files together. Also, recent 370 CPUs have microcoded instructions that accelerate the inner comparison loops. Mainframe tape drives have always had the ability to read tapes backwards, since an oscillating sort that merges alternate passes forward and backwards avoids time-consuming rewinds between passes. It's also worth noting that since sorting is a lengthy and repetitive task, there are all sorts of performance hacks commonly done in sort programs that are rare elsewhere. In all serious sort programs, the sort specification (i.e. what kind of keys, and where in the record) is compiled into a machine code comparison routine, one of the earliest and still most important uses of code modified at runtime. In many cases, the sort program is actually a driver with a library of routines tuned for different file sizes and media, and it links the comparison routine with library routines and user provided "exit routines" to produce a custom sort program for each sort job. These days I'm sure there are all sorts of hacks to improve paging performance, since sorting and merging have predictable access patterns that are often not LRU. Sorting is a very competitive business (the main competitors I'm aware of are Syncsort and IBM) and the details of their performance-enhancing tricks are jealously guarded trade secrets. -- John R. Levine, IECC, POB 349, Cambridge MA 02238, +1 617 864 9650 johnl@iecc.cambridge.ma.us, {ima|spdcc|world}!iecc!johnl " #(ps,#(rs))' " - L. P. Deutsch and C. N. Mooers