Path: utzoo!utgpu!jarvis.csri.toronto.edu!rutgers!apple!sun-barr!cs.utexas.edu!uunet!nuchat!moray!siswat!buck From: buck@siswat.UUCP (A. Lester Buck) Newsgroups: comp.lang.c Subject: Re: sort Summary: use replacement selection for initial runs in external sorts Message-ID: <425@siswat.UUCP> Date: 8 Jul 89 05:20:39 GMT References: <166@stsim.UUCP> <4700040@m.cs.uiuc.edu> <770@ocsmd.ocs.com> <10488@smoke.BRL.MIL> Organization: Photon Graphics, Houston Lines: 83 In article <10488@smoke.BRL.MIL>, gwyn@smoke.BRL.MIL (Doug Gwyn) writes: > In article <422@siswat.UUCP> buck@siswat.UUCP (A. Lester Buck) writes: > >But this [merge sorting with replacement selection] is definitely > >_not_ what Unix sort does, using qsort(3) for all sorts. > > As of SVR2, the UNIX "sort" utility definitely did perform n-way merge > sorting with replacement selection. It used a binary tree for each of > the internal runs being merged, inserting a fresh record from an > external merge input file after each merged record was output. A > quicksort algorithm (not qsort() from the C library, however) was used > only to produce the initial internal runs. > > When did this change? I was referring to replacement selection as the method to generate the initial runs, not as the method for the n-way merge. But these initial runs are not "internal," as the original poster's problem called for an external (tape or disk) sort. In replacement selection, a tournament tree is maintained of losers in the sort comparison, so a record can either be output in the current run or saved in the tree. When the tree is filled with losers, then a new run is started. These runs then go into the obvious n-way merge algorithm to be merged at the highest order possible, limited only by the number of open files. Replacement selection generates, on the average, runs that are twice as long as a simple in-core sort, thus allowing a lower merge order and possibly saving one or more input/output passes through all the data. On partially sorted data, the runs can be much longer. Knuth Vol. 3 gives a cute pictorial representation of the effective doubling of the run length using a snowplow traveling in a circle during a snow storm. Before I wrote my own custom tape sort with replacement selection, I was using the SVR2 sort. I was sorting ~160 MB tapes. It clearly was doing a simple in-core sort, because all the intermediate runs were exactly the same length (~1.2 MB in my case, for the initial set of runs). [I have also glanced at the source since then, and verified that it was doing a quicksort type sort in core (though not qsort(3), as Doug reminds me). I do not remember seeing any support for a replacement selection tree, just a simple n-way comparison with no memory.] I also had to make a binary patch to sort to have it use a huge scratch filesystem, rather than /tmp or whatever by default. Only later did I learn of the undocumented switch to add a temporary space directory. In addition, different manuals described different SVR2 sort options for the amount of memory to use, none of which seemed to do anything on my version. (The program could malloc() lots of virtual memory, but we would only want to use the available physical RAM.) Replacement selection can also be used in the n-way merge of initial runs, but the savings are marginal and usually not worth the trouble. For the initial runs, the order of merge P is effectively the number of records that will fit in RAM (real RAM, not virtual memory), while for the n-way merge, the order cannot exceed the number of free file descriptors, which is about 15 in Unix. Knuth suggests the breakeven in maintaining the comparison tree is order P ~ 8. For less than this, it is faster to do P-1 comparisons for each record. The version of SVR2 sort I employed did not appear to use replacement selection for any part of the sort/merge (and only merged at order 10 or 11, if I remember), and all the intermediate merge files were always the same fixed length (except for the last partial file). From "File Structures: A Conceptual Toolkit" by Folk and Zoellick (p. 6): A few months ago one of us was interviewing an applicant for a position that involved working on a large information retrieval problem. Over lunch we told the applicant about the difficulties we had finding a sort package suitable for sorting the hundreds of millions of records involved in the index structures for the system we were working on. The applicant asked, "I suppose you use Quicksort for that?" His question can serve as a nice illustration of the difference between a typical _data structure_ course and the study of _file structures_. [...] Unfortunately, he was also showing us that he had not had the opportunity to take a file structure course. A file consisting of hundreds of millions of records is much too big to load into RAM for sorting. The sorting must involve use of secondary storage. That means that we are no longer in a _random access_ environment; the cost of getting some pieces of information is now enormously more expensive than the cost of getting other information. [...] A later chapter gives careful consideration to the limitiations of secondary storage devices in the construction of a sorting algorithm that is very different than Quicksort. -- A. Lester Buck ...!texbell!moray!siswat!buck