Path: utzoo!utgpu!jarvis.csri.toronto.edu!rutgers!tut.cis.ohio-state.edu!gem.mps.ohio-state.edu!ginosko!uunet!nuchat!moray!siswat!buck From: buck@siswat.UUCP (A. Lester Buck) Newsgroups: comp.unix.wizards Subject: Re: Algorithm needed: reading/writing a large file Message-ID: <427@siswat.UUCP> Date: 13 Jul 89 15:41:33 GMT References: <205@larry.sal.wisc.edu> <8137@bsu-cs.bsu.edu> <207@larry.sal.wisc.edu> Organization: Photon Graphics, Houston Lines: 59 In article <207@larry.sal.wisc.edu>, jwp@larry.sal.wisc.edu (Jeffrey W Percival) writes: > In article <8137@bsu-cs.bsu.edu> dhesi@bsu-cs.bsu.edu (Rahul Dhesi) writes: > >In article <205@larry.sal.wisc.edu> jwp@larry.sal.wisc.edu (Jeffrey W > >Percival) writes: > >[how do I sort a large file that won't fit in memory?] > >There are many variations on the merge sort. Here is a simple one: > > Please be careful with your paraphrasing. I did not ask how to sort a > large file that won't fit in memory. I said I already had a fast and > efficient sort and that the sorting was already known. My question was > about optimizing the process of rearranging a disk file according to a > *given* mapping. > -- > Jeff Percival (jwp@larry.sal.wisc.edu) I was one who emailed about basic external sorting. I don't understand what you are saying above. If your problem is to sort a large file that is on external storage, then the classical methods are what you want. If you somehow have a problem where you are handed the already sorted list of pointers to records and you want to go from that stage, that is _maybe_ a different problem - I don't know because that sounds extrememly contrived. It is pointed out in many places, including Knuth Vol. 3, that it gains you next to *nothing* to get a sorted table of pointers to records when what you almost always want is the records themselves sorted. "Fast and efficient sorts" that don't order the records themselves are not of much advantage. You are just delaying the real pain of moving the records. From Knuth Vol. 3, p. 374: The external rearrangement problem which remains after [key sorting] seems trivial, at first glance; but in fact it is rather difficult, and no really good algorithms (significantly better than sorting) have yet been found. We could obviously do the rearrangement in N steps, moving one record at a time; for large enough N this is better than the N(log N) of a sorting method. But N is never that large, and N seeks are unthinkable! [...] But it seems wasteful to do a key sort only to follow it by a full sort! [Goes on to discuss a nontrivial lower bound on the number of seeks required to rearrange records on a disk file.] The only real gain of sorting the keys and not the records themselves is when you _don't_ have to move the data, as in an indexed file application. You still have to physically sort the records once if you require reasonable sequential access. I remember reading once that an estimated 40% of all mainframe cycles in the world were spent on sorting. I am sure we would all be interested in hearing about any special case you might have where your keysort is completely free, but I doubt whether you will be able to do much better than the "classical" sort/merge with replacement selection. If you do end up with a good method, you have a highly publishable (and patentable!) piece of work. -- A. Lester Buck ...!texbell!moray!siswat!buck