Path: utzoo!utgpu!jarvis.csri.toronto.edu!rutgers!sun-barr!cs.utexas.edu!pp!milano!bigtex!mybest!moray!siswat!buck From: buck@siswat.UUCP (A. Lester Buck) Newsgroups: comp.lang.c Subject: Re: sort Summary: use replacement selection Message-ID: <422@siswat.UUCP> Date: 2 Jul 89 15:00:52 GMT References: <166@stsim.UUCP> <4700040@m.cs.uiuc.edu> <770@ocsmd.ocs.com> Organization: Photon Graphics, Houston Lines: 42 In article <770@ocsmd.ocs.com>, glenn@ocsmd.ocs.com (glenn ford) writes: > Let me explain to the world what I am currently doing. At our work we > do alot of B-tree builds which require a sorted text file as input > to sort. Thus the need for something fast, since the text file > to sort can be up to 250 megabytes with about 12-15 million records > to sort. I currently run on a 6310 (VAX) that is about 5 mips rating > with 780 at 1 mip. Now the data is totally random and is spread > across the alphabet pretty well. Only requirements is that it be > CASE sensitive (yes, it makes things REAL slow then!) and that > you have a prefix offset. Now I can sort about 1.5 megabytes per > minute. I use ONLY quicksort in memory and if I don't have enough > memory I either a)split the file into multiple smaller files, then > sort and append to output file or b) I call sortmerge (VAX sort) > which takes care of those problems. I usually use case a unless I > have allready split the file. I only split the original file, and > I split it into 28 seperate files. > Now, is there anything that can beat quicksort? Sure. Once you can't fit everything into memory, you have a sort/merge problem. I had the same problem of sorting magnetic tapes full of fixed length records, up to about 160-180 MBytes. I only have one tape drive, so I, of course, have to simulate other tape drives on disk. If you do the in-core sort with quicksort, you generate fixed length runs to merge. Instead, you should use a replacement selection sorting algorithm for the in-core sort, which, on the average, generates runs twice as long. (Much longer if the data is nearly sorted, which mine sometimes is.) This saves a lot of tape handling in the merge phase. (Or a lot of seeking on disk.) I would hope that this is what the VAX sortmerge is doing. But this is definitely _not_ what Unix sort does, using qsort(3) for all sorts. Just remember, if even one merge pass can be eliminated by longer runs, then that means the entire file (~200 MBytes) needs to be input one less time. I had to write my own routines tuned for my application, but the algorithms are described very clearly in several places. I used Baase, "Computer Algorithms", and Knuth Vol. 3. A general discussion in Folk and Zoellick, "File Structures: A Conceptual Toolkit" covers various sorting problems, in-core and on disk. -- A. Lester Buck ...!texbell!moray!siswat!buck