Path: utzoo!attcan!uunet!algor2!jeffrey
From: jeffrey@algor2.UUCP (Jeffrey Kegler)
Newsgroups: comp.unix.wizards
Subject: Re: Algorithm needed: reading/writing a large file
Summary: Suggested algorithm
Message-ID: <470@algor2.UUCP>
Date: 9 Jul 89 06:05:41 GMT
References: <205@larry.sal.wisc.edu> <8137@bsu-cs.bsu.edu> <207@larry.sal.wisc.edu>
Reply-To: jeffrey@algor2.UUCP (Jeffrey Kegler)
Organization: Algorists, Inc., Reston VA
Lines: 54

In article <207@larry.sal.wisc.edu> jwp@larry.sal.wisc.edu.UUCP (Jeffrey W Percival) writes:

=> Please be careful with your paraphrasing.

I certainly promise to try.

=>  My question was about optimizing the process of rearranging a disk file
=> according to a *given* mapping.

Jeffrey P. (no relation) had implemented a suggestion made in the AT&T
Bell Laboratories Technical Journal, by J. P. Linderman, p. 258.  He
extracted the keys and record locations of an unsorted file (call it
U), sorted them, and them constructed the sorted file (call it S),
only to find the random seeks involved in the last phase horrifyingly
slow.

=> One helpful person suggested reading sequentially and writing randomly,
=> rather than vice-versa,

That would have been my first thought.

=> and I tried that but it didn't help.  I guess
=> the benefit gained from using the input stream buffering was canceled
=> out by the effective loss of the output stream buffering.

Oh well.

As a second try, allocate enough memory for N full length records, and
two arrays to be sorted together of N keys and N record locations.  Go
through the sorted keys and find the key and locations in U of the
first N records in the *sorted* file.  Sort them by record location in
U, the unsorted file, and read them in, in order by location in U,
writing them in memory in sorted order by key in the array of full
length records.  Then write those records out.  Repeat until all
records are written.  This will involve 1 sequential pass to write
file S, and M/N sequential passes to read file U.

A further improvement is to calculate how many sequential reads cost
the same as a random seek.  Call that ratio R.  Whenever performing
the algorithm above would require more than R sequential reads (this
is easily determined from the difference in the record locations),
perform a seek.

My guess at R for UNIX is around 2.5 times number of records per block.
Obviously the larger N the better this will work.  Note your original
try is this algorithm in the special case where N is 1.  If we could
run this algorithm in terms of physical disk block instead of logical
file location, this algorithm could really hum.

Further optimizations suggest themselves, but enough already.-- 

Jeffrey Kegler, President, Algorists,
jeffrey@algor2.UU.NET or uunet!algor2!jeffrey
1762 Wainwright DR, Reston VA 22090