Path: utzoo!attcan!uunet!algor2!jeffrey From: jeffrey@algor2.UUCP (Jeffrey Kegler) Newsgroups: comp.unix.wizards Subject: Re: Algorithm needed: reading/writing a large file Summary: Suggested algorithm Message-ID: <470@algor2.UUCP> Date: 9 Jul 89 06:05:41 GMT References: <205@larry.sal.wisc.edu> <8137@bsu-cs.bsu.edu> <207@larry.sal.wisc.edu> Reply-To: jeffrey@algor2.UUCP (Jeffrey Kegler) Organization: Algorists, Inc., Reston VA Lines: 54 In article <207@larry.sal.wisc.edu> jwp@larry.sal.wisc.edu.UUCP (Jeffrey W Percival) writes: => Please be careful with your paraphrasing. I certainly promise to try. => My question was about optimizing the process of rearranging a disk file => according to a *given* mapping. Jeffrey P. (no relation) had implemented a suggestion made in the AT&T Bell Laboratories Technical Journal, by J. P. Linderman, p. 258. He extracted the keys and record locations of an unsorted file (call it U), sorted them, and them constructed the sorted file (call it S), only to find the random seeks involved in the last phase horrifyingly slow. => One helpful person suggested reading sequentially and writing randomly, => rather than vice-versa, That would have been my first thought. => and I tried that but it didn't help. I guess => the benefit gained from using the input stream buffering was canceled => out by the effective loss of the output stream buffering. Oh well. As a second try, allocate enough memory for N full length records, and two arrays to be sorted together of N keys and N record locations. Go through the sorted keys and find the key and locations in U of the first N records in the *sorted* file. Sort them by record location in U, the unsorted file, and read them in, in order by location in U, writing them in memory in sorted order by key in the array of full length records. Then write those records out. Repeat until all records are written. This will involve 1 sequential pass to write file S, and M/N sequential passes to read file U. A further improvement is to calculate how many sequential reads cost the same as a random seek. Call that ratio R. Whenever performing the algorithm above would require more than R sequential reads (this is easily determined from the difference in the record locations), perform a seek. My guess at R for UNIX is around 2.5 times number of records per block. Obviously the larger N the better this will work. Note your original try is this algorithm in the special case where N is 1. If we could run this algorithm in terms of physical disk block instead of logical file location, this algorithm could really hum. Further optimizations suggest themselves, but enough already.-- Jeffrey Kegler, President, Algorists, jeffrey@algor2.UU.NET or uunet!algor2!jeffrey 1762 Wainwright DR, Reston VA 22090