Path: utzoo!attcan!uunet!lll-winken!ames!mailrus!ukma!uflorida!novavax!twwells!bill From: bill@twwells.uucp (T. William Wells) Newsgroups: comp.mail.uucp Subject: Re: Making /usr/lib/uucp/paths a _binary_ database Keywords: pathalias UUCP smail Message-ID: <307@twwells.uucp> Date: 8 Jan 89 08:13:31 GMT References: <486@fallst.UUCP> <490@gonzo.UUCP> Reply-To: bill@twwells.UUCP (T. William Wells) Organization: None, Ft. Lauderdale Lines: 63 Summary: Expires: Sender: Followup-To: Distribution: In article <490@gonzo.UUCP> daveb@gonzo.UUCP (Dave Brower) writes: : In article <486@fallst.UUCP> tkevans@fallst.UUCP (Tim Evans) writes: : >I'm not very proficient at C, but it seems that since 'paths' is : >essentially a two-field set of records it ought not be terribly : >hard to make 'pathalias' generate a binary database file. : : The first question is "what you would gain?" The answer is: I designed a compression method just for the paths database. While I haven't actually implemented it, my back-of-the-envelope estimate of the compressed size of the database is 175K, as compared to the 873K it currently takes (on my system). : A binary format that was significantly smaller would also be much more : cpu and disk i/o intensive to access. Only if one isn't very clever about it. Here: There are two databases. The first database is an index of addresses/location in the second database. The second database contains entries of: link to predecessor/element of route. You use it as follows: look up the address in the first database. This gives you a location in the second database. Decode the element of the route; that becomes the last element of the route to that address. Follow the predecessor link and decode the route element; that becomes the second-to-last element of the route. Repeat this till the predecessor link points to itself. You also do one other trick: you have two kinds of predecssor links, one that points within the current block and a longer one that points to other blocks. Of course, you arrange things so that most predecessor pointers point within the current block. Oh yes, the names are Huffmann coded before being put into the database. (Each name separately.) Disk accesses? The first database is about 80K; binary searching it with a .5K block size would require about 6 disk accesses. (Actually, there are better ways; consider using the Huffmann code as a hash. Another way uses the first bits of the Huffmann code to index a table of pointers into the first database, each bin of the first database contains only the rest of the code to represent the address. Either should work well.) The second database gets one read per element of a path; however, several of the reads come from the same block. Assuming an average path length of 9 and 3 levels of the tree per block, this is an additional 3 accesses. The total is 9 accesses. (If you use a hash or something similar, this is likely to be about 5.) For the existing database, the number of accesses is 11. CPU time? The original address needs to be Huffmann coded; each element of the path needs to be decoded. The links need to be bit-extracted from the database. These are all trivial, in that about 3 path decompressions + link extractions need to be done per disk access. --- Bill { uunet!proxftl | novavax } !twwells!bill