Path: utzoo!attcan!uunet!lll-winken!ames!mailrus!ukma!uflorida!novavax!twwells!bill
From: bill@twwells.uucp (T. William Wells)
Newsgroups: comp.mail.uucp
Subject: Re: Making /usr/lib/uucp/paths a _binary_ database
Keywords: pathalias UUCP smail
Message-ID: <307@twwells.uucp>
Date: 8 Jan 89 08:13:31 GMT
References: <486@fallst.UUCP> <490@gonzo.UUCP>
Reply-To: bill@twwells.UUCP (T. William Wells)
Organization: None, Ft. Lauderdale
Lines: 63
Summary:
Expires:
Sender:
Followup-To:
Distribution:

In article <490@gonzo.UUCP> daveb@gonzo.UUCP (Dave Brower) writes:
: In article <486@fallst.UUCP> tkevans@fallst.UUCP (Tim Evans) writes:
: >I'm not very proficient at C, but it seems that since 'paths' is
: >essentially a two-field set of records it ought not be terribly
: >hard to make 'pathalias' generate a binary database file.
:
: The first question is "what you would gain?"

The answer is:

I designed a compression method just for the paths database. While I
haven't actually implemented it, my back-of-the-envelope estimate of
the compressed size of the database is 175K, as compared to the 873K
it currently takes (on my system).

: A binary format that was significantly smaller would also be much more
: cpu and disk i/o intensive to access.

Only if one isn't very clever about it. Here:

There are two databases. The first database is an index of
addresses/location in the second database. The second database
contains entries of: link to predecessor/element of route.

You use it as follows: look up the address in the first database.
This gives you a location in the second database. Decode the element
of the route; that becomes the last element of the route to that
address. Follow the predecessor link and decode the route element;
that becomes the second-to-last element of the route. Repeat this
till the predecessor link points to itself.

You also do one other trick: you have two kinds of predecssor links,
one that points within the current block and a longer one that points
to other blocks. Of course, you arrange things so that most
predecessor pointers point within the current block.

Oh yes, the names are Huffmann coded before being put into the
database. (Each name separately.)

Disk accesses? The first database is about 80K; binary searching it
with a .5K block size would require about 6 disk accesses.
(Actually, there are better ways; consider using the Huffmann code as
a hash.  Another way uses the first bits of the Huffmann code to index
a table of pointers into the first database, each bin of the first
database contains only the rest of the code to represent the address.
Either should work well.) The second database gets one read per
element of a path; however, several of the reads come from the same
block. Assuming an average path length of 9 and 3 levels of the tree
per block, this is an additional 3 accesses. The total is 9 accesses.
(If you use a hash or something similar, this is likely to be about
5.)

For the existing database, the number of accesses is 11.

CPU time? The original address needs to be Huffmann coded; each
element of the path needs to be decoded. The links need to be
bit-extracted from the database. These are all trivial, in that about
3 path decompressions + link extractions need to be done per disk
access.

---
Bill
{ uunet!proxftl | novavax } !twwells!bill