Path: utzoo!utgpu!watserv1!watmath!att!att!linac!pacific.mps.ohio-state.edu!zaphod.mps.ohio-state.edu!wuarchive!udel!haven!mimsy!chris
From: chris@mimsy.umd.edu (Chris Torek)
Newsgroups: comp.unix.programmer
Subject: Re: ndbm won't cut it, what now?
Message-ID: <27794@mimsy.umd.edu>
Date: 18 Nov 90 17:53:53 GMT
References: <1990Nov11.111118.21856@cs.umn.edu> <2300@sixhub.UUCP> <1990Nov16.145542.22326@iecc.cambridge.ma.us>
Organization: U of Maryland, Dept. of Computer Science, Coll. Pk., MD 20742
Lines: 192

>In article <RANG.90Nov15214357@nexus.cs.wisc.edu> rang@cs.wisc.edu
>(Anton Rang) writes:
>>... I thought the original poster meant 'multiple keys' in the
>>sense of one record having several access paths to it.

(This was the impression I got as well.)

In article <1990Nov16.145542.22326@iecc.cambridge.ma.us>
johnl@iecc.cambridge.ma.us (John R. Levine) writes:
>You can fake it by making one key the primary key and storing the actual data
>in its record.  The secondary keys point to the primary key.  To keep sizes
>down you might want to make the primary key something small and boring like a
>three or four byte serial number. 
>
>Logically each key type is in a separate file, but you can combine them into
>one big dbm file using the prefix byte trick.  It's gross, but no grosser
>than what real data bases do every day.

You also need to store the `next serial number', so you have to reserve a
key for this as well.  This is indeed rather gross.

A few years ago I thought about this problem for a while, and came up with
a different idea based on the same hashing technique.  To explain it I need
to describe the idea behind dbm.

The way dbm works is to take each `key' string and turn it into a
32-bit number, something like a CRC-32 or a checksum.  This number is
not necessarily unique (two different strings will produce the same
32-bit value) but collisions should be rare, if the string-to-number
algorithm is chosen carefully.  In any case, the same string always
produces the same number.

Now to store a `datum', we turn its `key' into our 32 bit number.  This
number is taken as the block number in a file of blocks (typically 4096
bytes each).  To avoid starting out with 2^32 blocks, we ignore `some'
of the bits in this number.  Initially we ignore *all* the bits,
storing all key/datum pairs in block 0; as the database grows, we
ignore fewer bits.  We can leave a binary decision tree behind as a
record of when we stopped ignoring each bit.  This `tree' is simply a
string of bits, 0 where we are still ignoring, and 1 where we are
not.  Given a key hash `h', we start by saying `ignore all the bits',
then we ignore fewer bits as long as bit (h&mask)+mask is set:

	/*
	 * NB: the following is `off the top of my head'.
	 * It may contain errors.
	 */
	hash = calchash(key);
	for (hmask = 0; bit(map, (hash & hmask) + hmask) != 0;
	     hmask = (hmask << 1) | 1)
		/* void */;

Now if the key is anywhere in the database, it is in the block whose
number is (hash & hmask).  The datum for that key is stored immediately
after the key, so all we have to do is read that block and march
through every even-numbered entry, comparing against the key.  If there
is a match, return the corresponding odd-numbered entry.

Under this implementation, storing key=value pairs such as

	mimsy=128.8.128.8 mimsy.umd.edu
	mimsy.umd.edu=128.8.128.8 mimsy.umd.edu
	mimsy.cs.umd.edu=128.8.128.8 mimsy.umd.edu

(a name to number-and-official-name mapping as might be found in a host
number database) requires storing `128.8.128.8 mimsy.umd.edu' three
times---51 bytes, at 17 bytes each, counting the host number as 4 bytes.
Using John Levine's technique we might instead say

	Nmimsy=17
	Nmimsy.umd.edu=17
	Nmimsy.cs.umd.edu=17
	#17=128.8.128.8 mimsy.umd.edu

so that we can store `128.8.128.8 mimsy.umd.edu' only once, saving 34
bytes (and costing 1+4 per key plus 5 for the intermediate key; we wind
up with a net savings of 14 bytes---minus 2 more hidden inside the dbm
implementation, used when storing that intermediate key).  But this
means that to look up a name we have to say:

	*buf = 'N';
	strcpy(buf + 1, name);
	key.dptr = buf;
	key.dsize = strlen(buf);
	datum = fetch(key);
	if (datum.dptr != NULL) {
		*buf = '#';
		bcopy(datum.dptr, buf + 1, sizeof(int));
		key.dsize = 1 + sizeof(int);
		datum = fetch(key);
		if (datum.dptr == NULL)
			panic("lost target");
		printf("officially, %s = %.*s\n",
		    name, datum.dsize, datum.dptr);
	} else
		printf("%s does not exist\n", name);

There *is* another way.

Instead of storing keys and data in pairs, we can store each
individually.  In other words, we can dissociate them---put each in its
own block.  All we have to do is somehow tie one to the other
(otherwise the database is no good!).  The 32-bit hash value is almost
good enough, but not quite: two different strings can produce the same
hash.  But what if we augment it?

Instead of viewing the database as a series of `key/datum' pairs,
suppose we view it as a collection of `items'.  Each item has three
attributes:

 A. A number that is unique to this item within its block.  (Blocks
    are typically 4096 bytes or so, so there cannot possibly be more
    than 4096 or so items; a 16-bit number suffices.)  I called this
    an `index' for want of a better name; this particular index is
    the item's `in index'.  (`Key' might be a better name in some
    other context :-) .)  This exists so that the item can be a datum.

 B. A `link' record consisting of a 32-bit value and another index.
    This is only used if the item is considered a `key'.  The index
    here is an `out index'; it is zero if the item is not acting as
    a key.  The 32-bit value is an `out hash'.

 C. A `link count' telling how many times this item is a key or datum
    (it can only be a key once).

Now to store an item, whether as key or datum, we compute its hash and
find its block as above.  Then, if the item is a key (has a nonzero
`out index') and we are replacing its current content (link target),
we `unlink' its current target, store the new datum, and `link' this
key to the item produced here.  If the item is not a key and is being
made one, it simply acquires an `out index'.  If the item is a datum,
we just increment its link count.

To find a key's datum, we `follow the link': the out hash is the hash
value for the key's datum.  Just as with a key, this (plus the decision
bit tree) tells us exactly which block will contain the key's datum
(which in this case is certain to exist).  Since the item index values
are unique to each block, once we have the correct block all we need to
do is find the item with the same `in index' number that the key has in
its `out index'.

Thus, instead of

	Nmimsy=#17
	Nmimsy.umd.edu=#17
	Nmimsy.cs.umd.edu=#17
	#17=128.8.128.8 mimsy.umd.edu

we might have a database that looks like this:

	<block 1>:
	   <in=1, out=2, outhash=3ab197c4, links=1, "mimsy">
	<block 3>:
	   <in=3, out=2, outhash=3ab197c4, links=1, "mimsy.umd.edu">
	   <in=4, out=2, outhash=3ab197c4, links=1, "mimsy.cs.umd.edu">
	<block 4>:
	   <in=2, out=0, outhash=0, links=3, "128.8.128.8 mimsy.umd.edu">

More interestingly, if we create a new database and stick

	hello=hello

into it, we get:

	<block 0>:
	   <in=1, out=1, outhash=195e094c, links=2, "hello">

This key points to itself!

There is no problem with circular references, because a key does not
point to another `key', but rather to an `item'.  This item has two
references (itself as a key, and itself as a datum), not infinitely
many.  Storing hello=world causes the count to go to 1 (when the
as-datum link vanishes); removing `hello' as a key then makes the count
go to 0.

The main problem with this scheme is that, as with John Levine's, it
takes two database accesses to locate an item (though this time the
work is hidden from the programmer).  There is also a problem with the
index values.  As each block fills and splits, a new block is created
with some index values `used up', and it can become difficult to decide
on the next `available in index'.  I got around this by recording
minimum and maximum `in use' index numbers in each block header, so
that typically simply taking the next number suffices.  When the min
and max values reach 1 and 65535 (actually USHRT_MAX), however, it is
necessary to scan the block for a free number.  Also, this technique
imposes a fair amount of overhead (12 bytes per item for a size offset,
link count, two index values, and the out hash, versus 2 bytes per item
in dbm/ndbm for the size offset).
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 405 2750)
Domain:	chris@cs.umd.edu	Path:	uunet!mimsy!chris