Path: utzoo!utgpu!news-server.csri.toronto.edu!bonnie.concordia.ca!thunder.mcrcim.mcgill.edu!snorkelwacker.mit.edu!stanford.edu!agate!dog.ee.lbl.gov!elf.ee.lbl.gov!torek
From: torek@elf.ee.lbl.gov (Chris Torek)
Newsgroups: comp.lang.c
Subject: Re: hash function for mac needed
Message-ID: <14396@dog.ee.lbl.gov>
Date: 18 Jun 91 07:37:44 GMT
References: <14207.285B768A@stjhmc.fidonet.org>
Reply-To: torek@elf.ee.lbl.gov (Chris Torek)
Organization: Lawrence Berkeley Laboratory, Berkeley
Lines: 67
X-Local-Date: Tue, 18 Jun 91 00:37:44 PDT

In article <14207.285B768A@stjhmc.fidonet.org>
Dave.Harris@f14.n15.z1.fidonet.org (Dave Harris) writes:
>... The simple thing to do is to add all the characters up then mod by
>the size of your hash, then use that number to index into an array. ...
>You don't necessarly need to add all 255 characters together.  

Indeed: as it turns out, while many people have spent much time looking
for `good' hashing functions, fewer seem to have noted that many
hashing systems spend more time computing hashes than searching.  I
find that this is usually the case, and that a fast hash function is
better than a slower-but-better-distribution function.  One good hash
function, useful in many applications, is this:

	hash = 0;
	for (each character)
		hash = (hash * 33) + (the character);
	hash_index = hash & ((some power of two) - 1);

In C, this might be written as

	struct hashent *
	lookup(table, string)
		register struct hashtab *table;
		char *string;
	{
		register unsigned hash = 0;
		register char *p, c;
		register struct hashent *hp;

		for (p = string; (c = *p) != '\0'; p++)
			hash = (hash << 5) + hash + c;
		p = string;
		for (hp = table->ht_entries[hash & table->ht_mask];
		     hp != NULL; hp = hp->h_next)
			if (hp->h_hash == hash && strcmp(hp->h_key, p) == 0)
				return (hp);
		return (NULL);
	}

>To anybody else.  If I have n data elements, with a hash of size x, what is 
>considered a good number of searches based on n and x (i.e. ln(x/n)).

If you have a perfectly uniform distribution, each chain or hash
sequence will have n/x elements; assuming the probability of any
element lookup is identical, this is the best that can be done.  I do
not know what others consider `good', but a factor of 1.3 or so over
that seems reasonable to me (I cannot quite think why).

Typically neither assumption holds true.  There are too many ways to
take advantage of this to characterize here.  One thing that can be
done, however, is to use the above hash table scheme with expanding
tables.  As the number of entries increases, the hash table size can be
doubled, and table->ht_mask adds one bit:

	table->ht_mask = (table->ht_mask << 1) | 1;

Since the hash values of each <key,datum> pair are maintained, it is
easy to move the entries from their old hash slots to their new ones.
Doubling occurs O(log2 n) times if the doubling criterion is total
number of table entries; other doubling criteria, such as maximum chain
length, are possible but rarely worthwhile---one ends up spending more
time optimizing hashes than searching, just as with more complicated
hash functions.  Trading space for time this way seems to work well,
and you can select the amount of space by selecting the doubling factor.
-- 
In-Real-Life: Chris Torek, Lawrence Berkeley Lab CSE/EE (+1 415 486 5427)
Berkeley, CA		Domain:	torek@ee.lbl.gov