Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!swrinde!zaphod.mps.ohio-state.edu!cis.ohio-state.edu!ucbvax!pasteur!dog.ee.lbl.gov!elf.ee.lbl.gov!torek
From: torek@elf.ee.lbl.gov (Chris Torek)
Newsgroups: comp.lang.c
Subject: Re: hash function
Message-ID: <14583@dog.ee.lbl.gov>
Date: 21 Jun 91 22:37:59 GMT
References: <14491.28619071@stjhmc.fidonet.org>
Reply-To: torek@elf.ee.lbl.gov (Chris Torek)
Organization: Lawrence Berkeley Laboratory, Berkeley
Lines: 43
X-Local-Date: Fri, 21 Jun 91 15:37:59 PDT

In an article whose referent was deleted by bogus fidonet software
(and whose tabs were changed to spaces), I provided this example
hashing function:
>>	hash = 0;
>>	for (each character)
>>		hash = (hash * 33) + (the character);
>>	hash_index = hash & ((some power of two) - 1);

In article <14491.28619071@stjhmc.fidonet.org>
Dave.Harris@f14.n15.z1.fidonet.org (Dave Harris) writes:
>Is the *33 use to keep short sequences has key more random or does it
>really help in the distribution of large string too?

I am not sure what you are asking here.  There are basically three
questions you can ask about a hash function:

 a) Is it fast enough?  (complicated functions often take longer than
    the searches they replace)

 b) Does it produce a good distribution?  (the hash function
    h(input)=0 is very fast but produces horrible distributions)

 c) If (b) is true, why does it work?

The first two can really only be answered in some specific context.
In all the contexts I have tried, this one answers `yes' to both
(which is why I recommend it).

The third question is the hardest.  What *does* that 33 do?  I have no
idea.  All I know is that I have experimented with `easy' functions
(functions for which `fast enough' should generally be true) and this
one also produces a good distribution.  I doubt I still have the mail,
but when Margo Seltzer was writing the database code to replace ndbm,
she tried a bunch of different functions on a bunch of different data.
This one usually worked well---it did not always give the best
distributions, but it never performed poorly; some functions that gave
better distributions on some data gave much worse on other.

Maybe there are some theory people out there who can explain it.
Probably some staring at Knuth vol. 3 would help.
-- 
In-Real-Life: Chris Torek, Lawrence Berkeley Lab CSE/EE (+1 415 486 5427)
Berkeley, CA		Domain:	torek@ee.lbl.gov