Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!swrinde!zaphod.mps.ohio-state.edu!cis.ohio-state.edu!ucbvax!pasteur!dog.ee.lbl.gov!elf.ee.lbl.gov!torek From: torek@elf.ee.lbl.gov (Chris Torek) Newsgroups: comp.lang.c Subject: Re: hash function Message-ID: <14583@dog.ee.lbl.gov> Date: 21 Jun 91 22:37:59 GMT References: <14491.28619071@stjhmc.fidonet.org> Reply-To: torek@elf.ee.lbl.gov (Chris Torek) Organization: Lawrence Berkeley Laboratory, Berkeley Lines: 43 X-Local-Date: Fri, 21 Jun 91 15:37:59 PDT In an article whose referent was deleted by bogus fidonet software (and whose tabs were changed to spaces), I provided this example hashing function: >> hash = 0; >> for (each character) >> hash = (hash * 33) + (the character); >> hash_index = hash & ((some power of two) - 1); In article <14491.28619071@stjhmc.fidonet.org> Dave.Harris@f14.n15.z1.fidonet.org (Dave Harris) writes: >Is the *33 use to keep short sequences has key more random or does it >really help in the distribution of large string too? I am not sure what you are asking here. There are basically three questions you can ask about a hash function: a) Is it fast enough? (complicated functions often take longer than the searches they replace) b) Does it produce a good distribution? (the hash function h(input)=0 is very fast but produces horrible distributions) c) If (b) is true, why does it work? The first two can really only be answered in some specific context. In all the contexts I have tried, this one answers `yes' to both (which is why I recommend it). The third question is the hardest. What *does* that 33 do? I have no idea. All I know is that I have experimented with `easy' functions (functions for which `fast enough' should generally be true) and this one also produces a good distribution. I doubt I still have the mail, but when Margo Seltzer was writing the database code to replace ndbm, she tried a bunch of different functions on a bunch of different data. This one usually worked well---it did not always give the best distributions, but it never performed poorly; some functions that gave better distributions on some data gave much worse on other. Maybe there are some theory people out there who can explain it. Probably some staring at Knuth vol. 3 would help. -- In-Real-Life: Chris Torek, Lawrence Berkeley Lab CSE/EE (+1 415 486 5427) Berkeley, CA Domain: torek@ee.lbl.gov