Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!think.com!spool.mu.edu!agate!dog.ee.lbl.gov!elf.ee.lbl.gov!torek From: torek@elf.ee.lbl.gov (Chris Torek) Newsgroups: comp.lang.c Subject: Re: hash function Message-ID: <14607@dog.ee.lbl.gov> Date: 24 Jun 91 09:15:58 GMT Article-I.D.: dog.14607 References: <14491.28619071@stjhmc.fidonet.org> <14583@dog.ee.lbl.gov> <15027@ulysses.att.com> Reply-To: torek@elf.ee.lbl.gov (Chris Torek) Organization: Lawrence Berkeley Laboratory, Berkeley Lines: 70 X-Local-Date: Mon, 24 Jun 91 02:15:58 PDT In an article whose referent is long gone, I described this hash function: for (each character) hash = (hash * 33) + (the character); hash_index = hash & ((some power of two) - 1); and noted: >>Maybe there are some theory people out there who can explain [why 33 >>works well]. In article <15027@ulysses.att.com> kpv@ulysses.att.com (Phong Vo[drew]) writes: >A good property for a hash function is if you keep feeding it >a constant character, it performs as if it's a random number generator. >The hash function f(h,c) = 33*h + c is derived from a class of random >number generators called linear congruential. At the start of Knuth Vol 2, >there is a good discussion of these generators. ... and indeed, if you turn to Knuth vol. 2, chapter 3, you will find all sorts of interesting stuff, such as a proof that the low-order n bits of the usual rand() repeat with period $2^n$. (rand() uses a modulus of either $2^15$ or $2^31$.) >... Assuming that c is some fixed odd value, Theorem A guarantees that >33 is good because (33-1)%4 == 0. ... Here good means that if you keep >feeding c, you'll cycle thru the entire set of values [0,2^x). This is approaching what I was getting at, but is not quite there. Basically it says that 33 is `good' because it is `not bad'. An LCRNG is defined as h = ( a h + c ) mod m (where h is each a value of the n+1 n n variable `hash' in the for loop) (in quote `>' above, $m = 2^x$). With a `bad' multiplier for $a$---and what is `bad' depends on $c$ and $m$---the generator will not produce all $m$ possible values; with a `good' one it will. Intuitively, then, the `bad' one will miss some of the hash buckets entirely; the `good' one will not. But *why* is it true that `A good property is [that] it performs as if it's a [good] random number generator'? This is now `obvious' (from the above result) *if* we feed in constant values of $c$, i.e., if all the strings are made of the same characters, but are different lengths. But we rarely do that---the strings we hash are full of weird junk. How do LCRNGs get in there? Why do the varying values of $c$ not mess things up? (Of course, they do, but not enough to matter. Why not?) >At this point, we depart from theory. We typically hash byte strings ... Actually, I typically hash ASCII text. 33 is the `easiest' `good' multiplier (as defined above) that is >= 26. I have no idea whether this is significant, or merely coincidental. >... This suggests that some value larger than 2^8 (but not too large) >may do better as a multiplicator. ... > hash = 0; > for(each byte c) > hash = (hash<<8) + (hash<<2) + hash + c; I will see if I can get this one run on the same data as the `33 hash'. This is more work than hash = (hash << 5) + hash + c; and distribution improvements may be outweighed by additional hash time (or may not: who knows?). -- In-Real-Life: Chris Torek, Lawrence Berkeley Lab CSE/EE (+1 415 486 5427) Berkeley, CA Domain: torek@ee.lbl.gov