Path: utzoo!utgpu!news-server.csri.toronto.edu!bonnie.concordia.ca!thunder.mcrcim.mcgill.edu!snorkelwacker.mit.edu!stanford.edu!agate!dog.ee.lbl.gov!elf.ee.lbl.gov!torek From: torek@elf.ee.lbl.gov (Chris Torek) Newsgroups: comp.lang.c Subject: Re: hash function for mac needed Message-ID: <14396@dog.ee.lbl.gov> Date: 18 Jun 91 07:37:44 GMT References: <14207.285B768A@stjhmc.fidonet.org> Reply-To: torek@elf.ee.lbl.gov (Chris Torek) Organization: Lawrence Berkeley Laboratory, Berkeley Lines: 67 X-Local-Date: Tue, 18 Jun 91 00:37:44 PDT In article <14207.285B768A@stjhmc.fidonet.org> Dave.Harris@f14.n15.z1.fidonet.org (Dave Harris) writes: >... The simple thing to do is to add all the characters up then mod by >the size of your hash, then use that number to index into an array. ... >You don't necessarly need to add all 255 characters together. Indeed: as it turns out, while many people have spent much time looking for `good' hashing functions, fewer seem to have noted that many hashing systems spend more time computing hashes than searching. I find that this is usually the case, and that a fast hash function is better than a slower-but-better-distribution function. One good hash function, useful in many applications, is this: hash = 0; for (each character) hash = (hash * 33) + (the character); hash_index = hash & ((some power of two) - 1); In C, this might be written as struct hashent * lookup(table, string) register struct hashtab *table; char *string; { register unsigned hash = 0; register char *p, c; register struct hashent *hp; for (p = string; (c = *p) != '\0'; p++) hash = (hash << 5) + hash + c; p = string; for (hp = table->ht_entries[hash & table->ht_mask]; hp != NULL; hp = hp->h_next) if (hp->h_hash == hash && strcmp(hp->h_key, p) == 0) return (hp); return (NULL); } >To anybody else. If I have n data elements, with a hash of size x, what is >considered a good number of searches based on n and x (i.e. ln(x/n)). If you have a perfectly uniform distribution, each chain or hash sequence will have n/x elements; assuming the probability of any element lookup is identical, this is the best that can be done. I do not know what others consider `good', but a factor of 1.3 or so over that seems reasonable to me (I cannot quite think why). Typically neither assumption holds true. There are too many ways to take advantage of this to characterize here. One thing that can be done, however, is to use the above hash table scheme with expanding tables. As the number of entries increases, the hash table size can be doubled, and table->ht_mask adds one bit: table->ht_mask = (table->ht_mask << 1) | 1; Since the hash values of each pair are maintained, it is easy to move the entries from their old hash slots to their new ones. Doubling occurs O(log2 n) times if the doubling criterion is total number of table entries; other doubling criteria, such as maximum chain length, are possible but rarely worthwhile---one ends up spending more time optimizing hashes than searching, just as with more complicated hash functions. Trading space for time this way seems to work well, and you can select the amount of space by selecting the doubling factor. -- In-Real-Life: Chris Torek, Lawrence Berkeley Lab CSE/EE (+1 415 486 5427) Berkeley, CA Domain: torek@ee.lbl.gov