Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!uunet!munnari.oz.au!metro!dmssyd.syd.dms.CSIRO.AU!ditsydh.syd.dit.CSIRO.AU!evans From: evans@syd.dit.CSIRO.AU (Bruce.Evans) Newsgroups: comp.lang.c Subject: Re: hash function Message-ID: <1991Jun26.112724.20539@syd.dit.CSIRO.AU> Date: 26 Jun 91 11:27:24 GMT References: <14491.28619071@stjhmc.fidonet.org> <14583@dog.ee.lbl.gov> Organization: CSIRO Division of Info Tech, Sydney, Australia Lines: 71 In article <14583@dog.ee.lbl.gov> torek@elf.ee.lbl.gov (Chris Torek) writes: >>> hash = 0; >>> for (each character) >>> hash = (hash * 33) + (the character); >>> hash_index = hash & ((some power of two) - 1); > >In article <14491.28619071@stjhmc.fidonet.org> >Dave.Harris@f14.n15.z1.fidonet.org (Dave Harris) writes: >>Is the *33 use to keep short sequences has key more random or does it >>really help in the distribution of large string too? > >The third question is the hardest. What *does* that 33 do? I have no >idea. All I know is that I have experimented with `easy' functions I don't think there are any theoretical mysteries behind the 33. 33 = 32 + 1. The 32 is to avoid mixing bits with previous hash bits. 256 would avoid all mixing (with 8-bit chars) but it would throw away too many bits. The 1 is to avoid throwing away bits. Rotating the bits instead of shifting left would probably be a better way to avoid this. The following hash function worked well for me in an assembler. It implements the ideas that hashing all the chars in the string is unnecessary (and maybe even bad if too many bits get shifted out) and that the middle char is more random than leading or trailing chars. In the program this was extracted from, the string length was already known. Avoiding looking at all the chars in the string is not such a good idea when strlen has to do it anyway. Also, this probably does best with a lot of short strings. --- extern unsigned strlen(); /* #include */ #define SPTSIZE 1024 extern struct sym_s *spt[SPTSIZE]; #define hconv(ch) ((unsigned char) (ch) - 0x41) /* better form for hashing */ struct sym_s **gethashptr(str) register char *str; { register unsigned hashval; register unsigned length; /* Hash function is a weighted xor of 1 to 4 chars in the string. * This seems to work better than looking at all the chars. * It is important that the function be fast. * The multiplication by MULTIPLIER should compile as a shift. */ #define MULTIPLIER (SPTSIZE / (1 << USEFUL_BITS_IN_ASCII)) #define USEFUL_BITS_IN_ASCII 6 length = strlen(str); if (length <= 3) { if (length <= 2) hashval = hconv(str[-1]) * MULTIPLIER; else hashval = hconv(str[-2]) * MULTIPLIER, hashval ^= hconv(str[-1]); } else hashval = hconv(str[-(length / 2)]) * MULTIPLIER, hashval ^= hconv(str[-2]) << 2, hashval ^= hconv(str[-1]); return spt + (hashval ^ (hconv(str[0]) << 1)) % SPTSIZE; } --- -- Bruce Evans evans@syd.dit.csiro.au