Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!uunet!munnari.oz.au!metro!dmssyd.syd.dms.CSIRO.AU!ditsydh.syd.dit.CSIRO.AU!evans
From: evans@syd.dit.CSIRO.AU (Bruce.Evans)
Newsgroups: comp.lang.c
Subject: Re: hash function
Message-ID: <1991Jun26.112724.20539@syd.dit.CSIRO.AU>
Date: 26 Jun 91 11:27:24 GMT
References: <14491.28619071@stjhmc.fidonet.org> <14583@dog.ee.lbl.gov>
Organization: CSIRO Division of Info Tech, Sydney, Australia
Lines: 71

In article <14583@dog.ee.lbl.gov> torek@elf.ee.lbl.gov (Chris Torek) writes:
>>>	hash = 0;
>>>	for (each character)
>>>		hash = (hash * 33) + (the character);
>>>	hash_index = hash & ((some power of two) - 1);
>
>In article <14491.28619071@stjhmc.fidonet.org>
>Dave.Harris@f14.n15.z1.fidonet.org (Dave Harris) writes:
>>Is the *33 use to keep short sequences has key more random or does it
>>really help in the distribution of large string too?
>
>The third question is the hardest.  What *does* that 33 do?  I have no
>idea.  All I know is that I have experimented with `easy' functions

I don't think there are any theoretical mysteries behind the 33. 33 = 32 + 1.
The 32 is to avoid mixing bits with previous hash bits. 256 would avoid all
mixing (with 8-bit chars) but it would throw away too many bits. The 1 is to
avoid throwing away bits. Rotating the bits instead of shifting left would
probably be a better way to avoid this.

The following hash function worked well for me in an assembler. It implements
the ideas that hashing all the chars in the string is unnecessary (and maybe
even bad if too many bits get shifted out) and that the middle char is more
random than leading or trailing chars.

In the program this was extracted from, the string length was already known.
Avoiding looking at all the chars in the string is not such a good idea when
strlen has to do it anyway. Also, this probably does best with a lot of
short strings.
---
extern unsigned strlen();	/* #include <string.h> */

#define SPTSIZE	1024

extern struct sym_s *spt[SPTSIZE];

#define hconv(ch) ((unsigned char) (ch) - 0x41)	/* better form for hashing */

struct sym_s **gethashptr(str)
register char *str;
{
    register unsigned hashval;
    register unsigned length;

    /* Hash function is a weighted xor of 1 to 4 chars in the string.
     * This seems to work better than looking at all the chars.
     * It is important that the function be fast.
     * The multiplication by MULTIPLIER should compile as a shift.
     */

#define MULTIPLIER (SPTSIZE / (1 << USEFUL_BITS_IN_ASCII))
#define USEFUL_BITS_IN_ASCII 6

    length = strlen(str);
    if (length <= 3)
    {
	if (length <= 2)
	    hashval  = hconv(str[-1]) * MULTIPLIER;
	else
	    hashval  = hconv(str[-2]) * MULTIPLIER,
	    hashval ^= hconv(str[-1]);
    }
    else
	hashval  = hconv(str[-(length / 2)]) * MULTIPLIER,
	hashval ^= hconv(str[-2]) << 2,
	hashval ^= hconv(str[-1]);
    return spt + (hashval ^ (hconv(str[0]) << 1)) % SPTSIZE;
}
---
-- 
Bruce Evans		evans@syd.dit.csiro.au