Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!uunet!decwrl!megatest!djones
From: djones@megatest.UUCP (Dave Jones)
Newsgroups: comp.lang.c
Subject: Hash functions in C
Message-ID: <14442@goofy.megatest.UUCP>
Date: 7 Nov 90 01:09:01 GMT
Organization: Megatest Corporation, San Jose, Ca
Lines: 36

I did a little bit of experimentation with the hash-functions which
have been posted here. Nothing scientific, mind you, but very interesting.

I wrote a program which takes a set of strings and stores each
in a hash-table. (Lookups are equivalent to stores, for our purposes,
because both operations hash a key and then look for it using rehashing.)
I made two versions, each based on my standard library hash-package, one
using the simple little add-and-shift function that I posted here, the other
using the CRC-16 algorithm. I then ran each on various data, including
source files and the UNIX spelling-checker dictionary "/usr/dict/words".
The result was that the simpler hash function consistently gave much better
store/lookup times that the CRC version did. Not even close.

Interestingly, the simpler hash-function was not only much quicker to
calculate, but it also gave far fewer collisions. Using the /usr/dict/words
data, each hash-function was called 24473 times -- once for each
word in the file. The shift-and-add version called the string
comparison routine 39655 times, which represents a collision rate of
about 0.6 collisions per lookup, which is right around the theoretical
rate for this particular algorithm. But the CRC-16 function called the
string-comparison routine no fewer than 270,662 times, for a collision-
rate of about 10.0 collisions (and re-collisions) per lookup!

I think I can rationalize that. The CRC-16 function does indeed
spread the set of all strings uniformly over the interval 0 .. 2**16 - 1.
But for n < 16 it does not uniformly spread short strings over the interval
0 .. 2**n - 1 in the bottom n bits. Also consider that in ASCII (w/o parity)
we are really only dealing with bit-streams in which every eighth bit
is a zero -- a very regular pattern for a 16-bit checksum. That may
have something to do with it, I dunno. A little experimenting suggests
that it may be possible to devise better polynomials than the CRC-16.
I got one that seems to work pretty good. But I don't think it can
distribute the keys well enough to compensate for the extra time it takes
to calculate a check-sum. (Besides that, any hash-function  which does not
at least take into account the range of the printable characters probably
will distribute typical text-strings less than optimally, I suspect.)