Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!know!sdd.hp.com!elroy.jpl.nasa.gov!decwrl!megatest!djones
From: djones@megatest.UUCP (Dave Jones)
Newsgroups: comp.lang.c
Subject: Re: Hash functions in C
Message-ID: <14461@goofy.megatest.UUCP>
Date: 8 Nov 90 03:47:41 GMT
References: <14442@goofy.megatest.UUCP>
Organization: Megatest Corporation, San Jose, Ca
Lines: 17

I just now tested some hash-functions for an application which is
probably more typical, namely a compiler. The simplest function is
still a big winner on all counts.

I ran tests on a standard Pascal header-file containing 6290 identifiers,
including many duplicates (such as "procedure"). I got 370 collisions
using my hash-function, for a rate of only 0.06 collisions per lookup!
That was with a real big table. Lots of empty slots. When I reduced the
table size so that only about half the slots were empty, I got a little
under one collision per lookup, which is theoretical (1/2 + 1/2*1/2 + ...).
Under these circs, anything around one should generally be considered pretty
good. Surprisingly the CRC-16 hash-function was not too bad in this test,
giving a collision-rate of 1.24. Why it did so very poorly on the
dictionary, I don't know. And of course the relative slowness of
calculating it is also important, so I don't think it has anything at
all going for it as far as hashing char-strings is concerned. I'm wondering
why it was even suggested.