Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!uunet!decwrl!megatest!djones From: djones@megatest.UUCP (Dave Jones) Newsgroups: comp.lang.c Subject: Hash functions in C Message-ID: <14442@goofy.megatest.UUCP> Date: 7 Nov 90 01:09:01 GMT Organization: Megatest Corporation, San Jose, Ca Lines: 36 I did a little bit of experimentation with the hash-functions which have been posted here. Nothing scientific, mind you, but very interesting. I wrote a program which takes a set of strings and stores each in a hash-table. (Lookups are equivalent to stores, for our purposes, because both operations hash a key and then look for it using rehashing.) I made two versions, each based on my standard library hash-package, one using the simple little add-and-shift function that I posted here, the other using the CRC-16 algorithm. I then ran each on various data, including source files and the UNIX spelling-checker dictionary "/usr/dict/words". The result was that the simpler hash function consistently gave much better store/lookup times that the CRC version did. Not even close. Interestingly, the simpler hash-function was not only much quicker to calculate, but it also gave far fewer collisions. Using the /usr/dict/words data, each hash-function was called 24473 times -- once for each word in the file. The shift-and-add version called the string comparison routine 39655 times, which represents a collision rate of about 0.6 collisions per lookup, which is right around the theoretical rate for this particular algorithm. But the CRC-16 function called the string-comparison routine no fewer than 270,662 times, for a collision- rate of about 10.0 collisions (and re-collisions) per lookup! I think I can rationalize that. The CRC-16 function does indeed spread the set of all strings uniformly over the interval 0 .. 2**16 - 1. But for n < 16 it does not uniformly spread short strings over the interval 0 .. 2**n - 1 in the bottom n bits. Also consider that in ASCII (w/o parity) we are really only dealing with bit-streams in which every eighth bit is a zero -- a very regular pattern for a 16-bit checksum. That may have something to do with it, I dunno. A little experimenting suggests that it may be possible to devise better polynomials than the CRC-16. I got one that seems to work pretty good. But I don't think it can distribute the keys well enough to compensate for the extra time it takes to calculate a check-sum. (Besides that, any hash-function which does not at least take into account the range of the printable characters probably will distribute typical text-strings less than optimally, I suspect.)