Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!know!sdd.hp.com!elroy.jpl.nasa.gov!decwrl!megatest!djones From: djones@megatest.UUCP (Dave Jones) Newsgroups: comp.lang.c Subject: Re: Hash functions in C Message-ID: <14461@goofy.megatest.UUCP> Date: 8 Nov 90 03:47:41 GMT References: <14442@goofy.megatest.UUCP> Organization: Megatest Corporation, San Jose, Ca Lines: 17 I just now tested some hash-functions for an application which is probably more typical, namely a compiler. The simplest function is still a big winner on all counts. I ran tests on a standard Pascal header-file containing 6290 identifiers, including many duplicates (such as "procedure"). I got 370 collisions using my hash-function, for a rate of only 0.06 collisions per lookup! That was with a real big table. Lots of empty slots. When I reduced the table size so that only about half the slots were empty, I got a little under one collision per lookup, which is theoretical (1/2 + 1/2*1/2 + ...). Under these circs, anything around one should generally be considered pretty good. Surprisingly the CRC-16 hash-function was not too bad in this test, giving a collision-rate of 1.24. Why it did so very poorly on the dictionary, I don't know. And of course the relative slowness of calculating it is also important, so I don't think it has anything at all going for it as far as hashing char-strings is concerned. I'm wondering why it was even suggested.