Path: utzoo!censor!geac!torsqnt!news-server.csri.toronto.edu!cs.utexas.edu!swrinde!elroy.jpl.nasa.gov!usc!wuarchive!zaphod.mps.ohio-state.edu!think.com!mintaka!spdcc!iecc!compilers-sender From: rfg@ncd.com (Ron Guilmette) Newsgroups: comp.compilers Subject: Re: Hash specifics Keywords: design Message-ID: <2977@lupine.NCD.COM> Date: 13 Dec 90 05:55:44 GMT References: <9012111913.AA07310@cs-sun-fsa.cpsc.ucalgary.ca> Sender: compilers-sender@iecc.cambridge.ma.us Reply-To: rfg@ncd.com (Ron Guilmette) Organization: Network Computing Devices, Inc., Mt. View, CA Lines: 41 Approved: compilers@iecc.cambridge.ma.us In article <9012111913.AA07310@cs-sun-fsa.cpsc.ucalgary.ca> degroot@cpsc.ucalgary.ca (Adriaan Degroot) writes: >How does one build a hash function with minimal table size and minimal >collisions for a known set of hash strings? As others have already noted, Doug Schmidt wrote a cute program called gperf (in C++) which is useful for finding perfect hash functions. Among the possible command line arguments for Doug's program are a set of options which allow you to specify (for instance) the particular character positions from the input (search key) strings which you want to participate in the hashing. So (for instance) you can tell the program only to hash characters 1 thru 3 and the last 2 characters of each string. The unfortunate part (and this is something that I ragged on Doug about at length) is that the program is not setup to automatically try various possibilities for the `hash bytes subset' option(s). In other words, you may tell the program to go and look for a perfect hash function for a certain input (key) set and a certain table size and a certain `hash bytes subset' and it may simply come back and tell you that it ain't possible. Also, the program only uses one particular (simple & fast) hash function. I can't reacll exactly what it does, but that doesn't matter for the point I'm making. The point is this. Let's say that the hash function tries simply XOR'ing the bytes together, and then masking off the low bits to get the hash key when done. Well, gperf will never tell you that you might have been able to get a faster hash function (or even one that actually works given your stated table size constraints) if (for example) the hash function had simply *ADDED* the byte values together rather that XOR'ing. In other words, gperf explores only a small area within the (multidimensional?) space of all possible (or all useful) hashing functions. That's not to say that it isn't a nice an/or useful tool. It is both. -- // Ron Guilmette - C++ Entomologist // Internet: rfg@ncd.com uucp: ...uunet!lupine!rfg -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.