Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!usc!sdd.hp.com!think.com!mintaka!spdcc!iecc!compilers-sender From: lupine!rfg@uunet.UU.NET (Ron Guilmette) Newsgroups: comp.compilers Subject: Re: Hash specifics Keywords: design Message-ID: <2979@lupine.NCD.COM> Date: 13 Dec 90 06:25:10 GMT References: <9012112133.AA00452@rbbb.Eng.Sun.COM> Sender: compilers-sender@iecc.cambridge.ma.us Reply-To: lupine!rfg@uunet.UU.NET (Ron Guilmette) Organization: Network Computing Devices, Inc., Mt. View, CA Lines: 69 Approved: compilers@iecc.cambridge.ma.us In article <9012112133.AA00452@rbbb.Eng.Sun.COM> chased@Eng.Sun.COM (David Chase) writes: +CACM 33:6 (June 1990) "Fast Hashing of Variable Length Strings" +by Peter K. Pearson. + +He discusses use of hash functions of the form: + + h = 0; + n = length(string); + + for i = 1 through n do + h = Table [ h XOR string [i]]; + enddo; + + return h; Some time back, I wrote a program which used the following hash function to try (for a given input set of input strings) to find an "efficient" hash function for the given set (and some particular table size). I defined "efficiency" somewhat arbitrarily as the number members of the input (key) set divided by the number of "probes" (or "key comparisons") which would be needed to find each and every member of the input set in the hash table exactly once. (Note that I was assuming that the hash table in question would be the kind used in compilers & assemblers; i.e. one which would be fully "precomputed" ahead of time and used only for lookups, and NEVER for ADDS or DELETES.) static unsigned hash (const char *s, unsigned prime) { unsigned ret_val = 0; for (; *s; s++) ret_val = (*s ^ ret_val) * prime; return ret_val & hash_mask; } This hash function takes a second parameter `prime' which is used to help produce `well mangled' output keys from the input keys (i.e. strings). Knuth documents that fact that using prime numbers in hash functions is known to tend to make these functions more "efficient" (i.e. it tends to make the output keys more evenly distributed over the range). (Note also that this function assumes that `hash_mask' is some constant defined earlier on. Typically, it has a value like 2**N-1, where N is some integer. The `primary' hash table size is 2**N.) The program I wrote tried various values for `prime' up to some fixed limit (i.e. 8k). I ran the program for various sets of input strings (both large and small) and for various table sizes. The program reported the "efficiency" rating of the primes tested. Curiously, the number 47 always rated in the top 10 regardless of the set of input keys used or the primary table size used. I have no explanation for this. Moral of the story: If you need a reasonably efficient hashing function for some arbitrary set of strings (e.g. identifiers or keywords), and if you need it fast, use the function shown above and substitute `47' for `prime'. P.S. The above code is not copyrighted, but its `look & feel' may be. :-) -- // Ron Guilmette - C++ Entomologist // Internet: rfg@ncd.com uucp: ...uunet!lupine!rfg -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.