Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!usc!sdd.hp.com!think.com!mintaka!spdcc!iecc!compilers-sender
From: lupine!rfg@uunet.UU.NET (Ron Guilmette)
Newsgroups: comp.compilers
Subject: Re: Hash specifics
Keywords: design
Message-ID: <2979@lupine.NCD.COM>
Date: 13 Dec 90 06:25:10 GMT
References: <9012112133.AA00452@rbbb.Eng.Sun.COM>
Sender: compilers-sender@iecc.cambridge.ma.us
Reply-To: lupine!rfg@uunet.UU.NET (Ron Guilmette)
Organization: Network Computing Devices, Inc., Mt. View, CA
Lines: 69
Approved: compilers@iecc.cambridge.ma.us

In article <9012112133.AA00452@rbbb.Eng.Sun.COM> chased@Eng.Sun.COM (David Chase) writes:
+CACM 33:6 (June 1990) "Fast Hashing of Variable Length Strings"
+by Peter K. Pearson.
+
+He discusses use of hash functions of the form:
+
+  h = 0;
+  n = length(string);
+
+  for i = 1 through n do
+    h = Table [ h XOR string [i]];
+  enddo;
+
+  return h;

Some time back, I wrote a program which used the following hash function
to try (for a given input set of input strings) to find an "efficient"
hash function for the given set (and some particular table size).

I defined "efficiency" somewhat arbitrarily as the number members of the
input (key) set divided by the number of "probes" (or "key comparisons")
which would be needed to find each and every member of the input set in
the hash table exactly once.  (Note that I was assuming that the hash
table in question would be the kind used in compilers & assemblers; i.e.
one which would be fully "precomputed" ahead of time and used only for
lookups, and NEVER for ADDS or DELETES.)

static unsigned hash (const char *s, unsigned prime)
{
  unsigned ret_val = 0;

  for (; *s; s++)
    ret_val = (*s ^ ret_val) * prime;
  return ret_val & hash_mask;
}

This hash function takes a second parameter `prime' which is used to help
produce `well mangled' output keys from the input keys (i.e. strings).
Knuth documents that fact that using prime numbers in hash functions is
known to tend to make these functions more "efficient" (i.e. it tends to
make the output keys more evenly distributed over the range).

(Note also that this function assumes that `hash_mask' is some constant
defined earlier on.  Typically, it has a value like 2**N-1, where N is
some integer.  The `primary' hash table size is 2**N.)

The program I wrote tried various values for `prime' up to some fixed
limit (i.e. 8k).  I ran the program for various sets of input strings
(both large and small) and for various table sizes.

The program reported the "efficiency" rating of the primes tested.
Curiously, the number 47 always rated in the top 10 regardless of the set
of input keys used or the primary table size used.  I have no explanation
for this.

Moral of the story: If you need a reasonably efficient hashing function
for some arbitrary set of strings (e.g. identifiers or keywords), and if
you need it fast, use the function shown above and substitute `47' for
`prime'.

P.S.  The above code is not copyrighted, but its `look & feel' may be. :-)

-- 

// Ron Guilmette  -  C++ Entomologist
// Internet: rfg@ncd.com      uucp: ...uunet!lupine!rfg
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.