Path: utzoo!attcan!uunet!cs.utexas.edu!uwm.edu!rpi!zaphod.mps.ohio-state.edu!sunybcs!uhura.cc.rochester.edu!rochester!dietz
From: dietz@cs.rochester.edu (Paul Dietz)
Newsgroups: comp.theory
Subject: Re: hashing function for strings
Keywords: universal hash functions
Message-ID: <1990Feb21.154342.5944@cs.rochester.edu>
Date: 21 Feb 90 15:43:42 GMT
References: <1990Feb15.211210.22950@max.sunysb.edu>
Reply-To: dietz@cs.rochester.edu (Paul Dietz)
Organization: University of Rochester Computer Science Department
Lines: 56

In article <1990Feb15.211210.22950@max.sunysb.edu> rosalia@max.sunysb.edu (Mark Galassi) writes:

>I would like to know which is a good hash function for strings.
>I was going to compute the index into the hash table with
>	index = hashval(s) % HASH_LEN;
>where s is a string, and HASH_LEN is a prime number and is also
>the size of the hash table.
>
>I have seen two schemes:
>
>1. hashval(s) is the sum of the ascii values of each char in the string.
>
>2. hashval(s) is the sum of the ascii values*256^i where i is the
>   position of the char in the string.
>
>Which of these is better, or is there any other?  Knuth seems to not
>go into detail on hashval() for strings.

If you want to be rigorous, you should talk about good *classes*
of hash functions, since, for any particular hash function, there
are inputs on which is behaves terribly.

What you might want are universal classes of hash functions (strictly
speaking, universal-2 classes).  A class of hash functions U is
universal if, for any two distinct keys x and y, |U|/HASH_LEN
functions in U hash x and y to the same value.  So, if you pick a
random element of U, the chance x and y will collide is 1/HASH_LEN.
See, for example, Cormen, Leiserson and Rivest (Introduction to
Algorithms, MIT Press, Spring 1990) or Brassard and Bratley
(Algorithmic: Theory and Practice, Prentice Hall, 1988).

Here's a universal class of hash functions for strings, if HASH_LEN is
a prime number.  Let a[0], a[1], ... be random numbers from the set
{0,...,HASH_LEN-1}.  Then, the hash function is

	h(s) = (a[0]*s[0] + a[1]*s[1] + ... ) mod HASH_LEN

This does one multiplication per character, but you could save on
that by packing characters into longer words (but the resulting
numbers must still be < HASH_LEN) using these words instead of
the s[i] in the definition.

Here's another possibility, if HASH_LEN is a power of two.  For each
ascii character c and each position in the string i, pick a random
number a[i][c] from {0,...,HASH_LEN-1}.  Then, a universal class
of hash functions is defined by

	h(s) = a[0][s[0]] XOR a[1][s[1]] XOR ...

This is particularly easy to compute, involving no multiplication
(save any that might be required for the array accesses) or modulus
operations, but it does require one additional memory reference per
character.

	Paul F. Dietz
	dietz@cs.rochester.edu