Path: utzoo!attcan!uunet!cs.utexas.edu!uwm.edu!rpi!zaphod.mps.ohio-state.edu!sunybcs!uhura.cc.rochester.edu!rochester!dietz From: dietz@cs.rochester.edu (Paul Dietz) Newsgroups: comp.theory Subject: Re: hashing function for strings Keywords: universal hash functions Message-ID: <1990Feb21.154342.5944@cs.rochester.edu> Date: 21 Feb 90 15:43:42 GMT References: <1990Feb15.211210.22950@max.sunysb.edu> Reply-To: dietz@cs.rochester.edu (Paul Dietz) Organization: University of Rochester Computer Science Department Lines: 56 In article <1990Feb15.211210.22950@max.sunysb.edu> rosalia@max.sunysb.edu (Mark Galassi) writes: >I would like to know which is a good hash function for strings. >I was going to compute the index into the hash table with > index = hashval(s) % HASH_LEN; >where s is a string, and HASH_LEN is a prime number and is also >the size of the hash table. > >I have seen two schemes: > >1. hashval(s) is the sum of the ascii values of each char in the string. > >2. hashval(s) is the sum of the ascii values*256^i where i is the > position of the char in the string. > >Which of these is better, or is there any other? Knuth seems to not >go into detail on hashval() for strings. If you want to be rigorous, you should talk about good *classes* of hash functions, since, for any particular hash function, there are inputs on which is behaves terribly. What you might want are universal classes of hash functions (strictly speaking, universal-2 classes). A class of hash functions U is universal if, for any two distinct keys x and y, |U|/HASH_LEN functions in U hash x and y to the same value. So, if you pick a random element of U, the chance x and y will collide is 1/HASH_LEN. See, for example, Cormen, Leiserson and Rivest (Introduction to Algorithms, MIT Press, Spring 1990) or Brassard and Bratley (Algorithmic: Theory and Practice, Prentice Hall, 1988). Here's a universal class of hash functions for strings, if HASH_LEN is a prime number. Let a[0], a[1], ... be random numbers from the set {0,...,HASH_LEN-1}. Then, the hash function is h(s) = (a[0]*s[0] + a[1]*s[1] + ... ) mod HASH_LEN This does one multiplication per character, but you could save on that by packing characters into longer words (but the resulting numbers must still be < HASH_LEN) using these words instead of the s[i] in the definition. Here's another possibility, if HASH_LEN is a power of two. For each ascii character c and each position in the string i, pick a random number a[i][c] from {0,...,HASH_LEN-1}. Then, a universal class of hash functions is defined by h(s) = a[0][s[0]] XOR a[1][s[1]] XOR ... This is particularly easy to compute, involving no multiplication (save any that might be required for the array accesses) or modulus operations, but it does require one additional memory reference per character. Paul F. Dietz dietz@cs.rochester.edu