Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!watmath!clyde!ima!johnl
From: johnl@ima.UUCP
Newsgroups: mod.compilers
Subject: Re: Looking for Minimal Perfect Hash Functions
Message-ID: <474@ima.UUCP>
Date: Tue, 3-Feb-87 12:23:33 EST
Article-I.D.: ima.474
Posted: Tue Feb  3 12:23:33 1987
Date-Received: Thu, 5-Feb-87 07:05:01 EST
Sender: johnl@ima.UUCP
Lines: 72
Approved: compilers@ima.UUCP

In response to the article suggesting that binary search might be more
efficient than a minimal perfect hashing function for keyword search:

You may be right about the superiority of binary search over the use of
hashing in a keyword table -- it depends crucially on several details
which most people tend to overlook when they describe their favorite
keyword-determination algorithm.  For instance, when you do your "up to
10 string compares", do you call a function each time?  If so, it almost
certainly loses to a good hashing scheme.  (Actually, 10 compares seems
too high, since the size of the average keyword table is probably
about 50 items, but even 6 compares might be slower than hashing.)

As for hashing schemes, the efficiency issues include: time to calculate
the hash function; number of keywords that collide ("perfectness");
expected ratio of non-keywords to keywords; density of keywords in the
table ("minimalness").  Although it is obviously an advantage to have a
perfect hashing function, because there need be no mechanism to resolve
collisions, it actually WASTES time to have a minimal sized table,
because every non-keyword will then collide with a keyword when hashed,
requiring a string compare to verify that it is not a keyword.  Unless
you are cramped for space, or most of your identifiers are expected to
be keywords, the best density for your hash table is less than 1/5 full,
so that 4 out of 5 non-keyword identifiers will hash to an empty location,
which can be instantly recognized as such without a string compare.

Given that you don't want a MINIMAL hash function, the construction
of a PERFECT, very fast one is not hard at all.  This is how I did it
for two different languages:  With a list of keywords in front of
you, find two or three trivally computable attributes of an identifier that
distinguish all the keywords from one another.  These attributes can almost
always be chosen from the following: first, second or third letters;
identifier length; last or second last letters.  This should take you
about half an hour at most.  Now, let's say you decide to use first letter,
third letter and identifier length.  The next decision is to choose a
table size large enough to make most identifiers hash to empty positions,
say 257 (a prime number is best).  The only remaining task is to write a
program that finds values for the variables (i, j, k) in the function below,
such that there are no collisions:

	len = strlen(id);
	hash = ( i * id[0] + j * id[len>=3 ? 2 : len] + k * len ) % 257;

The program can start with, say, (i,j,k) = (1,10,40) and iterate through
increasing values in any order you choose.  Make sure, however, that i,j
and k are large enough to allow at least 257 different values to be pro-
duced for the mod (%) operation.  You will find that, with less than 50
keywords and a density of less than 1/5, you don't have to try many
combinations of i,j and k to hit a perfect hash.  My program ran for a
couple of minutes in both cases.  The above hash function takes three
multiply's, two adds, a mod, and a check on the length of the id.  Unlike
the binary search method, no additional string compare is necessary for 4
out of 5 non-keyword identifiers, although every keyword requires a string
compare.  (I assume that you would already have in hand the length, "len",
of the identifier, if you care at all about efficiency, so I do not include
the time to calculate strlen as part of the hash.  In the typical lexer that
reads identifiers into a token buffer, it can be computed in one instruction
by subtracting the token buffer address from the pointer to the end of the
token in the buffer.)

My program to find a perfect hash function as described above is all of 93
lines of C, including reading the list of keywords from a file into an array,
and printing out the values of i, j, k and the hashed values for each
keyword.  If anyone thinks it worthwhile, I'd be happy to mail it to them.

Michael Condict		{ihnp4|vax135}!m10ux!mnc
AT&T Bell Labs		(201)582-5911    MH 3B-416
Murray Hill, NJ
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU
Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | cca}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request