Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!watmath!clyde!ima!johnl From: johnl@ima.UUCP Newsgroups: mod.compilers Subject: Re: Looking for Minimal Perfect Hash Functions Message-ID: <474@ima.UUCP> Date: Tue, 3-Feb-87 12:23:33 EST Article-I.D.: ima.474 Posted: Tue Feb 3 12:23:33 1987 Date-Received: Thu, 5-Feb-87 07:05:01 EST Sender: johnl@ima.UUCP Lines: 72 Approved: compilers@ima.UUCP In response to the article suggesting that binary search might be more efficient than a minimal perfect hashing function for keyword search: You may be right about the superiority of binary search over the use of hashing in a keyword table -- it depends crucially on several details which most people tend to overlook when they describe their favorite keyword-determination algorithm. For instance, when you do your "up to 10 string compares", do you call a function each time? If so, it almost certainly loses to a good hashing scheme. (Actually, 10 compares seems too high, since the size of the average keyword table is probably about 50 items, but even 6 compares might be slower than hashing.) As for hashing schemes, the efficiency issues include: time to calculate the hash function; number of keywords that collide ("perfectness"); expected ratio of non-keywords to keywords; density of keywords in the table ("minimalness"). Although it is obviously an advantage to have a perfect hashing function, because there need be no mechanism to resolve collisions, it actually WASTES time to have a minimal sized table, because every non-keyword will then collide with a keyword when hashed, requiring a string compare to verify that it is not a keyword. Unless you are cramped for space, or most of your identifiers are expected to be keywords, the best density for your hash table is less than 1/5 full, so that 4 out of 5 non-keyword identifiers will hash to an empty location, which can be instantly recognized as such without a string compare. Given that you don't want a MINIMAL hash function, the construction of a PERFECT, very fast one is not hard at all. This is how I did it for two different languages: With a list of keywords in front of you, find two or three trivally computable attributes of an identifier that distinguish all the keywords from one another. These attributes can almost always be chosen from the following: first, second or third letters; identifier length; last or second last letters. This should take you about half an hour at most. Now, let's say you decide to use first letter, third letter and identifier length. The next decision is to choose a table size large enough to make most identifiers hash to empty positions, say 257 (a prime number is best). The only remaining task is to write a program that finds values for the variables (i, j, k) in the function below, such that there are no collisions: len = strlen(id); hash = ( i * id[0] + j * id[len>=3 ? 2 : len] + k * len ) % 257; The program can start with, say, (i,j,k) = (1,10,40) and iterate through increasing values in any order you choose. Make sure, however, that i,j and k are large enough to allow at least 257 different values to be pro- duced for the mod (%) operation. You will find that, with less than 50 keywords and a density of less than 1/5, you don't have to try many combinations of i,j and k to hit a perfect hash. My program ran for a couple of minutes in both cases. The above hash function takes three multiply's, two adds, a mod, and a check on the length of the id. Unlike the binary search method, no additional string compare is necessary for 4 out of 5 non-keyword identifiers, although every keyword requires a string compare. (I assume that you would already have in hand the length, "len", of the identifier, if you care at all about efficiency, so I do not include the time to calculate strlen as part of the hash. In the typical lexer that reads identifiers into a token buffer, it can be computed in one instruction by subtracting the token buffer address from the pointer to the end of the token in the buffer.) My program to find a perfect hash function as described above is all of 93 lines of C, including reading the list of keywords from a file into an array, and printing out the values of i, j, k and the hashed values for each keyword. If anyone thinks it worthwhile, I'd be happy to mail it to them. Michael Condict {ihnp4|vax135}!m10ux!mnc AT&T Bell Labs (201)582-5911 MH 3B-416 Murray Hill, NJ -- Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | cca}!ima Please send responses to the originator of the message -- I cannot forward mail accidentally sent back to compilers. Meta-mail to ima!compilers-request