Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!seismo!think!mit-eddie!genrad!decvax!ima!johnl
From: johnl@ima.UUCP (Compilers mailing list)
Newsgroups: mod.compilers
Subject: Query about token mapping
Message-ID: <204@ima.UUCP>
Date: Fri, 5-Sep-86 10:50:41 EDT
Article-I.D.: ima.204
Posted: Fri Sep  5 10:50:41 1986
Date-Received: Fri, 5-Sep-86 21:46:50 EDT
Reply-To: decvax!astrovax!sutin (Brian Sutin)
Lines: 73
Approved: <compilers@ima.UUCP>

I would like to know if anyone has heard of the following algorithm, or
knows how to write it, or has any ideas about how it could be written.

Say you have N numbers, {X}, which you know ahead of time, and you want
to assign them the integers {1-N} in any order, but in a unique fashion.
Numbers not in {X} can be assigned anything, although it would be nice if
they were assigned to, say 0, or N+1.  This is nice for parse tables, or for
case statements where the {X} cover a large spread and the compiler uses
an if-else tree.  What I want is a program which writes the above program.
The subroutine:

alg(Y)
    for( i = {1-N} )
	if( Y = X[i] ) return i


is of order N operations, probably 3 or 4 N.  Better is a binary tree:

alg(Y)
    if( Y = X[N/2] ) return N/2
    else
    if( Y < X[N/2] )
	if( Y = X[N/4] ) return N/4
	else
	if( Y < X[N/4] )
	    ...
    else
    if( Y > X[N/2] )
	if( Y = X[3N/4] ) return 3N/4
	...

which is of order 2logN operations, but longer.  Also, there is the lookup
table of size max{X}.  Yuk.

None of these are optimal.  Now we get to the meat of the problem.  I bet
that there is an algorithm which will find a set of binary operations,
such as binary and, or, not, xor, etc, which, when handed a Y, will
return {1-N}.  This would be a black box with about logN operations, and
would be optimal for both speed and space.  For example:

X = { 0, 3, 5 }.  then

alg(Y)
    return X & 3

will map ( 0 -> 0 ), ( 3 -> 3 ), ( 5 -> 1 ).  2 was skipped, but oh well.

Anyway, any ideas?


Brian Sutin
Department of Astrophysical Sciences/Princeton University
{ princeton, ... }!astrovax!sutin

[This problem is known as perfect hashing.  The traditional application is
taking a token from a program you're compiling and seeing if it's one of the
reserved words.  More formally, a perfect hash function for a set of tokens
maps exactly one of the tokens into each of a set of hash buckets.  Computing
perfect hash functions is hard -- it takes minutes of computer time on
something like a PDP-11.  Once you have the hash function, using it is O(1)
because you take a token, hash it, compare it to the token that belongs in the
bucket hashed to, and either it's that token or it's not in your set.

There has been a fair amount written on both the form of perfect hash functions
and methods of computing them, typically by fitting some sort of polynomial.
Do any of you readers know of good references?  I can't find any right now.
-John]
-- 
Send compilers mail to ima!compilers or, in a pinch to Levine@YALE.EDU
Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | bbncca}!ima

Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request