Path: utzoo!attcan!uunet!lll-winken!ames!amdahl!pyramid!prls!philabs!micomvax!ray
From: ray@micomvax.UUCP (Ray Dunn)
Newsgroups: comp.lang.c
Subject: Re: HASHING, revisited - A summary.
Message-ID: <1651@micomvax.UUCP>
Date: 25 Jan 89 16:13:32 GMT
References: <1761NU013809@NDSUVM1>
Reply-To: ray@micomvax.UUCP (Ray Dunn)
Organization: Philips Electronics Ltd. (TDS - Montreal) St. Laurent QC, Canada
Lines: 49

As Greg's posting may be taken as "definitive" and squirrelled away by readers
for future reference, it may be best to correct a couple of small points:

In article <1761NU013809@NDSUVM1> NU013809@NDSUVM1.BITNET (Greg Wettstein) writes:
>A common question when implementing a HASH based lookup table is how
>does one translate a supposedly random number into a finite set of
>'buckets' which are typically the indexes into the lookup table.  The
>technique used here is to divide the HASH value by a prime number and
>take the modulus of the division to be the bucket number or array index.

In fact the "most common" method, is to employ a table whose length is a
power of 2.  The "index" is then generated by simply ANDing the hash value
by the table size-1.

>The overall goal in the process of hashing and bucket index generation
>is to develop a process which disperses the input keys as evenly as
>possible among all the output buckets.

This is a very common mistake made when designing hashing algorithms.
In fact the goal should be to minimize the *total* time spent in the
searching algorithm over a statistically significant number of calls.  This
involves taking the *frequency of occurrence* of symbols or classes of
symbols into account when designing the mechanism.

If the tokens "( ) ; ," each occur significantly more frequently than the
tokens "while for switch do",  then it would be reasonable to design the
algorithm so that the former items were located faster than the latter, i.e.
the length of the search chains would be inversely proportional to the
frequency of occurrence of the tokens on the chains.

This can be achieved in a variety of ways, including ad hoc tests for
specific high frequency items prior to applying a generalized routine.

In the case of a fixed data-base, (for example the reserved words in a
compiler), it does not take much effort to analyze a filesystem or three for
token frequencies and then structure and *order* the dictionary accordingly
(i.e. order the linked lists with the most frequent tokens first).

It is also important to remember that a naive algorithm may well prove just
as good overall than a *time consuming* sophisticated one, particularly if
the search chains are short.  Equate the time taken by the algorithm to
extra "dead" length on the search chains.

This was all debated in this newsgroup as recently as August of last year.
-- 
Ray Dunn.                      |   UUCP: ..!philabs!micomvax!ray
Philips Electronics Ltd.       |   TEL : (514) 744-8200   Ext: 2347
600 Dr Frederik Philips Blvd   |   FAX : (514) 744-6455
St Laurent. Quebec.  H4M 2S9   |   TLX : 05-824090