Path: utzoo!attcan!uunet!lll-winken!ames!amdahl!pyramid!prls!philabs!micomvax!ray From: ray@micomvax.UUCP (Ray Dunn) Newsgroups: comp.lang.c Subject: Re: HASHING, revisited - A summary. Message-ID: <1651@micomvax.UUCP> Date: 25 Jan 89 16:13:32 GMT References: <1761NU013809@NDSUVM1> Reply-To: ray@micomvax.UUCP (Ray Dunn) Organization: Philips Electronics Ltd. (TDS - Montreal) St. Laurent QC, Canada Lines: 49 As Greg's posting may be taken as "definitive" and squirrelled away by readers for future reference, it may be best to correct a couple of small points: In article <1761NU013809@NDSUVM1> NU013809@NDSUVM1.BITNET (Greg Wettstein) writes: >A common question when implementing a HASH based lookup table is how >does one translate a supposedly random number into a finite set of >'buckets' which are typically the indexes into the lookup table. The >technique used here is to divide the HASH value by a prime number and >take the modulus of the division to be the bucket number or array index. In fact the "most common" method, is to employ a table whose length is a power of 2. The "index" is then generated by simply ANDing the hash value by the table size-1. >The overall goal in the process of hashing and bucket index generation >is to develop a process which disperses the input keys as evenly as >possible among all the output buckets. This is a very common mistake made when designing hashing algorithms. In fact the goal should be to minimize the *total* time spent in the searching algorithm over a statistically significant number of calls. This involves taking the *frequency of occurrence* of symbols or classes of symbols into account when designing the mechanism. If the tokens "( ) ; ," each occur significantly more frequently than the tokens "while for switch do", then it would be reasonable to design the algorithm so that the former items were located faster than the latter, i.e. the length of the search chains would be inversely proportional to the frequency of occurrence of the tokens on the chains. This can be achieved in a variety of ways, including ad hoc tests for specific high frequency items prior to applying a generalized routine. In the case of a fixed data-base, (for example the reserved words in a compiler), it does not take much effort to analyze a filesystem or three for token frequencies and then structure and *order* the dictionary accordingly (i.e. order the linked lists with the most frequent tokens first). It is also important to remember that a naive algorithm may well prove just as good overall than a *time consuming* sophisticated one, particularly if the search chains are short. Equate the time taken by the algorithm to extra "dead" length on the search chains. This was all debated in this newsgroup as recently as August of last year. -- Ray Dunn. | UUCP: ..!philabs!micomvax!ray Philips Electronics Ltd. | TEL : (514) 744-8200 Ext: 2347 600 Dr Frederik Philips Blvd | FAX : (514) 744-6455 St Laurent. Quebec. H4M 2S9 | TLX : 05-824090