Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!sdd.hp.com!wuarchive!udel!rochester!kodak!ispd-newsserver!ism.isc.com!bud.sos.ivy.isc.com!willcr
From: willcr@bud.sos.ivy.isc.com (Will Crowder)
Newsgroups: comp.lang.c
Subject: Re: Efficient STRing CoMPares?
Message-ID: <1991Mar21.175913.20196@ism.isc.com>
Date: 21 Mar 91 17:59:13 GMT
References: <15486@smoke.brl.mil> <1991Mar19.034802.24340@cbnewsk.att.com> <1991Mar20.020957.9180@athena.mit.edu> <1991Mar20.213615.11223@noose.ecn.purdue.edu>
Sender: usenet@ism.isc.com (Ism Usenet News)
Reply-To: willcr@ivy.isc.com
Followup-To: comp.lang.c
Organization: Interactive Systems Corp.
Lines: 57

In article <1991Mar20.213615.11223@noose.ecn.purdue.edu>, psmith@iies.ecn.purdue.edu (Paul F Smith) writes:
|> >In article barton@cbnewsk.att.com (jim.m.barton) writes:
|> >
|> >	struct chunk
|> >		{
|> >		char *c_ptr;
|> >		unsigned int c_hash;
|> >		};
|> >
|> ...
|> >
|> >You can also write great little symbol table implementations
|> >using hashing, and the hash values don't even have to take up
|>                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|> >extra space (they're implicit in the bucket or slot indices).
|>  ^^^^^^^^^^^!!!
|> >
|> >                                            Steve Summit
|> >                                            scs@adam.mit.edu
|> 
|> OK, I won't ask about a string hash function :-), but could you 
|> please elaborate on how your symbol table works. I just implemented a
|> hash table that keeps the hash value around (like your struct above), 
|> which I then use in scanning the list at the given bucket to minimize
|> strcmp()s.  But, I can't see how you can get rid of it and still reduce 
|> the number of strcmp()s. Please describe, Thanks!

(Since we're talking about one of my favorite data structures, hashing
with open chaining, I'll take a shot at this one.)

Hashing with open chaining is a method of saving information such that it
can be looked up quickly.  The basic idea is to deal with hash collisions
by maintaining a list (usually a linked-list) of data with identical hash
values.  Then each hash "bucket" points to the list of data having the
same hash value.  To look up a key, they key is hashed and the chain having
that hash value is scanned for the data.  Since all data in a given chain
have the same hash value, there is no need to save the hash value anywhere.

One optimization might be to further hash the data in each chain, obviously 
with a different hashing algorithm, and then rehash the key and scan
the chain for data having the same new hash value.  In this case,
you would have to save the new hash value along with the data.  You could
also have chains on top of chains, etc., etc., but a simple single-level
algorithm with one set of buckets is usually efficient enough.

|> ------------------------------------------------------------------
|> Paul F. Smith - ADPC - Purdue University     psmith@ecn.purdue.edu
|> <someday I'll think of something profound to put here, maybe>

Hope this helps,

Will

--------------------------------------------------------------------------------
Will Crowder, MTS            | "I belong to no organized politcal party.  
(willcr@ivy.isc.com)         |  I'm a democrat."  
INTERACTIVE Systems Corp.    |		-- Will Rogers