Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!usc!snorkelwacker.mit.edu!bloom-picayune.mit.edu!news
From: scs@adam.mit.edu (Steve Summit)
Newsgroups: comp.lang.c
Subject: Re: Efficient STRing CoMPares?
Summary: hashing
Message-ID: <1991Mar20.020957.9180@athena.mit.edu>
Date: 20 Mar 91 02:09:57 GMT
References: <1991Mar15.142821@mars.mpr.ca> <15486@smoke.brl.mil> <1991Mar19.034802.24340@cbnewsk.att.com>
Sender: news@athena.mit.edu (News system)
Reply-To: scs@adam.mit.edu
Organization: Thermal Technologies, Cambridge, MA
Lines: 49

In article <1991Mar19.034802.24340@cbnewsk.att.com> barton@cbnewsk.att.com (jim.m.barton) writes:
>A more fruitful approach might be to examine your data more closely to
>see if some of the strcmp()'s could be avoided. For example, if many
>of the string's being compared are actually part a restricted set of
>values rather than arbitrary strings, you might be able to represent them
>internally as enumerated values, (where integer comparisons could be done)...

Indeed.  When it's important to do so, reducing the number of
explicit string comparisons is not hard, even when the strings
are neither restricted nor known in advance.  Hashing is a very
powerful and surprisingly easy technique to use.

I was just writing a diff-like program the other day.  By
representing strings not with individual char *'s, but rather
with instances of this structure:

	struct chunk
		{
		char *c_ptr;
		unsigned int c_hash;
		};

, and by keeping the c_hash field filled in with a hash value
derived from the string pointed to by c_ptr, I could compare
strings for equality with this macro:

	#define Chunkeq(cp1, cp2) ((cp1)->c_hash == (cp2)->c_hash && \
				strcmp((cp1)->c_ptr, (cp2)->c_ptr) == 0)

When you have a lot of strings to compare, the time spent
computing hash values, and the extra space to store them,
can be well worth it.

You can also write great little symbol table implementations
using hashing, and the hash values don't even have to take up
extra space (they're implicit in the bucket or slot indices).

(These techniques are closely related to the "string pool"
modules mentioned by Doug and Chris.  In fact, a string pool
would almost certainly be written using hashing for string
insertion.)

See any algorithms book for much more on hashing; it's got
nothing particularly to do with C.  (Please don't ask "what's
the best string hash function?", which is an FAQ, though not on
the list yet.)

                                            Steve Summit
                                            scs@adam.mit.edu