Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!usc!snorkelwacker.mit.edu!bloom-picayune.mit.edu!news From: scs@adam.mit.edu (Steve Summit) Newsgroups: comp.lang.c Subject: Re: Efficient STRing CoMPares? Summary: hashing Message-ID: <1991Mar20.020957.9180@athena.mit.edu> Date: 20 Mar 91 02:09:57 GMT References: <1991Mar15.142821@mars.mpr.ca> <15486@smoke.brl.mil> <1991Mar19.034802.24340@cbnewsk.att.com> Sender: news@athena.mit.edu (News system) Reply-To: scs@adam.mit.edu Organization: Thermal Technologies, Cambridge, MA Lines: 49 In article <1991Mar19.034802.24340@cbnewsk.att.com> barton@cbnewsk.att.com (jim.m.barton) writes: >A more fruitful approach might be to examine your data more closely to >see if some of the strcmp()'s could be avoided. For example, if many >of the string's being compared are actually part a restricted set of >values rather than arbitrary strings, you might be able to represent them >internally as enumerated values, (where integer comparisons could be done)... Indeed. When it's important to do so, reducing the number of explicit string comparisons is not hard, even when the strings are neither restricted nor known in advance. Hashing is a very powerful and surprisingly easy technique to use. I was just writing a diff-like program the other day. By representing strings not with individual char *'s, but rather with instances of this structure: struct chunk { char *c_ptr; unsigned int c_hash; }; , and by keeping the c_hash field filled in with a hash value derived from the string pointed to by c_ptr, I could compare strings for equality with this macro: #define Chunkeq(cp1, cp2) ((cp1)->c_hash == (cp2)->c_hash && \ strcmp((cp1)->c_ptr, (cp2)->c_ptr) == 0) When you have a lot of strings to compare, the time spent computing hash values, and the extra space to store them, can be well worth it. You can also write great little symbol table implementations using hashing, and the hash values don't even have to take up extra space (they're implicit in the bucket or slot indices). (These techniques are closely related to the "string pool" modules mentioned by Doug and Chris. In fact, a string pool would almost certainly be written using hashing for string insertion.) See any algorithms book for much more on hashing; it's got nothing particularly to do with C. (Please don't ask "what's the best string hash function?", which is an FAQ, though not on the list yet.) Steve Summit scs@adam.mit.edu