Path: utzoo!utgpu!cs.utexas.edu!wuarchive!decwrl!ucbvax!ucsd!orion.oac.uci.edu!cedman
From: cedman@lynx.ps.uci.edu (Carl Edman)
Newsgroups: alt.sources
Subject: Re: Fast strcmp() wanted.
Message-ID: <CEDMAN.90Oct4171710@lynx.ps.uci.edu>
Date: 5 Oct 90 00:17:10 GMT
References: <1646@cherry.edc.UUCP> <OTTO.90Sep27145643@tukki.jyu.fi>
	<CEDMAN.90Sep27075013@lynx.ps.uci.edu>
	<1990Sep27.151543.8025@ccs.carleton.ca>
	<CEDMAN.90Sep29091115@lynx.ps.uci.edu> <6003@hplabsz.HPL.HP.COM>
	<1145@exodus.Eng.Sun.COM>
Organization: non serviam
Lines: 61
Nntp-Posting-Host: lynx.ps.uci.edu
In-reply-to: falk@peregrine.Sun.COM's message of 4 Oct 90 21:03:47 GMT

In article <1145@exodus.Eng.Sun.COM> falk@peregrine.Sun.COM (Ed Falk) writes:
   In article <6003@hplabsz.HPL.HP.COM> sartin@hplabs.hp.com (Rob Sartin) writes:
   >In article <CEDMAN.90Sep29091115@lynx.ps.uci.edu> cedman@lynx.ps.uci.edu (Carl Edman) writes:
   >>string structure (or better class, long live C++ ! :-) which calculates
   >>a 32-bit CRC for each string the first time and stores it somewhere.
   >>Then only 1 (inlined) longword-compare will do the stringcomparisons
   >>for you.
   >
   >Afraid not.  It'll give you an estimate of whether the strings match
   >(correctly identifying those that don't).  You will need to then
   >actually compare the strings if they are the same.  This method will
   >also be unable to reproduce strcmp's behavior (strcmp returns a signed
   >result indicated the <, =, > by being negative, zero, positive), it will
   >only return a boolean (match, no match).

   Also, these two strings

	   "ab\0x"
	   "ab\0y"

   (where x and y are any garbage that happens to be in memory after the
   terminating '\0') will be evaluated as unequal.

   There are workarounds for both problems, of course, but I think there
   won't be much of a performance improvement after you've done all it
   requires to get it right.

Please , take note ! I did NOT suggest that this method is always the
best, but I gave 3 or 4 different methods, and some hints, about when
I would use which one.
And I still hold that the CRC-method could be by far the most
efficient under some conditions:

	- each word is checked often. Then a small overhead like f.e.
	cleaning up the garbage behind the end of the string for the
	one time calculation wouldn't matter

	- strings are long, so the overhead of 32-bit per string is
	justified

	- you only need to know if 2 strings are identical

	- most compares between unequal strings

You can even imagine systems where it could be possible to remove
the actual string from memory, and simply assume that if the 32-bit
CRC match, the strings match. Such systems would have to be tolerant
about an occasional mismatch.

If memory serves correctly the above approach is used in some
implementations for diff. (only to give one practical, real world
example)

	Carl Edman


Theorectial Physicist,N.:A physicist whose   | Send mail
existence is postulated, to make the numbers |  to
balance but who is never actually observed   | cedman@golem.ps.uci.edu
in the laboratory.                           | edmanc@uciph0.ps.uci.edu