Newsgroups: comp.lang.c Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!sarah!bingnews!kym From: kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) Subject: Re: Efficient STRing CoMPares? Message-ID: <1991Mar20.213133.16160@bingvaxu.cc.binghamton.edu> Organization: State University of New York at Binghamton References: <15486@smoke.brl.mil> <1991Mar19.034802.24340@cbnewsk.att.com> <1991Mar20.174452.4246@zoo.toronto.edu> Date: Wed, 20 Mar 1991 21:31:33 GMT In article <1991Mar20.174452.4246@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes: >In article <1991Mar19.034802.24340@cbnewsk.att.com> barton@cbnewsk.att.com (jim.m.barton) writes: >>... The above macro, StrEq(), MIGHT be faster if most >>of the strings being compared tend to differ in their first character... >>IMHO, I doubt that StrEq() will achieve any significant optimization for >>MOST cases... >Sorry, there is considerable experimental evidence that it does. The fact >is, most string comparisons fail on the very first character, and even a An easy method for verifying that two strings typically differ in their first character run the following (dumb) sorting program. For the first 10k words out of /usr/dict/words scrambled up in essentially random order, it says 89% of the comparisons failed on their first character. This is not much of a test, granted, and there _will_ be some cases (e.g. sorting nearly-sorted lists of words) where comparisons _wont_ be decided by the first characters. However, even in these latter cases the overhead of an extra `compare' versus the overhead of the strcmp loop will be small -- please feel free to experimentally verify this. (As a back of the envelope calculation we can say that if it costs S instructions for the strcmp code and C instruction(s) for the compare then on average we will perform C + .11 S instructions. We can then say ``how much would C have to be to make it worth using only S?'' We find break even at C = .89 S. Therefore, provided the compare sequence (including jumps and register loads) on your machine is less than 89% of a (typical) call/inline of strcmp then USE THE COMPARE). As a slight side note there are certain contexts where string comparisons can be made MUCH more efficient than a character-by-character compare. The well-known `string search' algorithms that essentially compare _rare_ characters within each string (e.g. if one string contains a `z' then first search for `z' in the other string -- if found then match the rest) & then adjust pointers by an appropriate amount (rather than just incrementing them) depending on the outcome of each character comparison is one example. I remember reading an article in Software Practice and Experience that compared such a statistical method with string-matching INSTRUCTIONS (I _think_ it was on an IBM 370 so that dates me doesnt it?) and the algorithm was found to be SUBSTANTIALLY faster (like an order-of-magnitude, unless I'm mistaken) than the hardware string compare. -kym /* Dumb sort program that figgas out how many strings (mis)match on first character -- it says 11% match for /usr/dict/words. */ #include #include #define MAXWORD 10000 char *word[MAXWORD]; int nw=0; int nmatch1=0; int ncmp=0; main(){ init(); scramble(); sort(); printf("fraction match first char=%g\n",(double)nmatch1/ncmp); } init() { int i; char buf[100]; for(i=0;i