Path: utzoo!utgpu!news-server.csri.toronto.edu!rutgers!att!dptg!ulysses!andante!alice!ark From: ark@alice.att.com (Andrew Koenig) Newsgroups: comp.lang.c Subject: Re: Comparing strings... (long but maybe worth it) Message-ID: <11485@alice.att.com> Date: 14 Oct 90 01:21:17 GMT References: <2205.271700c2@cc.nu.oz.au> Organization: AT&T Bell Laboratories, Liberty Corner NJ Lines: 217 In article <2205.271700c2@cc.nu.oz.au>, v8902058@cc.nu.oz.au writes: > I'm sorry if this is a trival task, but I would like a function that > compares two strings. I would like to be able to find if a string is equal to, > less than, or greater than another string. Usually I don't answer questions that look like homework assignments. I can't resist this one, though, seeing that someone else has already posted an answer and also that it gives a nice opportunity to explore some programming techniques that some people may find unfamiliar. The first thing to do is understand the problem thoroughly. As stated, we do not know what a string is, or what it means for one string to be less than another. First let's define a string as a null-terminated character array represented by a pointer to the initial element of the array. Note that this definition precludes treating a null pointer as a string; the program we write will therefore fail if either of its arguments is a null pointer or if either of the strings is not null-terminated. Next, we define an ordering relation on strings. Two strings must be equal, of course, if all their characters are equal. What if they aren't? Then there are two cases: (a) one string is a prefix of the other. In this case we will define the shorter string as smaller. (b) neither string is a prefix of the other. In this case we can scan the strings left to right until we find a character in one string that is not equal to the corresponding character in the other; comparing these two characters will determine which string is smaller. This definition corresponds to the usual notion of lexical ordering. For example, the null string is the smallest string of all, "a" is less than "aa", which in turn is less than "ab", and so on. This definition has an important property: if you pick a character and append that same character to the BEGINNING of each of two strings, you will not affect how the strings compare. So, for example, because "a" is less than "aa", "xa" is less than "xaa". This fact is easy to prove. If the strings started out equal, they will still be equal after the transformation. If not, you can go through both cases (a) and (b) above and easily convince yourself that whichever one applies, adding the extra character will not change the result. That suggests the following algorithm: (1) If both strings are null, they're equal. (2) Otherwise, if one string is null and the other one isn't, the null string is shorter. (3) Otherwise, both strings are non-null; if their initial characters are unequal, the result of that comparison is the result of the algorithm. (4) Otherwise, we chop the initial character off both strings and recursively apply the algorithm on the remainders. This is easy to translate into a C program. We assume the constants LESS, EQUAL, and GREATER are defined elsewhere. We return LESS if the first argument is less than the second, GREATER if the first argument is greater, and EQUAL if they're equal: int compare (char *p, char *q) { if (*p == '\0' && *q == '\0') return EQUAL; if (*p == '\0' && *q != '\0') return LESS; if (*p != '\0' && *q == '\0') return GREATER; if (*p < *q) return LESS; if (*p > *q) return GREATER; return compare(p+1, q+1); } We now have a program that accurately reflects the definition, and that may be good enough! However, let's assume that we want the program to run faster. A dangerous assumption, but it's a wretched, miserable, rainy night out and I want to have a little fun. First, let's look at the first two comparisons. We're comparing p to '\0' twice. That's silly; we already know the answer from the first time. Let's factor those two comparisons: if (*p == '\0') { if (*q == '\0') return EQUAL; return LESS; } /* ... */ Next we look at the third comparison and realize we know half the answer to that two: we can only get there if *p != '\0'. The first three comparisons now look like this: if (*p == '\0') { if (*q == '\0') return EQUAL; return LESS; } if (*q == '\0') return GREATER; The next trick is to eliminate the recursive function call. Because the function call is the last thing in our function, we can simply replace it by a set of assignments to establish the right initial conditions followed by a goto up to the top of the function! Some C compilers will do it for us, but it's more fun this way: int compare (char *p, char *q) { top: if (*p == '\0') { if (*q == '\0') return EQUAL; return LESS; } if (*q == '\0') return GREATER; if (*p < *q) return LESS; if (*p > *q) return GREATER; p++; q++; goto top; } Nobody likes goto statements these days, so let's replace this one by a `while' loop. It's easy to do this by moving the body of the first `if' outside the loop and making *p == '\0' the subject of the loop. Of course, we have to remember to invert the sense of this test. While we're at it, we can swap the order of the to comparisons of *p and *q to merge the two tests that might result in returning GREATER: int compare (char *p, char *q) { while (*p != '\0') { if (*q == '\0' || *p > *q) return GREATER; if (*p < *q) return LESS; p++; q++; } if (*q == '\0') return EQUAL; return LESS; } We can take advantage of the fact that if we use an integral value by itself in the comparison, that implicitly compares it to zero (which is identical to '\0'). That shortens the program slightly. We will also make p and q into register variables, and finally note that the comparison `*p > *q' can be merged with the increment operations on p and q (because if the `return' is executed, the values of p and q no longer matter). This makes the program significantly faster on some machine architectures and compilers: int compare (register char *p, register char *q) { while (*p) { if (!*q || *p > *q) return GREATER; if (*p++ < *q++) return LESS; } if (!*q) return EQUAL; return LESS; } Finally, we can merge the two `return' statements at the end into a single ?: operator: int compare (register char *p, register char *q) { while (*p) { if (!*q || *p > *q) return GREATER; if (*p++ < *q++) return LESS; } return *q? LESS: EQUAL; } By this time, the program should be thoroughly incomprehensible, even though it was obtained by a series of straightforward steps from the original problem definition! As an exercise, you might try proving that the program works in its current form, or finding out where I goofed if it doesn't work. In `Structured Programming with Goto Statements,' Don Knuth said he hoped for a system that he would tentatively name UTOPIA84 (because he thought someone might be able to get it done by 1984 -- this article was published in 1974) that would allow people to perform these kinds of program transformations mechanically while having the system check that correctness was being maintained. He didn't want the system to do these things automatically -- he felt (and I agree) that it is only worth doing this kind of optimization after you've measured a working program to see that it is actually worthwhile. I imagine that further optimizations are possible. Anyone else want to give it a try? -- --Andrew Koenig ark@europa.att.com