Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!caen!spool.mu.edu!munnari.oz.au!goanna!ok From: ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) Newsgroups: comp.lang.c Subject: Re: strcmp Message-ID: <6447@goanna.cs.rmit.oz.au> Date: 21 Jun 91 12:04:36 GMT Article-I.D.: goanna.6447 References: <2695@m1.cs.man.ac.uk> <1991Jun18.074029.12226@panix.uucp> <677424380@romeo.cs.duke.edu> Organization: Comp Sci, RMIT, Melbourne, Australia Lines: 80 In article <677424380@romeo.cs.duke.edu>, drh@duke.cs.duke.edu (D. Richard Hipp) writes: > I have, on various occasions, implemented my own string comparison > routines which attempt to address the above deficiencies in strcmp. > (One such implementation, strpbcmp -- string compare in PhoneBook order, > is attached.) The routine posted does NOT compare strings in Phone Book order. Here are the rules from a Phone Book: Names are divided into two parts for sorting. The first part, or the first word, determines the place to find the name. The second part, all the initials or remaining words (including locality and telephone number) determine the order within that group. Business names which begin with "the" are generally sorted under the next word. Punctiation and special characters within a name will generally not alter their alphabetical position and should be ignored. When initials precede a name, they will be treated as the first name, regardless of punctuation. If the name contains a number, the numeric character will be sorted as though it were a word (i.e. 1 = one). In some cases, names which commence with numerals will be found under the name as it is pronounced. A prefix is included as part of the first word even if it is separated from the second part of the name by a hyphen. (This one is _really_ fun. You have to know that "Le Blanc" has a prefix "Le" while "Le Tseung" probably hasn't, so that the latter name precedes the first.) Names which contain a hyphen are treated as two words and are sorted according to the first name. This does not apply to hyphenated names which begin with a prefix. "Mc" is treated as though spelt "Mac". Names such as "Mace" and "Mack" are sorted with those names which commence with "Mc" and "Mac". "Mt" is treated as though spelt "Mount" Names such as "Mount" appear first, followed by names which have "Mt" or "Mount" as the first part of their name. Names beginning with "St" are treated as though beginning with "Saint" (same rules as Mt/Mount). This isn't really adequate; McDonald may also be spelled M'Donald, and "St" is sometimes abbreviated to S, so "S. Adam Parish School" should be sorted with "Saint-Adam", but isn't. It is worth noting that 'phone book order is not the same as dictionary order. There really wasn't any one order that C could have used. > I therefore request option from the net on what others think is the > one right, true, and proper way to compare strings. There isn't any. You might like to imitate the approach in ANSI C. There are two functions which give you access to the local collating method (see setlocale() / LC_COLLATE). There is a function strxfrm(): strxfrm(dest, source, /*length? I forget*/) produces in dest a ``normalised'' copy of source, and returns the length of this copy. Comparing two normalised copies using strcmp() then does the right thing. strcoll(s1, s2) has the same effect as normlising s1 and s2 separately, then comparing them with strcmp. What you want to do is to provide any number of normalising functions that take your fancy, and use strcmp() to compare normalised results. If you do it this way, then you can also use your comparison method with an external sort: when you write the file to be sorted, put the normalised version first, then a mark, then the real data. Sort (letting the external sort use the same rule as strcmp), then strip off the normalised prefixes. Note: when you are sorting, you want the very fastest comparison you can get. Sorting a bunch of names by normalising them, then sorting the normalised versions using strcmp(), is going to be a *LOT* faster than sorting using your strpbcmp or anything like it. -- I agree with Jim Giles about many of the deficiencies of present UNIX.