Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!utgpu!water!watmath!clyde!rutgers!uwvax!uwmacc!hobbes!root From: root@hobbes.UUCP Newsgroups: comp.std.internat Subject: Character representation Message-ID: <176@hobbes.UUCP> Date: Sat, 15-Aug-87 02:14:30 EDT Article-I.D.: hobbes.176 Posted: Sat Aug 15 02:14:30 1987 Date-Received: Sun, 16-Aug-87 08:39:51 EDT References: <2171@enea.UUCP> Reply-To: root@hobbes.UUCP (John Plocher) Followup-To: comp.std.internat Organization: U of Wisconsin - Madison Spanish Department Lines: 57 +---- Erland Sommarskog writes the following in article <2171@enea.UUCP> ---- | In some languages the combination may constitute a new letter | ("a" with ring and dots, "o" with dots in Swedish), in other you | can apply accents and other signs without affecting the sorting. | (E.g. French, Italian) | The conclusion is that a more sofisticated approach muct be taken. | However, I must admit that I do not have any bright proposals right | now, yet think of it!-- +---- I will be posting (as soon as I finish this note) a routine which we use called stracmp() to the newsgroup comp.sources.misc. It compares two strings of 8 bit characters while taking into account the correct collating sequence and precedence (if any) of accented letters. The routine is designed to drop in in place of the common strcmp(). The code has been used on IBM-PCs which support a limited set of accented characters in their character display ROMS and also with the ISO-Latin-1 alphabet (this requires more sophisticated display drivers). It would be very simple to add the tables for Latin-2 through n if desired. The code is not dependent on any particular hardware, but it does assume the C compiler handles "unsigned chars". I am including some of the comments I made in the header to the code: Description: stracmp() implements a string compare which correctly handles accented (non English) characters which have been encoded using 8-bit characters. It uses character lookup tables for doing string compares when accented characters are present and/or a non-ASCII collating sequence is desired. Theory: The correct way of sorting (or comparing) strings which contain accented characters is to first compare the strings with all accents stripped. If the two strings are the same, then and only then are the accents used. This second comparison involves only the accents. You can think of this as comparing the two strings with all the letters stripped. Also, there are times when the "normal" ASCII collating sequence is not appropriate for lexical ordering. (ie. A B C D ...> Examples: , : Comparing Junta and Junta (the second word has diacritical marks over the two vowels) first we compare("Junta", "Junta") which shows them EQUAL then we must compare(" ", " ' :") , : Thus, Junta comes before Junta in the lexical ordering of the two words. , , Comparing Junta and Junto (both words have accented 'u's) first we compare("Junta", "Junto"); since they are different we do not need to do anything more with the accents: , , "Junta" is less than "Junto". -- John Plocher uwvax!geowhiz!uwspan!plocher plocher%uwspan.UUCP@uwvax.CS.WISC.EDU