Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!utgpu!water!watmath!clyde!rutgers!uwvax!uwmacc!hobbes!root
From: root@hobbes.UUCP
Newsgroups: comp.std.internat
Subject: Character representation
Message-ID: <176@hobbes.UUCP>
Date: Sat, 15-Aug-87 02:14:30 EDT
Article-I.D.: hobbes.176
Posted: Sat Aug 15 02:14:30 1987
Date-Received: Sun, 16-Aug-87 08:39:51 EDT
References: <2171@enea.UUCP>
Reply-To: root@hobbes.UUCP (John Plocher)
Followup-To: comp.std.internat
Organization: U of Wisconsin - Madison  Spanish Department
Lines: 57

+---- Erland Sommarskog writes the following in article <2171@enea.UUCP> ----
| In some languages the combination may constitute a new letter
| ("a" with ring and dots, "o" with dots in Swedish), in other you
| can apply accents and other signs without affecting the sorting.
| (E.g. French, Italian)
|   The conclusion is that a more sofisticated approach muct be taken.
| However, I must admit that I do not have any bright proposals right
| now, yet think of it!-- 
+----

I will be posting (as soon as I finish this note) a routine which we use
called stracmp() to the newsgroup comp.sources.misc.  It compares two
strings of 8 bit characters while taking into account the correct collating
sequence and precedence (if any) of accented letters.  The routine is
designed to drop in in place of the common strcmp().

The code has been used on IBM-PCs which support a limited set of accented
characters in their character display ROMS and also with the ISO-Latin-1
alphabet (this requires more sophisticated display drivers).  It would be
very simple to add the tables for Latin-2 through n if desired.  The code
is not dependent on any particular hardware, but it does assume the
C compiler handles "unsigned chars".

I am including some of the comments I made in the header to the code:

Description:
	stracmp() implements a string compare which correctly handles
	accented (non English) characters which have been encoded using
	8-bit characters.  It uses character lookup tables for doing 
	string compares when accented characters are present and/or a
	non-ASCII collating sequence is desired.
Theory:
	  The correct way of sorting (or comparing) strings which contain
	accented characters is to first compare the strings with all accents
	stripped. If the two strings are the same, then and only then are the
	accents used.  This second comparison involves only the accents.
	You can think of this as comparing the two strings with all the letters
	stripped.
	  Also, there are times when the "normal" ASCII collating sequence is
	not appropriate for lexical ordering.  (ie.  A <AE> B C <CEDILLA> D ...>
Examples:
			     ,  :
	Comparing Junta and Junta	(the second word has diacritical
					 marks over the two vowels)
	    first we compare("Junta", "Junta")	which shows them EQUAL
	then we must compare("     ", " '  :")
				  ,  :
	Thus, Junta comes before Junta in the lexical ordering of the two words.
		   ,          ,
	Comparing Junta  and Junto	(both words have accented 'u's)
	    first we compare("Junta", "Junto"); since they are
	different  we do not need to do anything more with the accents:
	  ,                    ,
	"Junta" is less than "Junto".
 
-- 
John Plocher uwvax!geowhiz!uwspan!plocher  plocher%uwspan.UUCP@uwvax.CS.WISC.EDU