Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!linus!philabs!cmcl2!seismo!ut-sally!pyramid!decwrl!sun!guy From: guy@sun.uucp (Guy Harris) Newsgroups: net.internat Subject: Re: Int'l character sets (Re: Are the funny letters really needed?) Message-ID: <3268@sun.uucp> Date: Fri, 21-Feb-86 15:26:32 EST Article-I.D.: sun.3268 Posted: Fri Feb 21 15:26:32 1986 Date-Received: Mon, 24-Feb-86 06:03:33 EST References: <172@bu-cs.UUCP> <1176@enea.UUCP> Distribution: net Organization: Sun Microsystems, Inc. Lines: 167 > I don't think the "umlauts" should be moved elsewhere, at least not in > ASCII. Move the braces and brackets somewhere else. ASCII isn't likely to change, it's stable. Now ISO Latin Alphabet No. 1, that's another story. They've already moved the "umlauts", so it's too late. Dan Sahlins at The Royal Institute in Stockholm (ttds!dan) posted a part of the ISO Draft standard. The lower 128 code points (to use IBMese instead of English) are the same as they are in ASCII; the upper 128 code points include the various alphabetic characters used in non-English languages and some special characters (cent sign, pound sign, accents). "Capital letter A with ring above", as in "\oAngstrom), is in position 12/05, which in ANSI and ISO's somewhat opaque notation indicates the character with code 12*16 + 5, i.e. hexadecimal C5. Someone in an earlier article made reference to "the Danish Standard ISO 646 character set", and to other Scandinavian countries using their national versions of ISO 646. Is ISO 646 the current 7-bit alphabets with {, |, }, etc. replaced with additional letters? > It just so happens that the ASCII I am used to represents my alphabet. > (Almost: W isn't really part of it and oA is placed two steps wrong.) > What would you think of a collating sequence like: > A B C $ / # D E F ) = # and so forth. The second sentence voids the first. What would an English speaker think of a collating sequence like: "A B C F D E G ..." (I presume that's what you mean by "oA is placed two steps wrong.") "Almost" isn't good enough. > I think one of basic ideas behind this conference is to make it possible > not only for English-speaking people to not to have write special sort > programs, but be able rely on standard program (like grep) or standard > functions in programming languages. (like < >, etc for string comparisons > in Pascal.) GOOD LUCK. I don't think you stand a snowball's chance in hell of having standard string comparison functions in programming languages doing the right thing. From a posting by Lambert Meertens at the Centre for Mathematics and Computer Science in Amsterdam (mcvax!lambert): > 3. It is not really clear whether ij should be considered one letter or > two letters representing one vowel (just like no-one would dream of calling > aa a ligature). At school, Dutch kids are taught an alphabet ending in ... > x, ij, y, z. Also, if a word starting with ij is capitalized, the result > is always IJ (so ijspret, the joy of ice skating, becomes IJspret). Some > Dutch typewriters have a separate ij key. If I use such a typewriter, I > won't touch that key because the result is esthetically less satisfying > than that of i+j. > > 5. Really conclusive would be the sorting convention. ... > This, however, is anarchy. Most dictionaries sort ij like the two letters > i+j, so ignorant < ijspret < illusoir. Most encyclopedias use the school > alphabet, so Xenophobia < IJspret < Yggdrasil. The PTT sort on ij = y, so > Wijchen < Wymbritseradeel < Wijngaarden. They have a very good reason for > this: before standardization settled on ij, many Dutch family names had > already fixed themselves on y; only different branches could have different > spellings. So we have families De Bruyn next to families De Bruijn. > Usually, you don't know which of the two is used officially; it is not even > unheard of that a bearer of such a name doesn't know it themself unless > they look it up in their passport or driver's licence. And a subsequent posting: > As a kind reader points out to me: > > + I think you are mistaken when you say that "rr" is sorted as a single > + letter in Spanish. Although "ch" and "ll" do sort as single letters, > + "rr" does not (even though it is considered to be a separate letter). > + Perhaps this is because no Spanish words start with it. From "The International Utilities Package" in "Inside Macintosh": Note: ... String comparison in Pascal yields very different results (from the "international string comparison" routines in Macintosh - gh), since it simply follows the ordering of the characters' ASCII codes. These routines, from a quick reading of that section of "Inside Macintosh", change their behavior depending on the setting of a global flag indicating which language, etc. is in use. So comparison of character strings depending on the national sorting rules is a lot more complicated than comparison of character strings on a byte-by-byte basis. As such, I think the position of characters within the character set isn't really all that relevant. Sorting English-language text may run faster, since ASCII happens to be set up with the letters in the right order, but remember that "dictionary order" treats upper-case and lower-case letters the same, so even there a straight byte-by-byte comparison isn't always waht you want. > This of course also includes how things are represented on the screen and > the keyboard. Yes, screens will have to display national characters, and keyboards will have to have keys for them. I don't mind that, although you'll probably have to stuff {, |, }, etc. onto keyboards which currently don't have them. > So you're right, compilers will need to be rewritten. Not only to fit the > different keyboards, but also the HUMAN BEEINGS behind them. If the compiler accepts ISO Latin Alphabet No. 1, it won't have a problem. {, |, } are all in that alphabet. The only reason a compiler would have to be rewritten would be to support the 7-bit character sets, and the only reasons to do that would be if ISO Latin Alphabet No. 1, and keyboards which allowed you to type in all the characters of that character set you need (i.e., all of the lower 128 ASCII code points and all of the upper 128 code points you need in the languages you use), didn't become common. If we end up stuck with 7-bit character sets and keyboards which have oA, etc. instead of {, |, }, etc. rather than keyboards which have them in addition, we'll be stuck with modifying compilers. Unfortunatly, if that happens, BNF will have to be rewritten as well, since it uses "|".... > And of course it's a very big tail wagging a small dog. The tail is the > vast majority of the people in the world who don't have English as their > native language and the dog is those who do. No, there are many dogs, and the Chinese one is not only bigger than the Swedish one, it's bigger than the English-speaking one. Chinese won't even fit into an 8-bit character set, and lord only knows *how* you sort Chinese strings! If you warn people against Anglophone ethnocentrism, beware of Western ethnocentrism.... On the subject of non-Western language support: Note that AT&T is offering a version of System V which has been "turned Japanese". It supports several two-byte and three-byte character sets; it mentions JIS C6226 Kanji and JIS C6220 Kana. (According to Issue 2 of the System V Interface Definition, Volume 2's section on Future Directions, all the international character sets used by UNIX will be in conformance with ISO standard 2022-1982. It also indicates that ISO Latin Alphabet No. 1 is DIS 8859/1; I presume DIS is Draft International Standard.) The brochure AT&T handed out at UniForum indicates: addition of Japanese terminal and input attributes to "terminfo" addition of methods for entering Japanese characters, including a kana-to-kanji translation mechanism; they indicate two methods for entering Japanese characters, an "in-line kana to kanji module", whatever that means, and "jvi", which presumably stands for "Japanese vi" "Utility programs for preparation and maintenance of ESC and dictionary. o Extended characters font creation program o Extended character font load program o Dictionary maintenance program" (with no indication of what this all means, unfortunately). C language changes to support the use of Japanese characters in literals and comments - presumably, this just means the scanner has been changed to handle 8-bit characters and not get tripped up by character sequences, so this compiler presumably will be the standard C compiler in future UNIX releases and will work in any national environment. Changes to some commands to permit the processing of data written in Japanese (this, like the C compiler change, is listed as "International" rather than "Japanese", so presumably most of it will be part of future UNIX releases and will apply to all national environments). The changes include support of 8-bit character sets. -- Guy Harris {ihnp4, decvax, seismo, decwrl, ...}!sun!guy guy@sun.arpa (yes, really)