Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!linus!philabs!cmcl2!seismo!ut-sally!pyramid!decwrl!sun!guy
From: guy@sun.uucp (Guy Harris)
Newsgroups: net.internat
Subject: Re: Int'l character sets (Re: Are the funny letters really needed?)
Message-ID: <3268@sun.uucp>
Date: Fri, 21-Feb-86 15:26:32 EST
Article-I.D.: sun.3268
Posted: Fri Feb 21 15:26:32 1986
Date-Received: Mon, 24-Feb-86 06:03:33 EST
References: <172@bu-cs.UUCP> <1176@enea.UUCP>
Distribution: net
Organization: Sun Microsystems, Inc.
Lines: 167

> I don't think the "umlauts" should be moved elsewhere, at least not in
> ASCII.  Move the braces and brackets somewhere else.

ASCII isn't likely to change, it's stable.  Now ISO Latin Alphabet No. 1,
that's another story.  They've already moved the "umlauts", so it's too
late.  Dan Sahlins at The Royal Institute in Stockholm (ttds!dan) posted a
part of the ISO Draft standard.  The lower 128 code points (to use IBMese
instead of English) are the same as they are in ASCII; the upper 128 code
points include the various alphabetic characters used in non-English
languages and some special characters (cent sign, pound sign, accents).
"Capital letter A with ring above", as in "\oAngstrom), is in position
12/05, which in ANSI and ISO's somewhat opaque notation indicates the
character with code 12*16 + 5, i.e. hexadecimal C5.

Someone in an earlier article made reference to "the Danish Standard ISO 646
character set", and to other Scandinavian countries using their national
versions of ISO 646.  Is ISO 646 the current 7-bit alphabets with {, |, },
etc. replaced with additional letters?

> It just so happens that the ASCII I am used to represents my alphabet.
> (Almost: W isn't really part of it and oA is placed two steps wrong.)
> What would you think of a collating sequence like: 
>    A B C $ / # D E F ) = #  and so forth.

The second sentence voids the first.  What would an English speaker think of
a collating sequence like: "A B C F D E G ..." (I presume that's what you
mean by "oA is placed two steps wrong.")  "Almost" isn't good enough.

> I think one of basic ideas behind this conference is to make it possible
> not only for English-speaking people to not to have write special sort 
> programs, but be able rely on standard program (like grep) or standard 
> functions in programming languages. (like < >, etc for string comparisons
> in Pascal.)

GOOD LUCK.  I don't think you stand a snowball's chance in hell of having
standard string comparison functions in programming languages doing the
right thing.  From a posting by Lambert Meertens at the Centre for
Mathematics and Computer Science in Amsterdam (mcvax!lambert):

> 3.  It is not really clear whether ij should be considered one letter or
> two letters representing one vowel (just like no-one would dream of calling
> aa a ligature).  At school, Dutch kids are taught an alphabet ending in ...
> x, ij, y, z.  Also, if a word starting with ij is capitalized, the result
> is always IJ (so ijspret, the joy of ice skating, becomes IJspret).  Some
> Dutch typewriters have a separate ij key.  If I use such a typewriter, I
> won't touch that key because the result is esthetically less satisfying
> than that of i+j.
>
> 5.  Really conclusive would be the sorting convention.  ...
> This, however, is anarchy.  Most dictionaries sort ij like the two letters
> i+j, so ignorant < ijspret < illusoir.  Most encyclopedias use the school
> alphabet, so Xenophobia < IJspret < Yggdrasil.  The PTT sort on ij = y, so
> Wijchen < Wymbritseradeel < Wijngaarden.  They have a very good reason for
> this: before standardization settled on ij, many Dutch family names had
> already fixed themselves on y; only different branches could have different
> spellings.  So we have families De Bruyn next to families De Bruijn.
> Usually, you don't know which of the two is used officially; it is not even
> unheard of that a bearer of such a name doesn't know it themself unless
> they look it up in their passport or driver's licence.

And a subsequent posting:

> As a kind reader points out to me:
>
> + I think you are mistaken when you say that "rr" is sorted as a single
> + letter in Spanish.  Although "ch" and "ll" do sort as single letters,
> + "rr" does not (even though it is considered to be a separate letter).
> + Perhaps this is because no Spanish words start with it.

From "The International Utilities Package" in "Inside Macintosh":

	Note: ... String comparison in Pascal yields very different
	results (from the "international string comparison" routines
	in Macintosh - gh), since it simply follows the ordering of
	the characters' ASCII codes.

These routines, from a quick reading of that section of "Inside Macintosh",
change their behavior depending on the setting of a global flag indicating
which language, etc. is in use.

So comparison of character strings depending on the national sorting rules
is a lot more complicated than comparison of character strings on a
byte-by-byte basis.  As such, I think the position of characters within the
character set isn't really all that relevant.  Sorting English-language text
may run faster, since ASCII happens to be set up with the letters in the
right order, but remember that "dictionary order" treats upper-case and
lower-case letters the same, so even there a straight byte-by-byte
comparison isn't always waht you want.

> This of course also includes how things are represented on the screen and
> the keyboard.

Yes, screens will have to display national characters, and keyboards will
have to have keys for them.  I don't mind that, although you'll probably
have to stuff {, |, }, etc. onto keyboards which currently don't have them.

> So you're right, compilers will need to be rewritten. Not only to fit the
> different keyboards, but also the HUMAN BEEINGS behind them.

If the compiler accepts ISO Latin Alphabet No. 1, it won't have a problem.
{, |, } are all in that alphabet.  The only reason a compiler would have to
be rewritten would be to support the 7-bit character sets, and the only
reasons to do that would be if ISO Latin Alphabet No. 1, and keyboards which
allowed you to type in all the characters of that character set you need
(i.e., all of the lower 128 ASCII code points and all of the upper 128 code
points you need in the languages you use), didn't become common.  If we end
up stuck with 7-bit character sets and keyboards which have oA, etc. instead
of {, |, }, etc. rather than keyboards which have them in addition, we'll be
stuck with modifying compilers.  Unfortunatly, if that happens, BNF will
have to be rewritten as well, since it uses "|"....

> And of course it's a very big tail wagging a small dog. The tail is the
> vast majority of the people in the world who don't have English as their
> native language and the dog is those who do.

No, there are many dogs, and the Chinese one is not only bigger than the
Swedish one, it's bigger than the English-speaking one.  Chinese won't even
fit into an 8-bit character set, and lord only knows *how* you sort Chinese
strings!  If you warn people against Anglophone ethnocentrism, beware of
Western ethnocentrism....

On the subject of non-Western language support:

Note that AT&T is offering a version of System V which has been "turned
Japanese".  It supports several two-byte and three-byte character sets; it
mentions JIS C6226 Kanji and JIS C6220 Kana.  (According to Issue 2 of the
System V Interface Definition, Volume 2's section on Future Directions, all
the international character sets used by UNIX will be in conformance with
ISO standard 2022-1982.  It also indicates that ISO Latin Alphabet No. 1 is
DIS 8859/1; I presume DIS is Draft International Standard.)  The brochure
AT&T handed out at UniForum indicates:

	addition of Japanese terminal and input attributes to "terminfo"

	addition of methods for entering Japanese characters, including
	a kana-to-kanji translation mechanism; they indicate two methods
	for entering Japanese characters, an "in-line kana to kanji module",
	whatever that means, and "jvi", which presumably stands for
	"Japanese vi"

	"Utility programs for preparation and maintenance of ESC and
	dictionary.

		o Extended characters font creation program

		o Extended character font load program

		o Dictionary maintenance program"
	(with no indication of what this all means, unfortunately).

	C language changes to support the use of Japanese characters
	in literals and comments - presumably, this just means the
	scanner has been changed to handle 8-bit characters and not
	get tripped up by character sequences, so this compiler presumably
	will be the standard C compiler in future UNIX releases and
	will work in any national environment.

	Changes to some commands to permit the processing of data written
	in Japanese (this, like the C compiler change, is listed as
	"International" rather than "Japanese", so presumably most of
	it will be part of future UNIX releases and will apply to all
	national environments).  The changes include support of 8-bit
	character sets.
-- 
	Guy Harris
	{ihnp4, decvax, seismo, decwrl, ...}!sun!guy
	guy@sun.arpa	(yes, really)