Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!uunet!pilchuck!dataio!uw-entropy!quick!srg
From: srg@quick.COM (Spencer Garrett)
Newsgroups: comp.std.internat
Subject: Re: International Collating Sequence
Message-ID: <129@quick.COM>
Date: Wed, 30-Sep-87 17:06:51 EDT
Article-I.D.: quick.129
Posted: Wed Sep 30 17:06:51 1987
Date-Received: Mon, 5-Oct-87 08:27:14 EDT
References: <2706@sol.ARPA>
Organization: Quicksilver Engineering, Seattle
Lines: 55

In article <2706@sol.ARPA>, crowl@cs.rochester.edu (Lawrence Crowl) writes:
> I submit that we need not only an international character code, but an
> international collating sequence as well.  Such a sequence should be very
> simple.  There should be no "double letter" rules or unnatural separation
> of accented letters from base letters.  I see no reason not to embed the
> collating sequence within the numeric codes for the characters.

Absolutely.

> Note that many letter forms in Latin, Greek, and Cryllic are the same.  It
> is possible to merge these three alphabets into a single alphabet.  This will
> involve some re-ordering of the letters from at least two of the original
> alphabets, but not a great deal.  I do not know whether this is a good idea or
> not, I just thought I would mention it.  Of course, we still have Arabic,
> Hebrew, Kanji, Kana, etc. to incorporate.

Technically very difficult and probably politically impossible.

> Perhaps a better approach is to start from scratch with a new character
> standard.  One designed from the start to accomodate international needs.
> I am willing to translate my files to a new character set.  Are you?

I think this has seeds of a good idea, and I would be willing to shift
to a new character set to accomplish it.  I'd like to suggest that it's
important for the alphabetic portion of the code to fit within 8 bits,
though, or the storage cost associated with shifting to the new code
will be prohibitive.  This wouldn't have to include katakana or hiragana
and couldn't possibly include kanji.  The JIS presently uses two 7-bit
codes per symbol and reaches them through a "shift-out" sequence from
a more-or-less standard ASCII.  There are way too many kanji to fit into
8 bits, and the notion of "collating sequence" doesn't really apply to
them.  (Actually, a clever encoding might make this a new "feature".)
Katakana and hiragana couldn't coexist with anything else in 8 bits
and they're presently encoded in 14 (really 16) bits, so retaining a 2-byte
encoding wouldn't cause any pain.  If we used an "escape to k-h" followed
by a byte to encode the character itself, then these characters would at
least collate together when mixed with this new international alphabet,
and would collate correctly with each other, all without changing the
semantics of strcmp().  (perhaps there should be a separate escape to
each, but you get the idea.)  Perhaps the escape to kanji would be followed
by two 8-bit bytes?  If the escape codes, at least, were standardized then
terminals which weren't set up to handle kanji could at least know how to
skip them and perhaps display an "unknown symbol" code in their place.

The final (:->) problem is how to mix l->r and r->l "horizontal" writing
with eastern "vertical" writing.  Mixing the first two is tricky, but
already being done.  I have no idea how to add "vertical" to the list.

Hmmm.  It just occurred to me that rewriting all the western languages in
a new alphabet and then trying to retain the existing japanese script is
a bit inconsistent.  It's not too hard to phoneticize japanese (they've
done it 3 times already, once using the roman alphabet) so maybe they
should just join us in using this mythical new alphabet.  I don't know
if this is possible for chinese and its relatives, however.  I suspect
it is not.