Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!uunet!pilchuck!dataio!uw-entropy!quick!srg From: srg@quick.COM (Spencer Garrett) Newsgroups: comp.std.internat Subject: Re: International Collating Sequence Message-ID: <129@quick.COM> Date: Wed, 30-Sep-87 17:06:51 EDT Article-I.D.: quick.129 Posted: Wed Sep 30 17:06:51 1987 Date-Received: Mon, 5-Oct-87 08:27:14 EDT References: <2706@sol.ARPA> Organization: Quicksilver Engineering, Seattle Lines: 55 In article <2706@sol.ARPA>, crowl@cs.rochester.edu (Lawrence Crowl) writes: > I submit that we need not only an international character code, but an > international collating sequence as well. Such a sequence should be very > simple. There should be no "double letter" rules or unnatural separation > of accented letters from base letters. I see no reason not to embed the > collating sequence within the numeric codes for the characters. Absolutely. > Note that many letter forms in Latin, Greek, and Cryllic are the same. It > is possible to merge these three alphabets into a single alphabet. This will > involve some re-ordering of the letters from at least two of the original > alphabets, but not a great deal. I do not know whether this is a good idea or > not, I just thought I would mention it. Of course, we still have Arabic, > Hebrew, Kanji, Kana, etc. to incorporate. Technically very difficult and probably politically impossible. > Perhaps a better approach is to start from scratch with a new character > standard. One designed from the start to accomodate international needs. > I am willing to translate my files to a new character set. Are you? I think this has seeds of a good idea, and I would be willing to shift to a new character set to accomplish it. I'd like to suggest that it's important for the alphabetic portion of the code to fit within 8 bits, though, or the storage cost associated with shifting to the new code will be prohibitive. This wouldn't have to include katakana or hiragana and couldn't possibly include kanji. The JIS presently uses two 7-bit codes per symbol and reaches them through a "shift-out" sequence from a more-or-less standard ASCII. There are way too many kanji to fit into 8 bits, and the notion of "collating sequence" doesn't really apply to them. (Actually, a clever encoding might make this a new "feature".) Katakana and hiragana couldn't coexist with anything else in 8 bits and they're presently encoded in 14 (really 16) bits, so retaining a 2-byte encoding wouldn't cause any pain. If we used an "escape to k-h" followed by a byte to encode the character itself, then these characters would at least collate together when mixed with this new international alphabet, and would collate correctly with each other, all without changing the semantics of strcmp(). (perhaps there should be a separate escape to each, but you get the idea.) Perhaps the escape to kanji would be followed by two 8-bit bytes? If the escape codes, at least, were standardized then terminals which weren't set up to handle kanji could at least know how to skip them and perhaps display an "unknown symbol" code in their place. The final (:->) problem is how to mix l->r and r->l "horizontal" writing with eastern "vertical" writing. Mixing the first two is tricky, but already being done. I have no idea how to add "vertical" to the list. Hmmm. It just occurred to me that rewriting all the western languages in a new alphabet and then trying to retain the existing japanese script is a bit inconsistent. It's not too hard to phoneticize japanese (they've done it 3 times already, once using the roman alphabet) so maybe they should just join us in using this mythical new alphabet. I don't know if this is possible for chinese and its relatives, however. I suspect it is not.