Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!ncar!ico!ism780c!ism780b!greger
From: greger@ism780b (Greger Leijonhufvud)
Newsgroups: comp.std.internat
Subject: Re: 7-bit ASCII vs. 8-bit ASCII
Message-ID: <26644@ism780c.isc.com>
Date: 25 Apr 89 04:06:59 GMT
References: <2568@ndsuvax.UUCP> <5153@hubcap.clemson.edu> <1468@auspex.auspex.com> <Apr.19.10.41.28.1989.7554@paul.rutgers.edu>
Sender: news@ism780c.isc.com
Reply-To: greger@ism780b.UUCP (Greger Leijonhufvud)
Distribution: usa
Organization: Interactive Systems Corp., Santa Monica CA
Lines: 50

In article <Apr.19.10.41.28.1989.7554@paul.rutgers.edu> halldors@paul.rutgers.edu (Magnus M Halldorsson) writes:
>The ISO 8859 character sets specify sets for specific languages. Now
>what if one wants to use a combination of those? Is there any standard
>for storing, representing, and switching between various (ISO)
>character sets? What if one wants to allow for Japanese or Chinese as
>well?
>
>Magnus

There are several standardized (and several not yet blessed) techniques for
"mixing codesets". The /usr/group Subcommittee on Internationalization
has been studying several techniques for a while, and may even propose
something to POSIX (or whoever the appropriate forum is).

The AT&T "EUC" (Extended UNIX Codes) method is the only one so far
implemented within UNIX for "internal use". This was done in Japan, 
because the Japanese language typically is written with 3 different 
script systems (Kanji, Katakana and Hiragana). 
The EUC scheme is based on the ISO 2022 single-shift coding:

	7-bit ASCII is always present as code set 0.
	All other code sets must have the high-order bit set
	in all bytes.
	Code set 1 is distinguished by the high order bit set.
	Code set 2 has the high order bit set, and each character
	is prepended by the ISO 2022 SS2 (8e) character.
	Code set 3 has the high order bit set, and each character
	is prepended by the ISO 2022 SS3 (8f) character.

This scheme supports (in theory) 4 different code sets. For 8859
compatible code sets, of course, it only supports 3 (as ASCII is
part of each code set), and it does not support code sets that does
not conform to ISO 2022 (such as the IBM Extended ASCII used on
PC's, or the Shift-JIS code set.

A more generalized scheme is the "Compound String" method, also endorsed
by ISO. It may very well be the X Windows encoding scheme for
interchange or internal representation.

There are also other encoding schemes, by Sun, Xerox and other
companies.

There is, however, no standard as yet. Unfortunately. But, from V.4,
you should be able to mix Icelandic with Bulgarian, and get your
Greek quotations OK, too.

Greger Leijonhufvud
Interactive Systems Corp.
Sunny Santa Monica, Ca.
uunet!ism780c!greger