Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!ncar!ico!ism780c!ism780b!greger From: greger@ism780b (Greger Leijonhufvud) Newsgroups: comp.std.internat Subject: Re: 7-bit ASCII vs. 8-bit ASCII Message-ID: <26644@ism780c.isc.com> Date: 25 Apr 89 04:06:59 GMT References: <2568@ndsuvax.UUCP> <5153@hubcap.clemson.edu> <1468@auspex.auspex.com> Sender: news@ism780c.isc.com Reply-To: greger@ism780b.UUCP (Greger Leijonhufvud) Distribution: usa Organization: Interactive Systems Corp., Santa Monica CA Lines: 50 In article halldors@paul.rutgers.edu (Magnus M Halldorsson) writes: >The ISO 8859 character sets specify sets for specific languages. Now >what if one wants to use a combination of those? Is there any standard >for storing, representing, and switching between various (ISO) >character sets? What if one wants to allow for Japanese or Chinese as >well? > >Magnus There are several standardized (and several not yet blessed) techniques for "mixing codesets". The /usr/group Subcommittee on Internationalization has been studying several techniques for a while, and may even propose something to POSIX (or whoever the appropriate forum is). The AT&T "EUC" (Extended UNIX Codes) method is the only one so far implemented within UNIX for "internal use". This was done in Japan, because the Japanese language typically is written with 3 different script systems (Kanji, Katakana and Hiragana). The EUC scheme is based on the ISO 2022 single-shift coding: 7-bit ASCII is always present as code set 0. All other code sets must have the high-order bit set in all bytes. Code set 1 is distinguished by the high order bit set. Code set 2 has the high order bit set, and each character is prepended by the ISO 2022 SS2 (8e) character. Code set 3 has the high order bit set, and each character is prepended by the ISO 2022 SS3 (8f) character. This scheme supports (in theory) 4 different code sets. For 8859 compatible code sets, of course, it only supports 3 (as ASCII is part of each code set), and it does not support code sets that does not conform to ISO 2022 (such as the IBM Extended ASCII used on PC's, or the Shift-JIS code set. A more generalized scheme is the "Compound String" method, also endorsed by ISO. It may very well be the X Windows encoding scheme for interchange or internal representation. There are also other encoding schemes, by Sun, Xerox and other companies. There is, however, no standard as yet. Unfortunately. But, from V.4, you should be able to mix Icelandic with Bulgarian, and get your Greek quotations OK, too. Greger Leijonhufvud Interactive Systems Corp. Sunny Santa Monica, Ca. uunet!ism780c!greger