Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!iuvax!rutgers!cmcl2!phri!marob!cowan
From: cowan@marob.MASA.COM (John Cowan)
Newsgroups: comp.std.internat
Subject: Xerox Character Code Standard (was 7-bit ASCII vs. 8-bit ASCII)
Message-ID: <622@marob.MASA.COM>
Date: 23 Apr 89 00:42:56 GMT
References: <2568@ndsuvax.UUCP> <5153@hubcap.clemson.edu> <1468@auspex.auspex.com>
Reply-To: cowan@marob.masa.com (John Cowan)
Distribution: usa
Organization: ESCC  New York City
Lines: 46

In article <1468@auspex.auspex.com> guy@auspex.auspex.com (Guy Harris) writes:
>Well, a number of companies are starting to pick up some level of
>support for the ISO 8859 character sets - "8-bit-clean" software, ISO
>8859/n fonts (for n == 1, at least, and maybe for other values of n),
>support for the pANSI C internationalization stuff and/or the X/Open
>internationalization stuff, etc.
>
>Unfortunately, one form of inertia is represented by ISO 646 character
>set terminals; I think there are enough of them around now so that
>they'll have to be dealt with in the short term (e.g., with code to
>translate various national 646 character sets to 8859 character sets on
>input and to do the reverse translation as best as can be done on
>output). 
>

Xerox has devised a very slick 16-bit character standard for use with their
various Interlisp and Office Automation workstations, and with Interpress
printers.  Unfortunately (typical Xerox) it hasn't migrated to the rest of
the world yet.  The 16-bit space is divided into 255 "character sets"
containing 255 "character codes" each.  Not every set is in use, and not
every code is in use in every set, but 65025 characters is pretty generous.
Set #0 is ISO Latin #1, so 7-bit ASCII and ISO are upward compatible just by
adding 8 bits of zeros at the high order end.  Other ISO character sets are
also used; however, redundancies are stripped -- thus "A" is only character
code 65 in character set 0, and does not appear in any other character set.
There are character sets for Cyrillic, Greek, Hebrew, Arabic, Korean hangul,
Japanese katakana, Japanese hiragana, Chinese bopomofo, etc. etc.  There is
also a large block of character sets reserved to represent Japanese kanji
-- this part is bit-for-bit compatible with the 16-character kanji standard
of JIS.  There are several character sets for oddball symbols, one for
line drawing graphics, and the character sets 224-254 are reserved for 
"rendering" characters, like the fi and fl ligatures and the special initial,
medial, and final forms of Arabic letters, as well as "Old Style" digits.
Note that the character set code does >not< represent font-and-face information.

To prevent the tremendous wastage of space which would occur when representing
running text in "full" 16-bit form (which is defined by the standard to be a
big-endian form, with character set preceding character code) a special
compression format is defined.  Compressed strings represent only character
codes, and the character set defaults to zero.  The sequence of 255 followed
by a byte means "change to the character set numbered by the byte".
Therefore, regular ASCII strings are automatically in compressed Xerox format,
since their characters are already in set 0!  A double 255 serves as an
escape from the character set altogether, possibly to a whole different
16-bit character universe (!); the current universe is numbered 1, for future
expansion.