Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!iuvax!rutgers!cmcl2!phri!marob!cowan From: cowan@marob.MASA.COM (John Cowan) Newsgroups: comp.std.internat Subject: Xerox Character Code Standard (was 7-bit ASCII vs. 8-bit ASCII) Message-ID: <622@marob.MASA.COM> Date: 23 Apr 89 00:42:56 GMT References: <2568@ndsuvax.UUCP> <5153@hubcap.clemson.edu> <1468@auspex.auspex.com> Reply-To: cowan@marob.masa.com (John Cowan) Distribution: usa Organization: ESCC New York City Lines: 46 In article <1468@auspex.auspex.com> guy@auspex.auspex.com (Guy Harris) writes: >Well, a number of companies are starting to pick up some level of >support for the ISO 8859 character sets - "8-bit-clean" software, ISO >8859/n fonts (for n == 1, at least, and maybe for other values of n), >support for the pANSI C internationalization stuff and/or the X/Open >internationalization stuff, etc. > >Unfortunately, one form of inertia is represented by ISO 646 character >set terminals; I think there are enough of them around now so that >they'll have to be dealt with in the short term (e.g., with code to >translate various national 646 character sets to 8859 character sets on >input and to do the reverse translation as best as can be done on >output). > Xerox has devised a very slick 16-bit character standard for use with their various Interlisp and Office Automation workstations, and with Interpress printers. Unfortunately (typical Xerox) it hasn't migrated to the rest of the world yet. The 16-bit space is divided into 255 "character sets" containing 255 "character codes" each. Not every set is in use, and not every code is in use in every set, but 65025 characters is pretty generous. Set #0 is ISO Latin #1, so 7-bit ASCII and ISO are upward compatible just by adding 8 bits of zeros at the high order end. Other ISO character sets are also used; however, redundancies are stripped -- thus "A" is only character code 65 in character set 0, and does not appear in any other character set. There are character sets for Cyrillic, Greek, Hebrew, Arabic, Korean hangul, Japanese katakana, Japanese hiragana, Chinese bopomofo, etc. etc. There is also a large block of character sets reserved to represent Japanese kanji -- this part is bit-for-bit compatible with the 16-character kanji standard of JIS. There are several character sets for oddball symbols, one for line drawing graphics, and the character sets 224-254 are reserved for "rendering" characters, like the fi and fl ligatures and the special initial, medial, and final forms of Arabic letters, as well as "Old Style" digits. Note that the character set code does >not< represent font-and-face information. To prevent the tremendous wastage of space which would occur when representing running text in "full" 16-bit form (which is defined by the standard to be a big-endian form, with character set preceding character code) a special compression format is defined. Compressed strings represent only character codes, and the character set defaults to zero. The sequence of 255 followed by a byte means "change to the character set numbered by the byte". Therefore, regular ASCII strings are automatically in compressed Xerox format, since their characters are already in set 0! A double 255 serves as an escape from the character set altogether, possibly to a whole different 16-bit character universe (!); the current universe is numbered 1, for future expansion.