Path: utzoo!attcan!uunet!auspex!guy
From: guy@auspex.UUCP (Guy Harris)
Newsgroups: comp.windows.x
Subject: Re: 8 bits per char
Message-ID: <1073@auspex.UUCP>
Date: 26 Feb 89 01:50:10 GMT
References: <8902211720.AA14715@internal.apple.com> <722@acorn.co.uk>
Reply-To: guy@auspex.UUCP (Guy Harris)
Organization: Auspex Systems, Santa Clara
Lines: 48

>Even if it transmits an 8 bit character (say from the upper half of the
>multinational character set), in UN*X the tty will normally clobber the
>eighth bit anyway.

Only if your UNIX's tty driver is old and crufty.  More modern ones can
be told neither to strip the 8th bit on input or output, but to run in
"cooked" or "cbreak" mode (i.e., you don't have to go into "raw" mode to
get an 8-bit data path).

>As an experiment I switched my xterm pty into raw mode and typed the
><pound sterling> key on my keyboard (this generates a keycode with a
>suggested keysym of XK_sterling).

(Said keysym being, as I remember, the ISO Latin #1 code for "pound
sterling".)

>I regretted it - it would seem that some 8 bit character did, in fact,
>get through, because my csh promptly died (csh uses the ``spare'' bit
>in input characters while parsing the line, if it receives a byte with
>the top bit set it screws up :-(.

Eventually, more modern C shells will handle 8-bit characters as well. 
Some may already do so.

The bottom line is: *don't* use the inadequacies of some current UNIX
implementations as an excuse not to support 8-bit character sets in X11
terminal emulators; said inadequacies will not stick around forever.

>This area is a total mess - but its not X's fault - and a real solution
>would be a major change to most of the computer worlds preconceived
>ideas.  After all, what use is an extra bit if you want to transmit
>Chinese or Japanese characters?

If you want to transmit them using "EUC" code sets, the extra bit is
*quite* useful.  In the Japanese EUC set, bytes with the 8th bit not set
represent ASCII characters; bytes with the 8th bit set represent
characters from (what used to be called) the JIS 6226 and JIS 6220 sets,
or various "private" character sets.  6226 is a 14-bit character set, as
I remember, with two 7-bit bytes per character; the EUC version just
encodes those by turning the 8th bit on in both of those bytes.

>So how can you say what is the ``valid part'' of an arbitrary character
>stream?  Surely that is a matter for the two programs at either end, or
>for international standards (and the only really accepted standard -
>ASCII - says that there are only 7 bits in a character).

This will not be true forever; the ISO 8859 character sets are becoming
more widely accepted, for example.