Path: utzoo!attcan!uunet!auspex!guy From: guy@auspex.UUCP (Guy Harris) Newsgroups: comp.windows.x Subject: Re: 8 bits per char Message-ID: <1073@auspex.UUCP> Date: 26 Feb 89 01:50:10 GMT References: <8902211720.AA14715@internal.apple.com> <722@acorn.co.uk> Reply-To: guy@auspex.UUCP (Guy Harris) Organization: Auspex Systems, Santa Clara Lines: 48 >Even if it transmits an 8 bit character (say from the upper half of the >multinational character set), in UN*X the tty will normally clobber the >eighth bit anyway. Only if your UNIX's tty driver is old and crufty. More modern ones can be told neither to strip the 8th bit on input or output, but to run in "cooked" or "cbreak" mode (i.e., you don't have to go into "raw" mode to get an 8-bit data path). >As an experiment I switched my xterm pty into raw mode and typed the > key on my keyboard (this generates a keycode with a >suggested keysym of XK_sterling). (Said keysym being, as I remember, the ISO Latin #1 code for "pound sterling".) >I regretted it - it would seem that some 8 bit character did, in fact, >get through, because my csh promptly died (csh uses the ``spare'' bit >in input characters while parsing the line, if it receives a byte with >the top bit set it screws up :-(. Eventually, more modern C shells will handle 8-bit characters as well. Some may already do so. The bottom line is: *don't* use the inadequacies of some current UNIX implementations as an excuse not to support 8-bit character sets in X11 terminal emulators; said inadequacies will not stick around forever. >This area is a total mess - but its not X's fault - and a real solution >would be a major change to most of the computer worlds preconceived >ideas. After all, what use is an extra bit if you want to transmit >Chinese or Japanese characters? If you want to transmit them using "EUC" code sets, the extra bit is *quite* useful. In the Japanese EUC set, bytes with the 8th bit not set represent ASCII characters; bytes with the 8th bit set represent characters from (what used to be called) the JIS 6226 and JIS 6220 sets, or various "private" character sets. 6226 is a 14-bit character set, as I remember, with two 7-bit bytes per character; the EUC version just encodes those by turning the 8th bit on in both of those bytes. >So how can you say what is the ``valid part'' of an arbitrary character >stream? Surely that is a matter for the two programs at either end, or >for international standards (and the only really accepted standard - >ASCII - says that there are only 7 bits in a character). This will not be true forever; the ISO 8859 character sets are becoming more widely accepted, for example.