Xref: utzoo comp.emacs:4630 comp.lang.c:14058 comp.sys.ibm.pc:21182 Path: utzoo!utgpu!watmath!clyde!att!osu-cis!tut.cis.ohio-state.edu!bloom-beacon!apple!bionet!agate!ucbvax!decwrl!labrea!sri-unix!quintus!ok From: ok@quintus.uucp (Richard A. O'Keefe) Newsgroups: comp.emacs,comp.lang.c,comp.sys.ibm.pc Subject: Re: Programming and international character sets. Message-ID: <670@quintus.UUCP> Date: 13 Nov 88 07:17:36 GMT References: <532@krafla.rhi.hi.is> <8804@smoke.BRL.MIL> <207@jhereg.Jhereg.MN.ORG> <621@quintus.UUCP> <774@wsccs.UUCP> Sender: news@quintus.UUCP Reply-To: ok@quintus.UUCP (Richard A. O'Keefe) Organization: Quintus Computer Systems, Inc. Lines: 54 In article <774@wsccs.UUCP> terry@wsccs.UUCP (Every system needs one) writes: >Second, vi in the US strips the 8th bit out, and is therefore not >usable for programming international (8-bit) characters using either model. AT&T announced clearly in the SVID that they were going to stop doing that kind of thing, _and_they_have_. >Problems with 16 bit characters: > >O The Xerox model is 16-bit and only valid for bitmapped displays, > like Mac, and we all know how slowly that scrolls. > The Xerox model (XSIS 058404) has nothing to do with bitmapped displays. >O All of the current software would break without extensive rewrite It's going to break _anyway_. If you do one-character-equals-one-byte operations on Kanji, the results just aren't going to make sense. With a 16-bit model (actually, the Xerox model already has provision for 24-bit characters, though the implementation I was familiar with didn't provide them yet). In fact, when XNS support was added to InterLisp, most programs didn't even need to be recompiled, and those that needed other changes mostly _could_ have been written to be independent of character set using facilities already in the language. >O The internal overhead in a non-message passing operating system > (most of them) is so high that it's ridiculous. >O Think of pipes and all file I/O going half as fast. >O Think of your hard disks shrinking to half their size... source > files, after all, are text. These are essentially the same point, and are equally mistaken. There is no reason why a _single_ character and a _sequence_ of characters need to use the same coding. There are three representations used for character sequences in Interlisp-D: thin strings (vectors of 8-bit characters from "character set 0"), fat strings (vectors of 16-bit characters), and files (sequences of characters drawn from the same 256-character block are stored as sequences of 8-bit codes, with "font change" codes inserted as needed). Since a file is presumed to start in character set 0, files of 8-bit characters DIDN'T CHANGE AT ALL. If you want to position randomly in a sequence, then you have to know what the "font" is there, or a font change code could be inserted at the start of every block. It is only when a program picks up a single character and looks at it on its own that it materialises as 16 bits. [This coding wins if you tend to mix languages with small character sets, e.g. if you have whole sentences in English, Russian, Hebrew, Greek, &c, because then you can stay in the same "font" for at least a word at a time. It does not pay off for Kanji, but with a certain amount of cunning you can make it no worse than the ISO 2022 method.] Now you can only achieve code-set independence as easily as that in a high-level language, and font-compressed files really require all the utilities in the system to be internationalised at once, so the ANSI committee didn't really have the option of adopting a solution like this.