Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!uunet!seismo!rochester!PT!andrew.cmu.edu!bas+ From: bas+@andrew.cmu.edu (Bruce Sherwood) Newsgroups: comp.std.internat Subject: A full solution Message-ID: Date: Wed, 26-Aug-87 23:32:11 EDT Article-I.D.: andrew.EVAuUvy00jaUg8g0aL Posted: Wed Aug 26 23:32:11 1987 Date-Received: Sat, 29-Aug-87 09:14:35 EDT Organization: Carnegie Mellon University Lines: 100 In-Reply-To: <2276@zeus.TEK.COM> I'm distressed by the nature of the new ISO Latin scheme (ISO 8859-1). There already appeared some time ago ISO 6937 which covers nearly ALL languages which use Roman-letter alphabets (with the exception of Vietnamese), whereas the new ISO 8859 covers only some languages. ISO 8859-1 seems a very major step backwards. The processing of non-English text in computer systems has been plagued by one half-solution after another. Just when things were looking up (with ISO 6937), along comes a new and different standard which is much more limited in scope. ISO 6937, like ISO 8859, uses 8-bit codes to provide an additional 96 characters. About 30 of these are special characters not formable from diacritics (e.g., Icelandic thorn, or undotted i). There is a full set of diacritics, which precede the letter they modify. You can think of them as non-spacing characters (so that the following letter prints on top of the diacritic). A better way to think of them however is as "alert" codes, specifying that it and the following code form a 16-bit specification for a character. The actual dot pattern may be formed by superposition, or it may be stored in a separate "rendering" set (to make a better-looking character than could be produced by superimposing a letter and a separate diacritic). The rest of the 96 extra characters are punctuation (such as inverted exclamation and question for Spanish), some math symbols, etc. In fact, the first 32 characters of ISO 8859 are nearly identical to the first 32 8-bit characters of ISO 6937. There is something exceedingly strange about ISO 8859-1. Appendix A lists countries rather than languages for which the standard is valid. This is awfully peculiar. For example, Spain is in the list. But Catalan is a very important language in Spain, and in fact it is the language of the technologically most developed part of the country (the region containing Barcelona). And it appears that ISO 8859-1 does not handle Catalan (dotted L)! And I note that the ligatured ij of Dutch is missing. And the "apostrophe-n" of Afrikaans. And neither 8859-1 nor 8859-2 can handle Esperanto (a language which I use a lot). The ISO 6937 scheme handles all of these languages. Here is a quote from a discussion of ISO 8859 (Tim Lasko, lasko@video.dec.com, DEC, writing in comp.std.internat): "We (the U.S., ASC X3L2) realized a bit too late that certain characters needed to properly represent the Welsh language (w and y with circumflex) weren't conveniently available in any of the ISO 8859 sets, and tried to change Part 4 to include them. However, there was neither room nor consensus within the ISO committee to include these, so these too do not exist in any of the ISO 8859 code tables. (Arguably, the BSI should have been looking out for the requirements of Welsh, but for a number of reasons that I choose not to go into here, they did not.)" This case of Welsh is another sad example of ISO 8859 catering to countries rather than to languages... And even in the face of the excellent work of ISO 6937, which contains a listing of the diacritic needs for 41 languages, including Welsh, which is listed as needing w any y with circumflex. I can't understand why the people working on 8859 didn't check their work against the comprehensive list given in 6937. The 41 languages covered by 6937 are Afrikaans, Albanian, Basque, Breton, Catalan, Croat, Czech, Danish, Dutch, English, Esperanto, Estonian, Faroese, Finnish, French, Frisian, Galician, German, Greenlandic, Hungarian, Icelandic, Irish, Italian, Lapp, Latvian, Lithuanian, Maltese, Norwegian, Occitan, Polish, Portuguese, Rhaeto-Romanic, Romanian, Scots Gaelic, Slovak, Slovene, Sorbian, Spanish, Swedish, Turkish, and Welsh. It seems most unfortunate in this day of laser printers and fancy displays and sophisticated window managers to implement yet another half solution, one which is only sort of valid for some region of the globe, and even there is valid only for "national" rather than regional languages. The extensive multi-lingual Xerox scheme contains 6937 as one of the basic sets. The AT&T Videotex scheme is based on 6937. The basic coding scheme in PostScript is a subset of 6937 (it contains all of the 6937 diacritics, and some of the 6937 special characters such as AE, in the same slots as 6937, but it leaves many slots unused). It may be that suddently 6937 is out of favor because it "didn't fully catch on," but it seems tragic to back off from a full solution. Perhaps you would be interested in what we plan to do in Base Environment 2 (BE2) of the Andrew system under development at the Information Technology Center at Carnegie Mellon. Much of the design is due to Tomas Centerlind of Sweden, who worked here this summer. Since we don't do Unix operating-system development here, we feel that for now we have to stay with a 7-bit external representation (on disk, in mail, etc.). In the text datastream AE will be represented by \.DigraphAE{}, and the Spanish n-tilde will be represented by \.Tilde{n}. In memory the AE in a BE2 document will be the ISO 6937 8-bit code for AE. The n-tilde will be represented in the document by the code 255, indicating that one must look in the accompanying environment tree (used also for representing styles such as italic) for a 32-bit character code. This "longchar" has the form 8/0, 8/0, 8/tilde, 8/n. The upper bytes are for expansion and indicate what character sets the lower two bytes refer to, and the lower bytes are ISO 6937 for the diacritic and letter. The reason for putting the tilde-n out of line is to simplify various aspects of BE2 text manipulation, and to make multi-byte characters nevertheless be accessed by the programmer as single entities. While editing, you can choose a system- or user-defined keyboard, with associated key bindings. You can have the keyboard displayed at the bottom of the editing window and type with the mouse if you want. Much of the keyboard redefinition machinery has been built, but there are pieces of BE2 which have not yet been tweaked to make it all work. Bruce Sherwood Center for Design of Educational Computing and Information Technology Center Carnegie Mellon University