Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!uunet!seismo!rochester!PT!andrew.cmu.edu!bas+
From: bas+@andrew.cmu.edu (Bruce Sherwood)
Newsgroups: comp.std.internat
Subject: A full solution
Message-ID: <EVAuUvy00jaUg8g0aL@andrew.cmu.edu>
Date: Wed, 26-Aug-87 23:32:11 EDT
Article-I.D.: andrew.EVAuUvy00jaUg8g0aL
Posted: Wed Aug 26 23:32:11 1987
Date-Received: Sat, 29-Aug-87 09:14:35 EDT
Organization: Carnegie Mellon University
Lines: 100
In-Reply-To: <2276@zeus.TEK.COM>


I'm distressed by the nature of the new ISO Latin scheme (ISO 8859-1).  There
already appeared some time ago ISO 6937 which covers nearly ALL languages
which use Roman-letter alphabets (with the exception of Vietnamese), whereas
the new ISO 8859 covers only some languages.  ISO 8859-1 seems a very major
step backwards.  The processing of non-English text in computer systems has
been plagued by one half-solution after another.  Just when things were
looking up (with ISO 6937), along comes a new and different standard which is
much more limited in scope.

ISO 6937, like ISO 8859, uses 8-bit codes to provide an additional 96
characters.  About 30 of these are special characters not formable from
diacritics (e.g., Icelandic thorn, or undotted i).  There is a full set of
diacritics, which precede the letter they modify.  You can think of them as
non-spacing characters (so that the following letter prints on top of the
diacritic).  A better way to think of them however is as "alert" codes,
specifying that it and the following code form a 16-bit specification for a
character.  The actual dot pattern may be formed by superposition, or it may
be stored in a separate "rendering" set (to make a better-looking character
than could be produced by superimposing a letter and a separate diacritic).
The rest of the 96 extra characters are punctuation (such as inverted
exclamation and question for Spanish), some math symbols, etc.  In fact, the
first 32 characters of ISO 8859 are nearly identical to the first 32 8-bit
characters of ISO 6937.

There is something exceedingly strange about ISO 8859-1.  Appendix A lists
countries rather than languages for which the standard is valid.  This is
awfully peculiar.  For example, Spain is in the list.  But Catalan is a very
important language in Spain, and in fact it is the language of the
technologically most developed part of the country (the region containing
Barcelona).  And it appears that ISO 8859-1 does not handle Catalan (dotted
L)!  And I note that the ligatured ij of Dutch is missing.  And the
"apostrophe-n" of Afrikaans.  And neither 8859-1 nor 8859-2 can handle
Esperanto (a language which I use a lot).  The ISO 6937 scheme handles all of
these languages.

Here is a quote from a discussion of ISO 8859 (Tim Lasko,
lasko@video.dec.com, DEC, writing in comp.std.internat):  "We (the U.S., ASC
X3L2) realized a bit too late that certain characters needed to properly
represent the Welsh language (w and y with circumflex) weren't conveniently
available in any of the ISO 8859 sets, and tried to change Part 4 to include
them.  However, there was neither room nor consensus within the ISO committee
to include these, so these too do not exist in any of the ISO 8859 code
tables.  (Arguably, the BSI should have been looking out for the requirements
of Welsh, but for a number of reasons that I choose not to go into here, they
did not.)"

This case of Welsh is another sad example of ISO 8859 catering to countries
rather than to languages...  And even in the face of the excellent work of
ISO 6937, which contains a listing of the diacritic needs for 41 languages,
including Welsh, which is listed as needing w any y with circumflex.  I can't
understand why the people working on 8859 didn't check their work against the
comprehensive list given in 6937.  The 41 languages covered by 6937 are
Afrikaans, Albanian, Basque, Breton, Catalan, Croat, Czech, Danish, Dutch,
English, Esperanto, Estonian, Faroese, Finnish, French, Frisian, Galician,
German, Greenlandic, Hungarian, Icelandic, Irish, Italian, Lapp, Latvian,
Lithuanian, Maltese, Norwegian, Occitan, Polish, Portuguese, Rhaeto-Romanic,
Romanian, Scots Gaelic, Slovak, Slovene, Sorbian, Spanish, Swedish, Turkish,
and Welsh.

It seems most unfortunate in this day of laser printers and fancy displays
and sophisticated window managers to implement yet another half solution, one
which is only sort of valid for some region of the globe, and even there is
valid only for "national" rather than regional languages.

The extensive multi-lingual Xerox scheme contains 6937 as one of the basic
sets.  The AT&T Videotex scheme is based on 6937.  The basic coding scheme in
PostScript is a subset of 6937 (it contains all of the 6937 diacritics, and
some of the 6937 special characters such as AE, in the same slots as 6937,
but it leaves many slots unused).  It may be that suddently 6937 is out of
favor because it "didn't fully catch on," but it seems tragic to back off
from a full solution.

Perhaps you would be interested in what we plan to do in Base Environment 2
(BE2) of the Andrew system under development at the Information Technology
Center at Carnegie Mellon.  Much of the design is due to Tomas Centerlind of
Sweden, who worked here this summer.  Since we don't do Unix operating-system
development here, we feel that for now we have to stay with a 7-bit external
representation (on disk, in mail, etc.).  In the text datastream AE will be
represented by \.DigraphAE{}, and the Spanish n-tilde will be represented by
\.Tilde{n}.  In memory the AE in a BE2 document will be the ISO 6937 8-bit
code for AE.  The n-tilde will be represented in the document by the code
255, indicating that one must look in the accompanying environment tree (used
also for representing styles such as italic) for a 32-bit character code.
This "longchar" has the form 8/0, 8/0, 8/tilde, 8/n.  The upper bytes are for
expansion and indicate what character sets the lower two bytes refer to, and
the lower bytes are ISO 6937 for the diacritic and letter.  The reason for
putting the tilde-n out of line is to simplify various aspects of BE2 text
manipulation, and to make multi-byte characters nevertheless be accessed by
the programmer as single entities.

While editing, you can choose a system- or user-defined keyboard, with
associated key bindings.  You can have the keyboard displayed at the bottom
of the editing window and type with the mouse if you want.  Much of the
keyboard redefinition machinery has been built, but there are pieces of BE2
which have not yet been tweaked to make it all work.

Bruce Sherwood
Center for Design of Educational Computing
     and Information Technology Center
Carnegie Mellon University