Path: utzoo!attcan!uunet!fernwood!apple!uokmax!munnari.oz.au!samsung!zaphod.mps.ohio-state.edu!uwm.edu!bionet!agate!ucbvax!edinburgh.ac.uk!K.P.Donnelly From: K.P.Donnelly@edinburgh.ac.uk Newsgroups: comp.protocols.iso Subject: Re: Character sets: ISO 6937 vs ISO 8859 Message-ID: <05.Nov.90.19:39:02.gmt.320277@EMAS-A> Date: 5 Nov 90 19:39:02 GMT References: <1411.657811134@UK.AC.UCL.CS> Sender: daemon@ucbvax.BERKELEY.EDU Organization: The Internet Lines: 54 ISO 6937 and ISO 8859 are both extensions of ASCII (or ISO 646) to 8 bits. Both of them avoid not only the 32 control characters of ISO 646 (columns 0 and 1) but also their 8-bit equivalents (columns 8 and 9), so as to avoid possible transmission or other difficulties. The most important difference is that ISO 6937 has "floating diacritics" - characters of "zero width" representing accents, so that accented characters are represented by two bytes, one for the unaccented character and one for the accent. This means that it can accommodate many more accented characters within 8 bits than can ISO 8859. In fact it copes with almost all languages using a Latin based alphabet. However, it also means that most existing software, such as text editors, will not cope with ISO 6937, whereas most software needs little or no modification to work with ISO 8859. Probably this is why ISO 6937, although it came earlier than ISO 8859, has never really been adopted, whereas ISO 8859 is becoming very widely used. ISO 8859, because of the limited number of characters which it gets into 8 bits, has to be split into several parts. ISO 8859-1 covers nearly all Western European languages, which includes a lot of languages with economic clout. ISO 8859-2 covers Eastern European languages with a Latin based alphabet such as Czech and Polish. ISO 8859-2 and 8859-3 mop up some of the gaps. ISO 8859-5 is for languages like Russian with a Cyrillic alphabet, 8859-6 is for Arabic, 8859-7 is for Greek and 8859-8 is for Hebrew. ISO 8859-9 is a late addition; it adds to ISO 8859-1 the characters needed for Turkish, at the expense of Icelandic, which has far fewer speakers than Turkish but which got included ISO 8859-1 because the Icelanders got into 8-bit computing at an early stage and also because some of the characters are used in Old English. I don't know whether ISO 6937 has any additional parts for languages such as Russian or Arabic with a non Latin alphabet. ISO 6937 is a development from Teletex. ISO 8859-1 is a development of the DEC multinational character set. Various manufacturors extended ASCII to 8 bits in various ways (e.g. IBM-PC character set; HP Roman 8 character set used on Laserjet II laser printers), but the DEC multinational character set has a far more logical layout of characters than the others. ISO 8859-1 is used on DEC VT320 terminals, and terminal emulations such as MS-Kermit 3.0. The reason that X.400(1988) refers to ISO 6937 whereas X-Windows makes use of ISO 8859 may be the association between CCITT and Teletex and the association between DEC and the development of X-Windows, or it may just be that X.400(1988) was developed earlier on. It is now regarded as wasteful having anything like as many as 64 character positions reserved for control characters, and proposals have been made to extend ISO 8859-1 to cover more languages. Alternatively, it is possible that ISO 6937 might make something of a comeback within the context of structured documents. Or both ideas might be leapfrogged by two-byte or multi-byte character sets, with file compression for storage. I am no expert and some of the above information may be wrong. If so, I would be glad of corrections. Kevin Donnelly