Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!swrinde!zaphod.mps.ohio-state.edu!ub!uhura.cc.rochester.edu!rochester!kodak!uupsi!sunic!ugle.unit.no!nuug!ifi!enag From: enag@ifi.uio.no (Erik Naggum) Newsgroups: comp.std.internat Subject: Unicode vs ISO DIS 10646 (was universality of Latin-1) Message-ID: Date: 25 Apr 91 21:48:07 GMT References: <1991Apr10.172756.4991@murdoch.acc.Virginia.EDU> <1991Apr12.001902.9260@timessqr.gc.cuny.edu> <1991Apr12.123302.17817@murdoch.acc.Virginia.EDU> <1991Apr24.181121.6212@parc.xerox.com> Sender: enag@ifi.uio.no (Erik Naggum) Organization: Naggum Software, Oslo, Norway Lines: 96 In-Reply-To: daniels@parc.xerox.com's message of 24 Apr 91 18: 11:21 GMT Gentlemen, I've become somewhat tired of reading comments of the "my character set standard has more characters than your character set standard" kind. The problem is not one of already included characters in any given character set standard in its draft stage, but how easily new characters can be added when needed, and how you address them. To paraphrase a saying from programming environments: There's always one more character. Unicode has the charming quality that each script is separated in the code table by a generous amount of unassigned character positions. There is also what the Unicode Consortium believes to be a generous amount of spare code points for other scripts. ISO DIS 10646 does not have this charming quality to the same extent, being much harder packed, but it has entire rows available for new scripts, depending on their size. If you don't like any of these, you can grab a private use row. There are entire planes available for special scripts with lots of characters (ideographic scripts). Private use planes also exist. The ability to subsume an industry standard such as Unicode into ISO DIS 10646 is eminently present. Indeed, ISO DIS 10646 can subsume anything. When or if we meet life in outer space, they'd probably appreciate one of the 190 remaining groups, too. Unicode has the charming quality that you can address any of the 65 536 possible characters with a constant sixteen bits. (I'm deliberatly glossing over the "what's a character, anyway" issue.) ISO DIS 10646 has mechanisms to address any of the 1 330 863 361 possible characters, but each with a varying number of bits, if you don't use the four-octet canonical form. Unicode is stateless in terms of what any given 16-bit binary value means. (Again, glossing over issues such as floating diacritics.) ISO DIS 10646 has numerous states due to the compaction methods, Single Graphic Character Introducer, and High Octet Preset mechanisms. Unicode works with a unit 16 bits wide. ISO DIS 10646 works with several units 8 bits wide. Unicode is subject to endianism. ISO DIS 10646 is octet stream based, and is not subject to endianism. These are technical differences which will have a much larger impact on the acceptance of each of these proposed standards than any number of included or excluded characters from each. There are a couple important aspects of each of these that also require attention, and a comparison with previous attempts at the same have not generally fared well: Unicode employs floating diacritics for scripts which do not separate the diacritic and the character to which it applies. This was tried out with ISO 6937/2, a standard which is used mainly for reference purposes and in some specific applications for which it was created. ISO DIS 10646 employs code shifting in various ways, analogous to ISO 2022, ISO 4873 (num?) and others. This has generally posed problems for programmers who would like a one-to-one relationship between character and bit-string. Unicode caters to programmers in its fixed width, and to typographers and bibliography needs with floating diacritics, but these two issues tend to be contradictory on several levels. ISO DIS 10646 caters to national and international standards, and their procedures, which will ensure that formal agreement on a good standard will be easier and that revisions will be few and far (in time) between. (This "good" may not map to your "good", and I'm not going to fight over that.) These issues are relevant in the questions on agreement and acceptance by industry and systems developers, and become especially delicate when we consider government requirements. Governments tend to choose International Standards over industry standards (partly because to appear to give particular vendors an advantage places them in an uncomfortable light), and the European Community politicians are getting more and more power over what is and is not going to be part of Europe as we have yet to know it. I'd like to see some discussion on these topics, instead of the useless quibbling over which character set does or does not have "FOOTWEAR CAPITAL LETTER SWOOSH WITH AIR BELOW" or any other favorite "required" character. -- [Erik Naggum] Naggum Software, Oslo, Norway