Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!swrinde!zaphod.mps.ohio-state.edu!ub!uhura.cc.rochester.edu!rochester!kodak!uupsi!sunic!ugle.unit.no!nuug!ifi!enag
From: enag@ifi.uio.no (Erik Naggum)
Newsgroups: comp.std.internat
Subject: Unicode vs ISO DIS 10646 (was universality of Latin-1)
Message-ID: <ENAG.91Apr25234756@maud.ifi.uio.no>
Date: 25 Apr 91 21:48:07 GMT
References: <1991Apr10.172756.4991@murdoch.acc.Virginia.EDU>
	<1991Apr12.001902.9260@timessqr.gc.cuny.edu>
	<1991Apr12.123302.17817@murdoch.acc.Virginia.EDU>
	<1991Apr24.181121.6212@parc.xerox.com>
Sender: enag@ifi.uio.no (Erik Naggum)
Organization: Naggum Software, Oslo, Norway
Lines: 96
In-Reply-To: daniels@parc.xerox.com's message of 24 Apr 91 18: 11:21 GMT

Gentlemen,

I've become somewhat tired of reading comments of the "my character
set standard has more characters than your character set standard"
kind.

The problem is not one of already included characters in any given
character set standard in its draft stage, but how easily new
characters can be added when needed, and how you address them.  To
paraphrase a saying from programming environments:

	There's always one more character.

Unicode has the charming quality that each script is separated in the
code table by a generous amount of unassigned character positions.
There is also what the Unicode Consortium believes to be a generous
amount of spare code points for other scripts.

ISO DIS 10646 does not have this charming quality to the same extent,
being much harder packed, but it has entire rows available for new
scripts, depending on their size.  If you don't like any of these, you
can grab a private use row.  There are entire planes available for
special scripts with lots of characters (ideographic scripts).
Private use planes also exist.  The ability to subsume an industry
standard such as Unicode into ISO DIS 10646 is eminently present.
Indeed, ISO DIS 10646 can subsume anything.  When or if we meet life
in outer space, they'd probably appreciate one of the 190 remaining
groups, too.

Unicode has the charming quality that you can address any of the
65 536 possible characters with a constant sixteen bits.  (I'm
deliberatly glossing over the "what's a character, anyway" issue.)

ISO DIS 10646 has mechanisms to address any of the 1 330 863 361
possible characters, but each with a varying number of bits, if you
don't use the four-octet canonical form.

Unicode is stateless in terms of what any given 16-bit binary value
means.  (Again, glossing over issues such as floating diacritics.)

ISO DIS 10646 has numerous states due to the compaction methods,
Single Graphic Character Introducer, and High Octet Preset mechanisms.

Unicode works with a unit 16 bits wide.

ISO DIS 10646 works with several units 8 bits wide.

Unicode is subject to endianism.

ISO DIS 10646 is octet stream based, and is not subject to endianism.

These are technical differences which will have a much larger impact
on the acceptance of each of these proposed standards than any number
of included or excluded characters from each.

There are a couple important aspects of each of these that also
require attention, and a comparison with previous attempts at the same
have not generally fared well:

Unicode employs floating diacritics for scripts which do not separate
the diacritic and the character to which it applies.  This was tried
out with ISO 6937/2, a standard which is used mainly for reference
purposes and in some specific applications for which it was created.

ISO DIS 10646 employs code shifting in various ways, analogous to ISO
2022, ISO 4873 (num?) and others.  This has generally posed problems
for programmers who would like a one-to-one relationship between
character and bit-string.

Unicode caters to programmers in its fixed width, and to typographers
and bibliography needs with floating diacritics, but these two issues
tend to be contradictory on several levels.

ISO DIS 10646 caters to national and international standards, and
their procedures, which will ensure that formal agreement on a good
standard will be easier and that revisions will be few and far (in
time) between.  (This "good" may not map to your "good", and I'm not
going to fight over that.)

These issues are relevant in the questions on agreement and acceptance
by industry and systems developers, and become especially delicate
when we consider government requirements.  Governments tend to choose
International Standards over industry standards (partly because to
appear to give particular vendors an advantage places them in an
uncomfortable light), and the European Community politicians are
getting more and more power over what is and is not going to be part
of Europe as we have yet to know it.

I'd like to see some discussion on these topics, instead of the
useless quibbling over which character set does or does not have
"FOOTWEAR CAPITAL LETTER SWOOSH WITH AIR BELOW" or any other favorite
"required" character.

--
[Erik Naggum]					     <enag@ifi.uio.no>
Naggum Software, Oslo, Norway			   <erik@naggum.uu.no>