Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!usc!wuarchive!zaphod.mps.ohio-state.edu!pacific.mps.ohio-state.edu!linac!att!pacbell.com!ucsd!ucbvax!bloom-beacon!eru!hagbard!sunic!sics.se!ifi!enag
From: enag@ifi.uio.no (Erik Naggum)
Newsgroups: comp.std.internat
Subject: Re: Unicode vs ISO DIS 10646 (was universality of Latin-1)
Message-ID: <ENAG.91May3200814@maud.ifi.uio.no>
Date: 3 May 91 18:08:22 GMT
References: <10003@plains.NoDak.edu>
Sender: enag@ifi.uio.no (Erik Naggum)
Organization: Naggum Software, Oslo, Norway
Lines: 124
In-Reply-To: kkim@plains.NoDak.edu's message of 26 Apr 91 20: 19:21 GMT

In article <10003@plains.NoDak.edu> kkim@plains.NoDak.edu (kyongsok kim) writes:

   (Erik Naggum) writes:

   :Unicode is subject to endianism.

   could anyone please explain what "endianism" is?

Sorry for using an unwarrantedly technical term outside its original
domain.  Computers whose smallest addressable unit of information is
the octet (byte) need some ordering scheme for the octets to make up
units consisting of more than one octet, such as a 16-bit quantity, or
a 32-bit quantity.  There are basically two ways to do this, with
variations over the theme, called "big-endian" and "little-endian".

Big-endian octet order means that the "big end" (most significant
octet) comes first, and conversely for little-endian octet order.  By
way of example, consider the octet order for the 16-bit quantity
U+0040 (the commercial at-sign in Unicode).  A big-endian hardware
would represent this as

	+----+----+
	| 00 | 40 |
	+----+----+

(reading memory from low addresses at left to high addresses at
right), while a little-endian hardware would represent the same
numeric quantity or Unicode character as

	+----+----+
	| 40 | 00 |
	+----+----+

What I mean by "endianism", then, is the whole issue around the
portability of binary coded information when the order of larger-than-
octet units are moved around one octet at a time.  E.g. if a little-
endian machine writes a U+0040 to a file, it will be read as whatever
U+4000 is in Unicode on a big-endian machine, and exactly the same the
other way around.  It should be clear that interoperability will lose
significantly through this scheme, and if a choice is made, machines
who have made the other choice will hit a severe performance penalty.

   :Unicode employs floating diacritics for scripts which do not separate
   :the diacritic and the character to which it applies.

   most people in favor of iso 10646 attack floating diacritics.  how
   do floating diacritics and non-spacing characters (which i believe
   iso 10646 adopts) differ?  from end-users' point of view, these two
   seem one and the same.  am i missing something?

Consider the Norwegian and French words for a small restaurant,
spelled "cafe'" (where the ' serves as a floating acute accent for
rendering purposes in the absence of an international character set
standard in which we wouldn't need it :-).  In Norwegian, the acute
accent over e is optional, it's an ornament to indicate stress,
toneme, etc.  It's not orthographically required.  In French, an e
with acute is a different orthographic unit than plain, unadorned e.

This means that in Norwegian, we can make do with a floating acute
accent, since the function of the acute accent is to modify the
character with which is combined.  In French, however, they cannot
make do with a floating acute accent because the acute accent does not
have a function by itself.  Rather, the unit is "e with acute".

Then there's the Norwegian character "a with ring above", in which the
ring above has exactly the same nature as the acute accent in French.
If Norwegian was supposed to be written with "a*" (* substituting for
a non-existent non-spacing floating "ring above"), it would complicate
things for us to the point where we would have to vote a strong NO to
a standard forcing us to do this.  (Note that we can't vote against
Unicode, we can only "fail to adopt it".)

Of course, French and Norwegian are sufficiently important languages
that we've had all our characters represented in ISO 8859-1 (with the
possible exception of the French political faux pas with respect to
the "oe" ligature).  Some minority languages are less well off, to put
it mildly.  I've heard that East European languages employing a
heavily diacriticized Cyrillic script are suffering from the lack of
characters for their needs, and think that floating diacritics is the
answer to their problem.

So, to summarize, a diacritic mark may or may not be an integral part
of a character depending on orthographic conventions in the language
in question.  To treat a diacritic as floating when it is an integral
part of a character would be wrong, as would insisting on having all
possible combinations of a truly floating diacritic and the characters
with which it may be combined coded separately.

Now, ISO DIS 10646 is of the "insist on all combinations" persuasion,
but has non-spacing characters for languages in which the "separate
unit of information" is eminently the case (e.g. Hebrew).  I've come
to learn that this is overly restrictive in many, many cases.

Unicode allows a large number of floating diacritical marks in
languages which I don't have a shred of competence to make comments,
but several people have expressed the opinion that they're not really
floating for several languages.

Without a firm ruling in the standard or national standards on the
nature of the diacritical marks from orthographical conventions
employed, there is an annoying ambiguity between "cafe'" and "caf*" (*
now substituting for e with acute accent).  Is the * really an e plus
an ', or is a separate character, or is it vice versa?  As noted
above, the answer is different from French and Norwegian, although the
word is exactly the same!

The other problem with floating diacritics is that the number of
characters is not naturally bounded, a thought at which ISO
understandable shudders.  Unicode talks about bounding the displayable
number of characters (with diacritical marks) through extra-standard
means, while ISO wants do it with intra-standard means.  For instance,
a commercial at-sign with acute accent and cedilla below doesn't make
much sense.  What should a Unicode display device do with that
sequence of characters?

I am deeply indebted to Professor David Birnbaum for explaining this
to me in much detail, and I'm of course responsible for any mistakes.

Hope this has helped.

--
[Erik Naggum]           Professional Programmer        <enag@ifi.uio.no>
Naggum Software             Electronic Text          <erik@naggum.uu.no>
0118 OSLO, NORWAY       Computer Communications            +47-2-836-863