Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!swrinde!zaphod.mps.ohio-state.edu!casbah.acns.nwu.edu!hpa
From: hpa@casbah.acns.nwu.edu (H. Peter Anvin)
Newsgroups: comp.std.internat
Subject: Re: Unicode vs ISO DIS 10646 (was universality of Latin-1)
Message-ID: <1991May4.055401.15031@casbah.acns.nwu.edu>
Date: 4 May 91 05:54:01 GMT
References: <10003@plains.NoDak.edu> <ENAG.91May3200814@maud.ifi.uio.no>
Organization: Northwestern University
Lines: 58

In article <ENAG.91May3200814@maud.ifi.uio.no> enag@ifi.uio.no (Erik Naggum) writes:
>Unicode allows a large number of floating diacritical marks in
>languages which I don't have a shred of competence to make comments,
>but several people have expressed the opinion that they're not really
>floating for several languages.

Yes, UNICODE does not care which language we are dealing with; note too
that one may have to combine characters from several sections of the
UNICODE in order to form a complete script.  The question then becomes: so
what?  If we insist on having diacritics that float for the languages that
have them possible and fixed for the languages that require them, someday
someone will type "e'" with a fixed diacritic while writing Norwegian, or a
floting in French, just to have something break for them.  As I understand
it, UNICODE only has non-floating diacritics for historical (compatibility)
reasons.  For example, "e'" is U+009E only for compatibility with Latin-1,
while the explicit coding is U+0065 U+0301.  I take it that at U+009E there
will just be an alias entry referring to U+0065 U+0301.

>The other problem with floating diacritics is that the number of
>characters is not naturally bounded, a thought at which ISO
>understandable shudders.  Unicode talks about bounding the displayable
>number of characters (with diacritical marks) through extra-standard
>means, while ISO wants do it with intra-standard means.  For instance,
>a commercial at-sign with acute accent and cedilla below doesn't make
>much sense.  What should a Unicode display device do with that
>sequence of characters?

In my opinion, it should take the @ sign and superimpose an acute accent
and tack a cedilla at the bottom.  A high-quality output device will
probably have a set of pre-finished combinations, but that doesn't prevent
it from using plain old superposition (or fancied-up superposition) as a
default solution.  After all, the combination tells it what it should look
like, right? 

Endianism is a tricky question, but in most cases there is precedent.  For
telecommunication, both CCITT and Internet standards advocate bigendianism
(Motorola style).  Check out what the sequence of bits are out of a
V.24/RS-232 port.  Bigendian.  Thus that is probably the preferred style
for interchange.  For word processors etc. there are usually
numeric fields which have had to be resolved; mostly as the style dominant
on the machine it was introduced on.
[P.S. As a programmer, I prefer littleendian (Intel) style; while a
bigendian hex dump is easier to read, littleendianism avoid many of the
problems with different variable sizes.   D.S.]

I also think there should be a recommended mangling scheme for converting
Unitext to ASCII text spectrum (NOT octet spectrum) for purpouses like
Internet mail, which not is very likely to change any time soon.  I have
given the question some thought but I am not going to say anything until I
have figured out a "safe" way that could also distinguish between Unitext
and ASCII text.

                                     /Peter
-- 
IDENTITY:   Anvin, H. Peter           STATUS:    Student
INTERNET:   hpa@casbah.acns.nwu.edu   FIDONET:   1:115/989.4
HAM RADIO:  N9ITP, SM4TKN             RBBSNET:   8:970/101.4
EDITOR OF:  The Stillwaters BBS List  TEACHING:  Swedish