Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!swrinde!zaphod.mps.ohio-state.edu!casbah.acns.nwu.edu!hpa From: hpa@casbah.acns.nwu.edu (H. Peter Anvin) Newsgroups: comp.std.internat Subject: Re: Unicode vs ISO DIS 10646 (was universality of Latin-1) Message-ID: <1991May4.055401.15031@casbah.acns.nwu.edu> Date: 4 May 91 05:54:01 GMT References: <10003@plains.NoDak.edu> Organization: Northwestern University Lines: 58 In article enag@ifi.uio.no (Erik Naggum) writes: >Unicode allows a large number of floating diacritical marks in >languages which I don't have a shred of competence to make comments, >but several people have expressed the opinion that they're not really >floating for several languages. Yes, UNICODE does not care which language we are dealing with; note too that one may have to combine characters from several sections of the UNICODE in order to form a complete script. The question then becomes: so what? If we insist on having diacritics that float for the languages that have them possible and fixed for the languages that require them, someday someone will type "e'" with a fixed diacritic while writing Norwegian, or a floting in French, just to have something break for them. As I understand it, UNICODE only has non-floating diacritics for historical (compatibility) reasons. For example, "e'" is U+009E only for compatibility with Latin-1, while the explicit coding is U+0065 U+0301. I take it that at U+009E there will just be an alias entry referring to U+0065 U+0301. >The other problem with floating diacritics is that the number of >characters is not naturally bounded, a thought at which ISO >understandable shudders. Unicode talks about bounding the displayable >number of characters (with diacritical marks) through extra-standard >means, while ISO wants do it with intra-standard means. For instance, >a commercial at-sign with acute accent and cedilla below doesn't make >much sense. What should a Unicode display device do with that >sequence of characters? In my opinion, it should take the @ sign and superimpose an acute accent and tack a cedilla at the bottom. A high-quality output device will probably have a set of pre-finished combinations, but that doesn't prevent it from using plain old superposition (or fancied-up superposition) as a default solution. After all, the combination tells it what it should look like, right? Endianism is a tricky question, but in most cases there is precedent. For telecommunication, both CCITT and Internet standards advocate bigendianism (Motorola style). Check out what the sequence of bits are out of a V.24/RS-232 port. Bigendian. Thus that is probably the preferred style for interchange. For word processors etc. there are usually numeric fields which have had to be resolved; mostly as the style dominant on the machine it was introduced on. [P.S. As a programmer, I prefer littleendian (Intel) style; while a bigendian hex dump is easier to read, littleendianism avoid many of the problems with different variable sizes. D.S.] I also think there should be a recommended mangling scheme for converting Unitext to ASCII text spectrum (NOT octet spectrum) for purpouses like Internet mail, which not is very likely to change any time soon. I have given the question some thought but I am not going to say anything until I have figured out a "safe" way that could also distinguish between Unitext and ASCII text. /Peter -- IDENTITY: Anvin, H. Peter STATUS: Student INTERNET: hpa@casbah.acns.nwu.edu FIDONET: 1:115/989.4 HAM RADIO: N9ITP, SM4TKN RBBSNET: 8:970/101.4 EDITOR OF: The Stillwaters BBS List TEACHING: Swedish