Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!usc!wuarchive!zaphod.mps.ohio-state.edu!pacific.mps.ohio-state.edu!linac!att!pacbell.com!ucsd!ucbvax!bloom-beacon!eru!hagbard!sunic!sics.se!ifi!enag From: enag@ifi.uio.no (Erik Naggum) Newsgroups: comp.std.internat Subject: Re: Unicode vs ISO DIS 10646 (was universality of Latin-1) Message-ID: Date: 3 May 91 18:08:22 GMT References: <10003@plains.NoDak.edu> Sender: enag@ifi.uio.no (Erik Naggum) Organization: Naggum Software, Oslo, Norway Lines: 124 In-Reply-To: kkim@plains.NoDak.edu's message of 26 Apr 91 20: 19:21 GMT In article <10003@plains.NoDak.edu> kkim@plains.NoDak.edu (kyongsok kim) writes: (Erik Naggum) writes: :Unicode is subject to endianism. could anyone please explain what "endianism" is? Sorry for using an unwarrantedly technical term outside its original domain. Computers whose smallest addressable unit of information is the octet (byte) need some ordering scheme for the octets to make up units consisting of more than one octet, such as a 16-bit quantity, or a 32-bit quantity. There are basically two ways to do this, with variations over the theme, called "big-endian" and "little-endian". Big-endian octet order means that the "big end" (most significant octet) comes first, and conversely for little-endian octet order. By way of example, consider the octet order for the 16-bit quantity U+0040 (the commercial at-sign in Unicode). A big-endian hardware would represent this as +----+----+ | 00 | 40 | +----+----+ (reading memory from low addresses at left to high addresses at right), while a little-endian hardware would represent the same numeric quantity or Unicode character as +----+----+ | 40 | 00 | +----+----+ What I mean by "endianism", then, is the whole issue around the portability of binary coded information when the order of larger-than- octet units are moved around one octet at a time. E.g. if a little- endian machine writes a U+0040 to a file, it will be read as whatever U+4000 is in Unicode on a big-endian machine, and exactly the same the other way around. It should be clear that interoperability will lose significantly through this scheme, and if a choice is made, machines who have made the other choice will hit a severe performance penalty. :Unicode employs floating diacritics for scripts which do not separate :the diacritic and the character to which it applies. most people in favor of iso 10646 attack floating diacritics. how do floating diacritics and non-spacing characters (which i believe iso 10646 adopts) differ? from end-users' point of view, these two seem one and the same. am i missing something? Consider the Norwegian and French words for a small restaurant, spelled "cafe'" (where the ' serves as a floating acute accent for rendering purposes in the absence of an international character set standard in which we wouldn't need it :-). In Norwegian, the acute accent over e is optional, it's an ornament to indicate stress, toneme, etc. It's not orthographically required. In French, an e with acute is a different orthographic unit than plain, unadorned e. This means that in Norwegian, we can make do with a floating acute accent, since the function of the acute accent is to modify the character with which is combined. In French, however, they cannot make do with a floating acute accent because the acute accent does not have a function by itself. Rather, the unit is "e with acute". Then there's the Norwegian character "a with ring above", in which the ring above has exactly the same nature as the acute accent in French. If Norwegian was supposed to be written with "a*" (* substituting for a non-existent non-spacing floating "ring above"), it would complicate things for us to the point where we would have to vote a strong NO to a standard forcing us to do this. (Note that we can't vote against Unicode, we can only "fail to adopt it".) Of course, French and Norwegian are sufficiently important languages that we've had all our characters represented in ISO 8859-1 (with the possible exception of the French political faux pas with respect to the "oe" ligature). Some minority languages are less well off, to put it mildly. I've heard that East European languages employing a heavily diacriticized Cyrillic script are suffering from the lack of characters for their needs, and think that floating diacritics is the answer to their problem. So, to summarize, a diacritic mark may or may not be an integral part of a character depending on orthographic conventions in the language in question. To treat a diacritic as floating when it is an integral part of a character would be wrong, as would insisting on having all possible combinations of a truly floating diacritic and the characters with which it may be combined coded separately. Now, ISO DIS 10646 is of the "insist on all combinations" persuasion, but has non-spacing characters for languages in which the "separate unit of information" is eminently the case (e.g. Hebrew). I've come to learn that this is overly restrictive in many, many cases. Unicode allows a large number of floating diacritical marks in languages which I don't have a shred of competence to make comments, but several people have expressed the opinion that they're not really floating for several languages. Without a firm ruling in the standard or national standards on the nature of the diacritical marks from orthographical conventions employed, there is an annoying ambiguity between "cafe'" and "caf*" (* now substituting for e with acute accent). Is the * really an e plus an ', or is a separate character, or is it vice versa? As noted above, the answer is different from French and Norwegian, although the word is exactly the same! The other problem with floating diacritics is that the number of characters is not naturally bounded, a thought at which ISO understandable shudders. Unicode talks about bounding the displayable number of characters (with diacritical marks) through extra-standard means, while ISO wants do it with intra-standard means. For instance, a commercial at-sign with acute accent and cedilla below doesn't make much sense. What should a Unicode display device do with that sequence of characters? I am deeply indebted to Professor David Birnbaum for explaining this to me in much detail, and I'm of course responsible for any mistakes. Hope this has helped. -- [Erik Naggum] Professional Programmer Naggum Software Electronic Text 0118 OSLO, NORWAY Computer Communications +47-2-836-863