Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!asuvax!ncar!mephisto!mcnc!rti!mozart!bts From: bts@unx.sas.com (Brian T. Schellenberger) Newsgroups: comp.text Subject: Re: SGML question Keywords: SGML, ambiguity Message-ID: <1990Sep10.170717.7993@unx.sas.com> Date: 10 Sep 90 17:07:17 GMT References: <582@helios.prosys.se> <583@helios.prosys.se> <141873@sun.Eng.Sun.COM> Organization: SAS Institute Inc. Lines: 47 In article <141873@sun.Eng.Sun.COM> tut@cairo.Sun.COM (Bill "Bill" Tuthill) writes: |I believe the ultimate answer is Unicode, a 16-bit code set that |includes all known languages of the world in a single, interchangeable |code set. Developed by Joe Becker and others at Xerox, Apple, and |elsewhere, Unicode represents a tremendous leap forward. The main |reason 16 bits is sufficient is that Chinese, Japanese and Korean |pictographs have been combined so as to be complete and correctly |ordered, though not necessarily contiguous. | |The main drawback to Unicode is that files will be twice as big. But |being able to exchange data without shifting and conversion is a huge |advantage. Space has even been left in the Unicode address space for |ancient writing systems such as hieroglyphics and cuneiform. This is be no means necessary. Even the "large" versions of Kanji and such-like only have 6000 or so characters. Allowing for overlap, I would immensely surprised if you couldn't take care of all the ideographic living languages (mostly Chinese, Japanese, and Korean) in 15,000 characters, tops. For non-ideograhpic languages, which tend to have no more than about 40 characters tops in no more than two variations (eg, capital and lowercase for Roman; Katakana and Hiragana for Japanese), making for less than 100 characters total, we can accomidate more than 150 non-overlapping languages in the same space. This leaves 2,000 slots for "special" symbols and punctuation, while still coming in at less than 32,000 characters. Thus, it can easily be fit into 15 bits. That way, we can have the 127 or so most common chacters encoded into a 7 bits. Then if the eighth bit is set, we know it starts a 15-bit character. The obivous starting place it to make the 7-bit code be the current ASCII code. I suspect that this is close to the set of most commonly used characters world-wide, but it need not be the choice. If such a scheme is used, most files will be only a little bit longer than they are now, and (assuming ASCII is used as the base), computer code will increase not one wit. Neither will English text. Since English is the lingua franca of the world these days, this will includes a lot international text, and not just English. Finally, such a scheme is not all that difficult to work with. It is certainly easier than shared 7- and 8-bit codes than include "shift" characters. -- -- Brian, the Man from Babble-on. bts@unx.sas.com -- (Brian Schellenberger) "And when the votes were cast, the winner was . . . Mister James K. Polk, Napolean of the stump." -- THEY MIGHT BE GIANTS.