Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!asuvax!ncar!mephisto!mcnc!rti!mozart!bts
From: bts@unx.sas.com (Brian T. Schellenberger)
Newsgroups: comp.text
Subject: Re: SGML question
Keywords: SGML, ambiguity
Message-ID: <1990Sep10.170717.7993@unx.sas.com>
Date: 10 Sep 90 17:07:17 GMT
References: <582@helios.prosys.se> <583@helios.prosys.se> <141873@sun.Eng.Sun.COM>
Organization: SAS Institute Inc.
Lines: 47

In article <141873@sun.Eng.Sun.COM> tut@cairo.Sun.COM (Bill "Bill" Tuthill) writes:
|I believe the ultimate answer is Unicode, a 16-bit code set that
|includes all known languages of the world in a single, interchangeable
|code set.  Developed by Joe Becker and others at Xerox, Apple, and
|elsewhere, Unicode represents a tremendous leap forward.  The main
|reason 16 bits is sufficient is that Chinese, Japanese and Korean
|pictographs have been combined so as to be complete and correctly
|ordered, though not necessarily contiguous.
|
|The main drawback to Unicode is that files will be twice as big.  But
|being able to exchange data without shifting and conversion is a huge
|advantage.  Space has even been left in the Unicode address space for
|ancient writing systems such as hieroglyphics and cuneiform.

This is be no means necessary.  Even the "large" versions of Kanji and
such-like only have 6000 or so characters.  Allowing for overlap, I would
immensely surprised if you couldn't take care of all the ideographic living
languages (mostly Chinese, Japanese, and Korean) in 15,000 characters, tops.
For non-ideograhpic languages, which tend to have no more than about 40
characters tops in no more than two variations (eg, capital and lowercase
for Roman; Katakana and Hiragana for Japanese), making for less than 100
characters total, we can accomidate more than 150 non-overlapping languages
in the same space.  This leaves 2,000 slots for "special" symbols and
punctuation, while still coming in at less than 32,000 characters.  Thus,
it can easily be fit into 15 bits.  That way, we can have the 127 or so
most common chacters encoded into a 7 bits.  Then if the eighth bit is
set, we know it starts a 15-bit character.

The obivous starting place it to make the 7-bit code be the current ASCII
code.  I suspect that this is close to the set of most commonly used 
characters world-wide, but it need not be the choice.  If such a scheme is
used, most files will be only a little bit longer than they are now, and
(assuming ASCII is used as the base), computer code will increase not one
wit.  Neither will English text.  Since English is the lingua franca of the
world these days, this will includes a lot international text, and not just
English.

Finally, such a scheme is not all that difficult to work with.  It is
certainly easier than shared 7- and 8-bit codes than include "shift"
characters.


-- 
-- Brian, the Man from Babble-on.		bts@unx.sas.com
-- (Brian Schellenberger)
"And when the votes were cast, the winner was . . .
 Mister James K. Polk, Napolean of the stump."        -- THEY MIGHT BE GIANTS.