Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!uunet!seismo!ll-xn!husc6!necntc!frog!john
From: john@frog.UUCP (John Woods, Software)
Newsgroups: comp.lang.c,comp.std.internat
Subject: Re: What is a byte
Message-ID: <1564@frog.UUCP>
Date: Tue, 18-Aug-87 14:46:00 EDT
Article-I.D.: frog.1564
Posted: Tue Aug 18 14:46:00 1987
Date-Received: Thu, 20-Aug-87 06:04:33 EDT
References: <218@astra.necisa.oz> <142700010@tiger.UUCP> <2792@phri.UUCP> <8409@utzoo.UUCP>
Organization: Superfrog Heaven [ CRDS, Framingham MA ]
Lines: 51
Keywords: 32 bit bytes!  You ain't seen nothin', yet.
Summary: Some real research (GASP!)
Xref: mnetor comp.lang.c:3712 comp.std.internat:118

In article <8409@utzoo.UUCP>, henry@utzoo.UUCP (Henry Spencer) writes:
(and many others as well)
>>[English has] over 1,000,000 words.  Chinese is probably about the same.
> 
Many people (including Henry) have pointed out that (A) English is larger
than most languages (having borrowed "one of everything" from everyone), and
(B) Chinese ideographs are not one-per-word, but one-per-concept (hence most
words are two or more ideographs).  So, I went back to the source I first
read about this topic in:  "Multilingual Word Processing", Joseph D. Becker,
Scientific American July 1984.

In this article, he doesn't give an actual count of Chinese ideographs (just
the statement "tens of thousands"), but in the "flexible encoding" he and
other Xerox denizens developed (using alphabet shift-codes), to encode Chinese
you send the "shift superalphabet (for 16 bit characters)", the 8-bit "super-
alphabet number", and then 16-bit character sequences.  "The main
superalphabet, designated 00000000, is all one needs except for very rare
Chinese characters."  A little later in the article is the implication that
about 7000 ideographs are "commonly seen" in Chinese publishing.

So, there we have it:  not as bad as I thought, but still indicating that
8 bits is woefully inadequate.

Also, I seem to have slipped up in my understanding of Kanji:  Kanji is the
set of Chinese ideographs borrowed by the Japanese, of which "about 3500"
are in common use (and the number is declining).  The phonetic letters (which
can spell words in entirety, and are used to indicate grammatical endings for
Kanji roots) are collectively called "kana", and come in two sets:  "hiragana"
and "katakana" (it is probably more complicated than that, but that is about
all the article gives).  There used to be Kanji "typewriters" which scarcely
anyone used (using several hundred keys); now, computerised systems exist in
which one can type phonetic hiragana symbols (or, for those who prefer, the
Romaji phonetics), and press a "lookup key" to have the computer turn the
just-typed word into proper Kanji.

The Bibliography in that Scientific American says the following publications
may be helpful:

_Writing Systems of the World_, Akira Nakanishi.  Charles E. Tuttle, 1980.
"A Historical Study of Typewriters and Typing Methods:  From the Position
of Planning Japanese Parallels", Hisao Yamada in _Journal of Information
Processing_, Vol. 2, No. 4, pp 175-202; February, 1980.

Can we all now consider the statement "7 bits is enough" most sincerely dead?

--
John Woods, Charles River Data Systems, Framingham MA, (617) 626-1101
...!decvax!frog!john, ...!mit-eddie!jfw, jfw@eddie.mit.edu

"The Unicorn is a pony that has been tragically
disfigured by radiation burns." -- Dr. Science