Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!uunet!husc6!seismo!sundc!hadron!cos!howard
From: howard@COS.COM (Howard C. Berkowitz)
Newsgroups: comp.lang.c,comp.std.internat
Subject: Re: What is a byte
Message-ID: <393@cos.COM>
Date: Tue, 11-Aug-87 08:54:23 EDT
Article-I.D.: cos.393
Posted: Tue Aug 11 08:54:23 1987
Date-Received: Thu, 13-Aug-87 01:38:05 EDT
References: <218@astra.necisa.oz> <142700010@tiger.UUCP> <2792@phri.UUCP> <2034@xanth.UUCP>
Organization: Corporation for Open Systems, McLean, VA
Lines: 45
Keywords: 32 bit bytes!  You ain't seen nothin', yet.
Summary: Worst case approximately 100K ideographs
Xref: mnetor comp.lang.c:3581 comp.std.internat:89

In article <2034@xanth.UUCP>, kent@xanth.UUCP (Kent Paul Dolan) writes:
> While we're developing nightmares about the number of bits the Japanese
> need in a char, remember for text processing that for 1 billion of the
> earth's residents, the smallest unit of text processing is the ideograph,
> and that even 21 bits is probably barely sufficient to represent the number
> of written words in Chinese.  Anyone for 32 bit characters?  I sure don't
> want 24 bit ones!  ;-)


I worked at the Library of Congress in the late 70's, and was 
responsible for the hardware and systems software aspects of
experimental terminals for the 140 or so fonts (700 or so
languages and dialects) in which the Library has materials.

Chinese, of course, was the nightmare.  Several authorities
said we should assume about 50K distinct ideographs, but the
language scholars in the Orientalia Division said 100K was
a more correct number.  When the outside experts challenged
this, saying that the additional 50K appear in only esoteric
documents used by very specialized scholars, Orientalia responded
with "who do you think use the Orientalia collection at the
Library of Congress?"

It developed, however, that the Chinese ideograph problem could
be simplified.  While there are a very large number of distinct
ideographs, these ideographs are composed of a much smaller
(<100) number of superimposed radicals.  Chinese dictionaries
use radicals as a means of lexical ordering.  

While I am out of touch with current research, it was felt at
the time that Chinese (and full Japanese Kanji) could be approached
by using a mixture of codes for common ideographs and escapes
to strings of radicals (to be superimposed), or purely by
radical strings.

When discussing the Oriental language problem, do distinguish
the linguistic problem of ideograph uniqueness from the graphic
problem of ideograph display.  This differentiation is similar
to the difference between a code and a cipher.

-- 
-- howard(Howard C. Berkowitz) @cos.com
 {seismo!sundc, hadron, hqda-ai}!cos!howard
(703) 883-2812 [ofc] (703) 998-5017 [home]
DISCLAIMER:  I explicitly identify COS official positions.