Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!uunet!husc6!seismo!sundc!hadron!cos!howard From: howard@COS.COM (Howard C. Berkowitz) Newsgroups: comp.lang.c,comp.std.internat Subject: Re: What is a byte Message-ID: <393@cos.COM> Date: Tue, 11-Aug-87 08:54:23 EDT Article-I.D.: cos.393 Posted: Tue Aug 11 08:54:23 1987 Date-Received: Thu, 13-Aug-87 01:38:05 EDT References: <218@astra.necisa.oz> <142700010@tiger.UUCP> <2792@phri.UUCP> <2034@xanth.UUCP> Organization: Corporation for Open Systems, McLean, VA Lines: 45 Keywords: 32 bit bytes! You ain't seen nothin', yet. Summary: Worst case approximately 100K ideographs Xref: mnetor comp.lang.c:3581 comp.std.internat:89 In article <2034@xanth.UUCP>, kent@xanth.UUCP (Kent Paul Dolan) writes: > While we're developing nightmares about the number of bits the Japanese > need in a char, remember for text processing that for 1 billion of the > earth's residents, the smallest unit of text processing is the ideograph, > and that even 21 bits is probably barely sufficient to represent the number > of written words in Chinese. Anyone for 32 bit characters? I sure don't > want 24 bit ones! ;-) I worked at the Library of Congress in the late 70's, and was responsible for the hardware and systems software aspects of experimental terminals for the 140 or so fonts (700 or so languages and dialects) in which the Library has materials. Chinese, of course, was the nightmare. Several authorities said we should assume about 50K distinct ideographs, but the language scholars in the Orientalia Division said 100K was a more correct number. When the outside experts challenged this, saying that the additional 50K appear in only esoteric documents used by very specialized scholars, Orientalia responded with "who do you think use the Orientalia collection at the Library of Congress?" It developed, however, that the Chinese ideograph problem could be simplified. While there are a very large number of distinct ideographs, these ideographs are composed of a much smaller (<100) number of superimposed radicals. Chinese dictionaries use radicals as a means of lexical ordering. While I am out of touch with current research, it was felt at the time that Chinese (and full Japanese Kanji) could be approached by using a mixture of codes for common ideographs and escapes to strings of radicals (to be superimposed), or purely by radical strings. When discussing the Oriental language problem, do distinguish the linguistic problem of ideograph uniqueness from the graphic problem of ideograph display. This differentiation is similar to the difference between a code and a cipher. -- -- howard(Howard C. Berkowitz) @cos.com {seismo!sundc, hadron, hqda-ai}!cos!howard (703) 883-2812 [ofc] (703) 998-5017 [home] DISCLAIMER: I explicitly identify COS official positions.