Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!uunet!seismo!ll-xn!husc6!necntc!frog!john From: john@frog.UUCP (John Woods, Software) Newsgroups: comp.lang.c,comp.std.internat Subject: Re: What is a byte Message-ID: <1564@frog.UUCP> Date: Tue, 18-Aug-87 14:46:00 EDT Article-I.D.: frog.1564 Posted: Tue Aug 18 14:46:00 1987 Date-Received: Thu, 20-Aug-87 06:04:33 EDT References: <218@astra.necisa.oz> <142700010@tiger.UUCP> <2792@phri.UUCP> <8409@utzoo.UUCP> Organization: Superfrog Heaven [ CRDS, Framingham MA ] Lines: 51 Keywords: 32 bit bytes! You ain't seen nothin', yet. Summary: Some real research (GASP!) Xref: mnetor comp.lang.c:3712 comp.std.internat:118 In article <8409@utzoo.UUCP>, henry@utzoo.UUCP (Henry Spencer) writes: (and many others as well) >>[English has] over 1,000,000 words. Chinese is probably about the same. > Many people (including Henry) have pointed out that (A) English is larger than most languages (having borrowed "one of everything" from everyone), and (B) Chinese ideographs are not one-per-word, but one-per-concept (hence most words are two or more ideographs). So, I went back to the source I first read about this topic in: "Multilingual Word Processing", Joseph D. Becker, Scientific American July 1984. In this article, he doesn't give an actual count of Chinese ideographs (just the statement "tens of thousands"), but in the "flexible encoding" he and other Xerox denizens developed (using alphabet shift-codes), to encode Chinese you send the "shift superalphabet (for 16 bit characters)", the 8-bit "super- alphabet number", and then 16-bit character sequences. "The main superalphabet, designated 00000000, is all one needs except for very rare Chinese characters." A little later in the article is the implication that about 7000 ideographs are "commonly seen" in Chinese publishing. So, there we have it: not as bad as I thought, but still indicating that 8 bits is woefully inadequate. Also, I seem to have slipped up in my understanding of Kanji: Kanji is the set of Chinese ideographs borrowed by the Japanese, of which "about 3500" are in common use (and the number is declining). The phonetic letters (which can spell words in entirety, and are used to indicate grammatical endings for Kanji roots) are collectively called "kana", and come in two sets: "hiragana" and "katakana" (it is probably more complicated than that, but that is about all the article gives). There used to be Kanji "typewriters" which scarcely anyone used (using several hundred keys); now, computerised systems exist in which one can type phonetic hiragana symbols (or, for those who prefer, the Romaji phonetics), and press a "lookup key" to have the computer turn the just-typed word into proper Kanji. The Bibliography in that Scientific American says the following publications may be helpful: _Writing Systems of the World_, Akira Nakanishi. Charles E. Tuttle, 1980. "A Historical Study of Typewriters and Typing Methods: From the Position of Planning Japanese Parallels", Hisao Yamada in _Journal of Information Processing_, Vol. 2, No. 4, pp 175-202; February, 1980. Can we all now consider the statement "7 bits is enough" most sincerely dead? -- John Woods, Charles River Data Systems, Framingham MA, (617) 626-1101 ...!decvax!frog!john, ...!mit-eddie!jfw, jfw@eddie.mit.edu "The Unicorn is a pony that has been tragically disfigured by radiation burns." -- Dr. Science