Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!uunet!seismo!rutgers!ames!think!bloom-beacon!bu-cs!m2c!frog!john From: john@frog.UUCP (John Woods, Software) Newsgroups: comp.lang.c,comp.std.internat Subject: Re: What is a byte Message-ID: <1549@frog.UUCP> Date: Tue, 11-Aug-87 18:38:00 EDT Article-I.D.: frog.1549 Posted: Tue Aug 11 18:38:00 1987 Date-Received: Fri, 14-Aug-87 02:14:36 EDT References: <218@astra.necisa.oz> <142700010@tiger.UUCP> <2792@phri.UUCP> <34@piring.cwi.nl> Organization: Superfrog Heaven [ CRDS, Framingham MA ] Lines: 34 Keywords: 32 bit bytes! You ain't seen nothin', yet. Xref: mnetor comp.lang.c:3594 comp.std.internat:91 In article <34@piring.cwi.nl>, lambert@cwi.nl (Lambert Meertens) writes: I>In article <2034@xanth.UUCP> kent@xanth.UUCP (Kent Paul Dolan) writes: N>)While we're developing nightmares about the number of bits the Japanese C>)need in a char, remember for text processing that for 1 billion of the L>)earth's residents, the smallest unit of text processing is the ideograph, U>)and that even 21 bits is probably barely sufficient to represent the number D>)of written words in Chinese. E> D>Are you suggesting that there are more than 2**20 = 1048576 different >written words in Chinese? At typically 60 entries on a page, their T>dictionaries must have then some 17500 pages or more. I think that 16 bits E>are enough to accommodate all Chinese characters, and certainly ample for X>the about 5000 that are in actual use. T> In the English dictionary that the documentation department here uses, there are 320,000 words. I am told that the Oxford English Dictionary has approaching 1,000,000 words, and that the the total English language has just over 1,000,000 words. Chinese is probably about the same. I can see asking the Chinese to adopt some limited alphabet scheme (such as Romaji used by the Japanese (if I remember correctly, a 3-Roman-character spelling for each syllable of Kanji), or perhaps Roman phonetic spelling), but telling them that some microscopic fraction of their language has to be selected for interaction with computers is just flatly bogus. (a side note to provoke more chuckles than thought: are ideographs the CISCs of language? Perhaps that makes Morse code the RISC...) -- John Woods, Charles River Data Systems, Framingham MA, (617) 626-1101 ...!decvax!frog!john, ...!mit-eddie!jfw, jfw%mit-ccc@MIT-XX.ARPA "The Unicorn is a pony that has been tragically disfigured by radiation burns." -- Dr. Science