Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!uunet!seismo!rutgers!ames!ucbcad!ucbvax!dewey.soe.berkeley.edu!oster From: oster@dewey.soe.berkeley.edu (David Phillip Oster) Newsgroups: comp.lang.c,comp.std.internat Subject: Re: What is a byte Message-ID: <20131@ucbvax.BERKELEY.EDU> Date: Sat, 15-Aug-87 00:49:19 EDT Article-I.D.: ucbvax.20131 Posted: Sat Aug 15 00:49:19 1987 Date-Received: Sun, 16-Aug-87 07:18:25 EDT References: <218@astra.necisa.oz> <142700010@tiger.UUCP> <2792@phri.UUCP> <34@piring.cwi.nl> <1549@frog.UUCP> <8409@utzoo.UUCP> Sender: usenet@ucbvax.BERKELEY.EDU Reply-To: oster@dewey.soe.berkeley.edu.UUCP (David Phillip Oster) Organization: School of Education, UC-Berkeley Lines: 34 Keywords: 32 bit bytes! You ain't seen nothin', yet. Xref: mnetor comp.lang.c:3657 comp.std.internat:100 In article <8409@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes: >Remember that the OED includes an awful lot of words that are obsolete or >terminally obscure by anyone's standards. It is not a dictionary of current >English. That's part of the point. Would you support an encoding scheme that prevented me from using English documents, even those containg obselete or obscure words, on my computer? Well if we are going to standardize on an encoding for Chinese, it should be able to cover ALL of Chinese. There is no reason why we couldn't use a huffman encoding scheme: the 14 most common ideograms fit in a 4 bit nybble, the 15th pattern is a filler, and the 16th pattern means that the next byte encodes the 254 next most common ideograms, the 255 bit pattern meaning that the next 16-bit word had the 65534 next most common, and so on. That way, the average length of a run of chinese text is likely to be about 10 bits per ideogram, and any single ideogram would have canonical 64 bit representation: its bit pattern in the left of the 64 bits, including any nybble-shift, byte-shift, or word-shift bit patterns and padded out with filler nybbles. Now, all we have to do is pick an ideogram frequency standard. Say, this idea would also work for English. Assuming that the average English word takes 6*8 bits (average length of 5 + terminating space * 8 bit ascii) you could cut the disk space required for computer storage by a factor of close to 5 by using this encoding scheme. Too bad that you'd have a mammoth word list in main memory to unpack it speedily. Might be a nice way to increase the effective bandwidth of all those modems pushing UseNet around though. --- David Phillip Oster --My Good News: "I'm a perfectionist." Arpa: oster@dewey.soe.berkeley.edu --My Bad News: "I don't charge by the hour." Uucp: {seismo,decvax,...}!ucbvax!oster%dewey.soe.berkeleye yoe