Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!uunet!seismo!rutgers!ames!ucbcad!ucbvax!dewey.soe.berkeley.edu!oster
From: oster@dewey.soe.berkeley.edu (David Phillip Oster)
Newsgroups: comp.lang.c,comp.std.internat
Subject: Re: What is a byte
Message-ID: <20131@ucbvax.BERKELEY.EDU>
Date: Sat, 15-Aug-87 00:49:19 EDT
Article-I.D.: ucbvax.20131
Posted: Sat Aug 15 00:49:19 1987
Date-Received: Sun, 16-Aug-87 07:18:25 EDT
References: <218@astra.necisa.oz> <142700010@tiger.UUCP> <2792@phri.UUCP> <34@piring.cwi.nl> <1549@frog.UUCP> <8409@utzoo.UUCP>
Sender: usenet@ucbvax.BERKELEY.EDU
Reply-To: oster@dewey.soe.berkeley.edu.UUCP (David Phillip Oster)
Organization: School of Education, UC-Berkeley
Lines: 34
Keywords: 32 bit bytes!  You ain't seen nothin', yet.
Xref: mnetor comp.lang.c:3657 comp.std.internat:100

In article <8409@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes:
>Remember that the OED includes an awful lot of words that are obsolete or
>terminally obscure by anyone's standards.  It is not a dictionary of current
>English.

That's part of the point. Would you support an encoding scheme that
prevented me from using English documents, even those containg
obselete or obscure words, on my computer? Well if we are going to
standardize on an encoding for Chinese, it should be able to cover ALL
of Chinese.  There is no reason why we couldn't use a huffman encoding
scheme: the 14 most common ideograms fit in a 4 bit nybble, the 15th
pattern is a filler, and the 16th pattern means that the next byte
encodes the 254 next most common ideograms, the 255 bit pattern
meaning that the next 16-bit word had the 65534 next most common, and
so on.  

That way, the average length of a run of chinese text is
likely to be about 10 bits per ideogram, and any single ideogram would
have canonical 64 bit representation: its bit pattern in the left of
the 64 bits, including any nybble-shift, byte-shift, or word-shift bit
patterns and padded out with filler nybbles.

 
Now, all we have to do is pick an ideogram frequency standard.  Say,
this idea would also work for English. Assuming that the average
English word takes 6*8 bits (average length of 5 + terminating space
* 8 bit ascii) you could cut the disk space required for computer
storage by a factor of close to 5 by using this encoding scheme. Too
bad that you'd have a mammoth word list in main memory to unpack it
speedily. Might be a nice way to increase the effective bandwidth of
all those modems pushing UseNet around though.
--- David Phillip Oster            --My Good News: "I'm a perfectionist."
Arpa: oster@dewey.soe.berkeley.edu --My Bad News: "I don't charge by the hour."
Uucp: {seismo,decvax,...}!ucbvax!oster%dewey.soe.berkeleye yoe