Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!uunet!seismo!columbia!amsterdam!dupuy From: dupuy@amsterdam.columbia.edu (Alexander Dupuy) Newsgroups: comp.lang.c,comp.std.internat Subject: Re: What is a byte Message-ID: <4906@columbia.edu> Date: Wed, 31-Dec-69 18:59:59 EDT Article-I.D.: columbia.4906 Posted: Wed Dec 31 18:59:59 1969 Date-Received: Sun, 16-Aug-87 13:09:54 EDT References: <218@astra.necisa.oz> <142700010@tiger.UUCP> <2792@phri.UUCP> <34@piring.cwi.nl> <1549@frog.UUCP> <8409@utzoo.UUCP> <20131@ucbvax.BERKELEY.EDU> Sender: nobody@columbia.edu Reply-To: dupuy@amsterdam.columbia.edu (Alexander Dupuy) Followup-To: comp.std.internat Organization: Columbia University Computer Science Dept. Lines: 37 Summary: Disk storage is *not* what it's all about Xref: mnetor comp.lang.c:3673 comp.std.internat:105 In article <20131@ucbvax.BERKELEY.EDU> oster@dewey.soe.berkeley.edu.UUCP (David Phillip Oster) writes: > There is no reason why we couldn't use a huffman encoding >scheme: the 14 most common ideograms fit in a 4 bit nybble, the 15th >pattern is a filler, and the 16th pattern means that the next byte >encodes the 254 next most common ideograms, the 255 bit pattern >meaning that the next 16-bit word had the 65534 next most common, and >so on. > >That way, the average length of a run of chinese text is >likely to be about 10 bits per ideogram, and any single ideogram would >have canonical 64 bit representation: its bit pattern in the left of >the 64 bits, including any nybble-shift, byte-shift, or word-shift bit >patterns and padded out with filler nybbles. This underscores the central tradeoff in a code for Chinese or Chinese/Japanese - compact respresentation to save disk space versus consistent (same character size) representation for processing. But there is really no reason we have to trade these off against each other. We can just define a consistent representation for processing (24 or 32 bits will suffice - I don't think we need 64) and use a compresseion algorithm (Lempel-Ziv, Huffman, whatever, as long as it's standard, and not too expensive to decode/encode) when we aren't manipulating individual characters. Some languages even have rudimentary forms of support for this (packed array of char vs. array of char in Pascal). It's clear that operating system support has to be much better than it is now for there to be any hope of writing programs which are portable between Latin-only, Chinese/Japanese-only, and Chinese/Japanese/Latin environments. I don't see the programming language constructs as being the major problem. @alex --- arpanet: dupuy@columbia.edu uucp: ...!seismo!columbia, and i