Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!uunet!seismo!columbia!amsterdam!dupuy
From: dupuy@amsterdam.columbia.edu (Alexander Dupuy)
Newsgroups: comp.lang.c,comp.std.internat
Subject: Re: What is a byte
Message-ID: <4906@columbia.edu>
Date: Wed, 31-Dec-69 18:59:59 EDT
Article-I.D.: columbia.4906
Posted: Wed Dec 31 18:59:59 1969
Date-Received: Sun, 16-Aug-87 13:09:54 EDT
References: <218@astra.necisa.oz> <142700010@tiger.UUCP> <2792@phri.UUCP> <34@piring.cwi.nl> <1549@frog.UUCP> <8409@utzoo.UUCP> <20131@ucbvax.BERKELEY.EDU>
Sender: nobody@columbia.edu
Reply-To: dupuy@amsterdam.columbia.edu (Alexander Dupuy)
Followup-To: comp.std.internat
Organization: Columbia University Computer Science Dept.
Lines: 37
Summary: Disk storage is *not* what it's all about
Xref: mnetor comp.lang.c:3673 comp.std.internat:105

In article <20131@ucbvax.BERKELEY.EDU> oster@dewey.soe.berkeley.edu.UUCP
 (David Phillip Oster) writes:

>  There is no reason why we couldn't use a huffman encoding
>scheme: the 14 most common ideograms fit in a 4 bit nybble, the 15th
>pattern is a filler, and the 16th pattern means that the next byte
>encodes the 254 next most common ideograms, the 255 bit pattern
>meaning that the next 16-bit word had the 65534 next most common, and
>so on.  
>
>That way, the average length of a run of chinese text is
>likely to be about 10 bits per ideogram, and any single ideogram would
>have canonical 64 bit representation: its bit pattern in the left of
>the 64 bits, including any nybble-shift, byte-shift, or word-shift bit
>patterns and padded out with filler nybbles.

This underscores the central tradeoff in a code for Chinese or Chinese/Japanese
- compact respresentation to save disk space versus consistent (same character
size) representation for processing.

But there is really no reason we have to trade these off against each other.
We can just define a consistent representation for processing (24 or 32 bits
will suffice - I don't think we need 64) and use a compresseion algorithm
(Lempel-Ziv, Huffman, whatever, as long as it's standard, and not too expensive
to decode/encode) when we aren't manipulating individual characters.  Some
languages even have rudimentary forms of support for this (packed array of char
vs. array of char in Pascal).

It's clear that operating system support has to be much better than it is now
for there to be any hope of writing programs which are portable between
Latin-only, Chinese/Japanese-only, and Chinese/Japanese/Latin environments.
I don't see the programming language constructs as being the major problem.

@alex
---
arpanet: dupuy@columbia.edu
uucp:	...!seismo!columbia, and i