Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Posting-Version: version B 2.10.PCS 1/10/84; site mtgzz.UUCP
Path: utzoo!linus!philabs!prls!amdimage!amdcad!decwrl!decvax!tektronix!uw-beaver!cornell!vax135!ariel!mtunf!mtunh!mtuxo!mtgzz!dmt
From: dmt@mtgzz.UUCP (d.m.tutelman)
Newsgroups: net.math,net.micro
Subject: Re: Data compression and information theory
Message-ID: <1010@mtgzz.UUCP>
Date: Fri, 2-Aug-85 14:01:43 EDT
Article-I.D.: mtgzz.1010
Posted: Fri Aug  2 14:01:43 1985
Date-Received: Tue, 6-Aug-85 05:40:10 EDT
References: <417@lasspvax.UUCP>
Organization: AT&T Information Systems Labs, Middletown NJ
Lines: 36
Xref: linus net.math:1793 net.micro:10075

> I have read somewhere that the information content of English text (such
> as one might read in the New York Times) is roughly one bit per character.
> I have also worked out (it is fairly simple) that a single-character 
> Huffman encoding of English text requires about five bits per character.
> One assumes that larger savings could be realized by using digraph or
> trigraph Huffman encoding, but such schemes rapidly become unweildy for
> a computer implementation (unless I am missing something).

The one-bit-per-character assertion comes from an old classic paper.
(Don't have a reference handy, but I believe it's by Claude Shannon
himself, published in BSTJ in the 1950s or even '40s.)
What he did was have human beings "guess" the next letter in
meaningful (i.e. - not nonsense) English sentences. "Space" was included
as a character.  He then computed the probability distribution
of the number of guesses required to correctly guess the next
letter.  The entropy of that distribution was a smidgen over one bit
(i.e.- people usually guessed right the first try).  

How do you use this result in a "real" encoder?  First, you build a
"guesser" that knows as much about the language and human experience
as the humans in Shannon's experiment. (Obviously that's the 
HARD PART.)  Feed it the string and count the number of guesses
it takes.  Huffman-code this number, based on the performance
distribution of the "guesser".  If the guesser is as good as Shannon's
subjects were, it'll be about one bit per letter.
The decoder is the inverse operation, using the identical algorithm
in the "guesser".  (If there is ANY difference in the number of
guesses, the reconstruction will quickly go out of sync.)

Implementation is still a far-off AI research project.

			Dave Tutelman
			Physical - AT&T Information Systems
				   Holmdel, NJ 07733
			Logical  - ...ihnp4!mtuxo!mtgzz!dmt
			Audible  - (201)-834-2895