Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Posting-Version: version B 2.10.PCS 1/10/84; site mtgzz.UUCP Path: utzoo!linus!philabs!prls!amdimage!amdcad!decwrl!decvax!tektronix!uw-beaver!cornell!vax135!ariel!mtunf!mtunh!mtuxo!mtgzz!dmt From: dmt@mtgzz.UUCP (d.m.tutelman) Newsgroups: net.math,net.micro Subject: Re: Data compression and information theory Message-ID: <1010@mtgzz.UUCP> Date: Fri, 2-Aug-85 14:01:43 EDT Article-I.D.: mtgzz.1010 Posted: Fri Aug 2 14:01:43 1985 Date-Received: Tue, 6-Aug-85 05:40:10 EDT References: <417@lasspvax.UUCP> Organization: AT&T Information Systems Labs, Middletown NJ Lines: 36 Xref: linus net.math:1793 net.micro:10075 > I have read somewhere that the information content of English text (such > as one might read in the New York Times) is roughly one bit per character. > I have also worked out (it is fairly simple) that a single-character > Huffman encoding of English text requires about five bits per character. > One assumes that larger savings could be realized by using digraph or > trigraph Huffman encoding, but such schemes rapidly become unweildy for > a computer implementation (unless I am missing something). The one-bit-per-character assertion comes from an old classic paper. (Don't have a reference handy, but I believe it's by Claude Shannon himself, published in BSTJ in the 1950s or even '40s.) What he did was have human beings "guess" the next letter in meaningful (i.e. - not nonsense) English sentences. "Space" was included as a character. He then computed the probability distribution of the number of guesses required to correctly guess the next letter. The entropy of that distribution was a smidgen over one bit (i.e.- people usually guessed right the first try). How do you use this result in a "real" encoder? First, you build a "guesser" that knows as much about the language and human experience as the humans in Shannon's experiment. (Obviously that's the HARD PART.) Feed it the string and count the number of guesses it takes. Huffman-code this number, based on the performance distribution of the "guesser". If the guesser is as good as Shannon's subjects were, it'll be about one bit per letter. The decoder is the inverse operation, using the identical algorithm in the "guesser". (If there is ANY difference in the number of guesses, the reconstruction will quickly go out of sync.) Implementation is still a far-off AI research project. Dave Tutelman Physical - AT&T Information Systems Holmdel, NJ 07733 Logical - ...ihnp4!mtuxo!mtgzz!dmt Audible - (201)-834-2895