Newsgroups: comp.compression
Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!sarah!bingnews!kym
From: kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell)
Subject: Re: theoretical compression factor
Message-ID: <1991Mar29.031127.9128@bingvaxu.cc.binghamton.edu>
Summary: maybe
Organization: State University of New York at Binghamton
References: <1991Mar26.220519.1242@unislc.uucp> <1991Mar27.120829.26094@bingvaxu.cc.binghamton.edu> <20135@alice.att.com>
Date: Fri, 29 Mar 1991 03:11:27 GMT

In article <20135@alice.att.com> jj@alice.UUCP (jj, like it or not) writes:
>Please, folks, you're confusing several issues here.

Who/what/where -- _ME_ -- confused??!! :-)

>First, the question of "how much can I compress X" must
>be qualified.  If you have a historyless compression,
>you can compress up to the entropy ( sum of -p log p) bound,
>where p is the probability of each token.
>
>If you have history in your compression algorthm, you may (or
>may not) be able to do better, depending on the randomness
>of your token distribution (even given a fixed distribution of
>tokens).

Ok, I remember something of this ilk ``p log p'' stuff from Shannon
and his buddies, but I'm not too sure whether you (or he, or whoever) is 
saying historyless compression can not, no way, ever be LESS than this. 
But it _seems_ like you are saying this is your understanding, doesn't it?

If so, I have a question and another question.

Consider the scheme I outlined earlier -- of comparing an input bitstream with 
a random bitstream and just storing the differences in delta form (e.g. 
fixed-length or Huffman coded). This doesn't _appear_ to use history -- the 
next bit output by an uncompressing program doesn't use the previous bits of 
the uncompressed string to obtain the next bit. The compressing program
doesn't build up any dictionary. Or does it? (That's Q1).

There _is_ however, a quibble about a few hidden bits -- e.g. the length
of the random generator parameters and the compress/uncompress program
code. 

The program, however, is finite. If we make the input string large
enough the program length and any fiddling little parameters can be
swamped by the length of the string itself.

Now consider the following data obtained from an experimental
compression program based on the idea above:

====================
h.dat
nb=32400		<--- number of bits read
n1=0.341605		<--- fraction of 1's
-(p log p + q log q)=0.642094	<---	limit 1
-(p lg p + q lg q)=0.926346	<---	maybe this is log_2 instead of log_e?
2pq=0.449822		<--- my limit (sure looks low, doesn't it?)
minlen=20625		<--- after some trials this is the MINIMUM LENGTH 
				output string
compression=0.636574	<--- less than - sum p ln p

n.dat
nb=8080
n1=0.336015
-(p log p + q log q)=0.638357
-(p lg p + q lg q)=0.920954
2pq=0.446218
minlen=5028
compression=0.622277	<--- less than - sum p ln p

n2.dat
nb=16160
n1=0.334653
-(p log p + q log q)=0.637425
-(p lg p + q lg q)=0.91961
2pq=0.445321
minlen=10042
compression=0.621411	<--- less than - sum p ln p

n4.dat
nb=32320
n1=0.334653
-(p log p + q log q)=0.637425
-(p lg p + q lg q)=0.91961
2pq=0.445321
minlen=20088
compression=0.621535	<--- less than - sum p ln p
====================

Here are 4 data files compressed to LESS THAN the limit quoted above.
Certainly not _substantially less_, but less nevertheless. 8-)

The files aren't big. But they do vary an order of magnitude in
size or so. I therefore presume we _could_ arrange to have the data file
size -> inf. In that case, even including the size of the
compress/decompress program, we have case(s) where the ``p log p'' limit
seems to be beaten.

What gives? Perhaps there are additional O(1) factors involved in the 
``p log p'' limit? (That's Q2).

-kym

BTW, the data files are hex and decimal digits with about 50% ascii 0's.
The balance of 0 vs 1 *bits* is given in each case above.