Newsgroups: comp.compression
Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!sarah!bingnews!kym
From: kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell)
Subject: Re: theoretical compression factor
Message-ID: <1991Mar29.142549.15283@bingvaxu.cc.binghamton.edu>
Organization: State University of New York at Binghamton
References: <1991Mar27.120829.26094@bingvaxu.cc.binghamton.edu> <20135@alice.att.com> <1991Mar29.031127.9128@bingvaxu.cc.binghamton.edu>
Date: Fri, 29 Mar 1991 14:25:49 GMT

In article <20135@alice.att.com> jj@alice.UUCP (jj, like it or not) writes:
>First, the question of "how much can I compress X" must
>be qualified.  If you have a historyless compression,
>you can compress up to the entropy ( sum of -p log p) bound,
>where p is the probability of each token.

Here is some more data and another argument for your amusement.

[If you're well up on info thy you may as well `n' here].

The p log p ``bound'' is actually a disguised expectation value;
it is an _average_. It denotes the average amount of information
per symbol in a string. For example, if the string consists of
only the symbols 0 and 1 we write -(p lg p +(1-p) lg (1-p)) for
the average amount of information (in bits) per 0/1 symbol.  [lg(x) is log 
base 2 -- the size of the alphabet in this case].  If this average is 0.5, 
say, then we might expect that we could compress the string to 1/2 its 
size in some fashion without losing any ``information''since each 0/1 
symbol in the string would only be representing 0.5 of a bit on average.

(This measure does not take into account the _sequence_ of 0's
and 1's in the string; we're actually treating the string as a
multiset.)

Since we're talking about an _average_ it is quite possible to have
particular cases where the ``compression factor'' is less. Averages,
after all, are statistical variates and have distributions of their own.

So what is the distribution of ``p lg p'' compression factor? Rather hard to 
write in closed form as far as I know, so let's look at a table. The following
shows for each ``compression factor'' the likelyhood that its
value is less than `x'. 

x	Pr(X<x)
0.1	0.0251
0.2	0.0624
0.3	0.1072
0.4	0.1578
0.5	0.2199
0.6	0.2964
0.7	0.3813
0.8	0.4853
0.9	0.6331

E(X)	0.720872
var(X)	0.0730559

For example, a file can be compressed to 50% or less about 22% of the time 
(presuming the strings being used come uniformly from all possible strings). 
We also see the ``expected'' value for the ``p lg p'' is 72%. We therefore 
might expect, on average, only to be able to compress a file to 72% of its 
original size, all things being equal.

Looks bad? We can do MUCH, MUCH better. 

Order statistics is a relatively small branch of the subject that deals
with things sorted or arranged in order. It allows us to answer such
questions as ``if I pick up N candy bars, what is the largest one
I might expect to get'' (given that candy bars are not all exactly the
same size)? 

In our case we would like to know ``if I compressed my file with 
N different techniques, what might be the BEST I could expect (i.e. the 
_smallest_ result)''?

The tables at the endf show how taking the minimum of several
values of ``p lg p'' can quickly reduce the expected value for
a _set_ of compression techniques.

Whereas a file might be compressed to 1/2 its size only 22% of the time using 
a SINGLE technique, using TWO such methods will 1/2 the file 
39% of the time. Using 9 techniques will do so 89% of the time, etc.

Obviously this is a GOOD THING if you can afford the computer
time. Various ideas obviously present themselves.  You might run several 
techniques in parallel and only output the one that seems to be performing the 
best (a little code value in the output might be nice to indicate when you 
switch methods :-).

Another idea is to simply HASH your file by XOR'ing it with 10 different
pseudo-random generators and compress each with the SAME technique
and take the smallest result. More than about 90% of the time you should
get a 50% compression.

My little technique is similar to this last -- keep compressing the string 
using different parts of the SAME random-number sequence. To do this
you require a generator that has no correlations. (One of the usual mult
cong techniques will probably not suffice).

If you now include _HISTORY_ information (i.e. knowledge about the
sequence of symbols in the string) you should be able to do even better
still, on average.

However -- a reality check. There are SOME cases, and they may not be
rare enough, where the string has the property whereby it can NOT
be compressed no matter WHAT the technique. (I rather like the neat
argument posted by Doug :-).

For those interested, my little tables follow.

Cheers all,

-kym
===

BTW, a question. How much computation do I need to perform to
guarantee my previous ``2pq'' result >99% of the time?


Probabilities that expected compression factor -(p lg p + (1-p) lg (1-p))
is less than a given value `x' by selecting the minimum result produced
by several statistically independent techniques.

x	Pr(min{X1,...X2}<x)	Two techniques
0.1	0.04957
0.2	0.120906
0.3	0.202908
0.4	0.290699
0.5	0.391444		39% of the time compression < 1/2
0.6	0.504947
0.7	0.61721
0.8	0.735084
0.9	0.865384

x	Pr(min{X1,...X3}<x)	Three techniques
0.1	0.0734258
0.2	0.175762
0.3	0.288356
0.4	0.402627
0.5	0.525265		53% of the time compression < 1/2
0.6	0.651681
0.7	0.763168
0.8	0.863648
0.9	0.95061

x	Pr(min{X1,...X4}<x)	Four techniques
0.1	0.0966828
0.2	0.227194
0.3	0.364645
0.4	0.496892
0.5	0.62966			63%
0.6	0.754923
0.7	0.853472
0.8	0.929819
0.9	0.981879

x	Pr(min{X1,...X5}<x)	Five techniques
0.1	0.119356
0.2	0.275417
0.3	0.432755
0.4	0.576283
0.5	0.711097		71%
0.6	0.827564
0.7	0.909343
0.8	0.963878
0.9	0.993351

x	Pr(min{X1,...X6}<x)	Six techniques
0.1	0.14146
0.2	0.320631
0.3	0.493563
0.4	0.643145
0.5	0.774627		77%
0.6	0.878674
0.7	0.943911
0.8	0.981408
0.9	0.997561

x	Pr(min{X1,...X7}<x)	Seven techniques
0.1	0.16301
0.2	0.363024
0.3	0.547853
0.4	0.699457
0.5	0.824187		82%
0.6	0.914635
0.7	0.965297
0.8	0.990431
0.9	0.999105

x	Pr(min{X1,...X8}<x)	Eight techniques
0.1	0.184018
0.2	0.402771
0.3	0.596324
0.4	0.746883
0.5	0.862848		86%
0.6	0.939937
0.7	0.97853
0.8	0.995075
0.9	0.999672

x	Pr(min{X1,...X9}<x)	Nine techniques
0.1	0.204499
0.2	0.440038
0.3	0.639598
0.4	0.786825
0.5	0.893008		89%
0.6	0.95774
0.7	0.986716
0.8	0.997465
0.9	0.99988