Xref: utzoo comp.compression:65 alt.comp.compression:180
Path: utzoo!utgpu!cs.utexas.edu!usc!rpi!sarah!bingnews!kym
From: kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell)
Newsgroups: comp.compression,alt.comp.compression
Subject: Re: theoretical compression factor
Message-ID: <1991Mar26.111900.8812@bingvaxu.cc.binghamton.edu>
Date: 26 Mar 91 11:19:00 GMT
References: <1991Mar25.054838.15588@bingvaxu.cc.binghamton.edu> <15098:Mar2512:53:1291@kramden.acf.nyu.edu> <18263:Mar2518:10:4091@kramden.acf.nyu.edu>
Organization: State University of New York at Binghamton
Lines: 135

My last attempt to post this article with either bumped or dumped in
Dan's IN tray -- I don't know which. Sorry if you've seen it before.
===

Thanks, Dan, for your input. 

In article <15098:Mar2512:53:1291@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
> There is no way that a compressor can turn every string with as many 0's
> as 1's into a string of half the length.

In article <18263:Mar2518:10:4091@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
> To forestall argument: There are (2n)!/(n!)^2 bit strings of length 2n
> with n bits set. This exceeds (2n/e)^(2n) / (n/e)^(2n)(2\pi n)exp(1/6n),
> i.e., 2^(2n) / (2\pi n)exp(1/6n). So for large n there are at least
> 2n - 3 - log n bits of information in such strings, and there is no way
> they can all be compressed to length n...

While I have to agree with Dan's argument above a question that would be 
relevant to this group is WHAT FRACTION of strings can be compressed to 
strings of, say, <= 1/2 their length? Using the (possibly buggy) argument 
below, I find this to be ~ 1/sqrt(N). I would be interested in any refinement 
or other pointers.

\begin{argument}

One way we might approximate an answer is to consider `random' number
generators (sorry, this is an old axe :-).

A generator of bits, say, takes a number of parameters and can output, 
if properly managed, a sequence of 2^n bits. In certain circumstances we can 
guarantee that different n-bit parameter values will generate different bit 
sequences.

Let's say we have a random bit generator of N bits long (in some kind of
Turing/Wang formalism). It can generate of 2^N bits. By changing any of the 
bits in the N-bit program the sequence produced will be different. 
(I am talking about the limit of N here where _most_ of the N bits of 
generator are in fact parameters and not much is actual code where
bit changes are typically `uninteresting' :-).

We therefore arrive at:

N-bit `program'			=>	2^N bits
2^N possible N-bit programs	=>	(2^N)^2 bits

Therefore the number of 2N-bit strings that can be generated (by this 
means) by N-bit programs is (2^N)^2/(2N) = 2^(2N-1)/N.

Obviously there are a total of 2^(2N) 2N-bit strings so our random bit 
programs can generate the fraction:

	2^(2N-1)/N
	----------
	2^(2N)

=	1 / 2N

of them.

Now, the number of 2N-bit strings with 1/2 1's= (2N)!/(N!)^2.  Using the 
simple approx n! = sqrt(2 n pi) (n/e)^n (actually it's larger by a factor of 
about O(1+1/12n)) we find this last to be about:

	sqrt(2 2N pi) (2 N/e)^(2N)
~	--------------------------
	2 N pi (N/e)^(2N)

	2^(2N)
=	------
	sqrt(N pi)

So the fraction of 2N-bit strings that have 1/2 1's is therefore

	2^(2N)
~	--------- / 2^(2N)
	sqrt(N pi)

=	1 / sqrt(N pi)

What proportion of these might our N-bit programs be capable of generating?
All things being equal we find:

	1/2N
~	----------
	1/sqrt(N pi)

=	1/2 sqrt(pi/N)

which obviously ->0 (and is only good for large N anyway).

\end{argument}

If N were in the 10k's of bits (typical of files in my directory, anyway) then 
we would expect a 50% compression of `balanced' files using random number 
generators to occur in ~ 1% of cases -- not too often! 

As a kind of VERY ROUGH experiment, I ran all my current files through the Unix 
compress utility and selected those that had close to 1/2 1's. We find the 
following:

Fraction 1's:	Reduction factor:

0.506969	38.07% 
0.47619		-33.33%
0.492063	-55.55%
0.491462	33.09%
0.502947	36.47%
0.503021	35.94%
0.478355	26.01%
0.492063	-55.55%
0.49483		30.57%
0.47619		-55.55%
0.510204	32.50%
0.47619		-55.55%
0.525429	31.15%
0.505414	22.98%
0.487941	12.12%
0.514909	48.48%

Out of about 100 files these above contained between .48 and .52 (approx) 
1's vs 0's.  There were 16 attempts to compress (using the `standard' 
Lempel-Ziv adaptive coding).  There were 5 failures.  The average reduction 
where successful (geometric mean) was 30.04% (i.e. files compressed to about 
70% of original length).

Only 1 -- the last -- came _close_ to 50% reduction. Maybe a statistical
accident? But then again this utility doesn't use the method I was
talking about...

My remaining argument, not having a fully-working version of the 
compression algorithm I was discussing in my original post to run on text 
files, must run along the lines of ``text files ain't random'' and thefore 
the ~ 1/sqrt(N) argument above does not apply.

-kym