Xref: utzoo comp.compression:65 alt.comp.compression:180 Path: utzoo!utgpu!cs.utexas.edu!usc!rpi!sarah!bingnews!kym From: kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) Newsgroups: comp.compression,alt.comp.compression Subject: Re: theoretical compression factor Message-ID: <1991Mar26.111900.8812@bingvaxu.cc.binghamton.edu> Date: 26 Mar 91 11:19:00 GMT References: <1991Mar25.054838.15588@bingvaxu.cc.binghamton.edu> <15098:Mar2512:53:1291@kramden.acf.nyu.edu> <18263:Mar2518:10:4091@kramden.acf.nyu.edu> Organization: State University of New York at Binghamton Lines: 135 My last attempt to post this article with either bumped or dumped in Dan's IN tray -- I don't know which. Sorry if you've seen it before. === Thanks, Dan, for your input. In article <15098:Mar2512:53:1291@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes: > There is no way that a compressor can turn every string with as many 0's > as 1's into a string of half the length. In article <18263:Mar2518:10:4091@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes: > To forestall argument: There are (2n)!/(n!)^2 bit strings of length 2n > with n bits set. This exceeds (2n/e)^(2n) / (n/e)^(2n)(2\pi n)exp(1/6n), > i.e., 2^(2n) / (2\pi n)exp(1/6n). So for large n there are at least > 2n - 3 - log n bits of information in such strings, and there is no way > they can all be compressed to length n... While I have to agree with Dan's argument above a question that would be relevant to this group is WHAT FRACTION of strings can be compressed to strings of, say, <= 1/2 their length? Using the (possibly buggy) argument below, I find this to be ~ 1/sqrt(N). I would be interested in any refinement or other pointers. \begin{argument} One way we might approximate an answer is to consider `random' number generators (sorry, this is an old axe :-). A generator of bits, say, takes a number of parameters and can output, if properly managed, a sequence of 2^n bits. In certain circumstances we can guarantee that different n-bit parameter values will generate different bit sequences. Let's say we have a random bit generator of N bits long (in some kind of Turing/Wang formalism). It can generate of 2^N bits. By changing any of the bits in the N-bit program the sequence produced will be different. (I am talking about the limit of N here where _most_ of the N bits of generator are in fact parameters and not much is actual code where bit changes are typically `uninteresting' :-). We therefore arrive at: N-bit `program' => 2^N bits 2^N possible N-bit programs => (2^N)^2 bits Therefore the number of 2N-bit strings that can be generated (by this means) by N-bit programs is (2^N)^2/(2N) = 2^(2N-1)/N. Obviously there are a total of 2^(2N) 2N-bit strings so our random bit programs can generate the fraction: 2^(2N-1)/N ---------- 2^(2N) = 1 / 2N of them. Now, the number of 2N-bit strings with 1/2 1's= (2N)!/(N!)^2. Using the simple approx n! = sqrt(2 n pi) (n/e)^n (actually it's larger by a factor of about O(1+1/12n)) we find this last to be about: sqrt(2 2N pi) (2 N/e)^(2N) ~ -------------------------- 2 N pi (N/e)^(2N) 2^(2N) = ------ sqrt(N pi) So the fraction of 2N-bit strings that have 1/2 1's is therefore 2^(2N) ~ --------- / 2^(2N) sqrt(N pi) = 1 / sqrt(N pi) What proportion of these might our N-bit programs be capable of generating? All things being equal we find: 1/2N ~ ---------- 1/sqrt(N pi) = 1/2 sqrt(pi/N) which obviously ->0 (and is only good for large N anyway). \end{argument} If N were in the 10k's of bits (typical of files in my directory, anyway) then we would expect a 50% compression of `balanced' files using random number generators to occur in ~ 1% of cases -- not too often! As a kind of VERY ROUGH experiment, I ran all my current files through the Unix compress utility and selected those that had close to 1/2 1's. We find the following: Fraction 1's: Reduction factor: 0.506969 38.07% 0.47619 -33.33% 0.492063 -55.55% 0.491462 33.09% 0.502947 36.47% 0.503021 35.94% 0.478355 26.01% 0.492063 -55.55% 0.49483 30.57% 0.47619 -55.55% 0.510204 32.50% 0.47619 -55.55% 0.525429 31.15% 0.505414 22.98% 0.487941 12.12% 0.514909 48.48% Out of about 100 files these above contained between .48 and .52 (approx) 1's vs 0's. There were 16 attempts to compress (using the `standard' Lempel-Ziv adaptive coding). There were 5 failures. The average reduction where successful (geometric mean) was 30.04% (i.e. files compressed to about 70% of original length). Only 1 -- the last -- came _close_ to 50% reduction. Maybe a statistical accident? But then again this utility doesn't use the method I was talking about... My remaining argument, not having a fully-working version of the compression algorithm I was discussing in my original post to run on text files, must run along the lines of ``text files ain't random'' and thefore the ~ 1/sqrt(N) argument above does not apply. -kym