Xref: utzoo comp.compression:26 alt.comp.compression:163
Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!usc!rpi!sarah!bingnews!kym
From: kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell)
Newsgroups: comp.compression,alt.comp.compression
Subject: theoretical compression factor
Summary: what is it?
Keywords: random
Message-ID: <1991Mar25.031214.25696@bingvaxu.cc.binghamton.edu>
Date: 25 Mar 91 03:12:14 GMT
Organization: State University of New York at Binghamton
Lines: 93

I'm investigating a random number technique to compress text.

It runs along the following lines.

Some strings have `low complexity' in the sense of Martin-Lo{f.
(The program that can type them out is much shorter than the
actual string).

For such strings it is obviously better to store or transmit the _program_ 
than the string. Some method needs to be found to enumerate the possible 
programs useful for this purpose.

I am thinking of using strings generated by pseudo-random means
(I understand in similar vein to LPC coding). The message is first analyzed 
to determine some basic statistics (e.g. frequencies of groups of bits of 
lengths 1 to N) and this is used to select a generator. The generator will 
generate strings with groups in the appropriate proportions -- this sequence 
is compared with the original message and only the `differences' saved. The 
idea is the generator will be able to create a string that is very similar to 
the original and thereby reduce the number of differences that need to be
stored/sent.

Beside a great deal of freedom with generating the random strings,
another field of experimentation seems to be the method of _comparison_.
Some kinds of `differences' will be more sensitive than others to
certain types of noise. For example, if the `random string' is:

	010110111011101111011111

and the original message was:

	000010110111011110111110

there are 9 differences using `phase 0' but none using `phase 3'.

What kind of compression factors might be expected? Let's take a simple case 
of bitstrings.

Suppose we have a non-random string with p 1's and (1-p) 0's.
If we generate a random string of bits with this same proportion of 1's 
and 0's in how many places will they differ? If we call the message A
and the random string B we have:

Prob(A=1 & B=0 | A=0 & B=1) =
	Prob(A=1)Prob(B=0) + Prob(A=0)Prob(B=1)

Independence is presumed (which is not absolutely true since we analyze the 
message string to _set up_ the generator). We then find

Prob(difference) = 2p(1-p)

In a message of N bits we would then find 2p(1-p)N differences on average. 
We could adjust a phase parameter to further minimize this number of
differences, but the O(N) dependency would remain (I _think_).

To send all the `differences' we just encode the sequence of i_j where
A_i_j != B_{i_j+k}. This can be done using `delta modulation' (i.e. the
differences of the i_{j+1} and i_j are encoded). The average size
of each difference will be O(log N/N) bits each.

We therefore have a total compressed string of

~	2p(1-p)N log N/N = 2p(1-p) log N

(ignoring the parameters required to reconstitute the message; these would 
also need to be sent but are of O(log N) size anyway) and the 
compression ratio:

	|compressed string|
	---------------
	|original string|

~	2p(1-p) log N / N

I presume the constant of proportionality will be fairly large since with the 
following strings I get only 80-90% compression factors (using the best of 
100 random bit generators chosen at random :-):

abcdefghijklmnopqrstuvwxyz<NEWLINE>
189 bits
max compression=0.873016

this is a test message of a very short string<NEWLINE>
322 bits
max compression=0.813665

These small examples are not very impressive, but the complexity
of the method as N->inf is interesting (log N/N).

Any comments/suggestions? (Please be nice if I've made another of my
classic tremendous goofs :-).

-kym