Xref: utzoo comp.compression:26 alt.comp.compression:163 Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!usc!rpi!sarah!bingnews!kym From: kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) Newsgroups: comp.compression,alt.comp.compression Subject: theoretical compression factor Summary: what is it? Keywords: random Message-ID: <1991Mar25.031214.25696@bingvaxu.cc.binghamton.edu> Date: 25 Mar 91 03:12:14 GMT Organization: State University of New York at Binghamton Lines: 93 I'm investigating a random number technique to compress text. It runs along the following lines. Some strings have `low complexity' in the sense of Martin-Lo{f. (The program that can type them out is much shorter than the actual string). For such strings it is obviously better to store or transmit the _program_ than the string. Some method needs to be found to enumerate the possible programs useful for this purpose. I am thinking of using strings generated by pseudo-random means (I understand in similar vein to LPC coding). The message is first analyzed to determine some basic statistics (e.g. frequencies of groups of bits of lengths 1 to N) and this is used to select a generator. The generator will generate strings with groups in the appropriate proportions -- this sequence is compared with the original message and only the `differences' saved. The idea is the generator will be able to create a string that is very similar to the original and thereby reduce the number of differences that need to be stored/sent. Beside a great deal of freedom with generating the random strings, another field of experimentation seems to be the method of _comparison_. Some kinds of `differences' will be more sensitive than others to certain types of noise. For example, if the `random string' is: 010110111011101111011111 and the original message was: 000010110111011110111110 there are 9 differences using `phase 0' but none using `phase 3'. What kind of compression factors might be expected? Let's take a simple case of bitstrings. Suppose we have a non-random string with p 1's and (1-p) 0's. If we generate a random string of bits with this same proportion of 1's and 0's in how many places will they differ? If we call the message A and the random string B we have: Prob(A=1 & B=0 | A=0 & B=1) = Prob(A=1)Prob(B=0) + Prob(A=0)Prob(B=1) Independence is presumed (which is not absolutely true since we analyze the message string to _set up_ the generator). We then find Prob(difference) = 2p(1-p) In a message of N bits we would then find 2p(1-p)N differences on average. We could adjust a phase parameter to further minimize this number of differences, but the O(N) dependency would remain (I _think_). To send all the `differences' we just encode the sequence of i_j where A_i_j != B_{i_j+k}. This can be done using `delta modulation' (i.e. the differences of the i_{j+1} and i_j are encoded). The average size of each difference will be O(log N/N) bits each. We therefore have a total compressed string of ~ 2p(1-p)N log N/N = 2p(1-p) log N (ignoring the parameters required to reconstitute the message; these would also need to be sent but are of O(log N) size anyway) and the compression ratio: |compressed string| --------------- |original string| ~ 2p(1-p) log N / N I presume the constant of proportionality will be fairly large since with the following strings I get only 80-90% compression factors (using the best of 100 random bit generators chosen at random :-): abcdefghijklmnopqrstuvwxyz 189 bits max compression=0.873016 this is a test message of a very short string 322 bits max compression=0.813665 These small examples are not very impressive, but the complexity of the method as N->inf is interesting (log N/N). Any comments/suggestions? (Please be nice if I've made another of my classic tremendous goofs :-). -kym