Newsgroups: comp.compression Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!sarah!bingnews!kym From: kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) Subject: Re: theoretical compression factor Message-ID: <1991Mar29.142549.15283@bingvaxu.cc.binghamton.edu> Organization: State University of New York at Binghamton References: <1991Mar27.120829.26094@bingvaxu.cc.binghamton.edu> <20135@alice.att.com> <1991Mar29.031127.9128@bingvaxu.cc.binghamton.edu> Date: Fri, 29 Mar 1991 14:25:49 GMT In article <20135@alice.att.com> jj@alice.UUCP (jj, like it or not) writes: >First, the question of "how much can I compress X" must >be qualified. If you have a historyless compression, >you can compress up to the entropy ( sum of -p log p) bound, >where p is the probability of each token. Here is some more data and another argument for your amusement. [If you're well up on info thy you may as well `n' here]. The p log p ``bound'' is actually a disguised expectation value; it is an _average_. It denotes the average amount of information per symbol in a string. For example, if the string consists of only the symbols 0 and 1 we write -(p lg p +(1-p) lg (1-p)) for the average amount of information (in bits) per 0/1 symbol. [lg(x) is log base 2 -- the size of the alphabet in this case]. If this average is 0.5, say, then we might expect that we could compress the string to 1/2 its size in some fashion without losing any ``information''since each 0/1 symbol in the string would only be representing 0.5 of a bit on average. (This measure does not take into account the _sequence_ of 0's and 1's in the string; we're actually treating the string as a multiset.) Since we're talking about an _average_ it is quite possible to have particular cases where the ``compression factor'' is less. Averages, after all, are statistical variates and have distributions of their own. So what is the distribution of ``p lg p'' compression factor? Rather hard to write in closed form as far as I know, so let's look at a table. The following shows for each ``compression factor'' the likelyhood that its value is less than `x'. x Pr(X99% of the time? Probabilities that expected compression factor -(p lg p + (1-p) lg (1-p)) is less than a given value `x' by selecting the minimum result produced by several statistically independent techniques. x Pr(min{X1,...X2}