Newsgroups: comp.compression Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!sarah!bingnews!kym From: kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) Subject: Re: theoretical compression factor Message-ID: <1991Mar29.031127.9128@bingvaxu.cc.binghamton.edu> Summary: maybe Organization: State University of New York at Binghamton References: <1991Mar26.220519.1242@unislc.uucp> <1991Mar27.120829.26094@bingvaxu.cc.binghamton.edu> <20135@alice.att.com> Date: Fri, 29 Mar 1991 03:11:27 GMT In article <20135@alice.att.com> jj@alice.UUCP (jj, like it or not) writes: >Please, folks, you're confusing several issues here. Who/what/where -- _ME_ -- confused??!! :-) >First, the question of "how much can I compress X" must >be qualified. If you have a historyless compression, >you can compress up to the entropy ( sum of -p log p) bound, >where p is the probability of each token. > >If you have history in your compression algorthm, you may (or >may not) be able to do better, depending on the randomness >of your token distribution (even given a fixed distribution of >tokens). Ok, I remember something of this ilk ``p log p'' stuff from Shannon and his buddies, but I'm not too sure whether you (or he, or whoever) is saying historyless compression can not, no way, ever be LESS than this. But it _seems_ like you are saying this is your understanding, doesn't it? If so, I have a question and another question. Consider the scheme I outlined earlier -- of comparing an input bitstream with a random bitstream and just storing the differences in delta form (e.g. fixed-length or Huffman coded). This doesn't _appear_ to use history -- the next bit output by an uncompressing program doesn't use the previous bits of the uncompressed string to obtain the next bit. The compressing program doesn't build up any dictionary. Or does it? (That's Q1). There _is_ however, a quibble about a few hidden bits -- e.g. the length of the random generator parameters and the compress/uncompress program code. The program, however, is finite. If we make the input string large enough the program length and any fiddling little parameters can be swamped by the length of the string itself. Now consider the following data obtained from an experimental compression program based on the idea above: ==================== h.dat nb=32400 <--- number of bits read n1=0.341605 <--- fraction of 1's -(p log p + q log q)=0.642094 <--- limit 1 -(p lg p + q lg q)=0.926346 <--- maybe this is log_2 instead of log_e? 2pq=0.449822 <--- my limit (sure looks low, doesn't it?) minlen=20625 <--- after some trials this is the MINIMUM LENGTH output string compression=0.636574 <--- less than - sum p ln p n.dat nb=8080 n1=0.336015 -(p log p + q log q)=0.638357 -(p lg p + q lg q)=0.920954 2pq=0.446218 minlen=5028 compression=0.622277 <--- less than - sum p ln p n2.dat nb=16160 n1=0.334653 -(p log p + q log q)=0.637425 -(p lg p + q lg q)=0.91961 2pq=0.445321 minlen=10042 compression=0.621411 <--- less than - sum p ln p n4.dat nb=32320 n1=0.334653 -(p log p + q log q)=0.637425 -(p lg p + q lg q)=0.91961 2pq=0.445321 minlen=20088 compression=0.621535 <--- less than - sum p ln p ==================== Here are 4 data files compressed to LESS THAN the limit quoted above. Certainly not _substantially less_, but less nevertheless. 8-) The files aren't big. But they do vary an order of magnitude in size or so. I therefore presume we _could_ arrange to have the data file size -> inf. In that case, even including the size of the compress/decompress program, we have case(s) where the ``p log p'' limit seems to be beaten. What gives? Perhaps there are additional O(1) factors involved in the ``p log p'' limit? (That's Q2). -kym BTW, the data files are hex and decimal digits with about 50% ascii 0's. The balance of 0 vs 1 *bits* is given in each case above.