Xref: utzoo comp.compression:99 alt.comp.compression:185 Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!usc!rpi!sarah!bingnews!kym From: kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) Newsgroups: comp.compression,alt.comp.compression Subject: Re: theoretical compression factor Message-ID: <1991Mar27.120829.26094@bingvaxu.cc.binghamton.edu> Date: 27 Mar 91 12:08:29 GMT References: <1991Mar25.054838.15588@bingvaxu.cc.binghamton.edu> <1991Mar26.220519.1242@unislc.uucp> Organization: State University of New York at Binghamton Lines: 226 In article <1991Mar26.220519.1242@unislc.uucp> ttobler@unislc.uucp (Trent Tobler) writes: >From article <1991Mar25.054838.15588@bingvaxu.cc.binghamton.edu>, by kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell): >> would therefore be >> 2p(1-p) >> where p is the proportion of 0's (or 1's -- it doesn't matter). >Glaring error #1: > if maximum compression of a binary string is 2p(1-p), what is maximum >compression of the binary data... > 01010101010101010101010101010101 [further explanation and some references] Thanks for your comments Trent. I must be doing _something_ right -- 1/2 the respondents say ``too low'' and the other half say ``too high''. :-) Perhaps I was a little obscure with the subject line or something, but I have received a lot of email regarding examples similar to the above. What I'll do here is (a) show you some _measurements_, and then (b) present a very informal argument. (I will attempt to show that EVEN THE 01 EXAMPLE ABOVE, can not generally be represented in much fewer than a given number of bits, despite claims that it comes `for free'). PART 1 -- some numbers. ====================== After scrounging around all over my disks I came up with about 100 files with an equal (i.e. balance between 0.45 and 0.55) number of 0's and 1's. I then ran `compress' over them and determined the average factor of sizes after/before. It worked out about 67% over all attempts; about 56% percent in those cases where it was actually successful. A log file is appended, for those that may like to analyse it (probably there's plenty it can tell you about my disks, also plenty it might tell you about how `random' average text files of various programming languages may behave). PART 2 -- an argument. ===================== Let's take the 01 example above. (I have also received much email of similar ilk). It is claimed that this sequence can be represented in a small number of bits. But this can not _necessarily_ be done. Let me ask the question -- how _long_ is this sequence of 01's? If, let's say, the sequence of 01's is 2^1000000 bits long we must encode the _length_ as part of the compressed code. Representing this length does not come for free. We might for example compute the length as a binary number and then use some compression technique to compress _that_ into a compressed format. Eventually this kind of `iterated compression' comes to a fixpoint. Let's just say, without proof, that to encode the 01 sequence repeated N times takes O(log N) bits. I understand this point is covered in the Shannon paper that Trent indicated. This is obviously much less than 1/2 as N increases, but it _is_ only a single example. With the technique above we can _only_ represent 01 sequences. To represent other simple cycles will require more bits to `name' the particular cycle (possibly in a `deeper' compressed format). As Dan Bernstein pointed out in a previous article, and is discussed briefly in Knuth v2, there are a very large number :-) of strings that CAN NOT be represented by ANY program shorter than themselves. I think there are sufficient of these degenerate examples to bring any average figure well up past the (only APPARENT, from the 01 example above) log N/N limit. CONCLUSION ========== My little formula was not intended to EQUAL the compression in every case, for this is obviously not possible (it can not be a function of just p for one thing). Nor was it intended as an AVERAGE over EVERY case. It was intended as a ballpark figure after making certain assumptions about the specific underlying compression technique. (I am rather pleased, however, with its predictive ability in the p=1/2 case as illustrated for LZ compression, below). Cheers, -kym === BTW, some people around SUNY-B claim I do such measurements FIRST, then come up with a theory and subsequently produce them at an opportune moment -- this is usualy not the case, however. I almost NEVER have a theory :-). =================LOG FILE==================== Filename after size/before size using Sun LZ `compress' :mkshortnames 0.524213 :techinstall 0.751825 ApolloGraphicsMakefile 0.346165 CONTENTS 0.629024 Comp.lang.prolog 0.567824 DPram2.1 1 DPram2.1i 1 DProm.1 1 DProm.1i 1 INDEX 0.771619 MKNEW 0.646409 MKOLD 0.644444 Makefile.vax 0.399232 PORT.LOG 0.63328 READ_ME 0.633803 RR-INRIA-1320.dvi 0.549019 ReadMe 0.621053 abc10201.zip 1 apollo.c 0.610215 cal_findresist.1 1 cal_makefile.1 1 change.l 0.674312 chconfig.1 0.690476 chconfig.1i 0.690476 cman 0.911392 comp.sources.index 0.527516 deutsch.net 0.210469 displays.proto 0.836957 dpram2.1 1 dpram2.1i 1 dprom.1 1 dprom.1i 1 drc.1 1 drc.1i 1 esim.1 0.511555 espresso.1 0.571458 esptype.h 0.357095 ex.rc 1 exrc 1 ext.h 0.403077 extio.h 1 extra+.c 1 filehead.h 0.603837 gemini.1 0.516526 gen_control.1 0.550769 gen_control.1i 0.550769 geometry.3 0.495135 grApollo1.old 0.44325 grApollo4.c 0.522812 grApolloInt.h 0.622093 grOmegaInt.h 0.694301 grSunInt.h.old 0.595443 gram 0.627575 if.how 1 install 1 laygen.1 1 les.txt 0.541202 m.bat 1 m2cmp20.zip 1 m2doc20.zip 1 m2exa20.zip 1 m2lib20.zip 1 m2utl20.zip 1 magic.1 0.417558 magicutils.3 0.609205 makewhatis.csh 0.717697 mkdisttape 0.599204 mod2txt.arc 1 nc.1 0.589175 ndep.down.out 0.909836 ndep.up.out 1 nenh.down.out 0.396806 nenh.up.out 0.422425 nenhp.down.out 0.4109 netgen.1 1 netlist.1 0.598494 netlist.1i 0.598494 nload.up.out 1 nsuper.down.out 0.900826 nsuper.up.out 1 ohmics.1 0.532989 om-esubs.c 0.453863 om-subs.h 0.497336 out_struct.h 0.641509 panda.h 0.482293 paslib.i 0.428077 paslib2.i 0.42933 paslib3.i 0.427255 paslib4.i 0.428455 pattern.c 0.588988 plowDebugInt.h 0.691729 presim.1 0.636028 presim.1i 0.636028 procs.h 0.349669 prolog.lib 0.540272 rofix 1 route1.net 0.742857 route2.net 0.742857 rsleeper 0.893617 rsleeper.csh 0.893617 runstats.3 0.602226 s.Makefile 0.425905 scanner.h 0.714286 select.h 0.607029 showcdrcs.1 1 showcdrcs.1i 1 showsdrcs.1 1 showsdrcs.1i 1 signals.h 0.754472 simsort.1 1 spice_cor.1 0.544245 statlib.index 0.619519 tags 0.433903 test4.m 0.402351 test6.m 0.407745 tlogic.c 0.520635 ttable.c 0.164343 tut7b.net 0.625731 tut7d.net 0.625731 util.c 0.620143 valtbs.1 1 vaxu_unsw_gcc.bench 0.949367 win.c 0.584328 win.c- 0.572215 win.unx 0.584328 wireUndo.c 0.48959 geo avg of all=0.665099 geo avg of `successful' cases= 0.557706 ====================END END END====================