Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!mips!ptimtc!nntp-server.caltech.edu!madler From: madler@nntp-server.caltech.edu (Mark Adler) Newsgroups: comp.compression Subject: Re: Modeling vs encoding (Re: Lempel-Ziv v/s huffman encoding) Message-ID: <1991Apr1.191751.4211@nntp-server.caltech.edu> Date: 1 Apr 91 19:17:51 GMT References: <1991Mar31.000653.4598@zorch.SF-Bay.ORG> <12546@pt.cs.cmu.edu> <1991Apr01.053616.3665@looking.on.ca> Organization: California Institute of Technology, Pasadena Lines: 52 In article <1991Apr01.053616.3665@looking.on.ca> brad@looking.on.ca (Brad Templeton) writes: >In article <12546@pt.cs.cmu.edu> tgl@g.gp.cs.cmu.edu (Tom Lane) writes: >>A semiadaptive modeler prescans the file to be compressed, computes >>statistics *for that file*, and then encodes the file accordingly. This ... >>fly. This option is pretty much out of favor these days. > >Depends what you mean by "out of favour." One of the most commonly >used compressors in the world, PKZIP, uses exactly this method, and does >well by it. It is, in fact, the superior method for medium sized files. Actually PKZIP doesn't really do that (by the way, we're talking about the "implode" method, not "shrink", which is an LZW variant). What implode does is look at the first 3K of the file (or all of it if < 3K), counts how many bytes are in an odd-looking set of printable characters (CR, LF, space, some digits, some upper case letter, some lower case letters, a few punctuation--the letters look almost like the Scrabble letters of four or less points), and if the count is less than 5/9 of the total, then the file is "binary", otherwise it is "text". Then, implode uses a prebuilt set of trees depending on if it is binary or text. These trees were made from a statistical analysis of a very large set of sample files of each type. Actually, the text set of trees has two possibilities--if the file is less than 5632 bytes, there is no literal tree, and a 4K dictionary is used, otherwise there is a literal tree, and an 8K dictionary is used. Also, it doesn't do any of this, and uses shrink (LZW) instead if the file is 320 bytes or less. (Some of the above is from discussions with PKWare (PK and others), and some is from experimentation with PKZIP. I have not, nor would I wish to disassemble PKZIP.) So, the "statistical analysis" of the file really only outputs less than two bits about the file, upon which the tree selection is made. I believe that the implication about "methods out of favor" refers to making a complete first pass over the file to build a set of trees customized to that file. PKZIP doesn't do that. The funny thing is, PKZIP *sends* the trees with each compressed file, and PKUNZIP uses them (PKUNZIP does not have those trees pre-stored). Perhaps it was PK's intention to do make customized trees for each file, but found that it was too slow. By the way, the free zip being worked on *does* build trees for each file, and right now it's too slow. But a new string matching algorithm is being implemented, and we expect it to speed up considerably. Stay tuned ... Mark Adler madler@pooh.caltech.edu