Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!mips!ptimtc!nntp-server.caltech.edu!madler
From: madler@nntp-server.caltech.edu (Mark Adler)
Newsgroups: comp.compression
Subject: Re: Modeling vs encoding (Re: Lempel-Ziv v/s huffman encoding)
Message-ID: <1991Apr1.191751.4211@nntp-server.caltech.edu>
Date: 1 Apr 91 19:17:51 GMT
References: <1991Mar31.000653.4598@zorch.SF-Bay.ORG> <12546@pt.cs.cmu.edu> <1991Apr01.053616.3665@looking.on.ca>
Organization: California Institute of Technology, Pasadena
Lines: 52

In article <1991Apr01.053616.3665@looking.on.ca> brad@looking.on.ca (Brad Templeton) writes:
>In article <12546@pt.cs.cmu.edu> tgl@g.gp.cs.cmu.edu (Tom Lane) writes:
>>A semiadaptive modeler prescans the file to be compressed, computes
>>statistics *for that file*, and then encodes the file accordingly.  This
...
>>fly.  This option is pretty much out of favor these days.
>
>Depends what you mean by "out of favour."   One of the most commonly
>used compressors in the world, PKZIP, uses exactly this method, and does
>well by it.   It is, in fact, the superior method for medium sized files.

Actually PKZIP doesn't really do that (by the way, we're talking about
the "implode" method, not "shrink", which is an LZW variant).  What
implode does is look at the first 3K of the file (or all of it if < 3K),
counts how many bytes are in an odd-looking set of printable characters
(CR, LF, space, some digits, some upper case letter, some lower case
letters, a few punctuation--the letters look almost like the Scrabble
letters of four or less points), and if the count is less than 5/9 of
the total, then the file is "binary", otherwise it is "text".

Then, implode uses a prebuilt set of trees depending on if it is binary
or text.  These trees were made from a statistical analysis of a very
large set of sample files of each type.  Actually, the text set of
trees has two possibilities--if the file is less than 5632 bytes, there
is no literal tree, and a 4K dictionary is used, otherwise there is a
literal tree, and an 8K dictionary is used.

Also, it doesn't do any of this, and uses shrink (LZW) instead if the
file is 320 bytes or less.

(Some of the above is from discussions with PKWare (PK and others), and
some is from experimentation with PKZIP.  I have not, nor would I wish
to disassemble PKZIP.)

So, the "statistical analysis" of the file really only outputs less than
two bits about the file, upon which the tree selection is made.  I believe
that the implication about "methods out of favor" refers to making a
complete first pass over the file to build a set of trees customized to
that file.  PKZIP doesn't do that.

The funny thing is, PKZIP *sends* the trees with each compressed file,
and PKUNZIP uses them (PKUNZIP does not have those trees pre-stored).
Perhaps it was PK's intention to do make customized trees for each file,
but found that it was too slow.

By the way, the free zip being worked on *does* build trees for each
file, and right now it's too slow.  But a new string matching algorithm
is being implemented, and we expect it to speed up considerably.  Stay
tuned ...

Mark Adler
madler@pooh.caltech.edu