Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!crdgw1!uunet!looking!brad
From: brad@looking.on.ca (Brad Templeton)
Newsgroups: comp.compression
Subject: Re: Lempel-Ziv v/s huffman encoding
Message-ID: <1991Mar30.191331.12325@looking.on.ca>
Date: 30 Mar 91 19:13:31 GMT
References: <KENW.91Mar29153846@skyler.arc.ab.ca>
Distribution: comp
Organization: Looking Glass Software Ltd.
Lines: 36

The original L-Z algorithm was huffman encoded.

Welch, working at Sperry, modified the algorithm and *removed* the huffman
encoding.  The result was poorer compression.  But the result was also
fixed-sized tokens, and that meant you could implement the LZW algorithm in
hardware, and quickly.   This meant it was suitable for compressing as
you wrote out to disk.  Decompress was similarly quick.

Zip, for example, uses the original L-Z algorithm (sometimes called LZ1)
with variable length prefix codes.  (Huffman codes are the best possible
prefix codes, Zip uses Shannon-Fano codes which are almost as good but
slightly easier to calculate)

LZW is based on the 2nd L-Z algorithm.  It was designed for speed, not
the very best compression.  It is still a lot better than any first order
probabilistic algorithm like basic huffman coding.   But as you will note,
ZIP does a fair bit better than compress.

ZIP is far from perfect as well.  I have a compressor nearing beta test
that does quite a bit better that ZIP.

Note that while huffman makes the best prefix codes, prefix codes are
not the best encoding of uneven probability data streams.   If something
occurs 49% of the time, you will give it a 2 bit prefix code, when it
really only needs just slightly more than one bit.  BIG waste.  Arithmetic
coding (described in this month's Doctor Dobb's, and many other sources)
is able to use fractional bits, sort of, and it truly is optimal.  But
it takes more CPU to do, and a lot of storage if you want to store static
tables in the file.  It's almost always done in adaptive form.

The above example is the worst possible case.  Typical case is that arithmetic
coding gives you 1/2 bit per token improvement.   If your tokens are 10
bits long on average, as they might be in LZ, that's a 5% reduction in the
size of the file.
-- 
Brad Templeton, ClariNet Communications Corp. -- Waterloo, Ontario 519/884-7473