Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!crdgw1!uunet!looking!brad From: brad@looking.on.ca (Brad Templeton) Newsgroups: comp.compression Subject: Re: Lempel-Ziv v/s huffman encoding Message-ID: <1991Mar30.191331.12325@looking.on.ca> Date: 30 Mar 91 19:13:31 GMT References: Distribution: comp Organization: Looking Glass Software Ltd. Lines: 36 The original L-Z algorithm was huffman encoded. Welch, working at Sperry, modified the algorithm and *removed* the huffman encoding. The result was poorer compression. But the result was also fixed-sized tokens, and that meant you could implement the LZW algorithm in hardware, and quickly. This meant it was suitable for compressing as you wrote out to disk. Decompress was similarly quick. Zip, for example, uses the original L-Z algorithm (sometimes called LZ1) with variable length prefix codes. (Huffman codes are the best possible prefix codes, Zip uses Shannon-Fano codes which are almost as good but slightly easier to calculate) LZW is based on the 2nd L-Z algorithm. It was designed for speed, not the very best compression. It is still a lot better than any first order probabilistic algorithm like basic huffman coding. But as you will note, ZIP does a fair bit better than compress. ZIP is far from perfect as well. I have a compressor nearing beta test that does quite a bit better that ZIP. Note that while huffman makes the best prefix codes, prefix codes are not the best encoding of uneven probability data streams. If something occurs 49% of the time, you will give it a 2 bit prefix code, when it really only needs just slightly more than one bit. BIG waste. Arithmetic coding (described in this month's Doctor Dobb's, and many other sources) is able to use fractional bits, sort of, and it truly is optimal. But it takes more CPU to do, and a lot of storage if you want to store static tables in the file. It's almost always done in adaptive form. The above example is the worst possible case. Typical case is that arithmetic coding gives you 1/2 bit per token improvement. If your tokens are 10 bits long on average, as they might be in LZ, that's a 5% reduction in the size of the file. -- Brad Templeton, ClariNet Communications Corp. -- Waterloo, Ontario 519/884-7473