Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!mips!daver!zorch!xanthian From: xanthian@zorch.SF-Bay.ORG (Kent Paul Dolan) Newsgroups: comp.compression Subject: Re: Is arithmetic compression a crock? Summary: Depends on what you're goals may be Message-ID: <1991Jun3.102855.14243@zorch.SF-Bay.ORG> Date: 3 Jun 91 10:28:55 GMT References: <1991May30.035552.14435@alembic.acs.com> Organization: SF-Bay Public-Access Unix Lines: 95 csu@alembic.acs.com (Dave Mack) writes: > About a month ago, Kent Dolan posted a sample arithmetic coding > compression package here. I finally got around to messing with > it. For reference: [Omitted] > So I modified the makefile to use gcc with -g and -O, then ran it > on a large text file (jargon2.8.2, actually.) The results were > pretty horrible: > # time encode test.A > 80.7 real 79.8 user 0.3 sys > # time decode untest > 101.1 real 98.7 user 0.5 sys > # cmp test untest > # ls -l test test.A > -rw-r--r-- 1 csu 890267 May 29 23:19 test > -rw-r--r-- 1 csu 522053 May 29 23:28 test.A > This was done on a *very* lightly loaded Sun-4/110. For comparison, > I ran an LZW compress over the same input: > # time compress test > 17.1 real 16.3 user 0.4 sys > # ls -l test.Z > -rw-r--r-- 1 csu 392609 May 29 23:19 test.Z > # time uncompress test.Z > 10.8 real 9.2 user 0.5 sys > Is this a really bad implementation, a really bad algorithm, or > is arithmetic compression basically useless for text compression? Your results are fairly consistent with mine, but don't despair. The algorithm _is_ about six times as slow as compress, and not that hot for normal text. However, knowing several things will make you willing to investigate further. 1) The code as written is an adaptive model, like adaptive Huffman coding. For text, a fixed model (or best, a two pass approach) would possibly do better. 2) The code as written was explicitly written for clarity of tutorial use in a CACM article, not for speed. In particular, it does such horrid things as make a subroutine call per _bit_ of compressed file to write or read the bits. A little cleanup of this kind of bogosity would do a _lot_ to speed up its six to one lossage to "compress". 3) Arithmetic compression is an improvement over Huffman coding, by not insisting on encoding characters on whole bit boundaries, and not requiring explicit stop bits, and being capable of encoding multiple characters per bit in extreme cases, so it is a possible win as the second phase coding in LHARC-like two phase compression schemes. 4) Arithmetic coding shines, not for text, where about 1/3rd of the possible symbols are fairly well represented, but in _severely_ skewed data sets, like monochrome line drawings, where almost all of of the data is one symbol from the symbol set. To see arithmetic codeing shine, make up a test data set consisting of 99% nulls, and a random replacement of a randomly chosen other 1% with any of the other 255 byte values. Compress that data set with compress and with arithmetic compression, and compare the results for the two, and be prepared to be impressed for megabyte sized data sets. The obvious applications are noisy image data, where the noise breaks up the strings that make compress effective, but still leave the bulk of the pixels as a background value. 5) If you absolutely _must_ get the smallest possible archive file, use lharc. As a second choice, feed the output of compress through the arithmetic encoder. Though what compress produces looks like pure noise, I've found encode will compress it about 4% further. It costs lots of time, of course, but if that is not your main criterion, then it is quite handy. 6) Much better, state sensitive arithmetic coding algorithms exist than this simple one character deep tutorial code. See Doug Jones' response for more on this. Kent, the man from xanth.