Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!mips!daver!zorch!xanthian
From: xanthian@zorch.SF-Bay.ORG (Kent Paul Dolan)
Newsgroups: comp.compression
Subject: Re: Is arithmetic compression a crock?
Summary: Depends on what you're goals may be
Message-ID: <1991Jun3.102855.14243@zorch.SF-Bay.ORG>
Date: 3 Jun 91 10:28:55 GMT
References: <1991May30.035552.14435@alembic.acs.com>
Organization: SF-Bay Public-Access Unix
Lines: 95


 csu@alembic.acs.com (Dave Mack) writes:

> About a month ago, Kent Dolan posted a sample arithmetic coding
> compression package here. I finally got around to messing with 
> it. For reference:

[Omitted]

> So I modified the makefile to use gcc with -g and -O, then ran it
> on a large text file (jargon2.8.2, actually.) The results were
> pretty horrible:

>   # time encode <test >test.A
>   80.7 real        79.8 user         0.3 sys  
>   # time decode <test.A >untest
>   101.1 real        98.7 user         0.5 sys  
>   # cmp test untest
>   # ls -l test test.A
>   -rw-r--r--  1 csu        890267 May 29 23:19 test
>   -rw-r--r--  1 csu        522053 May 29 23:28 test.A

> This was done on a *very* lightly loaded Sun-4/110. For comparison,
> I ran an LZW compress over the same input:

>   # time compress test
>   17.1 real        16.3 user         0.4 sys  
>   # ls -l test.Z
>   -rw-r--r--  1 csu        392609 May 29 23:19 test.Z
>   # time uncompress test.Z
>   10.8 real         9.2 user         0.5 sys  

> Is this a really bad implementation, a really bad algorithm, or
> is arithmetic compression basically useless for text compression?

Your results are fairly consistent with mine, but
don't despair. The algorithm _is_ about six times as
slow as compress, and not that hot for normal text.
However, knowing several things will make you
willing to investigate further.

1) The code as written is an adaptive model, like
adaptive Huffman coding. For text, a fixed model (or
best, a two pass approach) would possibly do better.

2) The code as written was explicitly written for
clarity of tutorial use in a CACM article, not for
speed. In particular, it does such horrid things as
make a subroutine call per _bit_ of compressed file
to write or read the bits. A little cleanup of this
kind of bogosity would do a _lot_ to speed up its
six to one lossage to "compress".

3) Arithmetic compression is an improvement over
Huffman coding, by not insisting on encoding
characters on whole bit boundaries, and not
requiring explicit stop bits, and being capable of
encoding multiple characters per bit in extreme
cases, so it is a possible win as the second phase
coding in LHARC-like two phase compression schemes.

4) Arithmetic coding shines, not for text, where
about 1/3rd of the possible symbols are fairly well
represented, but in _severely_ skewed data sets,
like monochrome line drawings, where almost all of
of the data is one symbol from the symbol set. To
see arithmetic codeing shine, make up a test data
set consisting of 99% nulls, and a random
replacement of a randomly chosen other 1% with any
of the other 255 byte values. Compress that data set
with compress and with arithmetic compression, and
compare the results for the two, and be prepared to
be impressed for megabyte sized data sets.

The obvious applications are noisy image data, where
the noise breaks up the strings that make compress
effective, but still leave the bulk of the pixels as
a background value.

5) If you absolutely _must_ get the smallest
possible archive file, use lharc. As a second
choice, feed the output of compress through the
arithmetic encoder. Though what compress produces
looks like pure noise, I've found encode will
compress it about 4% further. It costs lots of time,
of course, but if that is not your main criterion,
then it is quite handy.

6) Much better, state sensitive arithmetic coding
algorithms exist than this simple one character
deep tutorial code.  See Doug Jones' response for
more on this.

Kent, the man from xanth.
<xanthian@Zorch.SF-Bay.ORG> <xanthian@well.sf.ca.us>