Path: utzoo!utgpu!news-server.csri.toronto.edu!bonnie.concordia.ca!uunet!ns-mx!pyrite.cs.uiowa.edu
From: jones@pyrite.cs.uiowa.edu (Douglas W. Jones,201H MLH,3193350740,3193382879)
Newsgroups: comp.compression
Subject: Re: Is arithmetic compression a crock?
Message-ID: <6285@ns-mx.uiowa.edu>
Date: 31 May 91 14:12:43 GMT
References: <1991May30.035552.14435@alembic.acs.com>
Sender: news@ns-mx.uiowa.edu
Lines: 37

From article <1991May30.035552.14435@alembic.acs.com>,
by csu@alembic.acs.com (Dave Mack):
> 
> Is this a really bad implementation, a really bad algorithm, or
> is arithmetic compression basically useless for text compression?

Arithmetic coding has never been advertised as a fast method, and the
speed you see is typical of the results others have obtained.

As for compression performance, the code in the Witten Neal and Cleary
paper only uses a single-state source model.  To compress as well as
LZW compresses, you need multiple states in your source model.

Witten Neal and Cleary used a move-to-front cumulative frequency table,
occupying 257 16 bit words.  If you want a 64 state source model, you
will need 64 of these tables (for a total cost of 16448 words or
32896 bytes.  In my CACM paper on Splay Tree Based compression (vol 31
no 8), I showed that a 64 state source model (where the next state is
determined by the previous character mod 64) works pretty well for
splay trees, and it should work even better for arithmetic codes.

The 64 state source model I suggested uses much less memory than the
hash tables used by LZW, and it should compress significantly better.
My splay-tree based code was quite good with this 64 state source model,
frequently beating LZW, and it is a simple prefix code where each source
character codes to at least one bit.

In a multiple-state source model, many characters will be so common in
some states that they will code to less than a bit when arithmetic
codes are used.

Finally, note that a multiple state source model does not slow things
down significantly.  You pay about one array index operation per character
compressed to find the base of the sub-array to be used in compressing
that character.
					Doug Jones
					jones@cs.uiowa.edu