Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!magnus.acs.ohio-state.edu!csn!ub!uhura.cc.rochester.edu!rochester!pt.cs.cmu.edu!g.gp.cs.cmu.edu!tgl
From: tgl@g.gp.cs.cmu.edu (Tom Lane)
Newsgroups: comp.compression
Subject: Modeling vs encoding (Re: Lempel-Ziv v/s huffman encoding)
Message-ID: <12546@pt.cs.cmu.edu>
Date: 1 Apr 91 02:19:07 GMT
References: <KENW.91Mar29153846@skyler.arc.ab.ca> <46466@ut-emx.uucp> <1991Mar31.000653.4598@zorch.SF-Bay.ORG>
Organization: Carnegie-Mellon University, CS/RI
Lines: 133

In article <1991Mar31.000653.4598@zorch.SF-Bay.ORG>, xanthian@zorch.SF-Bay.ORG
(Kent Paul Dolan) writes:
> [a short summary of Huffman and arithmetic encoding]

I second Kent's recommended references for these encoding methods.
Another good introductory article is "Data Compression" in ACM Computing
Surveys v19n3 (Sept 87).  This covers both of these encoding methods
and many more.

There's an important common feature of Huffman and arithmetic coding that
Kent didn't explain very clearly, so I will take a shot at it.  Both of
these methods (and many others) are based on the same fundamental idea:
	1. Obtain some statistical information about the relative
	   frequencies of the different symbols that may appear in the
	   source message.
	2. Encode the symbols in such a way that the common symbols are
	   represented in fewer bits than average, while uncommon symbols
	   need more bits than average.
(Compare this with standard encodings like ASCII, wherein every symbol is
represented with the same number of bits.)  Note that a "symbol" may be a
single 0/1 bit, a character, a word, or whatever unit is convenient.  Since
you can't get something for nothing, you must use longer-than-average
encodings for some symbols in order to get shorter-than-average encodings
for others.  *If* your statistics are accurate, you can still end up with
fewer total bits over a whole file.  If they aren't, you come out behind.

(This approach is completely different from Lempel-Ziv and related methods,
which rely on encoding whole strings of symbols rather than individual
symbols.  Most of those methods don't gather statistics per se, they just
assume that a string used before may be used again.)

Strictly speaking, the terms "Huffman" and "arithmetic" only tell you how
step 2 is done; to fully describe a compression method, you must also say
how step 1 is done.  In hi-faluting academic discussions, step 1 is called
modeling and step 2 is called coding; so be careful to say "Huffman coding",
not "Huffman compression", when you are talking to an expert.

There are a bunch of different ways of doing the modeling.  The simplest
distinction is between fixed, semiadaptive, and fully adaptive models.

With a fixed model, you gather your statistics but once, then encode
whatever is thrown at you.  Typically you gather statistics on a big sample
of English text (or whatever you can find that seems representative) and
then preset both compressor and decompressor with a corresponding Huffman
(or other coding method) encoding table.  Compression is fast and simple,
and it works well as long as the data to be encoded matches the fixed
statistics reasonably closely.  If you are given a file in, say, Swahili,
you can get unboundedly bad compression.

A semiadaptive modeler prescans the file to be compressed, computes
statistics *for that file*, and then encodes the file accordingly.  This
avoids the Swahili problem.  With this method you must store/transmit the
encoding table along with the data, so it loses for small files (the
overhead of the table is too great).  Also, in many applications having to
scan the file twice is bad news; you can't compress a stream of data on the
fly.  This option is pretty much out of favor these days.

A fully adaptive model starts from innocuous assumptions (say, all symbols
equally probable).  It then gathers statistics *as it scans the file*.  The
encoder updates its model after encoding each symbol; to stay in step, the
decoder makes an identical update to its model after decoding each symbol.
This arrangement can adapt to any file and has no overhead for transmitting
the model.  Characteristically, the first part of a file is encoded only
moderately well, but compression efficiency improves as you get further
along (because the statistics get more accurate).  A really sophisticated
adaptive model uses statistics from only the recently seen part of the file,
so that it can adapt to local changes in probabilities in different parts of
the file; neither fixed nor semiadaptive models can do this at all.  The
main drawback of the adaptive approach is that you need an encoding method
that can work directly from the statistical counts; you can't afford an
expensive statistics-to-encoding transformation.  Even with well-tuned
methods, these types of models tend to need more computation per symbol than
fixed or semiadaptive models.

Still with me?  The other important classification of models is "order",
which is how much history or state is used.

A zeroth-order model just remembers the (independent) probability of each
symbol, and codes accordingly.  For N symbols the model can be described
by N probabilities or counts.

A first-order model keeps track of the probability of each symbol *given the
previous symbol*.  For example, while a 0-order model would always code "u"
the same, a first-order model would discover that "u" is extremely probable
after "q"; therefore, after coding a "q" it would use a very short encoding
for "u", shorter than it would use for a "u" after some other letter.  This
sort of model can be thought of as having N states (one for each symbol); in
each state a different set of symbol probabilities is maintained, and a
different set of symbol encodings is used.  You need to store N*N
statistical counts to maintain a first-order model.

Similarly, second-order models code each symbol using knowledge of the
two preceding symbols, third-order models three preceding symbols, etc.
With each order step, the amount of state needed to represent the model's
statistics grows by a factor of N.  So high-order models are unusable for
semiadaptive modeling (you would never want to transmit that big a table),
but they can be a win for adaptive modeling (where the model size only
affects the amount of memory needed during compression/decompression).

It is also possible to develop multi-state models in which the state is
determined by something other than the immmediately preceding symbol(s).
For instance, in the JPEG image compression standard, the statistical model
uses separate counts and encodings for DC (zero-frequency) and AC
(non-zero-frequency) values, and it further separates sign bits from
magnitude bits in the values.  So the JPEG model separates states according
to the "class" of symbol rather than according to the immediately preceding
symbol.


The important feature of arithmetic coding is that the predicted frequencies
can be reflected exactly in the total length of the output string, whereas
Huffman coding must round off the frequencies because it encodes each symbol
with an integral number of bits.  For example, if a symbol has predicted
frequency 0.4, Huffman coding must represent it with two bits, while
arithmetic coding can actually achieve the one-and-a-fraction bits that are
justified in an information-theoretic sense.  This leads to the conclusion
that arithmetic coding is, on the average, about one-half bit per symbol
better than Huffman (given identical frequency statistics).  In order to
exploit this advantage, you need a model that can give adequately precise
statistical predictions; so more complex models are justified with
arithmetic coding than with Huffman coding.  For example, after a "q"
Huffman coding still needs at least one bit to represent "u", so there's no
point in knowing that "u" has better than 50% probability in this case.
But arithmetic coding can easily put out one-onehundredth of a bit if you
tell it the true probability of that "u" is 99%.


For further discussion of the notion of modeling for compression, see
Bell, Witten & Cleary in ACM Computing Surveys v21n4 (Dec 89).

-- 
			tom lane
Internet: tgl@cs.cmu.edu	BITNET: tgl%cs.cmu.edu@cmuccvma