Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!magnus.acs.ohio-state.edu!csn!ub!uhura.cc.rochester.edu!rochester!pt.cs.cmu.edu!g.gp.cs.cmu.edu!tgl From: tgl@g.gp.cs.cmu.edu (Tom Lane) Newsgroups: comp.compression Subject: Modeling vs encoding (Re: Lempel-Ziv v/s huffman encoding) Message-ID: <12546@pt.cs.cmu.edu> Date: 1 Apr 91 02:19:07 GMT References: <46466@ut-emx.uucp> <1991Mar31.000653.4598@zorch.SF-Bay.ORG> Organization: Carnegie-Mellon University, CS/RI Lines: 133 In article <1991Mar31.000653.4598@zorch.SF-Bay.ORG>, xanthian@zorch.SF-Bay.ORG (Kent Paul Dolan) writes: > [a short summary of Huffman and arithmetic encoding] I second Kent's recommended references for these encoding methods. Another good introductory article is "Data Compression" in ACM Computing Surveys v19n3 (Sept 87). This covers both of these encoding methods and many more. There's an important common feature of Huffman and arithmetic coding that Kent didn't explain very clearly, so I will take a shot at it. Both of these methods (and many others) are based on the same fundamental idea: 1. Obtain some statistical information about the relative frequencies of the different symbols that may appear in the source message. 2. Encode the symbols in such a way that the common symbols are represented in fewer bits than average, while uncommon symbols need more bits than average. (Compare this with standard encodings like ASCII, wherein every symbol is represented with the same number of bits.) Note that a "symbol" may be a single 0/1 bit, a character, a word, or whatever unit is convenient. Since you can't get something for nothing, you must use longer-than-average encodings for some symbols in order to get shorter-than-average encodings for others. *If* your statistics are accurate, you can still end up with fewer total bits over a whole file. If they aren't, you come out behind. (This approach is completely different from Lempel-Ziv and related methods, which rely on encoding whole strings of symbols rather than individual symbols. Most of those methods don't gather statistics per se, they just assume that a string used before may be used again.) Strictly speaking, the terms "Huffman" and "arithmetic" only tell you how step 2 is done; to fully describe a compression method, you must also say how step 1 is done. In hi-faluting academic discussions, step 1 is called modeling and step 2 is called coding; so be careful to say "Huffman coding", not "Huffman compression", when you are talking to an expert. There are a bunch of different ways of doing the modeling. The simplest distinction is between fixed, semiadaptive, and fully adaptive models. With a fixed model, you gather your statistics but once, then encode whatever is thrown at you. Typically you gather statistics on a big sample of English text (or whatever you can find that seems representative) and then preset both compressor and decompressor with a corresponding Huffman (or other coding method) encoding table. Compression is fast and simple, and it works well as long as the data to be encoded matches the fixed statistics reasonably closely. If you are given a file in, say, Swahili, you can get unboundedly bad compression. A semiadaptive modeler prescans the file to be compressed, computes statistics *for that file*, and then encodes the file accordingly. This avoids the Swahili problem. With this method you must store/transmit the encoding table along with the data, so it loses for small files (the overhead of the table is too great). Also, in many applications having to scan the file twice is bad news; you can't compress a stream of data on the fly. This option is pretty much out of favor these days. A fully adaptive model starts from innocuous assumptions (say, all symbols equally probable). It then gathers statistics *as it scans the file*. The encoder updates its model after encoding each symbol; to stay in step, the decoder makes an identical update to its model after decoding each symbol. This arrangement can adapt to any file and has no overhead for transmitting the model. Characteristically, the first part of a file is encoded only moderately well, but compression efficiency improves as you get further along (because the statistics get more accurate). A really sophisticated adaptive model uses statistics from only the recently seen part of the file, so that it can adapt to local changes in probabilities in different parts of the file; neither fixed nor semiadaptive models can do this at all. The main drawback of the adaptive approach is that you need an encoding method that can work directly from the statistical counts; you can't afford an expensive statistics-to-encoding transformation. Even with well-tuned methods, these types of models tend to need more computation per symbol than fixed or semiadaptive models. Still with me? The other important classification of models is "order", which is how much history or state is used. A zeroth-order model just remembers the (independent) probability of each symbol, and codes accordingly. For N symbols the model can be described by N probabilities or counts. A first-order model keeps track of the probability of each symbol *given the previous symbol*. For example, while a 0-order model would always code "u" the same, a first-order model would discover that "u" is extremely probable after "q"; therefore, after coding a "q" it would use a very short encoding for "u", shorter than it would use for a "u" after some other letter. This sort of model can be thought of as having N states (one for each symbol); in each state a different set of symbol probabilities is maintained, and a different set of symbol encodings is used. You need to store N*N statistical counts to maintain a first-order model. Similarly, second-order models code each symbol using knowledge of the two preceding symbols, third-order models three preceding symbols, etc. With each order step, the amount of state needed to represent the model's statistics grows by a factor of N. So high-order models are unusable for semiadaptive modeling (you would never want to transmit that big a table), but they can be a win for adaptive modeling (where the model size only affects the amount of memory needed during compression/decompression). It is also possible to develop multi-state models in which the state is determined by something other than the immmediately preceding symbol(s). For instance, in the JPEG image compression standard, the statistical model uses separate counts and encodings for DC (zero-frequency) and AC (non-zero-frequency) values, and it further separates sign bits from magnitude bits in the values. So the JPEG model separates states according to the "class" of symbol rather than according to the immediately preceding symbol. The important feature of arithmetic coding is that the predicted frequencies can be reflected exactly in the total length of the output string, whereas Huffman coding must round off the frequencies because it encodes each symbol with an integral number of bits. For example, if a symbol has predicted frequency 0.4, Huffman coding must represent it with two bits, while arithmetic coding can actually achieve the one-and-a-fraction bits that are justified in an information-theoretic sense. This leads to the conclusion that arithmetic coding is, on the average, about one-half bit per symbol better than Huffman (given identical frequency statistics). In order to exploit this advantage, you need a model that can give adequately precise statistical predictions; so more complex models are justified with arithmetic coding than with Huffman coding. For example, after a "q" Huffman coding still needs at least one bit to represent "u", so there's no point in knowing that "u" has better than 50% probability in this case. But arithmetic coding can easily put out one-onehundredth of a bit if you tell it the true probability of that "u" is 99%. For further discussion of the notion of modeling for compression, see Bell, Witten & Cleary in ACM Computing Surveys v21n4 (Dec 89). -- tom lane Internet: tgl@cs.cmu.edu BITNET: tgl%cs.cmu.edu@cmuccvma