Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!samsung!zaphod.mps.ohio-state.edu!mips!prls!pyramid!octopus!vsi1!zorch!xanthian From: xanthian@zorch.SF-Bay.ORG (Kent Paul Dolan) Newsgroups: alt.sources.d Subject: Re: Compressing alt.sex.pictures files Message-ID: <1990Aug30.100256.14356@zorch.SF-Bay.ORG> Date: 30 Aug 90 10:02:56 GMT References: <1990Aug29.000054.28444@murdoch.acc.Virginia.EDU> <1990Aug29.154719.17301@zorch.SF-Bay.ORG> <1990Aug29.211755.15899@murdoch.acc.Virginia.EDU> Organization: SF Bay Public-Access Unix Lines: 92 gl8f@astsun7.astro.Virginia.EDU (Greg Lindahl) writes: > xanthian@zorch.SF-Bay.ORG (Kent Paul Dolan) writes: >>Note that it is sufficient that the scheme found compress _realistic_image_ >>data well; it doesn't have to compress random input at all well, >*Realistic* astronomical images contain a lot of gaussian noise. How >general do you want your algorithm to be? Well, as long as the "noise" is bit-sparse per pixel, the compression algorithms, such as Huffman encoding and arithmetic compression, that specialize in coding frequent pixels in short bit strings will probably still beat the compression algorithms, such as LZ, that look for repeated strings of pixels, or run length encoding, that looks for long strings of identical pixels, both of which are massively sabotaged by noise. >As for other comments: I found in the "good old days" that ARC tended >to huffman-encode images instead of LZW, because it was smaller. That, from my remarks above, is the expected result. Byte compression oriented compressors _should_ do even better with the XOR step between, but testing is needed. >The "best" picture format on the ST is/was one called Tiny, which you >can ftp a description of from atari.archive.umich.edu in >~/atari/graphics/picfmts.doc. Not an ftp site here; can you give a five line desription of Tiny's algorithm? >However, that format would suck shit on an astronomical image. Adaptive >huffman encoding might be your best bet. In a comparision between arithmetic compression and Huffman compression, each has different inefficiencies that prevent "perfect" information theoretic minimal encodings: Huffman pays: 1) a one bit per encoded unit penalty to act as a "stop bit", to divide the encoding of one unit from the encoding of the next; 2) (for the adaptive method) a penalty in that the start-up unit frequency estimate is incorrect and all units are initially considered equally likely; 3) some penalty from never being able to allow the expected frequency of a unit to go to zero, so that an encoding for it must exist and thus expand the encoding space at the expense of the encodings of existant units which must avoid aliasing to that encoding; 4) a fractional bit penalty per encoded unit (byte or pixel) from encoding an experienced frequency in an integer number of bits, when a fractional number would more closely approximate the true, observed frequency. Arithmetic compression pays no "bit per unit" penalty, and encodes the observed frequency in fractions of a bit, 1) and 4) above, but it pays other penalties: 2) and 3) for about the same reasons as Huffman, plus: 5) table compression penalty: to stay in 32 bit arithmetic (I'm talking here about the arithmetic compression algorithm with which I'm intimately familiar, others may differ), the total observed frequency must fit in 14 bits, so the frequencies must be divided by two and floored at one every 2**13 bytes; 6) an explicit EOF unit 257th encoding (in the case of encoding 8 bit bytes) must be carried along and never used for the whole input data compression, causing a minimal 1/(2**13) loss in encoding efficiency, just so that it can be used at the end to close off the file; and perhaps others I don't recognize. The result of these various penalties is that over a large input file with a reasonably skewed unit distribution histogram, the arithmetic compression method is a winner over Huffman encoding by saving the stop bit and by encoding units in fractions of a bit, and about breaks even otherwise, since penalties 5) and 6) are quite small. Of course, given a file consisting of monotonous repetitions of strings of all the possible units in order, both of these methods, as well as run length encoding, will suck rocks, and LZ will be a run-away winner. For realistic image data, however, byte encodings should beat string encodings, and arithmetic encodings beat adaptive Huffman encodings. The proof is in the testing, of course. Kent, the man from xanth.