Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!samsung!zaphod.mps.ohio-state.edu!mips!prls!pyramid!octopus!vsi1!zorch!xanthian
From: xanthian@zorch.SF-Bay.ORG (Kent Paul Dolan)
Newsgroups: alt.sources.d
Subject: Re: Compressing alt.sex.pictures files
Message-ID: <1990Aug30.100256.14356@zorch.SF-Bay.ORG>
Date: 30 Aug 90 10:02:56 GMT
References: <1990Aug29.000054.28444@murdoch.acc.Virginia.EDU> <1990Aug29.154719.17301@zorch.SF-Bay.ORG> <1990Aug29.211755.15899@murdoch.acc.Virginia.EDU>
Organization: SF Bay Public-Access Unix
Lines: 92

gl8f@astsun7.astro.Virginia.EDU (Greg Lindahl) writes:
> xanthian@zorch.SF-Bay.ORG (Kent Paul Dolan) writes:

>>Note that it is sufficient that the scheme found compress _realistic_image_
>>data well; it doesn't have to compress random input at all well,

>*Realistic* astronomical images contain a lot of gaussian noise. How
>general do you want your algorithm to be?

Well, as long as the "noise" is bit-sparse per pixel, the compression
algorithms, such as Huffman encoding and arithmetic compression, that
specialize in coding frequent pixels in short bit strings will probably
still beat the compression algorithms, such as LZ, that look for repeated
strings of pixels, or run length encoding, that looks for long strings of
identical pixels, both of which are massively sabotaged by noise.

>As for other comments: I found in the "good old days" that ARC tended
>to huffman-encode images instead of LZW, because it was smaller.

That, from my remarks above, is the expected result.  Byte compression
oriented compressors _should_ do even better with the XOR step between,
but testing is needed.

>The "best" picture format on the ST is/was one called Tiny, which you
>can ftp a description of from atari.archive.umich.edu in
>~/atari/graphics/picfmts.doc.

Not an ftp site here; can you give a five line desription of Tiny's
algorithm?

>However, that format would suck shit on an astronomical image. Adaptive
>huffman encoding might be your best bet.

In a comparision between arithmetic compression and Huffman compression,
each has different inefficiencies that prevent "perfect" information
theoretic minimal encodings:

Huffman pays:

 1) a one bit per encoded unit penalty to act as a "stop bit", to
    divide the encoding of one unit from the encoding of the next;

 2) (for the adaptive method) a penalty in that the start-up unit
    frequency estimate is incorrect and all units are initially
    considered equally likely;

 3) some penalty from never being able to allow the expected
    frequency of a unit to go to zero, so that an encoding for it
    must exist and thus expand the encoding space at the expense
    of the encodings of existant units which must avoid aliasing
    to that encoding;

 4) a fractional bit penalty per encoded unit (byte or pixel) from
    encoding an experienced frequency in an integer number of bits,
    when a fractional number would more closely approximate the
    true, observed frequency.

Arithmetic compression pays no "bit per unit" penalty, and encodes
the observed frequency in fractions of a bit, 1) and 4) above, but
it pays other penalties: 2) and 3) for about the same reasons as
Huffman, plus:

 5) table compression penalty: to stay in 32 bit arithmetic (I'm
    talking here about the arithmetic compression algorithm with
    which I'm intimately familiar, others may differ), the total
    observed frequency must fit in 14 bits, so the frequencies
    must be divided by two and floored at one every 2**13 bytes;

 6) an explicit EOF unit 257th encoding (in the case of encoding
    8 bit bytes) must be carried along and never used for the whole
    input data compression, causing a minimal 1/(2**13) loss in
    encoding efficiency, just so that it can be used at the end to
    close off the file;

and perhaps others I don't recognize.

The result of these various penalties is that over a large input
file with a reasonably skewed unit distribution histogram, the
arithmetic compression method is a winner over Huffman encoding
by saving the stop bit and by encoding units in fractions of a bit,
and about breaks even otherwise, since penalties 5) and 6) are
quite small.

Of course, given a file consisting of monotonous repetitions of
strings of all the possible units in order, both of these methods,
as well as run length encoding, will suck rocks, and LZ will be
a run-away winner.  For realistic image data, however, byte encodings
should beat string encodings, and arithmetic encodings beat adaptive
Huffman encodings.  The proof is in the testing, of course.

Kent, the man from xanth.
<xanthian@Zorch.SF-Bay.ORG> <xanthian@well.sf.ca.us>