Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!usc!elroy.jpl.nasa.gov!ames!vsi1!zorch!xanthian
From: xanthian@zorch.SF-Bay.ORG (Kent Paul Dolan)
Newsgroups: comp.compression
Subject: Re: Atronomical data compression
Keywords: Spectra, Keck
Message-ID: <1991Mar26.033222.6693@zorch.SF-Bay.ORG>
Date: 26 Mar 91 03:32:22 GMT
References: <1991Mar23.013557.28151@nntp-server.caltech.edu>
Organization: SF-Bay Public-Access Unix
Lines: 96

sns@deimos.caltech.edu (Sam Southard Jr.) writes:

> I have a topic/question that should be suitable for this newsgroup. I
> am a software engineer working on the software for the Keck Telescope
> (the new 10m in Hawaii). One of the things I am doing is taking the
> data from the CCD to a VMEbus controller crate (Sun 1E running
> VxWorks) and from there to a Sun 4/470 over the ethernet.

> Each image can be up to 4096x4096 pixels (a 2x2 mosaic of 2048x2048
> CCDs), each pixel being 16 bits. Obviously, if this kind of data is
> going over the Ethernet, we want to compress it as much as possible.

> The compression I can do is restricted in a few ways, including:

> 1) It can not take much CPU, since some real-time control of the
> instrument is being done by the same VxWorks box. Much is a term that
> is yet to be defined.

> 2) Simple schemes like counting the number of times a pixel occurs in
> a row and comressing that to a single occurance and a count are out,
> because there is a random element to the data (cosmic rays, etc.),
> which means that it is unlikely that there will be many cases where
> this method would compress the data.

> I have thought of a few methods which I can try, but I am interested
> in looking at as many ways as possible. I have lots of actual data to
> try out various compression methods, and a reasonable amount of CPU to
> burn in choosing the algorithm (or two or three) which will actualyl
> be used.

> Does anyone have any suggestiongs?

Sure, else what's the point of responding.  ;-)

Data which is primarily of one (background) value, but contains enough
speckley noise to defeat run length or repeated string style (LZW, etc)
encodings, such as your CCD data, or other scanned real world images of
sparse data (engineering line drawings, e.g.) are ideal candidates for
arithmetic encoding.  Arithmetic encoding gains its efficiency by being
able to effectively encode a symbol set which is severely skewed to a
few of the possible values, (preferably one, where it is capable of
encoding the prevelant symbol in a small fraction of one bit), but may
in a large dataset contain at least one of _each_ symbol.  It is quite
immune to speckley noise, because it encodes symbol (here half-pixel) by
symbol rather than run by run, so a speckle just bumps the count for a
nearly empty bin by one for a while having little other effect on the
compression, rather than spoiling a run every time a speckle appears.

The down side is that (at least the implementation I know) is quite CPU
intensive. I understand there are much more effective methods, but I've
been quite impressed with the effectiveness of the simple routine in the
June, 1987 CACM on data of this type.  It might suit your needs for speed,
or be effective enough in compression you would consider dedicating a box
just to the compression task just to make use of this slower technique.

I have the code from the CACM article, with the modification of ANSI C
prototypes, which I've compiled under a C compiler with prototypes, and
under BSD Unix, but not both at once. I'd be glad to pass the code as is
to a few requests, but even happier to pass it on to someone who would
make sure it compiles under a modern Unix C compiler with the given
makefile, and then post it here or put it up for FTP access or both.

It has the typical BSD long file names, so it would be a little extra
work to unpack and use on a system with 14 character file name limits; a
bit of work by the recipient on the shar file with an editor, and then
on the #include statements and makefile when the stuff is unpacked, will
take care of that.

I also have a clever modification which replaces the "slide the highest
frequency bin to to the front of a linearly searched list for efficient
access" frequency count table of the CACM implementation with a heap
based table, which works but is not quite in such clean shape, if anyone
is interested. This is faster for a less skewed set of symbol
frequencies such as you might get by using it to replace the Huffman
coding part of lharc, or compressing common text, where the order (log
(symbol set size)) access to a heap bin beats the order(symbol set size)
access of the linearly searched table to find a bin, but loses in the
current case, where the expected table search would be very close to
order(1) in the highly skewed frequencies case.

Both sets of routines still have the general aspect of the CACM code,
which was written for tutorial rather than speed purposes (it does a
subroutine call per input or output bit, e.g.), so there is lots of room
to speed either routine up to make it less grim than my current
experience.

Again, I understand arithmetic coding compression has moved on long past
this code, but this is still a good place to start, because it has a
great set of documentation in the CACM article to assist understanding.

I haven't touched this stuff since June, from mental health problems, so
I'd be just as happy to pass it on to someone interested in pursuing the
matter further.

Kent, the man from xanth.
<xanthian@Zorch.SF-Bay.ORG> <xanthian@well.sf.ca.us>