Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!usc!elroy.jpl.nasa.gov!ames!vsi1!zorch!xanthian From: xanthian@zorch.SF-Bay.ORG (Kent Paul Dolan) Newsgroups: comp.compression Subject: Re: Atronomical data compression Keywords: Spectra, Keck Message-ID: <1991Mar26.033222.6693@zorch.SF-Bay.ORG> Date: 26 Mar 91 03:32:22 GMT References: <1991Mar23.013557.28151@nntp-server.caltech.edu> Organization: SF-Bay Public-Access Unix Lines: 96 sns@deimos.caltech.edu (Sam Southard Jr.) writes: > I have a topic/question that should be suitable for this newsgroup. I > am a software engineer working on the software for the Keck Telescope > (the new 10m in Hawaii). One of the things I am doing is taking the > data from the CCD to a VMEbus controller crate (Sun 1E running > VxWorks) and from there to a Sun 4/470 over the ethernet. > Each image can be up to 4096x4096 pixels (a 2x2 mosaic of 2048x2048 > CCDs), each pixel being 16 bits. Obviously, if this kind of data is > going over the Ethernet, we want to compress it as much as possible. > The compression I can do is restricted in a few ways, including: > 1) It can not take much CPU, since some real-time control of the > instrument is being done by the same VxWorks box. Much is a term that > is yet to be defined. > 2) Simple schemes like counting the number of times a pixel occurs in > a row and comressing that to a single occurance and a count are out, > because there is a random element to the data (cosmic rays, etc.), > which means that it is unlikely that there will be many cases where > this method would compress the data. > I have thought of a few methods which I can try, but I am interested > in looking at as many ways as possible. I have lots of actual data to > try out various compression methods, and a reasonable amount of CPU to > burn in choosing the algorithm (or two or three) which will actualyl > be used. > Does anyone have any suggestiongs? Sure, else what's the point of responding. ;-) Data which is primarily of one (background) value, but contains enough speckley noise to defeat run length or repeated string style (LZW, etc) encodings, such as your CCD data, or other scanned real world images of sparse data (engineering line drawings, e.g.) are ideal candidates for arithmetic encoding. Arithmetic encoding gains its efficiency by being able to effectively encode a symbol set which is severely skewed to a few of the possible values, (preferably one, where it is capable of encoding the prevelant symbol in a small fraction of one bit), but may in a large dataset contain at least one of _each_ symbol. It is quite immune to speckley noise, because it encodes symbol (here half-pixel) by symbol rather than run by run, so a speckle just bumps the count for a nearly empty bin by one for a while having little other effect on the compression, rather than spoiling a run every time a speckle appears. The down side is that (at least the implementation I know) is quite CPU intensive. I understand there are much more effective methods, but I've been quite impressed with the effectiveness of the simple routine in the June, 1987 CACM on data of this type. It might suit your needs for speed, or be effective enough in compression you would consider dedicating a box just to the compression task just to make use of this slower technique. I have the code from the CACM article, with the modification of ANSI C prototypes, which I've compiled under a C compiler with prototypes, and under BSD Unix, but not both at once. I'd be glad to pass the code as is to a few requests, but even happier to pass it on to someone who would make sure it compiles under a modern Unix C compiler with the given makefile, and then post it here or put it up for FTP access or both. It has the typical BSD long file names, so it would be a little extra work to unpack and use on a system with 14 character file name limits; a bit of work by the recipient on the shar file with an editor, and then on the #include statements and makefile when the stuff is unpacked, will take care of that. I also have a clever modification which replaces the "slide the highest frequency bin to to the front of a linearly searched list for efficient access" frequency count table of the CACM implementation with a heap based table, which works but is not quite in such clean shape, if anyone is interested. This is faster for a less skewed set of symbol frequencies such as you might get by using it to replace the Huffman coding part of lharc, or compressing common text, where the order (log (symbol set size)) access to a heap bin beats the order(symbol set size) access of the linearly searched table to find a bin, but loses in the current case, where the expected table search would be very close to order(1) in the highly skewed frequencies case. Both sets of routines still have the general aspect of the CACM code, which was written for tutorial rather than speed purposes (it does a subroutine call per input or output bit, e.g.), so there is lots of room to speed either routine up to make it less grim than my current experience. Again, I understand arithmetic coding compression has moved on long past this code, but this is still a good place to start, because it has a great set of documentation in the CACM article to assist understanding. I haven't touched this stuff since June, from mental health problems, so I'd be just as happy to pass it on to someone interested in pursuing the matter further. Kent, the man from xanth.