Path: utzoo!attcan!utgpu!jarvis.csri.toronto.edu!dptcdc!tmsoft!masnet!canremote!jack.bzoza From: jack.bzoza@canremote.uucp (JACK BZOZA) Newsgroups: comp.sys.ibm.pc Subject: Re: Squeeze compression utili Message-ID: <89080106504918@masnet.uucp> Date: 31 Jul 89 05:30:00 GMT Organization: Canada Remote Systems Limited, Mississauga, ON, Canada Lines: 89 Cont'd from last message: SQ is a little more complicated than what I have outlined since it must operate in the real world of hardware, but this is a fairly complete description of the algorithm. USQ The USQ program is very straightforward. It reads in the encoding information written out by SQ and builds the identical binary tree that SQ used to encode the file. USQ then reads the input file as if it were a string of bits. Starting at the root of the tree, it traverses one branch of the tree with each input bit. If it has reached a leaf, it has a character and that character is written to the output file. USQ then starts at the root again with the next bit from the input file. What does it all mean? Now that we understand the algorithm and a little about how the SQ and USQ programs work, we can use that knowledge to help us run our systems a little more efficiently. (So all of this theory is worth something after all!). 1. Files must be above a threshold size, or else the output file will be longer than the input file because of the decoding information put at the beginning of the compressed data. We don't know the exact size of the threshold because the encoding binary tree information depends on the distribution of the characters in a file. At least we know to check the size of the encoded file after we run SQ to make sure our file didn't grow. 2. Some files will not compress well if they have a uniform distribution of byte values, for example, .COM or .EXE files. This is because of the way SQ builds the tree. Remember that bytes with the same frequency of occurrence are at the same depth (usually) in the tree. So if all of the bytes have the same depth, the output strings are all the same length. 3. SQ reads the input file twice. If you can, use RAM disk at least for the input file and for both files if you have the room. The next best case is to use two floppy drives, one for input and one for output. This will cause a lot of disk starts and stops but not much head movement. Worst case is to use one floppy drive for both input and output. This will cause a lot of head movement as the programs alternate between the input and output files. Other Compression Techniques RUN-LENGTH ENCODING Run-length encoding is a technique whereby sequences of identical bytes are replaced by the repeated byte and a byte count. As you might guess, this method is effective only on very specialized files. One good candidate is a screen display buffer. A screen is made up mostly of "spaces". A completely blank line could be reduced from 80 bytes of spaces to one space followed by a value of 80. To go from 80 bytes down to two bytes is a savings of almost 98%. You might guess that for text files or binary files, this technique does not work well at all. ADAPTIVE COMPRESSION This technique replaces strings of characters of code. For example, the string "ABRACADABRA" would be replaced by a code. Typical algorithms use a 12 bit code. The algorithm is unique in that it only requires a single pass through the input file as the encoding is taking place. The current incarnation of this procedure is called the LZW method (after co-inventors; A. Lempel, J. Ziv and T. Welch). This algorithm claims a savings of 66% on machine language files and up to 83% on COBOL files. Other Reading If you are interested in reading more about data compression techniques, you may be interested in these articles: H.K. Reghbati, "An Overview of Data Compression Techniques," Computer Magazine, Vol. 14, No. 4, April 1981, pp. 71-76. T.A. Welch, "A Technique for High-Performance Data Compression", Computer Magazine, Vol 17, No. 6, June 1984, pp. 8-19. J. Ziv and A. Lempel, "A Universal Algorithm for Sequential Data Compression," IEEE Transactions on Information Theory, Vol. It-23, No. 3, May 1977, pp. 337-343. END OF MESSAGE --- * QDeLuxe 1.10 #168 Be hatting with you....