Path: utzoo!attcan!utgpu!jarvis.csri.toronto.edu!dptcdc!tmsoft!masnet!canremote!jack.bzoza
From: jack.bzoza@canremote.uucp (JACK BZOZA)
Newsgroups: comp.sys.ibm.pc
Subject: Re: Squeeze compression utili
Message-ID: <89080106504765@masnet.uucp>
Date: 31 Jul 89 05:30:00 GMT
Organization: Canada Remote Systems Limited, Mississauga, ON, Canada
Lines: 98

je>A friend of mine has recently come across some compressed
je>files that he says has been packed using a utility called
je>SQUEEZE. Apparently, it leaves an extension with a 'Q' in it.
je>I am not CERTAIN that isn't just the .hqx from a binhex
je>compression, but he seems to be quite certain about squeeze.

Here is a summary of SQueeze and USQueeze which I found on a local BBS
and am posting for all interested parties.

Data Compression Techniques Using SQ and USQ

Mark DiVecchio San Diego IBM PC Users Group

In any computer system, efficient management of the file storage space is
important.  The two programs SQ and USQ reduce the size of data files by using
the Huffman Data Compression Algorithm.

A file is composed of a sequence of bytes.  A byte is composed of 8 bits.  That
byte can have a decimal value between 0 and 255.  A typical text file, like a C
language program, is composed of the ASCII character set (decimal 0 to 127).
This character set only requires seven of the eight bits in a byte.  Just think
-- if there were a way to use that extra bit for another character, you could
reduce the size of the file by 12.5%.  For those of you who use upper case
characters only, you use only about 100 of the ASCII codes from 0 to 255.  You
could save 60% of your storage if the bits not needed within each byte could be
used for other characters.  What if you could encode the most frequently used
characters in the file with only one bit?  For example, if you could store the
letter "e" (the most commonly used letter in the English language) using only
one bit :  "1", you could save 87.5% of the space that the "e" would normally
take if it were stored as the ASCII "0110 0101" binary.

The answer to these questions is the SQ and USQ programs.

The SQ program uses the Huffman coding technique to search for the frequency of
use of each of the 256 possible byte patterns, and it then assigns a
translation for each character to a bit string.  All of these bit strings are
placed end to end and written onto a disk file.  The encoding information is
also put on the file since the USQ program needs to know the character
distribution of the original file.

The USQ program reads in the encoding information and then reads in the encoded
file.  It is a simple matter to scan the encoded file and produce an output
file which is identical to the file that SQ started with.

         Huffman Coding Technique

This is by far the most popular encoding technique in use today.  The Huffman
encoding replaces fixed bit characters with variable length bit strings.  The
length of the bit string is roughly inversely proportional to the frequency of
occurrence of the character.  For those of you inclined to such symbolism:

Length of bit  ~= log  (character
     string          2  probability)

The implementation of the algorithm which we will discuss encodes fixed bit
strings of length 8.

This algorithm requires two passes through the input file.  The first pass
builds a table of 256 entries showing the frequency of each occurrence of each
of the 256 possible values for a byte of information.

Once the counting is complete, the algorithm must decide which bit strings to
associate with each of the 256 characters that were found in the file.  Note
that if a particular byte value was never used, no string association is
needed.

The second pass through the input file converts each byte into its encoded
string.  Remember that when the output file is created, the information used
for encoding must also be written on the file for use by the decoding program.

The decoding program reads the encoding information from the file and then
starts reading the bit strings.  As soon as enough bits are read to interpret a
character, that character is written onto the final output file.  See the next
two sections on how SQ and USQ actually implement this.

Even though this article primarily has addresses ASCII input files, there is
nothing which restricts this algorithm to ASCII.  It will work on binary files
(.COM or .EXE) as well.  But since the length of the encoded bit string is
approximately equal to the inverse of the frequency of occurrence of each 8 bit
byte, a binary file may not compress very much.  This is because a binary file
most likely has a uniform distribution over the 256 values in a byte.  A
machine language program is not like the English language where the letter "e"
is used far more than other letters.  If the distribution is uniform, the
encoded bit strings will all be the same length and the encoded file could be
longer than the original (because of the encoding information on the front of
the file).  All of this has to be qualified, because machine language programs
tend to use a lot of "MOV" instructions and have a lot of bytes of zeros so
that encoding .COM and .EXE files does save some disk space.

                SQ

The SQ program is an example of the Huffman algorithm.

Cont'd Next Message

 
---
 * QDeLuxe 1.10 #168