Path: utzoo!attcan!utgpu!jarvis.csri.toronto.edu!dptcdc!tmsoft!masnet!canremote!jack.bzoza From: jack.bzoza@canremote.uucp (JACK BZOZA) Newsgroups: comp.sys.ibm.pc Subject: Re: Squeeze compression utili Message-ID: <89080106504765@masnet.uucp> Date: 31 Jul 89 05:30:00 GMT Organization: Canada Remote Systems Limited, Mississauga, ON, Canada Lines: 98 je>A friend of mine has recently come across some compressed je>files that he says has been packed using a utility called je>SQUEEZE. Apparently, it leaves an extension with a 'Q' in it. je>I am not CERTAIN that isn't just the .hqx from a binhex je>compression, but he seems to be quite certain about squeeze. Here is a summary of SQueeze and USQueeze which I found on a local BBS and am posting for all interested parties. Data Compression Techniques Using SQ and USQ Mark DiVecchio San Diego IBM PC Users Group In any computer system, efficient management of the file storage space is important. The two programs SQ and USQ reduce the size of data files by using the Huffman Data Compression Algorithm. A file is composed of a sequence of bytes. A byte is composed of 8 bits. That byte can have a decimal value between 0 and 255. A typical text file, like a C language program, is composed of the ASCII character set (decimal 0 to 127). This character set only requires seven of the eight bits in a byte. Just think -- if there were a way to use that extra bit for another character, you could reduce the size of the file by 12.5%. For those of you who use upper case characters only, you use only about 100 of the ASCII codes from 0 to 255. You could save 60% of your storage if the bits not needed within each byte could be used for other characters. What if you could encode the most frequently used characters in the file with only one bit? For example, if you could store the letter "e" (the most commonly used letter in the English language) using only one bit : "1", you could save 87.5% of the space that the "e" would normally take if it were stored as the ASCII "0110 0101" binary. The answer to these questions is the SQ and USQ programs. The SQ program uses the Huffman coding technique to search for the frequency of use of each of the 256 possible byte patterns, and it then assigns a translation for each character to a bit string. All of these bit strings are placed end to end and written onto a disk file. The encoding information is also put on the file since the USQ program needs to know the character distribution of the original file. The USQ program reads in the encoding information and then reads in the encoded file. It is a simple matter to scan the encoded file and produce an output file which is identical to the file that SQ started with. Huffman Coding Technique This is by far the most popular encoding technique in use today. The Huffman encoding replaces fixed bit characters with variable length bit strings. The length of the bit string is roughly inversely proportional to the frequency of occurrence of the character. For those of you inclined to such symbolism: Length of bit ~= log (character string 2 probability) The implementation of the algorithm which we will discuss encodes fixed bit strings of length 8. This algorithm requires two passes through the input file. The first pass builds a table of 256 entries showing the frequency of each occurrence of each of the 256 possible values for a byte of information. Once the counting is complete, the algorithm must decide which bit strings to associate with each of the 256 characters that were found in the file. Note that if a particular byte value was never used, no string association is needed. The second pass through the input file converts each byte into its encoded string. Remember that when the output file is created, the information used for encoding must also be written on the file for use by the decoding program. The decoding program reads the encoding information from the file and then starts reading the bit strings. As soon as enough bits are read to interpret a character, that character is written onto the final output file. See the next two sections on how SQ and USQ actually implement this. Even though this article primarily has addresses ASCII input files, there is nothing which restricts this algorithm to ASCII. It will work on binary files (.COM or .EXE) as well. But since the length of the encoded bit string is approximately equal to the inverse of the frequency of occurrence of each 8 bit byte, a binary file may not compress very much. This is because a binary file most likely has a uniform distribution over the 256 values in a byte. A machine language program is not like the English language where the letter "e" is used far more than other letters. If the distribution is uniform, the encoded bit strings will all be the same length and the encoded file could be longer than the original (because of the encoding information on the front of the file). All of this has to be qualified, because machine language programs tend to use a lot of "MOV" instructions and have a lot of bytes of zeros so that encoding .COM and .EXE files does save some disk space. SQ The SQ program is an example of the Huffman algorithm. Cont'd Next Message --- * QDeLuxe 1.10 #168