Path: utzoo!attcan!utgpu!jarvis.csri.toronto.edu!dptcdc!tmsoft!masnet!canremote!jack.bzoza From: jack.bzoza@canremote.uucp (JACK BZOZA) Newsgroups: comp.sys.ibm.pc Subject: Re: Squeeze compression utili Message-ID: <89080106504836@masnet.uucp> Date: 31 Jul 89 05:30:00 GMT Organization: Canada Remote Systems Limited, Mississauga, ON, Canada Lines: 95 Cont'd from last message: The first thing that SQ does is read through the input file and create a distribution array for the 256 possible characters. This array contains counts of the number of occurrences of each of the 256 characters. The program counts these values in a 16 bit number. It makes sure that, if you are encoding a big file, counts do not exceed a 16 bit value. This is highly unlikely, but it must be accounted for. At the same time, SQ removes strings of identical characters and replaces them with the ASCII character DLE followed by a character count of 2-255. SQ replaces the ASCII DLE with the pair of characters: DLE DLE. This is not related to the Huffman algorithm but just serves to compress the file a little more. Once SQ has scanned the input file, it creates a binary tree structure containing this frequency occurrence information. The most frequently occurring characters have the shortest path from the root to the node, and the least frequently occurring characters have the longest path. For example, if your file were: ABRACADABRA (a very simple and magical example) The table of frequency of occurrences would be: Letter # of Occurrences ------ --------------- A 5 B 2 C 1 D 1 R 2 all the rest 0 Since the letter "A" occurs most often, it should have the shortest encoded bit string. The letters "C" and "D" should have the longest. The other characters which don't appear in the input file don't need to be considered. SQ would create a binary tree to represent this information. The tree might look something like this (for purposes of discussion only): root <---Computer trees are / \ always upside down! 1 0 <-- This is called a / / \ node. A 1 0 / / \ <--This is called B 1 0 branch. / / \ C 1 0 / \ D R <-This is a leaf From this our encoded bit strings which are kept in a translation table would be: Table Entry Character Binary ----------- --------- ------ 1 A 1 2 B 01 3 C 001 4 D 0001 5 R 0000 The output file would be: A B R A C A D A B R A ---------------------------------- 1 01 0000 1 001 1 0001 1 01 0000 1 (binary) A1 31 A1 (hex) We have reduced the size of your file from ten bytes to three bytes for a 70% savings. For this simple example, things aren't that well off since we must put the binary tree encoding information onto the file as well. So the file size grew a lot. But consider a file with the word ABRACADABRA repeated 100,000 times. Now the encoding information is going to be a very very small percentage of the output file and the file will shrink tremendously. SQ opens the output file and writes out the binary tree information. Then SQ rewinds the input file and rereads it from the beginning. As it reads each character, it looks into the translation table and outputs the corresponding bit string. Cont'd next message --- * QDeLuxe 1.10 #168