Path: utzoo!attcan!utgpu!jarvis.csri.toronto.edu!dptcdc!tmsoft!masnet!canremote!jack.bzoza
From: jack.bzoza@canremote.uucp (JACK BZOZA)
Newsgroups: comp.sys.ibm.pc
Subject: Re: Squeeze compression utili
Message-ID: <89080106504836@masnet.uucp>
Date: 31 Jul 89 05:30:00 GMT
Organization: Canada Remote Systems Limited, Mississauga, ON, Canada
Lines: 95

Cont'd from last message:

The first thing that SQ does is read through the input file and create a
distribution array for the 256 possible characters.  This array contains counts
of the number of occurrences of each of the 256 characters.  The program counts
these values in a 16 bit number.  It makes sure that, if you are encoding a big
file, counts do not exceed a 16 bit value.  This is highly unlikely, but it
must be accounted for.

At the same time, SQ removes strings of identical characters and replaces them
with the ASCII character DLE followed by a character count of 2-255.  SQ
replaces the ASCII DLE with the pair of characters:  DLE DLE.  This is not
related to the Huffman algorithm but just serves to compress the file a little
more.

Once SQ has scanned the input file, it creates a binary tree structure
containing this frequency occurrence information.  The most frequently
occurring characters have the shortest path from the root to the node, and the
least frequently occurring characters have the longest path.  For example, if
your file were:

ABRACADABRA (a very simple and magical example)

The table of frequency of occurrences would be:

 Letter         # of Occurrences
 ------         ---------------
    A                 5
    B                 2
    C                 1
    D                 1
    R                 2
    all the rest      0

Since the letter "A" occurs most often, it should have the shortest encoded bit
string.  The letters "C" and "D" should have the longest.  The other characters
which don't appear in the input file don't need to be considered.

SQ would create a binary tree to represent this information.  The tree might
look something like this (for purposes of discussion only):

      root    <---Computer trees are
     /    \      always upside down!
   1       0   <-- This is called a
 /       /   \        node.
A      1       0
     /       /   \  <--This is called
   B       1       0    branch.
         /       /   \
       C       1       0
             /           \
           D               R  <-This
                                is a
                                leaf


From this our encoded bit strings which are kept in a translation table would
be:

 Table Entry  Character  Binary
 -----------  ---------  ------
      1           A      1
      2           B      01
      3           C      001
      4           D      0001
      5           R      0000


The output file would be:

   A B  R    A C   A D    A B  R    A
   ----------------------------------
   1 01 0000 1 001 1 0001 1 01 0000 1
   (binary)

   A1 31 A1
   (hex)

We have reduced the size of your file from ten bytes to three bytes for a 70%
savings.  For this simple example, things aren't that well off since we must
put the binary tree encoding information onto the file as well.  So the file
size grew a lot.  But consider a file with the word ABRACADABRA repeated
100,000 times.  Now the encoding information is going to be a very very small
percentage of the output file and the file will shrink tremendously.

SQ opens the output file and writes out the binary tree information.  Then SQ
rewinds the input file and rereads it from the beginning.  As it reads each
character, it looks into the translation table and outputs the corresponding
bit string.

Cont'd next message

 
---
 * QDeLuxe 1.10 #168