Path: utzoo!utgpu!news-server.csri.toronto.edu!rutgers!cs.utexas.edu!uunet!mcsun!ukc!axion!tharr!gtoal
From: gtoal@tharr.UUCP (Graham Toal)
Newsgroups: alt.sources.d
Subject: Compression algorithm
Keywords: compression, data storage, data transmission, trees, graphs
Message-ID: <986@tharr.UUCP>
Date: 15 Sep 90 22:48:18 GMT
Reply-To: gtoal%ed.ac.uk@nsfnet-relay.ac.uk (Graham Toal)
Followup-To: alt.sources.d
Distribution: alt
Organization: Public access to Usenet in the UK
Lines: 244

These are my thoughts on a new compression algorithm to replace Lempel-
Zev-Welch.*  They are very preliminary, and I haven't coded any of it. I throw
it out for comment, and if anyone wants to modify it and/or code it up,
please feel free to do so.  It's too big a hack for me to start on just now;
I've got quite a big in-tray!

(* if it becomes necessary because of the software patent issue)


Take the input data laid out horizontally, and build a tree out of it with
pairwise bottom-up assembly, eg:


FIG 1:
                      /\
                     /  \
                    /    \
                   /      \
                  /        \
                 /          \
                /            \
               /              \
              /                \
             /                  \
            /                    \
           /                      \
          /\                      /\
         /  \                    /  \
        /    \                  /    \
       /      \                /      \
      /        \              /        \
     /          \            /          \
    /\          /\          /\          /\
   /  \        /  \        /  \        /  \
  /    \      /    \      /    \      /    \
 /\    /\    /\    /\    /\    /\    /\    /\
a  b  c  d  a  b  c  g  h  i  b  c  c  g  c  g


Now we convert this into a DAG (Directed Acyclic Graph). In fact we actually
do the node merging/dag creation on the fly as we build the table bottom up.
If you use a hash table to detect identical nodes, this will surprisingly be
a linear time algorithm at the expense of space propotional to the tree
width; I used something similar in a spelling-checker project to compact word
dictionaries, so I know it works.

The Directed Graph looks like this:


FIG 2:
                       0
                      /\
                     /  \
                    /    \
                   /      \
                  /        \
                 /          \
                /            \
               /              \
              /                \
             /                  \
            /                    \
           1                      2
          /\                      /\
         /  \                    /  \
        /    \                  /    \
       /      \                /      \
      /        \              /        \
     3          4            5          6
    /\          /\          /\          /\
   /  \        /  \        /  \        /  \
  7    8     =7    10     11   12    =10  =10
 /\    /\          /\    /\    /\          
a  b  c  d  a  b  c  g  h  i  b  c  c  g  c  g


Now we output the tree from left to right, walking the leaves.

a  b  c  d  [7]  c  g  h  i  b  c  [10] [10]

Every time a node number is output instead of a leaf character, enough of the
tree will have been built so that the node number points to a bit of the tree
which has already been built.

(Node numbers could be output using escape codes, but it would probably be
better to use something like huffman on an alphabet size which is large
enough to encompass the tree.  Note that the most frequent codes, near the
bottom of the tree, have large numbers.  A neat encoding scheme should
undo this effect.)

In real life, data will seldom come in powers of two; however this is an easy
problem to avoid; just assume an infinite sequence of some impossible code
following the last character; eg, if our last example were a little shorter:


 { FIG 3:
 {
 {                        0
 {                       /\
 {                      /  \
 {                     /    \
 {                    /      \
 {                   /        \
 {                  /          \
 {                 /            \
 {                /              \
 {               /                \
 {              /                  \
 {             /                    \
 {            1                      2
 {           /\                      /\
 {          /  \                    /  \
 {         /    \                  /    \
 {        /      \                /      \
 {       /        \              /        \
 {      3          4            5          6
 {     /\          /\          /\          /\
 {    /  \        /  \        /  \        /  \
 {   7    8     =7    10     11   12     /    \
 {  /\    /\          /\    /\    /\    /\    /\ 
 { a  b  c  d  a  b  c  g  h  i  b  c  c -1  -1 -1


Of course, you wouldn't output all the -1's -- the first one would be enough
to imply termination of the compressed input file when reading it back.

The next problem is that a tree like this is not sensible for large files;
clearly we want some sort of windowing mechanism. I suggest we pick some
maximum power of two (here we're using 16 for the sake of demonstration,
although a real program might use 16384)

Now we start building up our tree as before, but when we've filled it
(to the stage we had done in FIG 2), we emit the codes for the left
half, and shift the right tree branches to the left;


FIG 4:

Output: a b c d [7] c g

                         0
                        / \
                       /   .
                      /     .
                     /       .
                    /
                   /
                  /
                 /
                /
               /
              /
             2
            /\
           /  \
          /    \
         /      \
        /        \
       5          6
      /\          /\
     /  \        /  \
    11   12    =10  =10
   /\    /\              
  h  i  b  c  c  g  c  g


The next N/2 chars are read in and added to the DAG, sharing nodes as before.
Nodes in the left-hand half which pointed to a node which has now been lost
must be renumbered (or more precisely, a new master node should be created for
each shared node):


FIG 5:

                         0
                        / \
                       /   .
                      /     .
                     /       .
                    /
                   /
                  /
                 /
                /
               /
              /
             1
            /\
           /  \
          /    \
         /      \
        /        \
       3          4
      /\          /\
     /  \        /  \
    7    8      9    =9
   /\    /\    /\                
  h  i  b  c  c  g  c  g


This process of double-buffering now repeats until the first -1 node is output.


Note that large runs of repeated data are encoded in NlogN tokens.  Whether this
is enough, or whether we should pipe the input through a simple run-length
compacter before being fed to this algorithm can be determined by experiment.


FIG 6:
                       0
                      /\
                     /  \
                    /    \
                   /      \
                  /        \
                 /          \
                /            \
               /              \
              /                \
             /                  \
            /                    \
           1                      =1
          /\
         /  \
        /    \
       /      \
      /        \
     3          =3
    /\           
   /  \
  7    =7
 /\  
a  a  a  a  a  a  a  a  a  a  a  a  a  a  a  a


Outputs: a a [7] [3] [1]


Happy hacking.

Graham Toal <gtoal@ed.ac.uk>

(Don't mail me comments; post instead please (to alt.sources.d))

7] [3] [1]


Happy hacking.

Graham Toal <gtoal@ed.ac.uk>

(Don't mail me comments; post instead please (to alt.sources.d))