Path: utzoo!utgpu!news-server.csri.toronto.edu!rutgers!cs.utexas.edu!uunet!mcsun!ukc!axion!tharr!gtoal From: gtoal@tharr.UUCP (Graham Toal) Newsgroups: alt.sources.d Subject: Compression algorithm Keywords: compression, data storage, data transmission, trees, graphs Message-ID: <986@tharr.UUCP> Date: 15 Sep 90 22:48:18 GMT Reply-To: gtoal%ed.ac.uk@nsfnet-relay.ac.uk (Graham Toal) Followup-To: alt.sources.d Distribution: alt Organization: Public access to Usenet in the UK Lines: 244 These are my thoughts on a new compression algorithm to replace Lempel- Zev-Welch.* They are very preliminary, and I haven't coded any of it. I throw it out for comment, and if anyone wants to modify it and/or code it up, please feel free to do so. It's too big a hack for me to start on just now; I've got quite a big in-tray! (* if it becomes necessary because of the software patent issue) Take the input data laid out horizontally, and build a tree out of it with pairwise bottom-up assembly, eg: FIG 1: /\ / \ / \ / \ / \ / \ / \ / \ / \ / \ / \ / \ /\ /\ / \ / \ / \ / \ / \ / \ / \ / \ / \ / \ /\ /\ /\ /\ / \ / \ / \ / \ / \ / \ / \ / \ /\ /\ /\ /\ /\ /\ /\ /\ a b c d a b c g h i b c c g c g Now we convert this into a DAG (Directed Acyclic Graph). In fact we actually do the node merging/dag creation on the fly as we build the table bottom up. If you use a hash table to detect identical nodes, this will surprisingly be a linear time algorithm at the expense of space propotional to the tree width; I used something similar in a spelling-checker project to compact word dictionaries, so I know it works. The Directed Graph looks like this: FIG 2: 0 /\ / \ / \ / \ / \ / \ / \ / \ / \ / \ / \ 1 2 /\ /\ / \ / \ / \ / \ / \ / \ / \ / \ 3 4 5 6 /\ /\ /\ /\ / \ / \ / \ / \ 7 8 =7 10 11 12 =10 =10 /\ /\ /\ /\ /\ a b c d a b c g h i b c c g c g Now we output the tree from left to right, walking the leaves. a b c d [7] c g h i b c [10] [10] Every time a node number is output instead of a leaf character, enough of the tree will have been built so that the node number points to a bit of the tree which has already been built. (Node numbers could be output using escape codes, but it would probably be better to use something like huffman on an alphabet size which is large enough to encompass the tree. Note that the most frequent codes, near the bottom of the tree, have large numbers. A neat encoding scheme should undo this effect.) In real life, data will seldom come in powers of two; however this is an easy problem to avoid; just assume an infinite sequence of some impossible code following the last character; eg, if our last example were a little shorter: { FIG 3: { { 0 { /\ { / \ { / \ { / \ { / \ { / \ { / \ { / \ { / \ { / \ { / \ { 1 2 { /\ /\ { / \ / \ { / \ / \ { / \ / \ { / \ / \ { 3 4 5 6 { /\ /\ /\ /\ { / \ / \ / \ / \ { 7 8 =7 10 11 12 / \ { /\ /\ /\ /\ /\ /\ /\ { a b c d a b c g h i b c c -1 -1 -1 Of course, you wouldn't output all the -1's -- the first one would be enough to imply termination of the compressed input file when reading it back. The next problem is that a tree like this is not sensible for large files; clearly we want some sort of windowing mechanism. I suggest we pick some maximum power of two (here we're using 16 for the sake of demonstration, although a real program might use 16384) Now we start building up our tree as before, but when we've filled it (to the stage we had done in FIG 2), we emit the codes for the left half, and shift the right tree branches to the left; FIG 4: Output: a b c d [7] c g 0 / \ / . / . / . / / / / / / / 2 /\ / \ / \ / \ / \ 5 6 /\ /\ / \ / \ 11 12 =10 =10 /\ /\ h i b c c g c g The next N/2 chars are read in and added to the DAG, sharing nodes as before. Nodes in the left-hand half which pointed to a node which has now been lost must be renumbered (or more precisely, a new master node should be created for each shared node): FIG 5: 0 / \ / . / . / . / / / / / / / 1 /\ / \ / \ / \ / \ 3 4 /\ /\ / \ / \ 7 8 9 =9 /\ /\ /\ h i b c c g c g This process of double-buffering now repeats until the first -1 node is output. Note that large runs of repeated data are encoded in NlogN tokens. Whether this is enough, or whether we should pipe the input through a simple run-length compacter before being fed to this algorithm can be determined by experiment. FIG 6: 0 /\ / \ / \ / \ / \ / \ / \ / \ / \ / \ / \ 1 =1 /\ / \ / \ / \ / \ 3 =3 /\ / \ 7 =7 /\ a a a a a a a a a a a a a a a a Outputs: a a [7] [3] [1] Happy hacking. Graham Toal (Don't mail me comments; post instead please (to alt.sources.d)) 7] [3] [1] Happy hacking. Graham Toal (Don't mail me comments; post instead please (to alt.sources.d))