Path: utzoo!attcan!utgpu!cs.utexas.edu!uunet!lll-winken!xanth!ames!uhccux!virtue!aukuni.ac.nz!pgut1
From: pgut1@iti.org (PeterClaus          Gutmann        )
Newsgroups: alt.sources.d
Subject: Re: Compression algorithm
Keywords: compression, data storage, data transmission, trees, graphs
Message-ID: <pgut1.654503506@aucs7.cs.aukuni.ac.nz>
Date: 28 Sep 90 06:31:46 GMT
Sender: news@ccu1.aukuni.ac.nz (News Owner)
Distribution: alt
Organization: University of Auckland, New Zealand.
Lines: 58

In <986@tharr.UUCP> gtoal@tharr.UUCP (Graham Toal) writes:

>These are my thoughts on a new compression algorithm to replace Lempel-
>Zev-Welch.*  They are very preliminary, and I haven't coded any of it. I throw
>it out for comment, and if anyone wants to modify it and/or code it up,
>please feel free to do so.
>(* if it becomes necessary because of the software patent issue)

[Long description of compression algorithm]

.................

>(Node numbers could be output using escape codes, but it would probably be
>better to use something like huffman on an alphabet size which is large
>enough to encompass the tree.  Note that the most frequent codes, near the
>bottom of the tree, have large numbers.  A neat encoding scheme should
>undo this effect.)


>The next problem is that a tree like this is not sensible for large files;
>clearly we want some sort of windowing mechanism. I suggest we pick some
>maximum power of two (here we're using 16 for the sake of demonstration,
>although a real program might use 16384)

.................

>Happy hacking.

>Graham Toal <gtoal@ed.ac.uk>


Algorithms to do this type of compression already exist.  Lempel and Ziv did
two compression schemes, generally referred to as LZ77 and LZ78 (after the
years in which they were published in the IEEE Transactions on Information
Theory).  LZW is based on the LZ78 scheme, the algorithm you propose is like
their LZ77 scheme.  This (LZ77) algorithm is used by PKZIP and LHARC as the
"front-end" for their compression.  LHARC then uses adaptive Huffman coding
to compress the output of the LZ compressor, PKZIP uses Shannon-Fano coding.
The LZ77-based compressors generally work by using an 8K window with various
data structures to handle the maximum-length string matching.  I've written a
compressor which seems to achieve much better results with a 16K window, at the
expense of being a bit slower.  There is also a very fast compression
algorithm invented by Ed Fiala and Dan Greene (published in Communications of
the ACM, Vol.32, No.4 (April 1989), p490) which is a sort of hybrid between
LZ77 and LZ78, which uses a very fancy data structure to build a DAG (much like
you are proposing).  The problem with this algorithm is that, like LZW, it is
also patented.  The LZ77-compressors used in PKZIP and LHARC are based on the
LZSS compression scheme, which is described fairly well by Tim Bell in
the IEEE Trans. on Communications, Vol.34, No.12 (Dec.1986), p.1176.....
waffle burble waffle (this is one of my pet topics :-)

Peter.

--
[In order of reliability] Peter_Gutmann@kcbbs.gen.nz or peter@nacjack.gen.nz
			or pgut1@compsc.aukuni.ac.nz
Disclaimer:  The contents of this message are irrelevant.  It serves merely
	     as a covert channel for a data leak to The Commies.