Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!usc!cs.utexas.edu!uunet!viusys!uxui!unislc!ttobler
From: ttobler@unislc.uucp (Trent Tobler)
Newsgroups: comp.compression
Subject: Re: Is there better lossless compression than LZW?
Message-ID: <1991Jun2.085929.12871@unislc.uucp>
Date: 2 Jun 91 08:59:29 GMT
References: <1991May29.155853.13716@looking.on.ca>
Organization: unisys
Lines: 83

From article <1991May29.155853.13716@looking.on.ca>, by brad@looking.on.ca (Brad Templeton):
> You suggest that a string look ahead with clever heuristics to add
> and delete from the dictionary would give good compression.
> 
> How?  If it's a look ahead method, the decompressor can't look ahead and
> duplicate the compresor's model.
> 
> This means you must insert tokens into the stream, saying "here's
> a good dictionary entry."    And then you need the reference to the
> entry.  The two together take as much or more room than the backpointer
> of LZ compression.
> 
> Or do you have some idea that I am missing?

This type of algorithem would adapt faster than any LZ algorithem.  It would
also only have those entries in it's dictionary that it needed, and because
it uses arithmetic coding, any LZ algorithem that contained more than
x times the number of entries at any time compared to this will lose when
it comes to compression ratios (the x come from the number of bits for the
prefix necessary to determine if the entry should be removed/added/etc - this
is what I think you had a problem with).  Since I have not actually coded this,
I can not give exact figures, but here is a description (albiet cryptic) of one.

The algorithm that I have imagined and examined is sort of a language.  It
has a beginning dictionary of BD (begin-definition), ED (end-definition),
ITK (insert-token-keep), ITR (insert-token-remove), BQ (quote), EQ (end-quote),
and QC (quote-character).  This is a state machine (Actually closer to a PDA),
with each state having some probability of occuring (based on passed
selections).  Below, I have outlined the possible actions that each state can
do.

  Command:
     BD  -> [Definition] appended to stream.
     ITK -> [Token] appended to stream.
     ITR -> [Token] appended to stream, then removed from dictionary.
     BQ -> [Quote] appended to stream.
     QC -> single character appended to stream.

  Definition:
     BD -> [Definition] appended to new token.
     ED -> return new token to calling state.
     ITK -> [Token] appended to new token.
     ITR -> [Token] appended to new token, then removed from dictionary.
     BQ -> [Quote] appended to new token.
     QC -> single character appended to new token.

  Token:
     <ident> -> return contents of token <ident>

  Quote:
     <ascii> -> add ascii character to quote buffer.
     EQ -> return quote buffer.


It might help if I gave an example. 


  Uncompressed text:
     "mississippi"

  Compressed text:                                      dictionary:
     [QC](2.32)  'm'(8)                                   -
     [BD](2.32)                                           -
          [BD](2.58) [QC](2.58) 'i'(8) [ED](2.58)        'i'
          [BD](2.58) [QC](2.58) 's'(8) [ED](2.58)        'i','s'
          [ITD](2.58) 's'(1)                             'i'
     [ED](2.58)                                          'i','iss'
     [ITD](2.32) 'iss'(1)                                'i'
     [ITK](2.32) 'i'(0)                                  'i'
     [BD](2.32) [QC](2.58) 'p'(8) [ED](2.58)             'i','p'
     [ITD](2.32) 'p'(1)                                  'i'
     [ITD](2.32) 'i'(0)                                   -

*the number in the parenteses is the number of bits to encode the symbol.

So, if you are using an 8 bit machine, "mississippi" requires 8*11 = 88 bits,
encoded, 77.04 bits.

> -- 
> Brad Templeton, ClariNet Communications Corp. -- Waterloo, Ontario 519/884-7473

--
   Trent Tobler - ttobler@csulx.weber.edu