Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!dali.cs.montana.edu!caen!sdd.hp.com!wuarchive!uunet!viusys!uxui!unislc!ttobler
From: ttobler@unislc.uucp (Trent Tobler)
Newsgroups: comp.compression
Subject: Re: Is there better lossless compression than LZW?
Message-ID: <1991Jun4.191208.11186@unislc.uucp>
Date: 4 Jun 91 19:12:08 GMT
References: <1991Jun02.213116.3069@looking.on.ca>
Organization: unisys
Lines: 81

From article <1991Jun02.213116.3069@looking.on.ca>, by brad@looking.on.ca (Brad Templeton):
> I investigated such ideas and I think there are a few problems.
> 
> First of all, to avoid confusion of terms, LZ compressions refers to two
> algorithms.   One is a general algorithm where you have backpointers into
> the previous text.  The whole of the previous text (or the last N bytes)
> is your dictionary, so there is not a question of "adapting faster" in
> such an algorithm.   THe laster LZ algorithm, and its descendants, such

Such an algorithm indeed does adapt, but similar to mine, requires
additional bits in order to determine if the next token is a symbol, or
if the next token is a back-pointer.  Depending on how fast you want it
to adapt, the number of additional bits for a back-pointer can be 0 (ie
no adaption, but high compression), or take 0 to be the number of bits to
say 'this is a symbol' and get infinite adaption, but no compression.
And also, you have to decide on how you are going to represent the
'back-pointers'.  Are they relative to the current position, the beginning
of the file, somewhere in between?  Also, are they fixed length, variable
length, etc.  And how do you say "take 50 characters from a position 100
bytes back"

> as LZW (a variant designed for speed) M-W and All-Prefix coding do have
> "how fast they adapt" as an important factor.

So does the 'back-pointer'.  I guess the nice thing about it is you
can decide how fast it will adapt..

> Your algorithm is similar to LZ1, but instead of backpointers you use
> tags.   Tag a string when it goes out, and then refer to it when you need to.

No, I only tag a string that the process KNOWS will be used later.

> Problem is that tag+referral can be larger than just a backpointer referral.
> How long are the end-token and start-token codes?   Just as important, since
> they exist in the same token-space as regular entries in the alphabet, how
> much longer do they make all the other codes.

all other codes will be 2.32 bits longer (ie a single character will be
10.32 bits long); a string of characters will be 2.32 + (x+1)*8.01 bits
long (ie a string of 10 characters will be 82.42 bits long); tokens (dictionary
entry substitution) will be 2.32 + ln(T/N)/ln(2) - N is the number of times the
token has been used, T is sum of the number of times all current tokens have
been used (so if there are 10 tokens, N(4) = 10, and T = 50, so the number
of bits for token 4 would be 4.64 bits;  To define a new entry requires
4.90 additional bits above what it takes to encode what that entry is (using
the previous encodings.)  All of this assumes that I do only frequency
counting on the dictionary entries, and do none on any of the other tokens.

> (ie. if you use 1 bit for a tag code, you add 1 bit to all the regular
> tokens in the alphabet)
> 
> It's tricky.  I'm not saying it's impossible, but I would like to see more
> evidence that it can be done.
> 
> Remember as well that you have to deal with the fact that a new entry can
> often consists of the tail of one reference and the head of another reference.

But I take in to consideration this fact.

> ie. if you have two dictionary entries "the boy " and "who fell" how do you
> tag the string "boy who" if it first appears in the transmission of the
> two entries together?   How do you do this without using space?

well, if "the boy who fell" appeared in the text, "the boy", and "who fell"
would not have been added to the dictionary unless the appeared later.  And
if "boy who" also appeared later, "boy ", and "who" would have also been
added.  I did not give any indication of how I was storing the strings, only
that they must have the properties that I can add any entry, and delete
any entry.  The string storage could be reference pointers, actual text, etc,
and so the space used would be dependant on that.  (There is one method
that guarantees no more than N+ln(N)*N memory units for data of size N)

Also, because I did not even attempt any heuristic algorithems in
my description of an example, there will be problems like this.  That is
why I stated that this combined with heuristics would be much better.

> -- 
> Brad Templeton, ClariNet Communications Corp. -- Waterloo, Ontario 519/884-7473

--
   Trent Tobler - ttobler@csulx.weber.edu