Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!usc!cs.utexas.edu!uunet!viusys!uxui!unislc!ttobler From: ttobler@unislc.uucp (Trent Tobler) Newsgroups: comp.compression Subject: Re: Is there better lossless compression than LZW? Message-ID: <1991Jun2.085929.12871@unislc.uucp> Date: 2 Jun 91 08:59:29 GMT References: <1991May29.155853.13716@looking.on.ca> Organization: unisys Lines: 83 From article <1991May29.155853.13716@looking.on.ca>, by brad@looking.on.ca (Brad Templeton): > You suggest that a string look ahead with clever heuristics to add > and delete from the dictionary would give good compression. > > How? If it's a look ahead method, the decompressor can't look ahead and > duplicate the compresor's model. > > This means you must insert tokens into the stream, saying "here's > a good dictionary entry." And then you need the reference to the > entry. The two together take as much or more room than the backpointer > of LZ compression. > > Or do you have some idea that I am missing? This type of algorithem would adapt faster than any LZ algorithem. It would also only have those entries in it's dictionary that it needed, and because it uses arithmetic coding, any LZ algorithem that contained more than x times the number of entries at any time compared to this will lose when it comes to compression ratios (the x come from the number of bits for the prefix necessary to determine if the entry should be removed/added/etc - this is what I think you had a problem with). Since I have not actually coded this, I can not give exact figures, but here is a description (albiet cryptic) of one. The algorithm that I have imagined and examined is sort of a language. It has a beginning dictionary of BD (begin-definition), ED (end-definition), ITK (insert-token-keep), ITR (insert-token-remove), BQ (quote), EQ (end-quote), and QC (quote-character). This is a state machine (Actually closer to a PDA), with each state having some probability of occuring (based on passed selections). Below, I have outlined the possible actions that each state can do. Command: BD -> [Definition] appended to stream. ITK -> [Token] appended to stream. ITR -> [Token] appended to stream, then removed from dictionary. BQ -> [Quote] appended to stream. QC -> single character appended to stream. Definition: BD -> [Definition] appended to new token. ED -> return new token to calling state. ITK -> [Token] appended to new token. ITR -> [Token] appended to new token, then removed from dictionary. BQ -> [Quote] appended to new token. QC -> single character appended to new token. Token: -> return contents of token Quote: -> add ascii character to quote buffer. EQ -> return quote buffer. It might help if I gave an example. Uncompressed text: "mississippi" Compressed text: dictionary: [QC](2.32) 'm'(8) - [BD](2.32) - [BD](2.58) [QC](2.58) 'i'(8) [ED](2.58) 'i' [BD](2.58) [QC](2.58) 's'(8) [ED](2.58) 'i','s' [ITD](2.58) 's'(1) 'i' [ED](2.58) 'i','iss' [ITD](2.32) 'iss'(1) 'i' [ITK](2.32) 'i'(0) 'i' [BD](2.32) [QC](2.58) 'p'(8) [ED](2.58) 'i','p' [ITD](2.32) 'p'(1) 'i' [ITD](2.32) 'i'(0) - *the number in the parenteses is the number of bits to encode the symbol. So, if you are using an 8 bit machine, "mississippi" requires 8*11 = 88 bits, encoded, 77.04 bits. > -- > Brad Templeton, ClariNet Communications Corp. -- Waterloo, Ontario 519/884-7473 -- Trent Tobler - ttobler@csulx.weber.edu