Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!think.com!mintaka!bloom-beacon!eru!hagbard!sunic!chalmers.se!mathrt0.math.chalmers.se!d5kwedb
From: d5kwedb@dtek.chalmers.se (Kristian Wedberg)
Newsgroups: comp.compression
Subject: Greedy weirdness...
Message-ID: <1991Jun10.230518.29430@mathrt0.math.chalmers.se>
Date: 10 Jun 91 23:05:18 GMT
Sender: news@mathrt0.math.chalmers.se (Evald Nyhetsson)
Organization: Chalmers University of Technology, Gothenburg, Sweden
Lines: 18

Not to be outdone, I've invented my very own dictionary based compression
algorithm. It uses the greedy algorithm for parsing the input, which
according to the gurus should set back compression by some 5 or 10 percent
due to sub-optimal choice of strings. What I don't understand is why
there is a optimal maximum for the stringlengths! Put differently,
for a given source text, my model produces the lowest entropy with
a maximum stringlength of about a dozen characters, but it increases
with longer strings.

Now, assuming the longer strings are coded with their appropriate 
probabilities (as I believe they are>-), it seems strange that longer strings
could worsen compression. Intuitively, the presence of longer strings should,
ON AVERAGE, improve results. Why? A longer string could be many times
less probable and still be coded in fewer bits compared to two strings
of half the size. Me thinks, that is. Unfortunately, my experiments doesn't bear
this out. So, what am I doing wrong? Ideas, flames, pointers et al accepted.

	k ki kit kitt kitte eom