Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!think.com!mintaka!bloom-beacon!eru!hagbard!sunic!chalmers.se!mathrt0.math.chalmers.se!d5kwedb From: d5kwedb@dtek.chalmers.se (Kristian Wedberg) Newsgroups: comp.compression Subject: Greedy weirdness... Message-ID: <1991Jun10.230518.29430@mathrt0.math.chalmers.se> Date: 10 Jun 91 23:05:18 GMT Sender: news@mathrt0.math.chalmers.se (Evald Nyhetsson) Organization: Chalmers University of Technology, Gothenburg, Sweden Lines: 18 Not to be outdone, I've invented my very own dictionary based compression algorithm. It uses the greedy algorithm for parsing the input, which according to the gurus should set back compression by some 5 or 10 percent due to sub-optimal choice of strings. What I don't understand is why there is a optimal maximum for the stringlengths! Put differently, for a given source text, my model produces the lowest entropy with a maximum stringlength of about a dozen characters, but it increases with longer strings. Now, assuming the longer strings are coded with their appropriate probabilities (as I believe they are>-), it seems strange that longer strings could worsen compression. Intuitively, the presence of longer strings should, ON AVERAGE, improve results. Why? A longer string could be many times less probable and still be coded in fewer bits compared to two strings of half the size. Me thinks, that is. Unfortunately, my experiments doesn't bear this out. So, what am I doing wrong? Ideas, flames, pointers et al accepted. k ki kit kitt kitte eom