Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!usc!elroy.jpl.nasa.gov!decwrl!frnkmth!bill
From: bill@franklin.com (bill)
Newsgroups: comp.compression
Subject: Re: fast string compr.?
Message-ID: <27May91.032155.8833@franklin.com>
Date: 27 May 91 03:21:55 GMT
References: <LACA.91May20141838@vipunen.hut.fi> <1991May25.235703.159943@locus.com>
Organization: Franklin Electronic Publishers
Lines: 29

In article <1991May25.235703.159943@locus.com> cjkuo@locus.com
	(Chengi Jimmy Kuo) writes:

(Are you the Jimmy Kuo who used to work at Proximity?)

: Compress the dictionary but having a table for the first 2 or 3 letters
: (26x26, or 26x26x26, but actually much less since there are no qbX, etc.).
: For the words, once you're passed the first 3 letters, start compressing by
: using "repeat deltas".  A word is composed of N letters that repeat from the
: previous word, followed by the delta.  You can compress the delta portion
: using whatever method you want since it's straight alphabet letters (but I
: choose not to).  Compression may not be the best, but data access is pretty
: fast since decompression is trivial (also, strcmp is partially completed
: due to this encoding).

The problem is that you have to decompress, on average, half a
bin, and if you have a lot of words, or lots of inflections, this
will be awfully slow. The thing that made this kind of compression
work for us (at Proximity) was that not only did we do the above,
but we assigned special codes that meant "include a table of
endings here". That and Huffman compression, with multiple models.

This was all very ad hoc, but got some very good compression. If
you can find it, the old Word Challenge game (which the Jimmy Kuo
of Proximity did the first version of and I did the version you
are likely to find), had a database compressed this way. We got
80K+ words and the program itself onto one old Apple II disk, the
140K one. If my memory serves me, we got just slightly over one
byte per word.