Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!usc!elroy.jpl.nasa.gov!turnkey!orchard.la.locus.com!fafnir.la.locus.com!cjkuo
From: cjkuo@locus.com (Chengi Jimmy Kuo)
Newsgroups: comp.compression
Subject: Re: fast string compr.?
Message-ID: <1991May27.085944.172509@locus.com>
Date: 27 May 91 08:59:44 GMT
References: <LACA.91May20141838@vipunen.hut.fi> <1991May25.235703.159943@locus.com> <27May91.032155.8833@franklin.com>
Organization: Locus Computing Corporation, Los Angeles, California
Lines: 55

bill@franklin.com (bill) writes:

>In article <1991May25.235703.159943@locus.com> cjkuo@locus.com
>	(Chengi Jimmy Kuo) writes:

>(Are you the Jimmy Kuo who used to work at Proximity?)

Yes.  Hi Bill.  Wow, it's been 7 1/2 years!  (Sorry everybody for the
reminiscing.)  Say hi to everyone for me, will ya?

>: Compress the dictionary but having a table for the first 2 or 3 letters
>: (26x26, or 26x26x26, but actually much less since there are no qbX, etc.).
>: For the words, once you're passed the first 3 letters, start compressing by
>: using "repeat deltas".  A word is composed of N letters that repeat from the
>: previous word, followed by the delta.  You can compress the delta portion
>: using whatever method you want since it's straight alphabet letters (but I
>: choose not to).  Compression may not be the best, but data access is pretty
>: fast since decompression is trivial (also, strcmp is partially completed
>: due to this encoding).

>The problem is that you have to decompress, on average, half a
>bin, and if you have a lot of words, or lots of inflections, this
>will be awfully slow. The thing that made this kind of compression
>work for us (at Proximity) was that not only did we do the above,
>but we assigned special codes that meant "include a table of
>endings here". That and Huffman compression, with multiple models.

You're talking about the full compression for the full compression.
The algorithm I described above is just a subset of that.  It's
not the same one that was used for Word Challenge (below).  The one I
described is the one I used for a simple spell-check demo.  I used it for
the incore cache words.  Decompression was real fast because you could
skip all the words with a repeat larger than what you're looking for, and
other neat things happen when you start playing with this.

And all this, I understand is way out of date as far as Franklin/Proximity
is concerned now that you guys have encoded phonetics and thesauraus and
all that good stuff as well.

>This was all very ad hoc, but got some very good compression. If
>you can find it, the old Word Challenge game (which the Jimmy Kuo
>of Proximity did the first version of and I did the version you
>are likely to find), had a database compressed this way. We got
>80K+ words and the program itself onto one old Apple II disk, the
>140K one. If my memory serves me, we got just slightly over one
>byte per word.

Ah, Word Challenge, the game that never caught on.  The thing that
was so wonderful about that was Clean Disk.  Punch a hole in the 
diskette and it still works!  Technology before its time.

Jimmy
-- 
cjkuo@locus.com
"The correct answer to an either/or question is both!"