Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!usc!elroy.jpl.nasa.gov!turnkey!orchard.la.locus.com!fafnir.la.locus.com!cjkuo From: cjkuo@locus.com (Chengi Jimmy Kuo) Newsgroups: comp.compression Subject: Re: fast string compr.? Message-ID: <1991May27.085944.172509@locus.com> Date: 27 May 91 08:59:44 GMT References: <1991May25.235703.159943@locus.com> <27May91.032155.8833@franklin.com> Organization: Locus Computing Corporation, Los Angeles, California Lines: 55 bill@franklin.com (bill) writes: >In article <1991May25.235703.159943@locus.com> cjkuo@locus.com > (Chengi Jimmy Kuo) writes: >(Are you the Jimmy Kuo who used to work at Proximity?) Yes. Hi Bill. Wow, it's been 7 1/2 years! (Sorry everybody for the reminiscing.) Say hi to everyone for me, will ya? >: Compress the dictionary but having a table for the first 2 or 3 letters >: (26x26, or 26x26x26, but actually much less since there are no qbX, etc.). >: For the words, once you're passed the first 3 letters, start compressing by >: using "repeat deltas". A word is composed of N letters that repeat from the >: previous word, followed by the delta. You can compress the delta portion >: using whatever method you want since it's straight alphabet letters (but I >: choose not to). Compression may not be the best, but data access is pretty >: fast since decompression is trivial (also, strcmp is partially completed >: due to this encoding). >The problem is that you have to decompress, on average, half a >bin, and if you have a lot of words, or lots of inflections, this >will be awfully slow. The thing that made this kind of compression >work for us (at Proximity) was that not only did we do the above, >but we assigned special codes that meant "include a table of >endings here". That and Huffman compression, with multiple models. You're talking about the full compression for the full compression. The algorithm I described above is just a subset of that. It's not the same one that was used for Word Challenge (below). The one I described is the one I used for a simple spell-check demo. I used it for the incore cache words. Decompression was real fast because you could skip all the words with a repeat larger than what you're looking for, and other neat things happen when you start playing with this. And all this, I understand is way out of date as far as Franklin/Proximity is concerned now that you guys have encoded phonetics and thesauraus and all that good stuff as well. >This was all very ad hoc, but got some very good compression. If >you can find it, the old Word Challenge game (which the Jimmy Kuo >of Proximity did the first version of and I did the version you >are likely to find), had a database compressed this way. We got >80K+ words and the program itself onto one old Apple II disk, the >140K one. If my memory serves me, we got just slightly over one >byte per word. Ah, Word Challenge, the game that never caught on. The thing that was so wonderful about that was Clean Disk. Punch a hole in the diskette and it still works! Technology before its time. Jimmy -- cjkuo@locus.com "The correct answer to an either/or question is both!"