Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!swrinde!elroy.jpl.nasa.gov!decwrl!frnkmth!bill From: bill@franklin.com (bill) Newsgroups: comp.compression Subject: Re: fast string compr.? Message-ID: <28May91.065529.39761@franklin.com> Date: 28 May 91 06:55:29 GMT References: <1991May25.235703.159943@locus.com> <27May91.032155.8833@franklin.com> <1991May27.085944.172509@locus.com> Organization: Franklin Electronic Publishers Lines: 46 In article <1991May27.085944.172509@locus.com> cjkuo@locus.com (Chengi Jimmy Kuo) writes: : bill@franklin.com (bill) writes: : >(Are you the Jimmy Kuo who used to work at Proximity?) : : Yes. Hi Bill. Wow, it's been 7 1/2 years! (Sorry everybody for the : reminiscing.) Say hi to everyone for me, will ya? I've passed them your e-mail address. There's still a few around whom you might know. : >: Compress the dictionary but having a table for the first 2 or 3 letters : >: (26x26, or 26x26x26, but actually much less since there are no qbX, etc.). : >: For the words, once you're passed the first 3 letters, start compressing by : >: using "repeat deltas". A word is composed of N letters that repeat from the : >: previous word, followed by the delta. You can compress the delta portion : >: using whatever method you want since it's straight alphabet letters (but I : >: choose not to). Compression may not be the best, but data access is pretty : >: fast since decompression is trivial (also, strcmp is partially completed : >: due to this encoding). : : >The problem is that you have to decompress, on average, half a : >bin, and if you have a lot of words, or lots of inflections, this : >will be awfully slow. The thing that made this kind of compression : >work for us (at Proximity) was that not only did we do the above, : >but we assigned special codes that meant "include a table of : >endings here". That and Huffman compression, with multiple models. : : You're talking about the full compression for the full compression. : The algorithm I described above is just a subset of that. It's : not the same one that was used for Word Challenge (below). The one I : described is the one I used for a simple spell-check demo. I used it for : the incore cache words. Decompression was real fast because you could : skip all the words with a repeat larger than what you're looking for, and : other neat things happen when you start playing with this. Then we're saying the same thing, actually. If the list is small enough, this will be very fast. If you have a lot of words, it can get slow. And if space is of real concern and you do additional compression, you'll pay for it in time for the decompression itself and because a lot of the speed-up tricks you allude to won't work if you've done more compression. I guess the person who asked the original question is going to have to weigh the pros and cons for himself; I wonder if any of this is even relevant to his intent?