Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!usc!elroy.jpl.nasa.gov!decwrl!frnkmth!bill From: bill@franklin.com (bill) Newsgroups: comp.compression Subject: Re: fast string compr.? Message-ID: <27May91.032155.8833@franklin.com> Date: 27 May 91 03:21:55 GMT References: <1991May25.235703.159943@locus.com> Organization: Franklin Electronic Publishers Lines: 29 In article <1991May25.235703.159943@locus.com> cjkuo@locus.com (Chengi Jimmy Kuo) writes: (Are you the Jimmy Kuo who used to work at Proximity?) : Compress the dictionary but having a table for the first 2 or 3 letters : (26x26, or 26x26x26, but actually much less since there are no qbX, etc.). : For the words, once you're passed the first 3 letters, start compressing by : using "repeat deltas". A word is composed of N letters that repeat from the : previous word, followed by the delta. You can compress the delta portion : using whatever method you want since it's straight alphabet letters (but I : choose not to). Compression may not be the best, but data access is pretty : fast since decompression is trivial (also, strcmp is partially completed : due to this encoding). The problem is that you have to decompress, on average, half a bin, and if you have a lot of words, or lots of inflections, this will be awfully slow. The thing that made this kind of compression work for us (at Proximity) was that not only did we do the above, but we assigned special codes that meant "include a table of endings here". That and Huffman compression, with multiple models. This was all very ad hoc, but got some very good compression. If you can find it, the old Word Challenge game (which the Jimmy Kuo of Proximity did the first version of and I did the version you are likely to find), had a database compressed this way. We got 80K+ words and the program itself onto one old Apple II disk, the 140K one. If my memory serves me, we got just slightly over one byte per word.