Path: utzoo!utgpu!news-server.csri.toronto.edu!bonnie.concordia.ca!uunet!decwrl!frnkmth!bill From: bill@franklin.com (bill) Newsgroups: comp.compression Subject: Re: fast string compr.? Message-ID: <24May91.206106.345@franklin.com> Date: 24 May 91 20:51:06 GMT References: <1991May23.235530.11443@looking.on.ca> Distribution: comp Organization: Franklin Electronic Publishers Lines: 44 In article <1991May23.235530.11443@looking.on.ca> brad@looking.on.ca (Brad Templeton) writes: : Data compression is not normally used for spell checkers. Hokum. The data that goes into, for example, one of our Italian spelling checkers is around 13M of text. It is compressed into around 64K, reversibly. I think that can be called data compression. : Instead, most spell checkers use hashing. No, Brad, they do not. As one of the developers of the most widely used OEM spelling checker, and having had to take a good look at our competitors at various times, I can say this with authority. Furthermore, hash techniques were generally discarded over nine years ago because they accept way too many spelling errors. And, no, you cannot make them acceptable by hard coding the common spelling mistakes they fail on. No matter how low the probability of false acceptance is made, it is still high enough to make the customers wince. *Please*, people, do not use this algorithm for a spelling checker. It has its uses, but this is not one of them. There was an article in a recent DDJ, if one is interested in implementation details of the hash algorithm that Brad described. I don't recall the exact title, but it had the word "existence" in it. (That's an example of brain damage if ever I saw one; this technique is useful for testing for the *nonexistence* of something, or for screening for the *possible* existence of something.) BTW, the *correct* answer to the question of the "normal" way of storing the lexicon is this: store the lexicon as a trie. This is easily the best method for heavily inflected languages like Finnish, and is used by most of the big players. The other method of doing it is to store the dictionary in a sorted order possibly with an in-memory index and/or a small cache of frequently used words to avoid most disk lookups. This, too, typically involves a rudimentary kind of data compression (dictionary style), in that groups of inflected words are often stored as a root and inflection information.