Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!think.com!spool.mu.edu!uunet!looking!brad From: brad@looking.on.ca (Brad Templeton) Newsgroups: comp.compression Subject: Re: fast string compr.? Message-ID: <1991May23.235530.11443@looking.on.ca> Date: 23 May 91 23:55:30 GMT Article-I.D.: looking.1991May23.235530.11443 References: Distribution: comp Organization: Looking Glass Software Ltd. Lines: 23 Data compression is not normally used for spell checkers. Instead, most spell checkers use hashing. Create an array of around B bits (around 600,000 is good and easy) and for each word in your dictionary, apply N hashing functions that map from the word to a number from 1 to B. Set the bit of that number. To see if a word is in the dictionary, apply the N functions to the word and see if all the bits are set, if so, the word is valid, else it is definitely not in. Note that this will say that some bad words are correct, it never says that a valid word is not found. The probablilty of saying a bad word is correct can be reduced until it is meaningless. If you have a 30,000 word dictionary and 5 functions then then ~1/4 of the words are set. The probability of a bad word giving yes is .25 ^ 5 or .09%. There is a formula that works out the optimal values of N for B and the dictionary size. If you are really worried, take common misspellings and special case them. -- Brad Templeton, ClariNet Communications Corp. -- Waterloo, Ontario 519/884-7473