Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!crdgw1!uunet!mcsun!ukc!edcastle!gtoal From: gtoal@castle.ed.ac.uk (G Toal) Newsgroups: comp.compression Subject: Re: fast string compr.? Message-ID: <10553@castle.ed.ac.uk> Date: 26 May 91 00:17:36 GMT References: <1991May23.235530.11443@looking.on.ca> Distribution: comp Organization: Edinburgh University Lines: 25 In article <1991May23.235530.11443@looking.on.ca> brad@looking.on.ca (Brad Templeton) writes: >Data compression is not normally used for spell checkers. Actually it is. >Instead, most spell checkers use hashing. Create an array of around >B bits (around 600,000 is good and easy) and for each word in your dictionary, >apply N hashing functions that map from the word to a number from 1 to B. >Set the bit of that number. > >To see if a word is in the dictionary, apply the N functions to the word >and see if all the bits are set, if so, the word is valid, else it is >definitely not in. 'fraid Bloom filters went out of fashion years ago -- they're particularly bad at typo correction. Although the data compression seems good (you get a reasonably good probability of detecting errors by using about 20 bits per word), it is lossy, and a Dag data structure comes in at about the same -- and is just as fast if not faster -- and lossless. Also very good for error correction. If the original poster wants some old C code, I posted a set of spelling utilities to alt.sources about two years back. Look for spellutils/* or dawgutils/* - or mail me for a copy. Regards Graham