Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!crdgw1!uunet!mcsun!ukc!edcastle!gtoal
From: gtoal@castle.ed.ac.uk (G Toal)
Newsgroups: comp.compression
Subject: Re: fast string compr.?
Message-ID: <10553@castle.ed.ac.uk>
Date: 26 May 91 00:17:36 GMT
References: <LACA.91May20141838@vipunen.hut.fi> <1991May23.235530.11443@looking.on.ca>
Distribution: comp
Organization: Edinburgh University
Lines: 25

In article <1991May23.235530.11443@looking.on.ca> brad@looking.on.ca (Brad Templeton) writes:
>Data compression is not normally used for spell checkers.
Actually it is.

>Instead, most spell checkers use hashing.   Create an array of around
>B bits (around 600,000 is good and easy) and for each word in your dictionary,
>apply N hashing functions that map from the word to a number from 1 to B.
>Set the bit of that number. 
>
>To see if a word is in the dictionary, apply the N functions to the word
>and see if all the bits are set, if so, the word is valid, else it is
>definitely not in.
'fraid Bloom filters went out of fashion years ago -- they're particularly
bad at typo correction.  Although the data compression seems good (you get
a reasonably good probability of detecting errors by using about 20 bits
per word), it is lossy, and a Dag data structure comes in at about the
same -- and is just as fast if not faster -- and lossless.  Also very good
for error correction.

If the original poster wants some old C code, I posted a set of spelling
utilities to alt.sources about two years back.  Look for spellutils/* or
dawgutils/* - or mail me for a copy.

Regards
   Graham