Path: utzoo!utstat!jarvis.csri.toronto.edu!rutgers!apple!usc!cs.utexas.edu!uunet!aplcen!haven!uflorida!novavax!twwells!bill From: bill@twwells.com (T. William Wells) Newsgroups: news.software.b Subject: Re: Not Checksums, how about string search hash codes? Message-ID: <1989Nov20.073448.18953@twwells.com> Date: 20 Nov 89 07:34:48 GMT References: <49602@looking.on.ca> Organization: None, Ft. Lauderdale, FL Lines: 27 In article <49602@looking.on.ca> brad@looking.on.ca (Brad Templeton) writes: : As I understand it (it's been a while) there are some nifty algorithms : that can be used to generate hash numbers than can make string searches : in regions of text very fast. : : In particular, as I understand it, using such hash numbers, you can tell : right away 95% of the time if a given string is NOT in a body of text. : ie. if the function returns false, you can be sure the string is not found. : If it's true, it might be found, and you have to search further -- different : hash functions, and eventually a full text search to verify things. : : Anybody know more about this and care to post it? This is easy enough. Just take any hashing function and use it to set bits in a bit table for each word in the article. If a hashed word creates a bit that is not in the table, the word is not in the article. As an extension, you can use multiple hash codes and OR their respective bit maps. You then test each hash function in turn. Unfortunately, I don't recall references or any more details, though I could find out more at work. --- Bill { uunet | novavax | ankh | sunvice } !twwells!bill bill@twwells.com