Path: utzoo!utstat!jarvis.csri.toronto.edu!mailrus!iuvax!watmath!looking!brad From: brad@looking.on.ca (Brad Templeton) Newsgroups: news.software.b Subject: Modifying news storage for fast searches Message-ID: <51195@looking.on.ca> Date: 22 Nov 89 06:58:02 GMT Organization: Looking Glass Software Limited, Waterloo ON Lines: 44 Class: query Well, if news.software.b is any indication, a fast search system for News might not be too bad. The average article in news.software.b this week had only 140 different words in it. In such cases, usually at least half those words are 'common' words that would not be used in a search. To be conservative, we could say that a high estimate for the average number of important words in an article would be under 100. Better software about common words could drop this even further. Using a hashing algorithm, if those 100 words were hashed into an array of 650 bits (can be expressed in 100 printable chars) we get an error rate of 15% or less, meaning that if we ask if word X is in the document, and it says yes, there is a less than 15% chance it is wrong. (If it says no, we can be sure X is not in the document) 15% ain't so hot, but if you are doing a 2 word search (ie. "Henry Spencer") then the probability drops even more. If you got 2 yes answers on those words, the chance of having neither is around 2.25%, although the chance of having only one is just as bad. Somehow I feel there are better algorithms, though, but my request for info hasn't drawn anything. I will have to search further. Another idea is to store articles in a special compressed form that lists the dictionary first (ie. the list of words) followed by the text expressed and indices into the word list. Turns out for this group that the average word list would be 870 bytes long. Not too bad. This doesn't cost any disk space though, as this method actually reduces the file size -- although I don't know how much, if any, it hurts compress. You want to store the 870 byte index in an index file outside the article, too. For typical news flow, that's a cost of 2.6 meg/day, which is a bit high, although you could compress it a lot, as well. But the advantage would be a complete accurate search facility. You could ask it to show you all the articles that mention Henry Spencer and get an answer almost right away. Any further ideas are welcome. A decoder for this is simple. -- Brad Templeton, ClariNet Communications Corp. -- Waterloo, Ontario 519/884-7473