Path: utzoo!utstat!jarvis.csri.toronto.edu!mailrus!iuvax!watmath!looking!brad
From: brad@looking.on.ca (Brad Templeton)
Newsgroups: news.software.b
Subject: Modifying news storage for fast searches
Message-ID: <51195@looking.on.ca>
Date: 22 Nov 89 06:58:02 GMT
Organization: Looking Glass Software Limited, Waterloo ON
Lines: 44
Class: query

Well, if news.software.b is any indication, a fast search system for
News might not be too bad.

The average article in news.software.b this week had only 140 different
words in it.  In such cases, usually at least half those words are
'common' words that would not be used in a search.  To be conservative,
we could say that a high estimate for the average number of important
words in an article would be under 100.   Better software about common
words could drop this even further.

Using a hashing algorithm, if those 100 words were hashed into an array of
650 bits (can be expressed in 100 printable chars) we get an error rate of
15% or less, meaning that if we ask if word X is in the document, and it
says yes, there is a less than 15% chance it is wrong.  (If it says no, we
can be sure X is not in the document)

15% ain't so hot, but if you are doing a 2 word search (ie. "Henry Spencer")
then the probability drops even more.  If you got 2 yes answers on those
words, the chance of having neither is around 2.25%, although the chance
of having only one is just as bad.

Somehow I feel there are better algorithms, though, but my request for info
hasn't drawn anything.  I will have to search further.

Another idea is to store articles in a special compressed form that lists
the dictionary first (ie. the list of words) followed by the text expressed
and indices into the word list.

Turns out for this group that the average word list would be 870 bytes long.
Not too bad.  This doesn't cost any disk space though, as this method actually
reduces the file size -- although I don't know how much, if any, it hurts
compress.   You want to store the 870 byte index in an index file outside
the article, too.   For typical news flow, that's a cost of 2.6 meg/day,
which is a bit high, although you could compress it a lot, as well.

But the advantage would be a complete accurate search facility.  You could
ask it to show you all the articles that mention Henry Spencer and get an
answer almost right away.

Any further ideas are welcome.

A decoder for this is simple.
-- 
Brad Templeton, ClariNet Communications Corp. -- Waterloo, Ontario 519/884-7473