Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Posting-Version: version B 2.10.2 9/18/84; site ritcv.UUCP
Path: utzoo!watmath!clyde!burl!ulysses!allegra!mit-eddie!think!harvard!seismo!rochester!ritcv!mjl
From: mjl@ritcv.UUCP (Mike Lutz)
Newsgroups: net.news
Subject: Re: keyword-based news
Message-ID: <8936@ritcv.UUCP>
Date: Sat, 5-Oct-85 10:16:25 EDT
Article-I.D.: ritcv.8936
Posted: Sat Oct  5 10:16:25 1985
Date-Received: Tue, 8-Oct-85 03:13:56 EDT
References: <820@vortex.UUCP>
Reply-To: mjl@ritcv.UUCP (Michael Lutz)
Organization: Rochester Institute of Technology, Rochester, NY
Lines: 24

In article <820@vortex.UUCP> lauren@vortex.UUCP (Lauren Weinstein) writes:
>For quite a few years, I've been using a very elaborate keyword-based
>system for searching a large newswire story database ...
>
>One thing I learned long ago thanks to this system--it is almost
>IMPOSSIBLE to avoid major missed matches AND extra matches.

Lauren, as usual, is right on the money.  This problem is known to the
Information Retrieval folks as precision vs. recall.  Precision is the
fraction of retrieved items that are relevant; recall is the fraction
of relevant articles retrieved.  As Lauren has noted, you generally
have to trade one off against the other.  And this ignores entirely the
subjective nature of 'relevancy.'

There's been a lot of work in this area (see, for example, the evolving
SMART system from Salton's group at Cornell).  However, the incremental
CPU time needed to make even small gains in both precision and recall
can be staggering.  When combined with the volume of database updates
represented by a day's worth of news, I don't see how the use of
keywords at the the transmission level is practical.
-- 
Mike Lutz	Rochester Institute of Technology, Rochester NY
UUCP:		{allegra,seismo}!rochester!ritcv!mjl
CSNET:		mjl%rit@csnet-relay.ARPA