Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Posting-Version: version B 2.10.2 9/18/84; site ritcv.UUCP Path: utzoo!watmath!clyde!burl!ulysses!allegra!mit-eddie!think!harvard!seismo!rochester!ritcv!mjl From: mjl@ritcv.UUCP (Mike Lutz) Newsgroups: net.news Subject: Re: keyword-based news Message-ID: <8936@ritcv.UUCP> Date: Sat, 5-Oct-85 10:16:25 EDT Article-I.D.: ritcv.8936 Posted: Sat Oct 5 10:16:25 1985 Date-Received: Tue, 8-Oct-85 03:13:56 EDT References: <820@vortex.UUCP> Reply-To: mjl@ritcv.UUCP (Michael Lutz) Organization: Rochester Institute of Technology, Rochester, NY Lines: 24 In article <820@vortex.UUCP> lauren@vortex.UUCP (Lauren Weinstein) writes: >For quite a few years, I've been using a very elaborate keyword-based >system for searching a large newswire story database ... > >One thing I learned long ago thanks to this system--it is almost >IMPOSSIBLE to avoid major missed matches AND extra matches. Lauren, as usual, is right on the money. This problem is known to the Information Retrieval folks as precision vs. recall. Precision is the fraction of retrieved items that are relevant; recall is the fraction of relevant articles retrieved. As Lauren has noted, you generally have to trade one off against the other. And this ignores entirely the subjective nature of 'relevancy.' There's been a lot of work in this area (see, for example, the evolving SMART system from Salton's group at Cornell). However, the incremental CPU time needed to make even small gains in both precision and recall can be staggering. When combined with the volume of database updates represented by a day's worth of news, I don't see how the use of keywords at the the transmission level is practical. -- Mike Lutz Rochester Institute of Technology, Rochester NY UUCP: {allegra,seismo}!rochester!ritcv!mjl CSNET: mjl%rit@csnet-relay.ARPA