Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Posting-Version: Notesfiles $Revision: 1.7.0.8 $; site ccvaxa Path: utzoo!watmath!clyde!burl!ulysses!mhuxr!mhuxt!houxm!ihnp4!inuxc!pur-ee!uiucdcs!ccvaxa!preece From: preece@ccvaxa.UUCP Newsgroups: net.news Subject: Re: keyword-based news Message-ID: <1300016@ccvaxa> Date: Thu, 10-Oct-85 11:49:00 EDT Article-I.D.: ccvaxa.1300016 Posted: Thu Oct 10 11:49:00 1985 Date-Received: Sun, 13-Oct-85 04:04:28 EDT References: <820@vortex.UUCP> Lines: 70 Nf-ID: #R:vortex.UUCP:-82000:ccvaxa:1300016:000:3797 Nf-From: ccvaxa.UUCP!preece Oct 10 10:49:00 1985 > /* Written 9:16 am Oct 5, 1985 by mjl@ritcv.UUCP in ccvaxa:net.news > */ However, the incremental CPU time needed to make even small gains in > both precision and recall can be staggering. When combined with the > volume of database updates represented by a day's worth of news, I > don't see how the use of keywords at the the transmission level is > practical. ---------- Well, you can visualize the problem more easily if you recognize that the present newsgroup system is isomorphic to a keyword system, with the name of the newsgroup being the keyword. We get lots of false hits because people cross-post without thinking about it and we get lots of missed matches because people post things in the wrong groups. It's not clear to me that changing to an explicitly keyword based system is going to have much effect on the human failings that cause the problems. Everything in life is a tradeoff. Using keywords allows the author to specify more closely (by selecting more keywords) what the posting is about, but also makes the posting appear to be relevant to more topics: the difference is whether you consider the keywords to be ORed together or ANDed together. The best results come from using small numbers of keywords assigned from a large, carefully thought out vocabulary with hierarchical relationships among index terms, but that kind of vocabulary is hard to learn, easy to mis-use, and not well attuned to change in usage over time (ask a librarian about de-superimposition). There are also search systems based on associations between articles or between articles and queries. These offer a lot of promise, especially when (as on Usenet) the full text of the items is available. The user says "Article X is what I'm really interested in" and the system finds all the documents that are "like" article X, or the user provides a natural language description of the topic of interest and the system finds all the articles that are similar to that description. Similarity is usually based on similar vocabulary, weighted so that some words count more than others, or other common factors. Citation patterns are very strong tools for similarity, too. Articles that share many citations or that are often cited in the same place are very likely to be similar in content (don't bother sending counter-examples, all of this is biased by the law of large numbers). It's not clear that these mechanisms, developed for searching a retrospective collection of documents, are applicable to running a newsletter, which is a better model of Usenet. The user might have specific discussions (tied together by citation) that were assumed to be of continuing interest (any new items citing any of a set of existing items would be displayed), but the basic mechanism for viewing new material would have to depend on a set of profiles of interest, which is probably too specific. The reader might very well want to read generally in the area of feminism, with no more specific topic, and that kind of connection is hard to make by association unless the author has specifically tagged the item with a descriptor that can be placed in a hierarchical subject space, which brings us right back to the original problem -- that keyword might as well be the name of a newsgroup. Oh, one more problem with keyword based approaches: speed. The big database systems depend on inverted indexes: given a word, you can get a list of all the items containing that word. Maintaining that kind of index for a database whose contents change daily would be very expensive; doing without an inverted index would slow the user interface to unusability. The present system, of course, has an inverted index: the list of newsgroups. -- scott preece gould/csd - urbana ihnp4!uiucdcs!ccvaxa!preece