Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Posting-Version: Notesfiles $Revision: 1.7.0.8 $; site ccvaxa
Path: utzoo!watmath!clyde!burl!ulysses!mhuxr!mhuxt!houxm!ihnp4!inuxc!pur-ee!uiucdcs!ccvaxa!preece
From: preece@ccvaxa.UUCP
Newsgroups: net.news
Subject: Re: keyword-based news
Message-ID: <1300016@ccvaxa>
Date: Thu, 10-Oct-85 11:49:00 EDT
Article-I.D.: ccvaxa.1300016
Posted: Thu Oct 10 11:49:00 1985
Date-Received: Sun, 13-Oct-85 04:04:28 EDT
References: <820@vortex.UUCP>
Lines: 70
Nf-ID: #R:vortex.UUCP:-82000:ccvaxa:1300016:000:3797
Nf-From: ccvaxa.UUCP!preece    Oct 10 10:49:00 1985


> /* Written  9:16 am  Oct  5, 1985 by mjl@ritcv.UUCP in ccvaxa:net.news
> */ However, the incremental CPU time needed to make even small gains in
> both precision and recall can be staggering.  When combined with the
> volume of database updates represented by a day's worth of news, I
> don't see how the use of keywords at the the transmission level is
> practical.
----------
Well, you can visualize the problem more easily if you recognize
that the present newsgroup system is isomorphic to a keyword system,
with the name of the newsgroup being the keyword.  We get lots of
false hits because people cross-post without thinking about it
and we get lots of missed matches because people post things in the
wrong groups.  It's not clear to me that changing to an explicitly
keyword based system is going to have much effect on the human
failings that cause the problems.

Everything in life is a tradeoff.  Using keywords allows the author
to specify more closely (by selecting more keywords) what the posting is
about, but also makes the posting appear to be relevant to more topics:
the difference is whether you consider the keywords to be ORed together
or ANDed together.  The best results come from using small numbers
of keywords assigned from a large, carefully thought out vocabulary
with hierarchical relationships among index terms, but that kind of
vocabulary is hard to learn, easy to mis-use, and not well attuned to
change in usage over time (ask a librarian about de-superimposition).

There are also search systems based on associations between articles
or between articles and queries.  These offer a lot of promise,
especially when (as on Usenet) the full text of the items is available.
The user says "Article X is what I'm really interested in" and the
system finds all the documents that are "like" article X, or the
user provides a natural language description of the topic of interest
and the system finds all the articles that are similar to that
description.  Similarity is usually based on similar vocabulary,
weighted so that some words count more than others, or other common
factors.  Citation patterns are very strong tools for similarity, too.
Articles that share many citations or that are often cited in the
same place are very likely to be similar in content (don't bother
sending counter-examples, all of this is biased by the law of
large numbers).

It's not clear that these mechanisms, developed for searching a
retrospective collection of documents, are applicable to running
a newsletter, which is a better model of Usenet.  The user might
have specific discussions (tied together by citation) that were
assumed to be of continuing interest (any new items citing any
of a set of existing items would be displayed), but the basic
mechanism for viewing new material would have to depend on a set
of profiles of interest, which is probably too specific.  The
reader might very well want to read generally in the area of
feminism, with no more specific topic, and that kind of connection
is hard to make by association unless the author has specifically
tagged the item with a descriptor that can be placed in a
hierarchical subject space, which brings us right back to the
original problem -- that keyword might as well be the name of
a newsgroup.

Oh, one more problem with keyword based approaches: speed.  The
big database systems depend on inverted indexes: given a word, you
can get a list of all the items containing that word.  Maintaining
that kind of index for a database whose contents change daily would
be very expensive; doing without an inverted index would slow
the user interface to unusability.  The present system, of course,
has an inverted index: the list of newsgroups.

-- 
scott preece
gould/csd - urbana
ihnp4!uiucdcs!ccvaxa!preece