Xref: utzoo comp.sources.d:6978 news.admin:14590
Path: utzoo!utgpu!watserv1!watmath!att!linac!uwm.edu!caen!ox.com!msen.com!emv
From: emv@msen.com (Ed Vielmetti)
Newsgroups: comp.sources.d,news.admin
Subject: Re: UK Copyright libraries and Usenet
Message-ID: <EMV.91May23174209@bronte.aa.ox.com>
Date: 23 May 91 21:42:14 GMT
References: <3amo22w164w@mantis.co.uk> <10508@skye.cs.ed.ac.uk>
	<4592.283113da@iccgcc.decnet.ab.com>
	<1991May16.050935.29882@newshost.anu.edu.au>
	<EMV.91May16024531@poe.aa.ox.com> <4660.283b7d00@iccgcc.decnet.ab.com>
Sender: usenet@ox.com (Usenet News Administrator)
Followup-To: comp.sources.d,news.admin
Organization: MSEN, Inc. Ann Arbor MI
Lines: 83
In-Reply-To: herrickd@iccgcc.decnet.ab.com's message of 23 May 91 13:27:44 GMT

In article <4660.283b7d00@iccgcc.decnet.ab.com> herrickd@iccgcc.decnet.ab.com writes:

   In article <EMV.91May16024531@poe.aa.ox.com>, emv@ox.com (Ed Vielmetti) writes:
   [probably I should wait for the archive administration group, but Ed
    raised the subject now - i'm trying to keep in in news.admin]

fine, news.admin it is.  I suppose I should x-post it over to
dnet.archiv or aus.archives or any of the other ``regional''
unmoderated groups.

   Ignoring the question of why you do it, I've been watching comp.archives
   and wondering HOW you do it.  You can't possibly read 500 newsgroups
   looking for postings that identify repositories.  Can you?  

Sure, why not?  It's a very simple first-pass filter: take in every
single article, grep it for some key words, if they key words are in
there then save it aside for human processing.  Out of 10000 articles
in a day you can narrow things down to ca. 100-150, of which ca. 10-15
will be interesting once you've read them.  You could apply the same
filtering technique in about an hour's work with grep and a sys file
entry, and look for anything you can express in a shell script.  It's
quite reasonable to scan all of the various newsgroups if you have the
CPU handy.

Determining the set of key words is slightly tricky one.  The ideal
phrase to look for for comp.archives is
  this whizzy new package is available for anonymous ftp from 
      host.domain.org:/pub/whizzy/package-1.0.tar.Z
but there are a multitude of variations on that theme.  The AI project
is going to disambiguate between that ideal phrase and "does anyone
know where i can ftp some gif files from?".  fortunately with a big
screen and a fast cpu a human can chunk through those in 15-30 seconds
each, so it's not so important.

With a big (5000+) collection of articles which have already passed
the comp.archives eyeball test, I should be able to feed potential
matches into a filter that looks for whether there are any other
similar articles in the database.  That will categorize things into
three sets:
 - noise (requests, mostly, or the bitftp discussion)
 - things which I've seen before (new releases or reviews or updates)
 - things which I've never seen before but which might be interesting
then the above-mentioned human can go through a smaller set looking
for the interesting 10-15 articles.

   You recently
   doubled the effort of making the posting by adding that verification
   of the accessibility of the material.

No extra thought-power here, just a little more time.  All of the
packages thus far identified have their home sites identified in a
database; the verification step goes off and fetches the directory
information and sets it aside.  Some shell scripts make these easy to
type, but they're basically no-brainers.

   So, is MSEN studying artificial intelligence?  Does a VP Research have
   time to be working on a dissertation about information retrieval that
   is analagous to snatching a drink from a fire hose?

No AI here, just honest work.  None of the ``information retrieval''
stuff that I've looked at so far has been very well indexed or
cataloged, so my first-pass conclusion is that existing research
efforts aren't very good.  Most of the hard-core information retrieval
engines (that cost real money ) that would do the sorts of things I'm
doing fail in interesting ways on the usenet problem; it would seem
more likely to be worthwhile to build a functioning "expert system
that reads netnews and looks for articles that belong in
comp.archives" than to go off degree-hunting.  Especially if that
expert system can be trained to go after things in the charter of
other arbitrary newsgroups.

Any pointers to other functioning "read all the news and find the good
parts" systems are welcomed -- I've heard of something called NewsPeek
apparently running out at MIT, but nothing that I know of off hand
that runs off the relatively dirty usenet wire rather than the tidy
AI, UPI, or Dow Jones feeds.

-- 
Edward Vielmetti, moderator, comp.archives, emv@msen.com

"He who hesitates is last;" "The point man takes the hits;" "It's easier to
get forgiveness than permission;" "There's no harm in asking."  Pick your
aphorism and live by it. 		             -- Stephen Wolff, NSF