Xref: utzoo comp.sources.d:6978 news.admin:14590 Path: utzoo!utgpu!watserv1!watmath!att!linac!uwm.edu!caen!ox.com!msen.com!emv From: emv@msen.com (Ed Vielmetti) Newsgroups: comp.sources.d,news.admin Subject: Re: UK Copyright libraries and Usenet Message-ID: Date: 23 May 91 21:42:14 GMT References: <3amo22w164w@mantis.co.uk> <10508@skye.cs.ed.ac.uk> <4592.283113da@iccgcc.decnet.ab.com> <1991May16.050935.29882@newshost.anu.edu.au> <4660.283b7d00@iccgcc.decnet.ab.com> Sender: usenet@ox.com (Usenet News Administrator) Followup-To: comp.sources.d,news.admin Organization: MSEN, Inc. Ann Arbor MI Lines: 83 In-Reply-To: herrickd@iccgcc.decnet.ab.com's message of 23 May 91 13:27:44 GMT In article <4660.283b7d00@iccgcc.decnet.ab.com> herrickd@iccgcc.decnet.ab.com writes: In article , emv@ox.com (Ed Vielmetti) writes: [probably I should wait for the archive administration group, but Ed raised the subject now - i'm trying to keep in in news.admin] fine, news.admin it is. I suppose I should x-post it over to dnet.archiv or aus.archives or any of the other ``regional'' unmoderated groups. Ignoring the question of why you do it, I've been watching comp.archives and wondering HOW you do it. You can't possibly read 500 newsgroups looking for postings that identify repositories. Can you? Sure, why not? It's a very simple first-pass filter: take in every single article, grep it for some key words, if they key words are in there then save it aside for human processing. Out of 10000 articles in a day you can narrow things down to ca. 100-150, of which ca. 10-15 will be interesting once you've read them. You could apply the same filtering technique in about an hour's work with grep and a sys file entry, and look for anything you can express in a shell script. It's quite reasonable to scan all of the various newsgroups if you have the CPU handy. Determining the set of key words is slightly tricky one. The ideal phrase to look for for comp.archives is this whizzy new package is available for anonymous ftp from host.domain.org:/pub/whizzy/package-1.0.tar.Z but there are a multitude of variations on that theme. The AI project is going to disambiguate between that ideal phrase and "does anyone know where i can ftp some gif files from?". fortunately with a big screen and a fast cpu a human can chunk through those in 15-30 seconds each, so it's not so important. With a big (5000+) collection of articles which have already passed the comp.archives eyeball test, I should be able to feed potential matches into a filter that looks for whether there are any other similar articles in the database. That will categorize things into three sets: - noise (requests, mostly, or the bitftp discussion) - things which I've seen before (new releases or reviews or updates) - things which I've never seen before but which might be interesting then the above-mentioned human can go through a smaller set looking for the interesting 10-15 articles. You recently doubled the effort of making the posting by adding that verification of the accessibility of the material. No extra thought-power here, just a little more time. All of the packages thus far identified have their home sites identified in a database; the verification step goes off and fetches the directory information and sets it aside. Some shell scripts make these easy to type, but they're basically no-brainers. So, is MSEN studying artificial intelligence? Does a VP Research have time to be working on a dissertation about information retrieval that is analagous to snatching a drink from a fire hose? No AI here, just honest work. None of the ``information retrieval'' stuff that I've looked at so far has been very well indexed or cataloged, so my first-pass conclusion is that existing research efforts aren't very good. Most of the hard-core information retrieval engines (that cost real money ) that would do the sorts of things I'm doing fail in interesting ways on the usenet problem; it would seem more likely to be worthwhile to build a functioning "expert system that reads netnews and looks for articles that belong in comp.archives" than to go off degree-hunting. Especially if that expert system can be trained to go after things in the charter of other arbitrary newsgroups. Any pointers to other functioning "read all the news and find the good parts" systems are welcomed -- I've heard of something called NewsPeek apparently running out at MIT, but nothing that I know of off hand that runs off the relatively dirty usenet wire rather than the tidy AI, UPI, or Dow Jones feeds. -- Edward Vielmetti, moderator, comp.archives, emv@msen.com "He who hesitates is last;" "The point man takes the hits;" "It's easier to get forgiveness than permission;" "There's no harm in asking." Pick your aphorism and live by it. -- Stephen Wolff, NSF