Path: utzoo!attcan!uunet!cbmvax!rutgers!ucsd!brian From: brian@ucsd.EDU (Brian Kantor) Newsgroups: news.misc Subject: keyword-based news Message-ID: <1127@ucsd.EDU> Date: 7 Sep 88 05:01:50 GMT Organization: The Avant-Garde of the Now, Ltd. Lines: 26 In some analysis attempts to characterize the dimensions of the problem of a keyword-based news system, I looked at a week's worth of news, omitting the sources groups. In 6 days of articles, there were over 20,000 articles on file, containing 100,000+ unique strings of which only 17,000 were to be found in the 4.3BSD spelling dictionary of 25,000+ words. The average article-id is 22 characters long. (All numbers are rounded.) Thus the keyword index is going to be a rather large database, even with nonsense and trivial words filtered out. A quick back-of-the-envelope calculation says that we could fill up one or two WORM platters a year holding the index and articles. Yeah, I know, lots of ways to compress, filter out headers, signatures, etc. etc. etc. None of those will do more than cut the problem in half. For a while. Maybe I can inspire some student.... And that's as far as I'm going to go with it for now. Just thought you'd like to see some numbers. Brian Kantor UCSD Office of Academic Computing Academic Network Operations Group UCSD B-028, La Jolla, CA 92093 USA brian@ucsd.edu ucsd!brian BRIAN@UCSD