Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!tut.cis.ohio-state.edu!ucbvax!agate!eos!jaw From: jaw@eos.UUCP (James A. Woods) Newsgroups: news.software.b Subject: NNGRAB -- a netnews subject search speedup for 'nn' Keywords: USENET, information retrieval, subject, search, 'nn' Message-ID: <4931@eos.UUCP> Date: 30 Aug 89 06:48:09 GMT Organization: NASA Ames Research Center, California Lines: 151 # "Information is any difference that makes a difference." -- G. Bateson Here is tonic to aid with fast USENET subject search. A big assumption is that you subscribe to 'nn' and its delightful article header database. For a cheap keyword-based news selector, the operation nn -mxX -sword all is useful, except that on many systems (say a VAX 11/785 or Sun 3-class server with a fortnight of news), it may take upwards of a minute or two of wall clock time on a moderately-loaded machine. This can still be faster than ransacking 80MB of /usr/spool/news with 'grep', but not nearly enough for newsaholics. The changes which follow reduce the time to a couple of seconds or less. The catch involves keeping another file (about 1% of the news system size) updated a few times a day, just like what 'fastfind' does. First, define the shell command 'nngrab', where nngrab word is simple shorthand for the 'nn' one-liner above: ---------------------------------------------------------------------------- #!/bin/sh # nngrab -- quick news retrieval by subject search trap "rm -f /tmp/nngrab$$" 0 1 2 15 egrep -i "^....*$1" /usr/spool/news/.nn/subjects |\ sed 's/^\(...\).*/\1/' | uniq > /tmp/nngrab$$ v=`fgrep -f /tmp/nngrab$$ /usr/spool/news/.nn/map | sed 's/.* //'` case $v in "") exit ;; esac nn -Q -mxX -s$1 $v ---------------------------------------------------------------------------- 'nngrab' returns silently if there is no relevant news, and fires up normal 'nn' otherwise. It operates by mapping submatched subject lines containing pre-stored three-digit hex code group IDs to real newsgroup names (along with rare numeric false drops) for subsequent input to 'nn.' [The hex output is a suboptimal, but simple space-saving code.] Naturally, you're running fast e?grep (GNU-style) or this is all for naught. One possible problem is that subject underspecification might tickle a limit with 'fgrep' in degenerate cases on some systems ("wordlist too large"). Now some folks might consider this good [you didn't really want to call up all news articles containing the letter 'e', did you?] But if you don't consider this a feature, then either: (1) raze the limits in the Berkeley 'fgrep' source, or (2) bug Andrew Hume at AT&T for his implementation of the Commentz-Walter algorithm in Unix Edition Nine 'fgrep', or (3) write a (slothful) two-line 'awk' program utilizing associative arrays to map newsgroups completely, or (4) make C code for same. Next, update the auxiliary 'map' and 'subjects' files via a system-wide 'cron' or 'at' command, using a modified 'nn' (here called from a script dubbed 'nnspew') to do the hard part: ---------------------------------------------------------------------------- 2 6,9,12,15,18,21 * * * root /bin/nice /bin/sh /usr/lib/news/nnspew ---------------------------------------------------------------------------- is what's on one NASA Ames Research Center machine, along with: ---------------------------------------------------------------------------- # nnspew -- generate subject line database and newsgroup map trap "rm /tmp/nnmap$$ /tmp/nnsubj$$; exit" 0 1 2 15 export TERM TERM=vt100 # arbitrary MAP=/usr/spool/news/.nn/map SUBJECTS=/usr/spool/news/.nn/subjects awk '{printf "+ 000000 %s\n", $1}' /usr/lib/news/active > /.nn/rc /usr/lib/news/nnx -Q -mxX -sipsissima_verba all \ 2>/tmp/nnmap$$ | sort -u > /tmp/nnsubj$$ case $? in 0) mv /tmp/nnmap$$ $MAP; mv /tmp/nnsubj$$ $SUBJECTS;; esac ---------------------------------------------------------------------------- What we have here is a mutated form of 'nn' ('nnx') to spit out all news article subject headers in all groups. (The Latinism is presumably a non-occurring subject phrase whose sole purpose is to suppress 'nnx' output other than from the change below.) Sorting with the "unique" flag saves disk (~30-50%) by condensing followup verbiage. Another 15% could be trimmed by more complicated logic to merge cross-posted subjects. Another time/space tradeoff would allow a half-size subject file by using 'compress'. [Only recommended for severely unbalanced systems, e.g. an adrenal CPU like an Amdahl or high-freq Mips R2000 feeding a very slow filesystem or datapipe, say NFS<->optical disk.] In 'nnspew', the awk line fixates a master newsgroup file for 'root', the executor of 'nnx'. The lines containing "TERM" are a kludge, frankly, there just to quash the interactive 'nn' milieu. This hack could go away with more C code mods to 'nnx', I suppose. ---------------------------------------------------------------------------- Finally, there is 'nnx' itself, which is just a recompiled 'nn' with three changes to file src/nn/group.c: (1) to suppress the groupname indicator, comment out the line printf("\r%s", cur->group_name); clrline(); (2) add two lines of declaration near the top of access-group() static int gcount = -1; static char gsave[100]; (more precisely, after the line containing "static char subptext[80];") (3) immediately after the spot where the subject is read: ah->subject = alloc_str((int)hdr.dh_subject_length); if (fread(ah->subject, sizeof(char), (int)hdr.dh_subject_length, data) != hdr.dh_subject_length) goto data_error; add: if (strcmp(gsave, gh->group_name)) { gcount++; strcpy (gsave, gh->group_name); fprintf (stderr, "%03x %s\n", gcount, gh->group_name); } printf ("%03x%*s\n", gcount, (int)hdr.dh_subject_length, ah->subject); ---------------------------------------------------------------------------- These changes really just feed a subject list to 'stdout' and a newsgroup mapping to 'stderr'. Wrapping up, all this pre-computation (a three-minute process, including sorting, per 50 meg of news at one VAX MIP) is worth it if there are just a few invocations of 'nngrab' per day. In fact, the setup time of 'nnspew' is completely amortized in less than two uses of 'nngrab', so crontab entries can be fairly liberal. I have found the script invaluable as shorthand to call up already- perused as well as unsubscribed topics. My only wish would be for standard 'nn' to add the "Keywords:" fields to its database. Ah yes, all in keeping with the silk-purse solution to a sow's ear! James A. Woods (ames!jaw) NASA Ames Research Center