Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!watmath!clyde!burl!ulysses!ucbvax!hplabs!glacier!reid From: reid@glacier.ARPA (Brian Reid) Newsgroups: net.news.adm,net.news.b Subject: great glacier news flood problem solved Message-ID: <5819@glacier.ARPA> Date: Fri, 28-Mar-86 19:47:17 EST Article-I.D.: glacier.5819 Posted: Fri Mar 28 19:47:17 1986 Date-Received: Sun, 30-Mar-86 02:43:28 EST Organization: Stanford University, Computer Systems Lab Lines: 154 Xref: watmath net.news.adm:570 net.news.b:1339 I've spent the last two days going over 2.10.3 netnews with a fine-toothed cmb, and I have finally found the problem that was causing Glacier to drown in tidal waves of news at certain times. [Summary of the problem: when for various reasons Glacier received an unusually large number of news messages all at once, with 2.10.3 news, it was going totally wild. On the most recent episode it took 2 days of Vax 750 time to process all of a batch of news that we got from one of our feeds.] The first problem, which several people in net-land drew my attention to, was the 1-hour timeout on uuxqt locks. When the load got heavy, it was taking more than 1 hour of wall-clock time to process a batch, and after 1 hour, uuxqt would break the lock and start over again on the same batch. To get around this, I used Rick Adams' suggestion for expire interlocks, and replaced /usr/bin/rnews with a shell script that copied incoming news to /usr/spool/news/rnews.$$ instead of processing it straightaway. This quickly drained the uuxqt queues. Historical note: we have a lot of computers here at Stanford, and we use the "rdist" program to update the software on many of them from a single central source machine. One night many months ago, the nightly rdist program ran while the news expire was happening, and as a result the version of /usr/bin/rnews that got rdisted to all of the machines was the fake version that copied the news instead of processing it. To prevent this kind of bug, I replaced /usr/bin/rnews with this shell script: ---------------------------------------------------------------------- #! /bin/sh # # This shell script replaces /usr/bin/rnews. It tests for the presence of a # lock file. If the lock file is there, then the news is hidden away in a # spool directory instead of being processed. The lock file is set by the # "newsoff" command and cleared by the "newson" command. NEWSLIB=/usr/lib/news NEWSSPOOL=/usr/spool/news if [ -f $NEWSLIB/rnews.lock ] then exec /bin/cat > $NEWSSPOOL/rnews.$$ else exec $NEWSLIB/rnews 2>>$NEWSLIB/errlog fi ---------------------------------------------------------------------- The file /usr/lib/news/rnews.lock is specially marked so that it does not rdist out from the central source machine. (You don't want to mark software so that it doesn't rdist, because then updates won't propagate, but it is ok to mark lock files) I then ran Rick Adams' code to pick up all of the saved rnews files and process them all at once: -------------------------------------- cd /usr/spool/news for i in rnews.* do /usr/lib/news/rnews <$i rm -f $i done -------------------------------------- This immediately failed with "argument list too long", because there were so many rnews.* files that the shell couldn't glob the entire list. So I replaced it with this code: ------------------------------------------------------ PATH=.:/usr/stanford/bin:/usr/ucb:/usr/bin:/bin: NEWSLIB=/usr/lib/news NEWSSPOOL=/usr/spool/news rm -f $NEWSLIB/rnews.lock cd $NEWSSPOOL ls -tr1 |\ grep \^rnews. |\ awk "{print \"$NEWSLIB/rnews < \",\$1,\" && rm -f \",\$1,\"\"}" |\ /bin/sh ------------------------------------------------------ This worked just fine, and had the added advantage that it processed news in the same order that it arrived, thereby maintaining a semblance of chronological order. Looks just fine, right? Yep, that's what I thought. So I started doing this, and I started monitoring the queues and processes, and I noticed that rnews was spending most of its time in a "sleep" state, and the queues were draining very slowly. It turns out that having multiple rnews processes running at once was causing lock delays, with one rnews process going into a "sleep" to wait for another. Therefore I decided to limit things to running one rnews at a time. I ran "newsoff", which created /usr/lib/news/rnews.lock, reasoning that the newly-arriving news, fresh out of uuxqt, would just go into the spool queue. Of course it wouldn't be picked up by this loop, because the "ls -tr1" will have taken a snapshot of the directory, but I could go back and get it later after the big flood was fixed. ( If this was a mystery movie, the sound track would play dramatic and suspicious music right now.) One of the ways that 2.10.3 news differs from 2.10.2 news is that you can feed batched news directly to rnews; if rnews sees the batch flag it will automatically fork unbatch. So when we installed 2.10.3 news, our news neighbors changed from feeding us "unbatch" commands to feeding us "rnews" commands. What happens is that /usr/bin/rnews sees the batch flag, and upon seeing it forks /usr/lib/news/unbatch. In turn, /usr/lib/news/unbatch picks the batch apart into articles, and forks rnews on each one of the individual articles. This morning I put a bunch of diagnostic code into rnews and unbatch, to measure where the time was going, and upon studying it, I suddenly realized that unbatch is forking /usr/bin/rnews instead of /usr/lib/news/rnews. Now in the distributed version of netnews this is not a bug, because the distributed version doesn't do this mucking around with expire interlocks, so I can't really call this a bug in the news software. What happened, of course, was that when unbatch forked /usr/bin/rnews on individual articles, they were summarily copied right back into /usr/spool/news, this time without the batching information, and with one file per article. The second time around, the unbatching would not take place, and so the article would be processed correctly. The end result of all of this was that every batch was being processed as follows: uucico forks uuxqt forks /bin/sh cunbatch forks /usr/lib/news/compress -d forks /bin/sh rnews forks /bin/cat which copies the batch to spool dir then newson for each batch performed /usr/lib/news/rnews forks /usr/lib/news/unbatch forks /bin/sh /usr/bin/rnews forks /bin/cat which copies the article to spool dir then newson for each article performed /usr/lib/news/rnews which, thankfully, didn't fork anything, and finished in a few seconds per article. By my count that is 7 execs per batch and an additional 3 execs per article. 2 of the 7 per-batch execs are shells, and 1 of the per-article execs is a shell. When I figured out what was happening I didn't know whether to laugh or cry. I made the following change to unbatch.c, which fixed everything. News is now being processed at least twice as fast as it was before. 111c95 < (void) sprintf(buf, "%s/rnews", LIBDIR); --- > (void) sprintf(buf, "%s/rnews", BINDIR); If you have adopted Rick Adams' rnews hack (of putting a fake rnews in /usr/bin to capture the news during expire), then I recommend that you put a link to the "real" rnews in your /usr/lib/news directory, and then make this change to unbatch. It will save you the cost of execing a shell for each unbatched article, and it will not interfere with the workings of the rnews hack. -- Brian Reid decwrl!glacier!reid Stanford reid@SU-Glacier.ARPA