Path: utzoo!utgpu!utstat!jarvis.csri.toronto.edu!mailrus!wuarchive!brutus.cs.uiuc.edu!coolidge From: coolidge@brutus.cs.uiuc.edu (John Coolidge) Newsgroups: news.software.b Subject: Re: Lots of dups Summary: Actually, that's pretty much what I do... Message-ID: <1989Oct28.061250.6036@brutus.cs.uiuc.edu> Date: 28 Oct 89 06:12:50 GMT References: <1989Oct25.164024.14894@ctr.columbia.edu> <1989Oct25.205129.16397@brutus.cs.uiuc.edu> <1989Oct26.164042.4692@utzoo.uucp> <1989Oct27.012640.27706@brutus.cs.uiuc.edu> <1989Oct27.161302.5026@utzoo.uucp> Sender: news@brutus.cs.uiuc.edu Reply-To: coolidge@cs.uiuc.edu Organization: U of Illinois, CS Dept., Systems Research Group Lines: 94 henry@utzoo.uucp (Henry Spencer) writes: >In article <1989Oct27.012640.27706@brutus.cs.uiuc.edu> coolidge@cs.uiuc.edu writes: >>[...]my system (write each incoming article as a separate file, then >>feed them all down a pipe into one relaynews) seems to have done >>a great job... >In general, the bigger the lump you can feed to relaynews, the better. >The really-ambitious could "cat * | relaynews" (more or less) rather than >using the "for" loop of the current "newsrun". This is more-or-less what I do. The current work-in-process is a C newsrund that does: foreach file in the incoming directory, send it down a pipe into a waiting relaynews (started after it's determined that there is at least one article to be received). I have an alpha version that does exactly that (and nothing else: no locking, no error logging, _nothing_). >One thing you do lose, >though, is error recovery -- it is no longer possible to localize problems >to a single file. One can imagine a hybrid approach that would pour it >all in, and then *if* there was an error message, back off and push the >files in one at a time to see who was to blame (since feeding the same >stuff through twice is generally harmless, and also fairly quick [rejection >of duplicates is really fast]). At the time it didn't seem worth the >trouble. I'd already granted losing some error recovery. I'm not planning to be as cautious about disk space as the C News authors were. I'm not planning to catch errors as neatly. Actually I hadn't thought of this roll-back scheme --- seems like a very good idea that I may try to implement. It could be further optimized by adding some sort of "article key" down the pipe into relaynews (perhaps #! key , with being the batch file name) --- this makes things extremely simple. It also increases the workload for relaynews, but perhaps not too much. All of my changes are intended to feed the performance monster. If at all possible, I want to feed both mouths of the performance beast --- the propagation delay mouth and the system load mouth. If I can't feed both, propagation delay gets a little extra weight on the balance, but not much --- I can't bring our news machine to its knees (or anywhere near). If I lose some error recovery along the way, then I'm willing to pay the price to get the performance. A lot of people aren't, and frankly I'm not too sure that there exist all that many sites who really need to optimize the sorts of things I'm optimizing. Most sites just want news delivered at a reasonable pace. I suspect most sites get news at most hourly and more commonly once or twice a day. On the other hand, there are the NNTP high-speed feeds (like me!) who want all the articles processed yesterday and everything done double-quick. The occasional odd failed batch (and I get very, very few of them) can be diverted into the slow lane for attention, but most of the news should be in and out ASAP. Dropping a bad batch is even often an option; I've got 9 other people all trying to give the same articles to me... Anyway, there are several classes of performance patches: 1) Those that improve performance without any cost (in portability, recovery, or anything else). Clearly these should go everywhere. Here lie things like Chris Myers' nntpxmit patches (to put the message-id in the batch file and have nntpxmit use it) and David Robinson's ihave patch (remove a useless file read in nntpd's ihave processing). 2) Those that improve performance at the cost of some portability, but no loss in recovery or anything else. Things like writing spacefor and newsrun in C come under this heading. Very much worth it if done as an optimization, and if a portable alternative is available to those sites who can't run the optimized version. 3) Those that improve performance at the cost of recovery, but not at the cost of portability. Changing the code in the stock newsrun to do cat * | relaynews is an example of this. Useful, but only if the people using it know what they're giving up. 4) Those that improve performance at the cost of both recovery and portability. Things like my newsrund-in-progress do this, since I'm both working in C and doing things like cat * | relaynews. Useful only if it'll run where you are and if you are willing to give up the reliability. 5) And, of course, things that _don't_ improve performance, even though someone thought they might. I've tried a number of experiments that just didn't yield the benefits I thought they would. Oh well, back to the drawing board... I guess that's enough of the soapbox for now. There's a fair bit of philosophical ground here. Geoff and Henry have provided a very high performance, yet portable and reliable news system. Those of us who still aren't satisfied can easily go in and hack things up to run even faster --- but it's up to us to watch our steps and not break something important. --John -------------------------------------------------------------------------- John L. Coolidge Internet:coolidge@cs.uiuc.edu UUCP:uiucdcs!coolidge Of course I don't speak for the U of I (or anyone else except myself) Copyright 1989 John L. Coolidge. Copying allowed if (and only if) attributed. You may redistribute this article if and only if your recipients may as well.