Newsgroups: news.software.b Path: utzoo!henry From: henry@utzoo.uucp (Henry Spencer) Subject: Re: Lots of dups Message-ID: <1989Oct26.164042.4692@utzoo.uucp> Organization: U of Toronto Zoology References: <1989Oct25.164024.14894@ctr.columbia.edu> <1989Oct25.205129.16397@brutus.cs.uiuc.edu> Date: Thu, 26 Oct 89 16:40:42 GMT In article <1989Oct25.205129.16397@brutus.cs.uiuc.edu> coolidge@cs.uiuc.edu writes: >>I run a Cnews machine with a few high-speed (NNTP) feeds. My problem is that >>two of them have excessive (> 80%) duplication. > >The problem is that with REALLY fast feeds, even processing articles >once a minute is not fast enough. The problem lies in nntpd accepting >multiple copies because the queue hasn't been run yet... The fundamental problem, however, goes even deeper. There are two inherently conflicting desires: 1. Minimum processing latency, so that an article which has arrived is known as soon as possible, e.g. so that it need not be brought in again. This pushes you towards processing individual articles the instant they show up. 2. Minimum processing overhead, so that machine resources devoted to news are minimized. One major way of doing this -- one of C News's biggest performance wins over B News -- is to amortize setup and teardown overhead over multiple articles. This means delaying processing until several articles can be run as a batch. There is NO WAY to satisfy both of these desires simultaneously. All you can do is strike some sort of compromise, depending on your own priorities. In particular, if you have very fast feeds and optimize for minimum overhead, you will inevitably receive lots of articles more than once, although C News will throw away the duplicates quite efficiently. (I don't really see that there is cause for great alarm about efficiently-discarded duplicates.) C News is generally slanted towards minimum overhead, given our observation that B News was increasingly eating our machines alive doing one article at a time. (Incidentally, the oft-seen suggestion of running relaynews as a daemon doesn't really help very much. An article is not really received until it, its history-file line, and the update to the relevant active-file line(s), are flushed out to disk. Flushing data to disk is a big part of the setup/teardown overhead. If you do it once per article, the overhead goes way up. If you batch the disk flushing, you're back to having a significant window in which the article has been received but this fact is not universally and positively known yet. The relaynews daemon might be a net improvement, but it is *not* an escape from the fundamental dilemma.) -- A bit of tolerance is worth a | Henry Spencer at U of Toronto Zoology megabyte of flaming. | uunet!attcan!utzoo!henry henry@zoo.toronto.edu