Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!uwm.edu!lll-winken!uunet!sceard!mrm From: mrm@sceard.Sceard.COM (M.R.Murphy) Newsgroups: news.software.b Subject: Re: Cnews expire problem... need help (LONG) Message-ID: <1990Dec10.015743.28094@sceard.Sceard.COM> Date: 10 Dec 90 01:57:43 GMT References: <1990Dec7.130639.15803@bnr.ca> <660596702.10086@mindcraft.com> <1990Dec8.190114.15171@sceard.Sceard.COM> <660770337.20986@mindcraft.com> Reply-To: mrm@Sceard.COM (M.R.Murphy) Organization: Sceard Systems, Inc. San Marcos, CA 92069 Lines: 115 In article <660770337.20986@mindcraft.com> karish@mindcraft.com (Chuck Karish) writes: >In article <1990Dec8.190114.15171@sceard.Sceard.COM> mrm@Sceard.COM >(M.R.Murphy) writes: >> >>The following code cleans up a history file so that mkdbm is happy with it, >>and also replaces the single awk line that sifts a history file and prints >>only lines that are after a given time that I used in a modified expire scheme. >>The checking for goodness in a history line could be made fancier, but this is >>enough to make mkdbm happy. Makes for a pretty fast expire, too. Every so often, >>writing a short specialized tool in C is appropriate, though I'd rather use >>awk :-) > >This C program is needed only to avoid re-writing the whole history >file during checking. On my machine, the mkdbm step takes much longer >than the scan anyway and I have enough disk space for a second copy >of history, so I use this one-liner in sed: > >sed -n 's/^<.* /p' # The white space in the pattern is a tab. > >>... >>now=`getdate now` >>ago=`awk "/^\/expired\// {print ($now-(86400*\$(3)))} {next}" explist` >># replace the single-line awk with exphist >>#awk "{split(\$2,dates,\"~\");if(dates[1]>$ago)print \$0}" history >history.n > >Doesn't this reproduce the functionality specified by the 'expired' >line in the expire control file? >-- > > Chuck Karish karish@mindcraft.com > Mindcraft, Inc. (415) 323-9000 The C program referenced in article <1990Dec8.190114.15171@sceard.Sceard.COM> above does not just avoid re-writing the whole history file during checking. It does reproduce the functionality specifed by the 'expired' line in the expire control file, sort of, but the C News expire is not used at all in the simple scheme for "expiration" that I posted a while back. Expiration is maintaining the news database, that is, the articles that are the ebb and flow of USENET as we know it, and the control of reception of duplicate articles from other sites. The scheme is based on: 1) don't accept an article from another site that has already been received, that is, that already exists in the history file, and 2) don't keep old articles lying about wasting space. Another function of the standard C News expire, that is, archiving, I think is better separated. It is more reasonable to set up a sys file entry that sends articles from newsgroups to be archived to an archiver when they are received from the feed. The archiver can then be quite clever and selective about what it bothers to archive. The less that the expiration process has to handle, the better. To accomplish this scheme, I split C News expiration into two separate parts, expire, which maintains the history file and handles 1) above, and trasher which gets rid of old articles and handles 2) above. BTW, the Expires: header is ignored by trasher on the basis that it is only the business of a system's administrators how long an article should take up space. I have since kissed off the script that was trasher and replaced it with reap by dt@yenta.alb.nm.us (thanks, david). Expiration of the history file is just the creation of a new history file that omits lines of the previous history file that are older than some particular time. It need have nothing to do with whether the articles referenced by that line are still around. I used the one-liner awk script awk "{split(\$2,dates,\"~\");if(dates[1]>$ago)print \$0}" history >history.n to do just that. I was happy enough with this part of the scheme until a bad disk block corrupted the history file. Oops. Awk groused because a record was too long for it to handle. Mkdbm groused because the line in the history file was not up to its expectations for a valid line (simple and incomplete though those expectations were). BTW, the corrupted part of the history file had a less-than followed by some characters and a tab, so it would have passed the sed test referenced by Chuck and still would have given mkdbm a problem. Unless sed croaked on the line, too. :-) To get around the problem of lines that mkdbm chokes on, I decided to snag the code from mkdbm and twiddle it about a little so that it would just read history lines on its standard input and write only lines on its standard output that mkdbm would be happy with. As long as I was going to do that much, I might as well have it do the check for old lines, too. That way, exphist, the new C program, reads an old history file, deletes bad lines or lines that are too old, and writes the output so that mkdbm can make the new history files. The awk line above is then replaced by exphist $ago history.n Then mkdbm, move the results around, and save the old stuff. Simple, no? On our news machine, both the scan and the mkdbm are fast :-) Exphist and mkdbm could have been combined, and would probably have been faster, but these tools are more useful when separate. That's part of the UNIX(tm) philosophy. Reap is a separate process for getting rid of old articles. It is completely independent of the process of maintaining history. Reap also has the benefit that it is: 1) short enough so that I can understand it, 2) flexible, 3) fast, 4) and, written by someone else so I didn't have to do it. (thanks again, david) Again, the standard C News expire is not used at all. What I am talking about here is an alternate method of maintaining the news database: articles and history. Yes, it is a Really Good Thing to lock so that no other News processing goes on during the history expiration. It is not necessary to lock News processing during reaping. Everything needs to be locked against itself running at the same time. Don't you just love crons that can't keep things straight? I really like C News. Thanks to its authors. -- Mike Murphy mrm@Sceard.COM ucsd!sceard!mrm +1 619 598 5874