Path: utzoo!attcan!uunet!pyrdc!pyrnj!esquire!yost From: yost@esquire.UUCP (David A. Yost) Newsgroups: news.admin Subject: Re: News delivery problems - old news again Message-ID: <1355@esquire.UUCP> Date: 14 Aug 89 19:35:10 GMT References: <43675@bbn.COM> <651@vector.Dallas.TX.US> <505@logicon.arpa> Reply-To: yost@esquire.UUCP (David A. Yost) Organization: DP&W, New York, NY Lines: 173 In article <43675@bbn.COM> denbeste@BBN.COM (Steven Den Beste) writes: >Today we received a large number of news articles dated July 22, which we have >received before. I located 50 or so of them and analyzed the paths by which >they arrived here. Today alone, we have 1,883 duplicate messages in comp alone. We may have some other problem. Enclosed is a shell script that will find duplicate messages and list their pathnames. Its implementation required another interesting, more generally useful utility which I've wished I had for a long time and finally wrote. Here is quickie documentation on both scripts, followed by a shar of the two scripts. --dave yost ----------------- Usage: find newsdir ... -type f -print | newsdups [ -i file ] [ -o file ] Takes news file pathnames on standard input and outputs pathnames of later-numbered duplicate messages in each group. If the -o file argument is used, newsdups also outputs a record of the message IDs seen in this pass into the specified file for use as the -i file argument to a future run of newsdups, at which time newsdups will assume that those message IDs already exist. ----------------- Usage: numdups awk-field-number-list file ... For each line, prints a number n followed by a space followed by the original text on the line. The number identifies the line as the nth line containing the specified fields. If multiple field numbers are specified, they must be separated by spaces, and the awk-field-number-list argument must be quoted. Fields are numbered as in awk (0 = whole line, 1 = field 1, 2 = field 2, etc.). ----------------- #!/bin/sh unlink=NO case $1 in -u) unlink=YES shift esac echo x newsdups case $unlink in YES) rm -f newsdups esac sed 's/^X//' >newsdups <<'*-*-END-of-newsdups-*-*' X#!/bin/sh X X# See "Usage:" below X# X# Requires the nonstandard shell script 'numdups' X# X# 890814 D Yost, Davis Polk & Wardwell X# X# Why the tmp1 file instead of a pipe? Otherwise it doesn't work. X Xcase $# in X0 | 2) ;; X*) echo 1>&2 " XUsage: find newsdir ... -type f -print | newsdups [ -i file ] [ -o file ] X XTakes news file pathnames on standard input and outputs Xpathnames of later-numbered duplicate messages in each group. X XIf the -o file argument is used, newsdups also outputs a record of Xthe message IDs seen in this pass into the specified file for use Xas the -i file argument to a future run of newsdups, at which time Xnewsdups will assume that those message IDs already exist. X" X exit 2 Xesac X Xtmp1=/tmp/newsdups1.$$ Xtmp2=/tmp/newsdups2.$$ Xtmp3=/tmp/newsdups3.$$ Xtrap "status=$? ; rm -f $tmp1 $tmp2 $tmp3 ; exit $status " \ X 0 1 2 3 4 5 6 7 8 10 12 13 15 24 25 29 X Xifile= Xofile=$tmp2 X Xcase "$1" in X-i) ifile=$2 ; shift ; shift ;; X-o) ofile=$2 ; shift ; shift ;; Xesac X Xxargs grep -i '^message-id:' > $tmp1 Xsed < $tmp1 's,/\([^/:]*\):[^:]*: , \1 ,' \ X| awk '{printf "%s %07d %s\n", $1, $2, $3}' \ X| sort > $ofile X Xcase "$ifile" in X"") cat $ofile X ;; X*) numdups '1 3' $ifile \ X | awk '$1 == 1 { print $2, $3, $4 }' > $tmp3 X cat $tmp3 $ofile X ;; Xesac \ X| numdups '1 3' \ X| awk '$1 != 1 { printf "%s/%d\n", $2, $3 }' X Xexit *-*-END-of-newsdups-*-* echo x numdups case $unlink in YES) rm -f numdups esac sed 's/^X//' >numdups <<'*-*-END-of-numdups-*-*' X#!/bin/sh X Xcase $# in X0) echo 1>&2 " XUsage: numdups awk-field-number-list file ... X XFor each line, prints a number n followed by a space followed Xby the original text on the line. The number identifies the Xline as the nth line containing the specified fields. XIf multiple field numbers are specified, they must be separated Xby spaces, and the awk-field-number-list argument must be quoted. XFields are numbered as in awk (0 = whole line, 1 = field 1, X2 = field 2, etc.). X" X exit 2 Xesac X Xfield=$1 ; shift X Xtmp=/tmp/numdups.$$ Xtrap "status=$? ; rm -f $tmp ; exit $status " \ X 0 1 2 3 4 5 6 7 8 10 12 13 15 24 25 29 X X( Xecho -n "awk '{ X tmpstr = sprintf "'"' X Xfor X in $field Xdo X echo -n "%s " Xdone X Xecho -n '"' X Xfor X in $field Xdo X echo -n ', $'"$X" Xdone X Xecho ' X ++counts[tmpstr] X printf "%d %s\n", counts[tmpstr], $0 X}'"' $*" ) > $tmp Xsh $tmp *-*-END-of-numdups-*-* exit