Path: utzoo!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!pacific.mps.ohio-state.edu!linac,convex!news
From: tchrist@convex.COM (Tom Christiansen)
Newsgroups: comp.unix.questions
Subject: Re: Text Processing Question
Message-ID: <1991Mar18.051909.19578@convex.com>
Date: 18 Mar 91 05:19:09 GMT
References: <31134@usc> <1991Mar18.013647.7570@midway.uchicago.edu>
Sender: news@convex.com (news access account)
Reply-To: tchrist@convex.COM (Tom Christiansen)
Distribution: usa
Organization: CONVEX Software Development, Richardson, TX
Lines: 44
Nntp-Posting-Host: pixel.convex.com

From the keyboard of goer@ellis.uchicago.edu (Richard L. Goerwitz):
:In article <31134@usc> rkumar@buddha.usc.edu (C.P. Ravikumar) writes:
:
:>I was wondering if there is a utility to check
:>for repitition of words in a document....
:>
:>I have the feeling this can be done using "awk".
:
:The hard part, as always, is settling on a field separator -

Perhaps.  I always thought the hard part was catching pairs of words that
extend over line boundaries.  Here's a perl version that catches these,
although I admit it's probably overkill to suck up the whole file into
memory before munging it.  Works fine on my machine. :-)

Here's the output when run on my C compiler man page:

/usr/man/man1/cc.1:
   39 compiler. Certain extensions, notably the [* long long *] type,
   57 Forces language and library interpretation based on [* the the *] original
  770 Each library has a profiled version whose name is formed [* by
  771 by *] inserting \(lq_p\(rq before the \(lq.a\(rq.

The precise definition of what constitutes a repeated words (and what
legit separators are) will vary according to tastes.  I chose identifier-
like tokens separated by white space.  Speed (and definitely memory)
optimizations are certainly possible, but this does the job well enough
for me.  The program (not line noise :-) follows:

--tom

#!/usr/bin/perl
undef $/; $* = 1; # process whole file
while ( $ARGV = shift ) { 
    if (!open ARGV) { warn "$ARGV: $!\n"; next; } 
    $_ = <>;
    s/\b(\s?)(([a-z]\w*)(\s+\3)+\b)/$1\200$2\200/g || next;
    split(/\n/);
    $n = 0; @hits = ();
    for (@_) { $n++; push(@hits, sprintf("%5d %s", $n, $_)) if /\200/; } 
    $_ = join("\n",@hits);
    s/\200([^\200]+)\200/[* $1 *]/g;
    print "$ARGV:\n$_\n";
}