Path: utzoo!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!pacific.mps.ohio-state.edu!linac,convex!news From: tchrist@convex.COM (Tom Christiansen) Newsgroups: comp.unix.questions Subject: Re: Text Processing Question Message-ID: <1991Mar18.051909.19578@convex.com> Date: 18 Mar 91 05:19:09 GMT References: <31134@usc> <1991Mar18.013647.7570@midway.uchicago.edu> Sender: news@convex.com (news access account) Reply-To: tchrist@convex.COM (Tom Christiansen) Distribution: usa Organization: CONVEX Software Development, Richardson, TX Lines: 44 Nntp-Posting-Host: pixel.convex.com From the keyboard of goer@ellis.uchicago.edu (Richard L. Goerwitz): :In article <31134@usc> rkumar@buddha.usc.edu (C.P. Ravikumar) writes: : :>I was wondering if there is a utility to check :>for repitition of words in a document.... :> :>I have the feeling this can be done using "awk". : :The hard part, as always, is settling on a field separator - Perhaps. I always thought the hard part was catching pairs of words that extend over line boundaries. Here's a perl version that catches these, although I admit it's probably overkill to suck up the whole file into memory before munging it. Works fine on my machine. :-) Here's the output when run on my C compiler man page: /usr/man/man1/cc.1: 39 compiler. Certain extensions, notably the [* long long *] type, 57 Forces language and library interpretation based on [* the the *] original 770 Each library has a profiled version whose name is formed [* by 771 by *] inserting \(lq_p\(rq before the \(lq.a\(rq. The precise definition of what constitutes a repeated words (and what legit separators are) will vary according to tastes. I chose identifier- like tokens separated by white space. Speed (and definitely memory) optimizations are certainly possible, but this does the job well enough for me. The program (not line noise :-) follows: --tom #!/usr/bin/perl undef $/; $* = 1; # process whole file while ( $ARGV = shift ) { if (!open ARGV) { warn "$ARGV: $!\n"; next; } $_ = <>; s/\b(\s?)(([a-z]\w*)(\s+\3)+\b)/$1\200$2\200/g || next; split(/\n/); $n = 0; @hits = (); for (@_) { $n++; push(@hits, sprintf("%5d %s", $n, $_)) if /\200/; } $_ = join("\n",@hits); s/\200([^\200]+)\200/[* $1 *]/g; print "$ARGV:\n$_\n"; }