Path: utzoo!yunexus!geac!torsqnt!news-server.csri.toronto.edu!mailrus!cs.utexas.edu!sdd.hp.com!elroy.jpl.nasa.gov!jpl-devvax!lwall From: lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) Newsgroups: comp.lang.perl Subject: Re: How can I scan one file with a list of RE's from another efficently?. Message-ID: <8225@jpl-devvax.JPL.NASA.GOV> Date: 29 May 90 18:05:56 GMT Article-I.D.: jpl-devv.8225 References: <694@hades.OZ> Reply-To: lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) Organization: Jet Propulsion Laboratory, Pasadena, CA Lines: 97 In article <694@hades.OZ> greyham@hades.OZ (Greyham Stoney) writes: : I've put together a perl script which does the job, but BOY is it : slow!; I imagine becuase it has to compile that RE thousands of times. That's the primary problem. A secondary problem is the use of subscripts to index into arrays. Whenever you see subscripts in a Perl script, it's a pretty strong indication that things aren't being done the Perl Way. Iteration over an array should almost always be done with foreach. : Can anyone of severe perl guru wizard status suggest a better way of doing : it? [ doesn't have to use perl, I'm easy ]. I could just use fgrep -f, but the : list of groups dropped is too long for it to handle, and they're RE's anyway. RE's aside, a properly written Perl script will beat fgrep at its own game. The trick is to use Perl's strengths rather than its weaknesses. In the following, we write a little bit of code that gets eval'ed. This lets us compile each pattern just once--a major savings. Additionally, since we'll be matching against multiple patterns, we do a study on each line, which provides additional savings. The script below is identical to yours, down to the #CHANGES line. #!/usr/local/bin/perl # provide a report (from checkgroups) as to what newsgroups we still get, # and what ones we don't get. # slurp in the 'dropped' file. open(DROPPED, 'dropped'); @dropped = ; close(DROPPED); chop (@dropped); # nuke the \n off the end of each line. # print it, just for checking. #print @dropped; #print $#dropped; # slurp in the 'checkgroups' file (it's a news article). open(CHECKGROUPS,'checkgrps.msg'); # skip the header business. while () { if (/^$/) { last; } } @checkgroups = ; close (CHECKGROUPS); # print it, just for checking. #print @checkgroups; #print $#checkgroups; # go down each message in the checkgroups, and find whether we get it or not. #CHANGES BEGIN HERE $prog = <<'EOF'; foreach $_ (@checkgroups) { study; $go = 1; # Assume we get it. EOF; foreach $pat (@dropped) { $prog .= <<"EOF"; next if /$pat/; EOF } $prog .= <<'EOF'; $go = 0; } continue { if ($go) { push(@yes, $_); } else { push(@no, $_); } } EOF eval $prog; die $@ if $@; print <