Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!sun-barr!newstop!texsun!convex!convex.COM From: tchrist@convex.COM (Tom Christiansen) Newsgroups: comp.lang.perl Subject: Re: fast grep? Message-ID: <107211@convex.convex.com> Date: 16 Oct 90 02:00:39 GMT References: <1990Oct15.211219.8543@wrl.dec.com> Sender: news@convex.com Reply-To: tchrist@convex.COM (Tom Christiansen) Organization: CONVEX Software Development, Richardson, TX Lines: 62 In article <1990Oct15.211219.8543@wrl.dec.com> vixie@volition.pa.dec.com (Paul Vixie) writes: >I've a need to grep for a simple pattern in a large file from within a perl >script. Simple means no metacharacters, large means many megabytes. So >far it looks like "netgrep" -- the B M thing that later turned into GNU grep, >is the clear winner. >Have I found something that perl can't do as well as C? With this >kind of variance, it's cheaper to fork/exec a bmgrep. Well, the perl code you used was slightly sub-optimal. If you slap an eval around the the whole thing, then $pat becomes a constant string, which is even better than the /o. Somewhere I've an article in which Larry explains that this makes it use BM-style search routines. I've taken a 4-megabyte file, which is just 40 concatenated termcap files, and I'm looking for the string /termcap/. Here are some timings. The two perl codes are: $pat = shift; while (<>) { print if /$pat/o; }; and $pat = shift; eval "while (<>) { print if /$pat/; }"; Now, in descending real-time order: (It's too bad that novices think fgrep stands for fast.) fgrep 14.649289 real 12.367839 user 1.743313 sys 14.582311 real 12.360291 user 1.745801 sys 14.707298 real 12.364471 user 1.760904 sys egrep 12.672381 real 7.283316 user 3.333458 sys 12.693866 real 7.293646 user 3.369639 sys 11.212437 real 7.246659 user 3.312308 sys perl with pat/o 11.755116 real 10.349145 user 0.414848 sys 11.619828 real 10.345177 user 0.428762 sys 11.650266 real 10.342087 user 0.420670 sys grep 10.311009 real 9.389851 user 0.493640 sys 10.238396 real 9.389047 user 0.439769 sys 10.372305 real 9.392753 user 0.447918 sys perl with eval 7.068890 real 5.988363 user 0.377409 sys 8.725602 real 6.019300 user 0.439920 sys 7.004554 real 5.983231 user 0.401990 sys gungrep 2.050560 real 1.442311 user 0.494117 sys 2.164280 real 1.445398 user 0.494487 sys 2.082105 real 1.443624 user 0.488781 sys I still can't get it sufficently close to gnugrep to justify not running a system on gnugrep, but even so, there's nothing truly wrong with that. I don't think Larry would have made popens and `foo` and system() so easy to get at if he didn't want you do use them. Note that perl DOES beat everything except for gnugrep. One of these days, I've got to see whether I can't vectorize any of the regexp stuff in perl... --tom