Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!usc!cs.utexas.edu!convex!usenet From: tchrist@convex.COM (Tom Christiansen) Newsgroups: comp.lang.perl Subject: Re: Counting RE occurrences Message-ID: <1991May13.225603.29819@convex.com> Date: 13 May 91 22:56:03 GMT References: <1991May13.184504.13844@demon.co.uk> Sender: usenet@convex.com (news access account) Reply-To: tchrist@convex.COM (Tom Christiansen) Organization: CONVEX Software Development, Richardson, TX Lines: 124 Nntp-Posting-Host: pixel.convex.com From the keyboard of Paul Moore : :This is one of those problems which I am convinced ought to have a simple :(probably one-line) solution in perl, but I sure can't find it... :I have a string, which contains a piece of text. I also have a regular :expression. I want to count the number of times the RE appears in the :string. :As an example (this is the task which first made me want to do this), I :have a file, which has been copied from an MS-DOS box to my (non-MS-DOS) :machine. So the lines in the file are delimited by "\r\n", and not just :"\n". I have slurped the file into a string, in order to do some processing, :and I need to count the number of lines. So what I want to do is count the :number of occurrences of the string "\r\n" in the string. : open(DOS,"Ms-dos-file"); : undef $/; : $str = ; # Slurp : .... processing on $str ... : $lines = &count($str, "\r\n"); # Somehow... : .... more processing ... I don't know what else you're doing, but I would think that slurping is a pretty inefficient way. I usually try to avoid it. It sure does make some things easier, though. :The only way I can see, which works for a general RE, is : $count = ($str =~ s/RE/$&/g); :but the idea of doing global substitution, and using $&, strikes me as :a bit inefficient... You could make it faster if you could throw out the $&, but that's not good for a general routine. :Another example, which shows why a general RE is better than just a string, :is if I am trying to write a wc clone. So we have : open(FILE, $ARGV[1]); : undef $/; : $str = ; : $chars = length($str); : # Don't worry about funny line terminators this time, and note : # that we can use the return value of tr/// for single character : # counts... : $lines = ($str =~ tr/\n//); :It seems to me that a nice way of counting words would be to count the :occurrences of the pattern /\b/, and divide by 2. With perl's blindingly :efficient pattern matching, this may be a very fast method. :Obviously, in most individual cases, there are alternative ways of doing :what I want. However, counting REs strikes me as a very "perl-ish" sort :of activity, and I would have expected it to be built in, somehow. :Perhaps as the return value of m// (which specifically isn't the case). :Comments, anyone? Larry has posted musings about adding a /g switch to the m// operator, or making a g// operator. There are two things this could do: $count = ($str =~ /pat/g); would get what you want. Another possibility is to keep some state around, as in while ($str =~ /pat/g) { $len += length $`; do munge($&); } I'm not sure that these two uses are compatible. To overload the two uses would (to my mind) mean Larry would have to have it know whether it's in a loop, which is even more context-sensitivity in a language where folks are already shooting themselves in the foot with context anyway. The first use would easier to implement, I think, and more useful at least in that I believe it would get used more. For the 2nd use, we could use /i for an incremental match, but no, that's taken. How about /p? No, folks'll expect that to print the thing, as in sed. Other ideas? On wc, here's a wc clone I once wrote. I don't slurp for speed's sake. #!/usr/bin/perl -n $lines++; $chars += length; $words += s/\S+//g; next unless eof; printf " %7d %7d %7d %s\n", $lines, $words, $chars, ($ARGV eq '-'?'':$ARGV); $tlines += $lines; $twords += $words; $tchars += $chars; $chars = $words = $lines = 0; printf " %7d %7d %7d %s\n", $tlines, $twords, $tchars, "total" if $files++ && eof(); It's a lot slower than the C version. Probably the s///g is what's slowing it down. If that line could be changed to $words += /\S+/g; and Larry were to implement this in any reasonably efficient manner, it would probably run much faster. But hey, at least it gets 'wc /vmunix' right. :-) I don't use $. and close(ARGV) because it confuses the program. Do you all see that code up there just YEARNING for ($tlines, $twords, $tchars) += ($lines, $words, $chars); I know, I know... along that road lies APL and madness. --tom -- Tom Christiansen tchrist@convex.com convex!tchrist "So much mail, so little time."