Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!usc!cs.utexas.edu!convex!usenet
From: tchrist@convex.COM (Tom Christiansen)
Newsgroups: comp.lang.perl
Subject: Re: Counting RE occurrences
Message-ID: <1991May13.225603.29819@convex.com>
Date: 13 May 91 22:56:03 GMT
References: <1991May13.184504.13844@demon.co.uk>
Sender: usenet@convex.com (news access account)
Reply-To: tchrist@convex.COM (Tom Christiansen)
Organization: CONVEX Software Development, Richardson, TX
Lines: 124
Nntp-Posting-Host: pixel.convex.com

From the keyboard of Paul Moore <pmoore@cix.compulink.co.uk>:
:This is one of those problems which I am convinced ought to have a simple
:(probably one-line) solution in perl, but I sure can't find it...

:I have a string, which contains a piece of text. I also have a regular
:expression. I want to count the number of times the RE appears in the
:string. 

:As an example (this is the task which first made me want to do this), I
:have a file, which has been copied from an MS-DOS box to my (non-MS-DOS)
:machine. So the lines in the file are delimited by "\r\n", and not just
:"\n". I have slurped the file into a string, in order to do some processing,
:and I need to count the number of lines. So what I want to do is count the
:number of occurrences of the string "\r\n" in the string.

:        open(DOS,"Ms-dos-file");
:        undef $/;
:        $str = <DOS>;                      # Slurp
:        .... processing on $str ...
:        $lines = &count($str, "\r\n");     # Somehow...
:        .... more processing ...

    I don't know what else you're doing, but I would think that slurping
    is a pretty inefficient way.  I usually try to avoid it.  It sure does
    make some things easier, though.

:The only way I can see, which works for a general RE, is
:        $count = ($str =~ s/RE/$&/g);

:but the idea of doing global substitution, and using $&, strikes me as
:a bit inefficient...

    You could make it faster if you could throw out the $&, but
    that's not good for a general routine.


:Another example, which shows why a general RE is better than just a string,
:is if I am trying to write a wc clone. So we have

:        open(FILE, $ARGV[1]);
:        undef $/;
:        $str = <FILE>;
:        $chars = length($str);
:        # Don't worry about funny line terminators this time, and note
:        # that we can use the return value of tr/// for single character
:        # counts...
:        $lines = ($str =~ tr/\n//);

:It seems to me that a nice way of counting words would be to count the
:occurrences of the pattern /\b/, and divide by 2. With perl's blindingly
:efficient pattern matching, this may be a very fast method.

:Obviously, in most individual cases, there are alternative ways of doing
:what I want. However, counting REs strikes me as a very "perl-ish" sort
:of activity, and I would have expected it to be built in, somehow.
:Perhaps as the return value of m// (which specifically isn't the case).

:Comments, anyone?


Larry has posted musings about adding a /g switch to the m// operator, or
making a g// operator.  There are two things this could do:

       $count = ($str =~ /pat/g);

would get what you want.  Another possibility is to keep some state
around, as in 
    
	while ($str =~ /pat/g) {
	    $len += length $`;
	    do munge($&);
	}

I'm not sure that these two uses are compatible.  To overload the 
two uses would (to my mind) mean Larry would have to have it know
whether it's in a loop, which is even more context-sensitivity
in a language where folks are already shooting themselves in the
foot with context anyway.

The first use would easier to implement, I think, and more useful at least
in that I believe it would get used more.

For the 2nd use, we could use /i for an incremental match, but
no, that's taken.  How about /p?  No, folks'll expect that to 
print the thing, as in sed.  Other ideas?


On wc, here's a wc clone I once wrote.  I don't slurp for speed's sake.

    #!/usr/bin/perl -n
    $lines++;
    $chars += length;
    $words += s/\S+//g;
    next unless eof;
    printf " %7d %7d %7d %s\n", $lines, $words, $chars, ($ARGV eq '-'?'':$ARGV);
    $tlines += $lines; 
    $twords += $words; 
    $tchars += $chars; 
    $chars = $words = $lines = 0;
    printf " %7d %7d %7d %s\n", $tlines, $twords, $tchars, "total" 
	if $files++ && eof();

It's a lot slower than the C version.  Probably the s///g is what's 
slowing it down.  If that line could be changed to 
    
    $words += /\S+/g;

and Larry were to implement this in any reasonably efficient manner,
it would probably run much faster.

But hey, at least it gets 'wc /vmunix' right. :-)

I don't use $. and close(ARGV) because it confuses the program.

Do you all see that code up there just YEARNING for 
    
    ($tlines, $twords, $tchars) += ($lines, $words, $chars);

I know, I know... along that road lies APL and madness.

--tom
--
Tom Christiansen		tchrist@convex.com	convex!tchrist
		"So much mail, so little time."