Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!wuarchive!uunet!mcsun!ukc!slxsys!ibmpcug!demon!news
From: pmoore@cix.compulink.co.uk (Paul Moore)
Newsgroups: comp.lang.perl
Subject: Counting RE occurrences
Message-ID: <1991May13.184504.13844@demon.co.uk>
Date: 13 May 91 18:45:04 GMT
Sender: news@demon.co.uk (C-News Owner)
Reply-To: Paul Moore <pmoore@cix.compulink.co.uk>
Organization: Gated to News by demon.co.uk
Lines: 72

This is one of those problems which I am convinced ought to have a simple
(probably one-line) solution in perl, but I sure can't find it...

I have a string, which contains a piece of text. I also have a regular
expression. I want to count the number of times the RE appears in the
string. I am aware that obnoxious REs, such as ones which match the empty
string, and ones which overlap themselves, can make even *defining* the
idea of "the number of times this RE appears in this string" difficult,
but for straightforward cases the intention is clear.

As an example (this is the task which first made me want to do this), I
have a file, which has been copied from an MS-DOS box to my (non-MS-DOS)
machine. So the lines in the file are delimited by "\r\n", and not just
"\n". I have slurped the file into a string, in order to do some processing,
and I need to count the number of lines. So what I want to do is count the
number of occurrences of the string "\r\n" in the string.

IE,

        open(DOS,"Ms-dos-file");
        undef $/;
        $str = <DOS>;                      # Slurp
        .... processing on $str ...
        $lines = &count($str, "\r\n");     # Somehow...
        .... more processing ...

The only way I can see, which works for a general RE, is

        $count = ($str =~ s/RE/$&/g);

but the idea of doing global substitution, and using $&, strikes me as
a bit inefficient...

Another example, which shows why a general RE is better than just a string,
is if I am trying to write a wc clone. So we have

        open(FILE, $ARGV[1]);
        undef $/;
        $str = <FILE>;

        $chars = length($str);

        # Don't worry about funny line terminators this time, and note
        # that we can use the return value of tr/// for single character
        # counts...
        $lines = ($str =~ tr/\n//);

It seems to me that a nice way of counting words would be to count the
occurrences of the pattern /\b/, and divide by 2. With perl's blindingly
efficient pattern matching, this may be a very fast method.

Obviously, in most individual cases, there are alternative ways of doing
what I want. However, counting REs strikes me as a very "perl-ish" sort
of activity, and I would have expected it to be built in, somehow.
Perhaps as the return value of m// (which specifically isn't the case).

Comments, anyone?

Gustav.

PS Sorry if this has already appeared, but I don't think it made it out of
   my system...

E-Mail: pmoore%cix@ukc.ac.uk
    or: gustav@tharr.UUCP