Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!uunet!midway!msuinfo!netnews.upenn.edu!jes
From: jes@mbio.med.upenn.edu (Joe Smith)
Newsgroups: comp.lang.perl
Subject: global pattern matching question
Message-ID: <JES.91May3121528@mbio.med.upenn.edu>
Date: 3 May 91 16:15:28 GMT
Sender: news@netnews.upenn.edu
Distribution: comp
Organization: University of Pennsylvania, Philadelphia, PA
Lines: 60
Nntp-Posting-Host: mbio.med.upenn.edu


I'm experimenting with perl to find patterns in DNA sequences.  So
far, the experiment is partially successful.  Can anyone suggest
improvements?  The DNA is represented by a long scalar string
(5000-10,000 characters is not uncommon), in which I want to find
instances of a pattern.  Here's a first draft:

#!/usr/local/bin/perl

# a test sequence
$seq =  'TATAGTGAGTCGTATTACAATTCACTGGCCGTCGTTTTACAACGTCG' .
	'CCCAACTTAATCGCCTTGCAGCACATCCCCCTTTCGCCAGCTGGCGT' .
	'TCGCCCTTCCCAACAGTTGCGCAGCCTGAATGGCGAATGGACGCGCC' ;

# now the search...
$n = ($seq =~ s/(CC[AT]+)/
	push(@sites, $1),			# record the actual match
	push(@positions, length $`),		# and it's position (1)
	$1/gei					# don't change anything (2)
);

print "$n sites:\n";
for ($[..$#sites) {
	print "  $_: $positions[$_], '$sites[$_]'\n";
}
__END__

Here's what I get:

7 sites:
  0: 0, 'CCAA'
  1: 0, 'CCTT'
  2: 0, 'CCTTT'
  3: 0, 'CCA'
  4: 0, 'CCTT'
  5: 0, 'CCAA'
  6: 0, 'CCT'

Note that

  1) The 'length $`' doesn't seem to work while the search is going
     on.  Keeping track of the positions that matched is critical, and
     carving the sequence into substrings is likely to be messy and
     slow.  Did I miss something simple?  Would it be possible/useful
     to have perl update a variable with the offset of the beginning
     of the match?

  2) Having to replace the matched pattern with itself seems very
     inefficient, (especially when processing a 10Kb string!).  Is
     there any way of doing a similar operation with m//, or perhaps
     tricking s/// into not doing any replacement?

Thanks for any suggestions,
<Joe

--
 Joe Smith
 University of Pennsylvania                    jes@mbio.med.upenn.edu
 Dept. of Biochemistry and Biophysics          (215) 898-8348
 Philadelphia, PA 19104-6059