Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!uunet!midway!msuinfo!netnews.upenn.edu!jes From: jes@mbio.med.upenn.edu (Joe Smith) Newsgroups: comp.lang.perl Subject: global pattern matching question Message-ID: Date: 3 May 91 16:15:28 GMT Sender: news@netnews.upenn.edu Distribution: comp Organization: University of Pennsylvania, Philadelphia, PA Lines: 60 Nntp-Posting-Host: mbio.med.upenn.edu I'm experimenting with perl to find patterns in DNA sequences. So far, the experiment is partially successful. Can anyone suggest improvements? The DNA is represented by a long scalar string (5000-10,000 characters is not uncommon), in which I want to find instances of a pattern. Here's a first draft: #!/usr/local/bin/perl # a test sequence $seq = 'TATAGTGAGTCGTATTACAATTCACTGGCCGTCGTTTTACAACGTCG' . 'CCCAACTTAATCGCCTTGCAGCACATCCCCCTTTCGCCAGCTGGCGT' . 'TCGCCCTTCCCAACAGTTGCGCAGCCTGAATGGCGAATGGACGCGCC' ; # now the search... $n = ($seq =~ s/(CC[AT]+)/ push(@sites, $1), # record the actual match push(@positions, length $`), # and it's position (1) $1/gei # don't change anything (2) ); print "$n sites:\n"; for ($[..$#sites) { print " $_: $positions[$_], '$sites[$_]'\n"; } __END__ Here's what I get: 7 sites: 0: 0, 'CCAA' 1: 0, 'CCTT' 2: 0, 'CCTTT' 3: 0, 'CCA' 4: 0, 'CCTT' 5: 0, 'CCAA' 6: 0, 'CCT' Note that 1) The 'length $`' doesn't seem to work while the search is going on. Keeping track of the positions that matched is critical, and carving the sequence into substrings is likely to be messy and slow. Did I miss something simple? Would it be possible/useful to have perl update a variable with the offset of the beginning of the match? 2) Having to replace the matched pattern with itself seems very inefficient, (especially when processing a 10Kb string!). Is there any way of doing a similar operation with m//, or perhaps tricking s/// into not doing any replacement? Thanks for any suggestions,