Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!sdd.hp.com!elroy.jpl.nasa.gov!jpl-devvax!lwall From: lwall@jpl-devvax.jpl.nasa.gov (Larry Wall) Newsgroups: comp.lang.perl Subject: Re: global pattern matching question Message-ID: <1991May4.010307.11792@jpl-devvax.jpl.nasa.gov> Date: 4 May 91 01:03:07 GMT References: Reply-To: lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) Distribution: comp Organization: Jet Propulsion Laboratory, Pasadena, CA Lines: 64 In article jes@mbio.med.upenn.edu (Joe Smith) writes: : : I'm experimenting with perl to find patterns in DNA sequences. So : far, the experiment is partially successful. Can anyone suggest : improvements? The DNA is represented by a long scalar string : (5000-10,000 characters is not uncommon), in which I want to find : instances of a pattern. Here's a first draft: : : #!/usr/local/bin/perl : : # a test sequence : $seq = 'TATAGTGAGTCGTATTACAATTCACTGGCCGTCGTTTTACAACGTCG' . : 'CCCAACTTAATCGCCTTGCAGCACATCCCCCTTTCGCCAGCTGGCGT' . : 'TCGCCCTTCCCAACAGTTGCGCAGCCTGAATGGCGAATGGACGCGCC' ; : : # now the search... : $n = ($seq =~ s/(CC[AT]+)/ : push(@sites, $1), # record the actual match : push(@positions, length $`), # and it's position (1) : $1/gei # don't change anything (2) : ); : : print "$n sites:\n"; : for ($[..$#sites) { : print " $_: $positions[$_], '$sites[$_]'\n"; : } : __END__ : : Here's what I get: : : 7 sites: : 0: 0, 'CCAA' : 1: 0, 'CCTT' : 2: 0, 'CCTTT' : 3: 0, 'CCA' : 4: 0, 'CCTT' : 5: 0, 'CCAA' : 6: 0, 'CCT' : : Note that : : 1) The 'length $`' doesn't seem to work while the search is going : on. Keeping track of the positions that matched is critical, and : carving the sequence into substrings is likely to be messy and : slow. Did I miss something simple? Would it be possible/useful : to have perl update a variable with the offset of the beginning : of the match? In 4.003, $` is unfortunately broken within s/// because of a fix to something else. That'll be fixed in patch 4. As a workaround you could use length($seq) - length($') - length($1). Yeah, blech... Alternately, you could use split. : 2) Having to replace the matched pattern with itself seems very : inefficient, (especially when processing a 10Kb string!). Is : there any way of doing a similar operation with m//, or perhaps : tricking s/// into not doing any replacement? An option to s/// not to do replacement would be interesting, though there might be a better way--maybe an initial offset for m//, or allowing patterns in index(). Larry