Path: utzoo!mnetor!tmsoft!torsqnt!jarvis.csri.toronto.edu!cs.utexas.edu!usc!elroy.jpl.nasa.gov!jpl-devvax!lwall From: lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) Newsgroups: comp.lang.perl Subject: Re: perl slowness with regexp Message-ID: <7292@jpl-devvax.JPL.NASA.GOV> Date: 6 Mar 90 17:39:14 GMT References: Reply-To: lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) Organization: Jet Propulsion Laboratory, Pasadena, CA Lines: 38 In article jbw@bucsf.bu.edu (Joe Wells) writes: : I ran into some interesting perl behavior recently. The following short : program will illustrate it: : : for (1 .. 20) : { : $_ = 'Y' . ('x' x $_); : print "_: $_\n"; : m/(.+)*Y/; : print "`: $`, 1: $1, &: $&, ': $'\n"; : } : : If you run this program, you will notice that each iteration of the loop : takes much longer than the previous iteration. Am I correct in guessing : that this is perl's way of telling me don't do that? Or is this kind of : regular expression optimizable? What you've just discovered is that with a backtracking mechanism it's easy to fall into combinatorial explosions. Saying /.+/ gives you one level of backtracking; saying /(.+)*/ backtracks for every state of the inner backtrack. As to whether it's optimizable or not /(.+)*/ is obviously equivalent to /.*/, but I don't think it's a worthwhile optimization to put in, because it will almost never arise in real life (your own problem has other stuff in with the \S+). There is, of course, the optimization of using a non-bactracking algorithm like egrep, but then you wouldn't be able to isolate $1, because the way the states are constructed in the DFA, in a particular state you could be just starting or just ending or in the middle of or nowhere near any number of potential $1's. Well, maybe not any number. It might just be possible to construct your DFA such that it kept an array of potential $1's, but I get a headache when I look at GNU egrep. And when GNU egrep itself sees backreferences it punts and goes to a backtracking algorithm. So yes, don't do that. There's almost always a much more efficient way, and probably a more readable one at that. Larry