Path: utzoo!utgpu!water!watmath!clyde!att!pacbell!ames!mailrus!iuvax!pur-ee!a.cs.uiuc.edu!m.cs.uiuc.edu!liberte From: liberte@m.cs.uiuc.edu Newsgroups: comp.editors Subject: Re: pattern matches Message-ID: <37200011@m.cs.uiuc.edu> Date: 21 Jul 88 23:02:00 GMT References: <19413@cornell.UUCP> Lines: 71 Nf-ID: #R:cornell.UUCP:19413:m.cs.uiuc.edu:37200011:000:3545 Nf-From: m.cs.uiuc.edu!liberte Jul 21 18:02:00 1988 Mike Haertel has written an e?grep for FSF. I mailed the comp.editors discussion on regexp negation to him. He asked me to post his response. Dan LaLiberte uiucdcs!liberte liberte@cs.uiuc.edu liberte%a.cs.uiuc.edu@uiucvmd.bitnet ------------------------------------- From: mike@wheaties.ai.mit.edu (Mike Haertel) Negation is a good idea if (1) we can decide exactly what it should mean and (2) how to implement it. Several weeks ago, just after I finished GNU e?grep, I gave some consideration to the problem. In your summary of the net traffic, there is some obvious dissension as to what precisely the semantics of negation would be. The only clear definition I could come up for a complement operator is as follows: if foo describes some regular set (of strings) R, then !(foo) describes the complement set ~R (with respect to the universe of all strings). This is well-defined, since the set of regular sets is closed under union, intersection, and complement. This definition is unlikely to do what people would (naively) expect in the simple cases, because of the typical semantics of "find the first and longest match, considering all possible substrings of the text." For example, in an editor, "!bar" would most likely match from the current position to the end of the buffer (unless the text from the current position to the end of the buffer were exactly "bar", in which case it would match "ba"). On the other hand, constructs such as "foo(!bar)baz" would be useful, but would usually require scanning to the end of the buffer (unless matching were restricted to one line at a time). Negation has been studied in the theoretical setting, along with intersection (ever wanted an "&" operator?). It can lead to bizarre consequences; in fact, the problem of building a complete DFA from a regexp including "|", "&", and "!" turns out to be one of the most intractable problems imaginable. "The Design and Analysis of Computer Algorithms," (by Aho, Hopcroft, and Ullman) contains a discussion of this in chapter 11, titled "Some Provably Intractable Problems." Nobody's suggestions about how to implement negation have been right. Right now I see an obvious way to do it, but (1) I might be wrong, and (2) it would require complete replacement of the routines I already have for building a DFA. Since negation doesn't actually expand the set of strings that can be described, I think other considerations (such as figuring out how to do Boyer-Moore style searches based on regexps and not just fixed strings) should be given higher priority. Right now I am working on diff and don't have any time for grep other than bug fixes. What all this adds up to is that I'm not going to do anything about this just now. After the summer is over and I'm back in school, or maybe when I go to graduate school, I hope to look further into the problem if somebody else hasn't already. GNU e?grep is available on prep.ai.mit.edu via anonymous ftp, in the directory /u/emacs/grep. jaw@ames.arc.nasa.gov as a version of it available in ~ftp/pub on that host that has some simple boyer-moore enhancements added. My original version tends to run slightly more than twice as fast as Unix egrep (the fastest of the Unix greps). With his enhancements, it can beat Vax microcode searching for fixed strings. You may wish to forward this to comp.editors as I do not have reliable access to news here. Also, I would appreciate it if you could forward copies of any future netnews traffic about this to me (mike@wheaties.ai.mit.edu).