Path: utzoo!utgpu!water!watmath!clyde!att!pacbell!ames!mailrus!iuvax!pur-ee!a.cs.uiuc.edu!m.cs.uiuc.edu!liberte
From: liberte@m.cs.uiuc.edu
Newsgroups: comp.editors
Subject: Re: pattern matches
Message-ID: <37200011@m.cs.uiuc.edu>
Date: 21 Jul 88 23:02:00 GMT
References: <19413@cornell.UUCP>
Lines: 71
Nf-ID: #R:cornell.UUCP:19413:m.cs.uiuc.edu:37200011:000:3545
Nf-From: m.cs.uiuc.edu!liberte    Jul 21 18:02:00 1988


Mike Haertel has written an e?grep for FSF.  I mailed the
comp.editors discussion on regexp negation to him.  He asked me to
post his response.


Dan LaLiberte
uiucdcs!liberte
liberte@cs.uiuc.edu
liberte%a.cs.uiuc.edu@uiucvmd.bitnet
-------------------------------------

From: mike@wheaties.ai.mit.edu (Mike Haertel)

Negation is a good idea if (1) we can decide exactly what it should
mean and (2) how to implement it.

Several weeks ago, just after I finished GNU e?grep, I gave some
consideration to the problem.  In your summary of the net traffic,
there is some obvious dissension as to what precisely the semantics of
negation would be.

The only clear definition I could come up for a complement operator is
as follows:  if foo describes some regular set (of strings) R, then
!(foo) describes the complement set ~R (with respect to the universe of
all strings).  This is well-defined, since the set of regular sets is
closed under union, intersection, and complement.

This definition is unlikely to do what people would (naively) expect in
the simple cases, because of the typical semantics of "find the first
and longest match, considering all possible substrings of the text."
For example, in an editor, "!bar" would most likely match from the
current position to the end of the buffer (unless the text from the
current position to the end of the buffer were exactly "bar", in which
case it would match "ba").  On the other hand, constructs such as
"foo(!bar)baz" would be useful, but would usually require scanning to
the end of the buffer (unless matching were restricted to one line at a
time).

Negation has been studied in the theoretical setting, along with
intersection (ever wanted an "&" operator?).  It can lead to bizarre
consequences; in fact, the problem of building a complete DFA from a
regexp including "|", "&", and "!" turns out to be one of the most
intractable problems imaginable.  "The Design and Analysis of Computer
Algorithms," (by Aho, Hopcroft, and Ullman) contains a discussion of
this in chapter 11, titled "Some Provably Intractable Problems."

Nobody's suggestions about how to implement negation have been right.
Right now I see an obvious way to do it, but (1) I might be wrong, and
(2) it would require complete replacement of the routines I already
have for building a DFA.  Since negation doesn't actually expand the
set of strings that can be described, I think other considerations
(such as figuring out how to do Boyer-Moore style searches based on
regexps and not just fixed strings) should be given higher priority.
Right now I am working on diff and don't have any time for grep other
than bug fixes.  What all this adds up to is that I'm not going to do
anything about this just now.  After the summer is over and I'm back in
school, or maybe when I go to graduate school, I hope to look further
into the problem if somebody else hasn't already.

GNU e?grep is available on prep.ai.mit.edu via anonymous ftp, in the
directory /u/emacs/grep.  jaw@ames.arc.nasa.gov as a version of it
available in ~ftp/pub on that host that has some simple boyer-moore
enhancements added.  My original version tends to run slightly more
than twice as fast as Unix egrep (the fastest of the Unix greps).  With
his enhancements, it can beat Vax microcode searching for fixed strings.

You may wish to forward this to comp.editors as I do not have reliable
access to news here.  Also, I would appreciate it if you could forward
copies of any future netnews traffic about this to me
(mike@wheaties.ai.mit.edu).