Path: utzoo!attcan!uunet!lll-winken!lll-tis!ames!mailrus!cornell!blandy From: blandy@awamore.cs.cornell.edu (Jim Blandy) Newsgroups: comp.editors Subject: Re: pattern matches Message-ID: <19413@cornell.UUCP> Date: 20 Jul 88 16:48:56 GMT References: <427@grand.UUCP> <37200009@m.cs.uiuc.edu> <18838@cornell.UUCP> <295@hobbit.sci.kun.nl> Sender: nobody@cornell.UUCP Reply-To: blandy@cs.cornell.edu (Jim Blandy) Organization: Cornell Univ. CS Dept, Ithaca NY Lines: 61 Thinking about pattern complements: In defining this, we should respect the automata theory behind regular expressions, since lots of the algorithms in use today make use of the nice properties of regular sets. A regular expression denotes a set of strings (a string matches the RE if it's in the set), so the complement of a RE is the RE denoting the complement of the original set. Definition: If R is a regular expression, (R)^ is a regular expression matching the complement of R; a string matches (R)^ iff that string does NOT match R. This doesn't work the way one might think it does - for example, the RE "d" denotes the set of strings { "d" }. The complement of this is NOT the set of one-character strings <> "d". It includes things like the null string. So the regexp "abc(d)^.*" (meaning "abc", not "d", any string) DOES match the string "abcd" : the subexpression "(d)^" matches the null string, and the ".*" matches the "d" at the end of "abcd" . However, this definition IS useful. I'm looking at a Technical Report describing the ANSI C grammar, and they've got a lex scanner here. The pattern they give to match comments is: "/*"([^*]*[*]+[^\*\/])*(^*]*[*]+\/) It's a real monstrosity, but necessary because a comment cannot include the string "*/" . Under the longest-match rule, the line /* a comment */ some C code /* another comment */ would scan as one long comment if they hadn't gone to the trouble of excluding it and used the more obvious pattern "/*" .* "*/" (spaces for clarity) Here's where negation comes in: if I've got it right, that monstrosity can be rewritten with the negation operator as "/*" (.* "*/" .*)^ "*/" which is a LOT easier to understand: the pattern .* "*/" .* matches any string containing "*/". So, its complement matches any string NOT containing "*/". So my comment pattern reads as "a comment introducer ("/*"), followed by a string not containing a comment end ("*/"), followed by a comment end." A big win for clarity. It just seemed to me that the issue needed a clear discussion about what the complement of a regular expression really is, and some examples of how it should work. I think it's a good idea. I think one could extend the Thompson construction (turning an RE into an NFA) to handle this, but it won't fit in without some modification... -- Jim Blandy - blandy@crnlcs.bitnet "insects were insects when man was just a burbling whatisit." - archy