Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!uunet!snorkelwacker!spdcc!ima!esegue!compilers-sender
From: boehm@flora.rice.edu (Hans Boehm)
Newsgroups: comp.compilers
Subject: Re: right context in scanner generators (was Re: LEX and YACC)
Message-ID: <1989Nov12.223642.14219@esegue.segue.boston.ma.us>
Date: 12 Nov 89 22:36:42 GMT
References: <1989Nov11.161355.10081@esegue.segue.boston.ma.us> <1989Nov12.041025.12451@esegue.segue.boston.ma.us>
Sender: compilers-sender@esegue.segue.boston.ma.us
Reply-To: boehm@flora.rice.edu (Hans Boehm)
Organization: Rice University, Houston
Lines: 56
Approved: compilers@esegue.segue.boston.ma.us

The moderator concluded my previous message with the following questions:

>[How about stepping across the matched token, trying to match the second
>expression; when it matches all the way to the end you've found the right
>context.  It's never been clear to me how useful lex-style right context
>is.  Anyone have some real examples?  -John]

Here are some partial answers:

  I may be wrong, but it has been my impression that lex is capable of
building Fortran scanners, which is certainly no easy task.  For example,
isn't it the case that DO can be recognized by looking for a `d' followed by
an `o' in a right context in which a `,' precedes the first left parenthesis?
My impression is that nobody has really built a Fortran scanner in this way,
but I assumed that was for performance reasons.

  Trying to match the context expression separately is, at least in some
theoretical sense, not very efficient.  If there is more than one possible
regular expression that may match, it may be necessary to look for several
different right contexts at several different starting positions.  This would
involve backtracking within the recognition of a single token.  The current
lex algorithm has the property that it needs to only simulate a single finite
automaton, independent of the number of regular expressions in the
specification, and independent of whether or not they include right context.
It does need to back up in the input at the end of every token recognition to
find the longest match, but this only happens once per token.  It's a simple
scan to find the last time a the FA was in a final state.  Right context is
implemented by filtering out certain final states. 

  I believe we can keep things down to a single forward and a single backward
scan by building a FA recognizing strings ending in the reversal of one of the
right context expressions, and running it in the backup phase.  Are there
simpler techniques that don't require repeated backtracking?

Hans-J. Boehm
boehm@rice.edu

[You can't build a Fortran scanner in lex without some external hackery.  In
particular, to identify a DO statement you need to look for a comma not inside
of parens to the right of the equal, and you can't do paren matching with
regular expressions.  I don't think you can do H constants such as 3Hfoo in
lex, either.  And in a statement like "DO10E5I=1,10" you need contextual help
to tell that the statement number is 10 rather than 10E5 and the variable is
E5I rather than I.  (Nearly everywhere else that an integer constant is
allowed, a real constant is, too, or at least there's no ambiguity.)

  When I wrote an F77 compiler (INfort, for anyone who may have used it) I did
the same thing as everyone else which is to prescan each statement to take out
the string constants and tell whether it was an assignment statement or not.
Having done that, scanning is easy although it needs a lot of feedback from
the parser.  For people who are interested, I am giving away a Fortran subset
parser that I wrote a while ago.  -John]
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{spdcc | ima | lotus}!esegue.  Meta-mail to compilers-request@esegue.
Please send responses to the author of the message, not the poster.