Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!uunet!snorkelwacker!spdcc!ima!esegue!compilers-sender From: boehm@flora.rice.edu (Hans Boehm) Newsgroups: comp.compilers Subject: Re: right context in scanner generators (was Re: LEX and YACC) Message-ID: <1989Nov12.223642.14219@esegue.segue.boston.ma.us> Date: 12 Nov 89 22:36:42 GMT References: <1989Nov11.161355.10081@esegue.segue.boston.ma.us> <1989Nov12.041025.12451@esegue.segue.boston.ma.us> Sender: compilers-sender@esegue.segue.boston.ma.us Reply-To: boehm@flora.rice.edu (Hans Boehm) Organization: Rice University, Houston Lines: 56 Approved: compilers@esegue.segue.boston.ma.us The moderator concluded my previous message with the following questions: >[How about stepping across the matched token, trying to match the second >expression; when it matches all the way to the end you've found the right >context. It's never been clear to me how useful lex-style right context >is. Anyone have some real examples? -John] Here are some partial answers: I may be wrong, but it has been my impression that lex is capable of building Fortran scanners, which is certainly no easy task. For example, isn't it the case that DO can be recognized by looking for a `d' followed by an `o' in a right context in which a `,' precedes the first left parenthesis? My impression is that nobody has really built a Fortran scanner in this way, but I assumed that was for performance reasons. Trying to match the context expression separately is, at least in some theoretical sense, not very efficient. If there is more than one possible regular expression that may match, it may be necessary to look for several different right contexts at several different starting positions. This would involve backtracking within the recognition of a single token. The current lex algorithm has the property that it needs to only simulate a single finite automaton, independent of the number of regular expressions in the specification, and independent of whether or not they include right context. It does need to back up in the input at the end of every token recognition to find the longest match, but this only happens once per token. It's a simple scan to find the last time a the FA was in a final state. Right context is implemented by filtering out certain final states. I believe we can keep things down to a single forward and a single backward scan by building a FA recognizing strings ending in the reversal of one of the right context expressions, and running it in the backup phase. Are there simpler techniques that don't require repeated backtracking? Hans-J. Boehm boehm@rice.edu [You can't build a Fortran scanner in lex without some external hackery. In particular, to identify a DO statement you need to look for a comma not inside of parens to the right of the equal, and you can't do paren matching with regular expressions. I don't think you can do H constants such as 3Hfoo in lex, either. And in a statement like "DO10E5I=1,10" you need contextual help to tell that the statement number is 10 rather than 10E5 and the variable is E5I rather than I. (Nearly everywhere else that an integer constant is allowed, a real constant is, too, or at least there's no ambiguity.) When I wrote an F77 compiler (INfort, for anyone who may have used it) I did the same thing as everyone else which is to prescan each statement to take out the string constants and tell whether it was an assignment statement or not. Having done that, scanning is easy although it needs a lot of feedback from the parser. For people who are interested, I am giving away a Fortran subset parser that I wrote a while ago. -John] -- Send compilers articles to compilers@esegue.segue.boston.ma.us {spdcc | ima | lotus}!esegue. Meta-mail to compilers-request@esegue. Please send responses to the author of the message, not the poster.