Path: utzoo!utgpu!jarvis.csri.toronto.edu!rutgers!tut.cis.ohio-state.edu!ucbvax!decwrl!decvax!ima!esegue!compilers-sender From: djones@megatest.uucp (Dave Jones) Newsgroups: comp.compilers Subject: Re: Error reporting in LR parsers Message-ID: <1989Aug6.024931.10014@esegue.uucp> Date: 6 Aug 89 02:49:31 GMT References: <1989Aug4.032053.753@esegue.uucp> Sender: compilers-sender@esegue.uucp Reply-To: djones@megatest.uucp (Dave Jones) Organization: Megatest Corporation, San Jose, Ca Lines: 153 Approved: compilers@esegue.segue.bos.ma.us Mr. Worley has generously taken the time to post a long article about getting yacc to give up a list of token-categories which would be legal when a syntax error is found. The idea of grouping tokens into categories is a good one. But the technique in no way solves the "default reduction" problem, as he claims it does. The technique he describes suffers from the exact same problem that all the other attempts have done so far. I will demonstrate that even the small example he gives will not work, by inspecting the actual output from yacc. From article <1989Aug4.032053.753@esegue.uucp>, by worley@compass.com (Dale Worley): > ... [ concerning how to get yacc to generate lists of legal tokens when a syntax error is encounterd ... ] > > I've used similar techniques, and a few simple fixes can solve a lot > of these problems. The first is to define "token groups", which are > groups of tokens the user thinks of as being similar. For instance, > '+', '-', '*', etc. can be included in the group 'operator'; '(', > 'IDENTIFIER', '-', etc., can be included in the group 'expression'. > Now, when a syntax error is found, the error-reporter assembles a list > of acceptable tokens. Then it checks each token group, and if the > group is contained in the list of acceptable tokens, all those tokens > are removed from the list, and the name of the token group is added. > (In practice, only about three members of the group need to be checked > for membership, and if one token group overlaps another, you have to > be careful about the order in which the groups are tested.) The > result is that you can usually reduce the list of allowed tokens to > about three items, and produce error messages like: > > 34 i = j k; > ^ > Syntax error. > Found: IDENTIFIER 'k' > Expecting: OPERATOR ; END > > The trick of only checking a few members of the group can be used to > defeat the problem of losing information because of default > reductions. > The way to do this is to only test members whose validity > will not be eliminated by the default reductions. > In the above > example, 'j' may be reduced to 'expression' before the error is found, > but if the 'operator' group is represented only by '+', then the fact > that '*' cannot follow 'expression' will not stop the 'operator' group > from showing in the list of expected tokens. > In the above example, when the yacc-generated compier discovers the syntax error, the lookahead set under consideration would contain only ';' and END. It would not contain either '+' or '*'. Having done the default reduction, the compiler would have discarded the state or states which could shift '+' or '*'. That's the whole point. Nothing in the algorithm described above retains or recovers the lost information. To make it more concrete, let's actually try to parse the example fragment: BEGIN i = j k END I've written a grammar for us to consider. In the interest of brevity, and "doing it right", the grammar employs precedence-rules, but the result would be the same if we used separate "expr", "term", and "factor" productions. (If you are skeptical, try it.) %token BEGIN END ID %left '+' %left '*' %% prog : BEGIN list END list : stat | list ';' stat ; stat : /* empty */ | ID '=' expr ; expr : ID | expr '+' expr | expr '*' expr ; %% Say, "yacc -vd gram.y" From the resulting y.output file, we observe the following: state 10 stat : ID = expr_ (5) expr : expr_+ expr expr : expr_* expr + shift 12 * shift 13 . reduce 5 We see that when the parser, in state 10, does not see a '+' or a '*', it will make the default reduction (5): stat : ID = expr Doing so, it will discard three states from the top of the stack, including state 10, which is the one which "knows about" the '+' and the '*'. Next it will do another default reduction, producing a "list". Then it will be in state 3, as listed in the y.output file: state 3 prog : BEGIN list_END list : list_; stat END shift 6 ; shift 7 . error Only now will it discover that the next symbol, another ID, is not legal. Note that the only symbols which it now considers to be valid are ';' and END, although '+' and '*' would have been valid before the default reductions were made. How are we to infer that a class of symbols which we choose to call "operators" would also be valid? The information is gone. In this example, state 3 can only be entered after having done the default reductions which we saw. Thus the LALR(1) state 3 only stands for one LR(1) state, which additionally has the lookahead tokens '+' and '*'. But in general, a given LALR(1) state can stand in for several LR(1) states with different LR(1)-lookahead sets. I have demonstrated in previous postings how the entire LR(1)-lookahead set can be calculated at runtime, as the file is parsed. But any such calculation must take into account the input (or the states resulting from the input). There is not enough information in the LALR(1) state alone. [From djones@megatest.uucp (Dave Jones)] -- -- Send compilers articles to compilers@ima.isc.com or, perhaps, Levine@YALE.EDU { decvax | harvard | yale | bbn }!ima. Meta-mail to ima!compilers-request. Please send responses to the author of the message, not the poster.