Path: utzoo!utgpu!jarvis.csri.toronto.edu!rutgers!tut.cis.ohio-state.edu!ucbvax!decwrl!decvax!ima!esegue!compilers-sender
From: djones@megatest.uucp (Dave Jones)
Newsgroups: comp.compilers
Subject: Re: Error reporting in LR parsers
Message-ID: <1989Aug6.024931.10014@esegue.uucp>
Date: 6 Aug 89 02:49:31 GMT
References: <1989Aug4.032053.753@esegue.uucp>
Sender: compilers-sender@esegue.uucp
Reply-To: djones@megatest.uucp (Dave Jones)
Organization: Megatest Corporation, San Jose, Ca
Lines: 153
Approved: compilers@esegue.segue.bos.ma.us

Mr. Worley has generously taken the time to post a long article about getting
yacc to give up a list of token-categories which would be legal when a syntax
error is found. The idea of grouping tokens into categories is a good one.

But the technique in no way solves the "default reduction" problem, as
he claims it does. The technique he describes suffers from the exact same
problem that all the other attempts have done so far. I will demonstrate that
even the small example he gives will not work, by inspecting the actual
output from yacc.

From article <1989Aug4.032053.753@esegue.uucp>, by worley@compass.com (Dale Worley):
> 
... [ concerning how to get yacc to generate lists of legal tokens
      when a syntax error is encounterd ... ]
> 
> I've used similar techniques, and a few simple fixes can solve a lot
> of these problems. The first is to define "token groups", which are
> groups of tokens the user thinks of as being similar.  For instance,
> '+', '-', '*', etc. can be included in the group 'operator'; '(',
> 'IDENTIFIER', '-', etc., can be included in the group 'expression'.
> Now, when a syntax error is found, the error-reporter assembles a list
> of acceptable tokens.  Then it checks each token group, and if the
> group is contained in the list of acceptable tokens, all those tokens
> are removed from the list, and the name of the token group is added.
> (In practice, only about three members of the group need to be checked
> for membership, and if one token group overlaps another, you have to
> be careful about the order in which the groups are tested.)  The
> result is that you can usually reduce the list of allowed tokens to
> about three items, and produce error messages like:
> 
> 	    34	i = j k;
> 		      ^
>     Syntax error.
>     Found: IDENTIFIER 'k'
>     Expecting: OPERATOR ; END
> 
> The trick of only checking a few members of the group can be used to
> defeat the problem of losing information because of default
> reductions.
>  The way to do this is to only test members whose validity
> will not be eliminated by the default reductions.
>  In the above
> example, 'j' may be reduced to 'expression' before the error is found,
> but if the 'operator' group is represented only by '+', then the fact
> that '*' cannot follow 'expression' will not stop the 'operator' group
> from showing in the list of expected tokens.
> 

In the above example, when the yacc-generated compier discovers the
syntax error, the lookahead set under consideration would contain
only ';' and END.  It would not contain either '+' or '*'.

Having done the default reduction, the compiler would
have discarded the state or states which could shift '+' or '*'.
That's the whole point. Nothing in the algorithm described above retains
or recovers the lost information.

To make it more concrete, let's actually try to parse the example
fragment:

	BEGIN
          i = j k
        END

I've written a grammar for us to consider. In the interest of brevity,
and "doing it right", the grammar employs precedence-rules, but the result
would be the same if we used separate "expr", "term", and "factor"
productions. (If you are skeptical, try it.)

%token BEGIN END ID
%left '+'
%left '*'

%%

prog : BEGIN list END

list     : stat
         | list ';' stat
;

stat :  /* empty */
     |  ID '=' expr
;

expr : ID
     | expr '+' expr
     | expr '*' expr
;


%%


Say, "yacc -vd gram.y"

From the resulting y.output file, we observe the following:


   state 10
	stat :  ID = expr_    (5)
	expr :  expr_+ expr 
	expr :  expr_* expr 

	+  shift 12
	*  shift 13
	.  reduce 5


We see that when the parser, in state 10, does not see a '+' or a '*',
it will make the default reduction (5):

        stat : ID = expr

Doing so, it will discard three states from the top of the stack,
including state 10, which is the one which "knows about" the '+' and
the '*'.  Next it will do another default reduction, producing a "list".
Then it will be in state 3, as listed in the y.output file:

   state 3
	prog :  BEGIN list_END 
	list :  list_; stat 

	END  shift 6
	;  shift 7
	.  error


Only now will it discover that the next symbol, another ID, is not
legal.

Note that the only symbols which it now considers to be valid are
';' and END, although '+' and '*' would have been valid before the
default reductions were made.  How are we to infer that a class of
symbols which we choose to call "operators" would also be valid?
The information is gone.

In this example, state 3 can only be entered after having done
the default reductions which we saw. Thus the LALR(1) state 3 only
stands for one LR(1) state, which additionally has the lookahead tokens
'+' and '*'. But in general, a given LALR(1) state can stand in for
several LR(1) states with different LR(1)-lookahead sets. I have
demonstrated in previous postings how the entire LR(1)-lookahead
set can be calculated at runtime, as the file is parsed. But any such
calculation must take into account the input (or the states resulting
from the input). There is not enough information in the LALR(1) state alone.
[From djones@megatest.uucp (Dave Jones)]
-- 
--
Send compilers articles to compilers@ima.isc.com or, perhaps, Levine@YALE.EDU
{ decvax | harvard | yale | bbn }!ima.  Meta-mail to ima!compilers-request.
Please send responses to the author of the message, not the poster.