Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!ames!decwrl!megatest!djones
From: djones@megatest.UUCP (Dave Jones)
Newsgroups: comp.lang.c
Subject: Re: error recovery
Message-ID: <4595@goofy.megatest.UUCP>
Date: 29 Apr 89 01:00:48 GMT
References: <1989Apr24.203827.5969@utzoo.uucp>
Organization: Megatest Corporation, San Jose, Ca
Lines: 118

From article <1989Apr24.203827.5969@utzoo.uucp>, by henry@utzoo.uucp (Henry Spencer):
> In article <4389@goofy.megatest.UUCP> djones@megatest.UUCP (Dave Jones) writes:
>>... error-recovery in a recursive descent parser is even
>>trickier than in an LR parser!
> 
> Nonsense.
>

Ahem.

The "sense" of it is that, to me, it's trickier.  Maybe you know
an untrickier way. If so, please do tell.  I can always stand a bit
of educating.

> If you insist on doing it as part of the parser, it gets messy,
> but there's an easy way around that.  Have the parser tell the scanner what
> kind of tokens it wants at each point, rather than just asking for "the 
> next token", and do the error recovery in the scanner.  The parser
> always sees a syntactically correct program, and never has to get into
> the messy business of popping an activation stack.

This corresponds to the action which an LR parser
takes as it gobbles tokens from the input until a token can legally be
shifted. In the scheme you describe, is there anything equivalent to poping
the LR-state stack?  That's done, as you may know, to cooperate
with the token-gobbling in an attempt to "snip out" sections of bad input,
often beginning somewhere before the first illegal token, and extending
somewhat beyond it. The grammar is written so that these snipped out
sections reduce to the nonterminal symbol which the code-writer probably
meant for them to produce.

The parser might undertake, for example, to snip out a defective statement:
When the parser is producing a statement, and a bad token is encountered, it
will first pop states until it has uncovered the state that started the 
statement, and it will produce an "error-statement". Only then 
will it gobble tokens until it finds a token that could legally
follow a statement.

Such recovery can be attempted at arbitrary levels: The parser might
first undertake to snip out some smaller part of the statement, perhaps the
expression on the right-hand-side of an assignment, etc.., and produce
an "error-assignment-statement". See the second example at the bottom
of this posting.

It would seem to me, that to accomplish this same kind of "snipping",
in an LL parser, some kind of longjump, or error induced short circuiting
would be necessary, in order to abort the productions which should not
be completed.  It seems wrong to force the completion of such productions,
willy nilly, with tokens from the input stream, and in doing so, perhaps
"stealing" tokens from other productions which might otherwise be completed
successfully. If I understand the proposed method correctly, I can even
think of situations in which rather simple mistakes would cause productions
to be done using tokens which the author obviously did not intend to go
together.

>  With the necessary cooperation from the
> parser, this is about a page of code all told.  It works well, too -- often
> much better than the messy contortions in yacc.  (Yes, I've done both.)

It would seem to be not only a matter of personal judgement, but also
a matter of Yungian "peak-experience", as to whether the yacc method
is "messy" or "contorted".   Although I have seen the method applied
very very badly, I find it to be extremely intuitive and straight forward,
once one is "in on the joke".  It's the catching on that seems to be
the big hurdle. That and the fact that most yaccs I have come across
will use default productions in states which can shift "error". That's
a bug (a typo) that screws up the whole deal. I've fixed it in my copy
of BSD4.2 yacc.

An example:

    identifier_list	
	        : IDENTIFIER
                     { $$ = List_new(); List_append($$, (Ptr)$1); }
                | identifier_list ',' IDENTIFIER
                      { $$ = $1; List_append($1, (Ptr)$3); }
                | error
                      { $$ = List_new(); }
                | identifier_list ',' error
		      /* $$ = $1; .. default */
	
;


Here's another...

expression      : IDENTIFIER
			{ $$ = Mpc_identifier_expr($1); }
                | structured_var_access
			
                | raw_constant	      
			{ $$ = Mpc_const_expr($1->name, $1->const); }
                | rvalue adding_operator rvalue	%prec '+'
			{$$ = Mpc_binary_oper_expr($1,$2,$3); }
		| rvalue multiplying_operator rvalue   	%prec '*'
			{$$ = Mpc_binary_oper_expr($1,$2,$3); }
		| rvalue relational_operator rvalue	%prec '='
			{$$ = Mpc_relational_oper_expr($1,$2,$3); }
		| unary_operator rvalue			%prec UNARY
			{$$ = Mpc_unary_oper_expr($1,$2); }
		| NOT rvalue				%prec NOT
			{$$ = Mpc_unary_oper_expr($1,$2); }
		| IDENTIFIER  actual_parameter_list 
			{ $$ = Mpc_function_expr($1,$2); }
		| set_constructor
		
		| '(' expression ')' { $$ = $2; }

		| error  
			{$$ = (Expr*)0; }
	
;


What could be easier? And it works really well.

Perhaps this discussion, if it is to continue, should move to comp.compilers,
or whatever.