Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!ames!decwrl!megatest!djones From: djones@megatest.UUCP (Dave Jones) Newsgroups: comp.lang.c Subject: Re: error recovery Message-ID: <4595@goofy.megatest.UUCP> Date: 29 Apr 89 01:00:48 GMT References: <1989Apr24.203827.5969@utzoo.uucp> Organization: Megatest Corporation, San Jose, Ca Lines: 118 From article <1989Apr24.203827.5969@utzoo.uucp>, by henry@utzoo.uucp (Henry Spencer): > In article <4389@goofy.megatest.UUCP> djones@megatest.UUCP (Dave Jones) writes: >>... error-recovery in a recursive descent parser is even >>trickier than in an LR parser! > > Nonsense. > Ahem. The "sense" of it is that, to me, it's trickier. Maybe you know an untrickier way. If so, please do tell. I can always stand a bit of educating. > If you insist on doing it as part of the parser, it gets messy, > but there's an easy way around that. Have the parser tell the scanner what > kind of tokens it wants at each point, rather than just asking for "the > next token", and do the error recovery in the scanner. The parser > always sees a syntactically correct program, and never has to get into > the messy business of popping an activation stack. This corresponds to the action which an LR parser takes as it gobbles tokens from the input until a token can legally be shifted. In the scheme you describe, is there anything equivalent to poping the LR-state stack? That's done, as you may know, to cooperate with the token-gobbling in an attempt to "snip out" sections of bad input, often beginning somewhere before the first illegal token, and extending somewhat beyond it. The grammar is written so that these snipped out sections reduce to the nonterminal symbol which the code-writer probably meant for them to produce. The parser might undertake, for example, to snip out a defective statement: When the parser is producing a statement, and a bad token is encountered, it will first pop states until it has uncovered the state that started the statement, and it will produce an "error-statement". Only then will it gobble tokens until it finds a token that could legally follow a statement. Such recovery can be attempted at arbitrary levels: The parser might first undertake to snip out some smaller part of the statement, perhaps the expression on the right-hand-side of an assignment, etc.., and produce an "error-assignment-statement". See the second example at the bottom of this posting. It would seem to me, that to accomplish this same kind of "snipping", in an LL parser, some kind of longjump, or error induced short circuiting would be necessary, in order to abort the productions which should not be completed. It seems wrong to force the completion of such productions, willy nilly, with tokens from the input stream, and in doing so, perhaps "stealing" tokens from other productions which might otherwise be completed successfully. If I understand the proposed method correctly, I can even think of situations in which rather simple mistakes would cause productions to be done using tokens which the author obviously did not intend to go together. > With the necessary cooperation from the > parser, this is about a page of code all told. It works well, too -- often > much better than the messy contortions in yacc. (Yes, I've done both.) It would seem to be not only a matter of personal judgement, but also a matter of Yungian "peak-experience", as to whether the yacc method is "messy" or "contorted". Although I have seen the method applied very very badly, I find it to be extremely intuitive and straight forward, once one is "in on the joke". It's the catching on that seems to be the big hurdle. That and the fact that most yaccs I have come across will use default productions in states which can shift "error". That's a bug (a typo) that screws up the whole deal. I've fixed it in my copy of BSD4.2 yacc. An example: identifier_list : IDENTIFIER { $$ = List_new(); List_append($$, (Ptr)$1); } | identifier_list ',' IDENTIFIER { $$ = $1; List_append($1, (Ptr)$3); } | error { $$ = List_new(); } | identifier_list ',' error /* $$ = $1; .. default */ ; Here's another... expression : IDENTIFIER { $$ = Mpc_identifier_expr($1); } | structured_var_access | raw_constant { $$ = Mpc_const_expr($1->name, $1->const); } | rvalue adding_operator rvalue %prec '+' {$$ = Mpc_binary_oper_expr($1,$2,$3); } | rvalue multiplying_operator rvalue %prec '*' {$$ = Mpc_binary_oper_expr($1,$2,$3); } | rvalue relational_operator rvalue %prec '=' {$$ = Mpc_relational_oper_expr($1,$2,$3); } | unary_operator rvalue %prec UNARY {$$ = Mpc_unary_oper_expr($1,$2); } | NOT rvalue %prec NOT {$$ = Mpc_unary_oper_expr($1,$2); } | IDENTIFIER actual_parameter_list { $$ = Mpc_function_expr($1,$2); } | set_constructor | '(' expression ')' { $$ = $2; } | error {$$ = (Expr*)0; } ; What could be easier? And it works really well. Perhaps this discussion, if it is to continue, should move to comp.compilers, or whatever.