Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!yale!mintaka!spdcc!ima!esegue!compilers-sender
From: firth@sei.cmu.edu (Robert Firth)
Newsgroups: comp.compilers
Subject: Re: LL parsing questions
Keywords: parse, question, LL(1)
Message-ID: <9096@fy.sei.cmu.edu>
Date: 17 Oct 90 19:34:30 GMT
References: <9010160921.AA00952@turia.dit.upm.es>
Sender: compilers-sender@esegue.segue.boston.ma.us
Reply-To: firth@sei.cmu.edu (Robert Firth)
Organization: Software Engineering Institute, Pittsburgh, PA
Lines: 96
Approved: compilers@esegue.segue.boston.ma.us

In article <9010160921.AA00952@turia.dit.upm.es> esink@turia.dit.upm.es (Eric Wayne Sink) writes:

>I believe that few people write parsers without YACC (or am I wrong ?). 

(Well, I wouldn't go near yacc with a ten-meter pole, but who cares
 what I think!)

>Suppose you WERE writing a recursive descent parser without YACC, for
>the purpose of understanding how they work.  Take the example language
>to be C; with the following simple grammar:
>
>statement
>	: typename identifier ';'
>	| typename identifier '(' ')' ';'
>	| typename identifier '[' constant ']' ';'
>	;
>
>Obviously this grammar is not useful or complete, I have made it only
>complex enough to illustrate my problem.  Now, try parsing the following
>statements:
>
>int x;
>int x();
>int x[5];

Yes, I believe you are going about it the wrong way. The basic idea
of recursive descent is that you don't call a recogniser until you
are sure that (in the absence of errors) it will indeed find what
it expects.

For example, suppose you have recognisers for the IF statement, the
WHILE statement, and the GOTO statement.  You enter them with the
initial token already consumed, so the code that gets you there
looks like this

	IF nextis(iftoken) THEN readifstatement()
	ELSE IF nextis(whiletoken) THEN readwhilestatement()
	ELSE IF nextis(gototoken) THEN readgotostatement()
	ELSE error...

The auxiliary routine nextis(T) is about the most helpful thing to
have in such a parser: if the current token is T then it CONSUMES
that token and returns TRUE, otherwise it LEAVES THE TOKEN ALONE and
returns FALSE.

Now, when we get to readifstatement() we have consumed the IF, so
the body reads something like

	readcondition()
	nexthadbetterbe(thentoken)
	readstatement()
	IF nextis(elsetoken) THEN readstatement()

Let's turn now to your example.  At the top level, we read first
whatever is common in the left parts of the productions, since they
have to be there in all cases:

	readtype()
	readidentifier()

We now switch on the next token:

	IF nextis(semicolontoken) THEN -- we're done

	ELSE IF nextis(leftparentoken) THEN read_rest_of_case_2()

	ELSE IF nextis(leftbrackettoken) THEN read_rest_of_case_3()

	ELSE error...

There's more to it than that, of course: passing information down to
the recognisers, building parse trees, and all the error handling.  But
the above should give you the basic structure of the code.  As a final
useful trick: I always have the recognisers return the tree fragment
for the production they finish recognising, so in reality the code
tends to look like this

	IF nextis(iftoken) RETURN readifstatement()

...

readifstatement:

	c := readcondition()
	nexthadbetterbe(thentoken)
	s1 := readstatement()
	s2 := IF nextis(elsetoken) THEN readstatement() ELSE NIL

	RETURN buildcell(iftoken,c,s1,s2)

Good luck, and hope it all helps!

Robert Firth
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.