Path: utzoo!news-server.csri.toronto.edu!rutgers!att!pacbell.com!decwrl!llustig!objy!prefect!peter
From: peter@prefect.Berkeley.EDU (Peter Moore)
Newsgroups: comp.std.c++
Subject: Re: parsing C++ woes
Message-ID: <1991Mar12.220532.897@objy.com>
Date: 12 Mar 91 22:05:32 GMT
References: <12.UUL1.3#8618@softrue.UUCP>
Sender: news@objy.com
Reply-To: peter@objy.com
Organization: Objectivity Inc.
Lines: 61

> I think that everyone wants C++ to be parsable without having to write the
> complete internals of a compiler.  However, as things now stand, this may
> not be possible.  

I think you are making this sound harder than it really is.  The
typename/id distinction is not new to C++.  C has the same problem,
albeit to a lesser extent.  The solution is simple: you have a symbol
table that the lexer uses to classify an identifier as a typename or a
simple id.  The parser updates entries in this symbol table as
typenames go in and out of scope, and the lexer sees these changes
next time it looks up an id.  (I haven't thought very hard about it,
but it might be technically possible to avoid the problem in C by
clever grammars, but I think almost everyone makes life easier for
themselves and uses symbol table information instead).

C++ scoping rules are more complicated than C, but a symbol table is
still straight forward.  Each class has a symbol table associated with
it, with its table linked to the table of its super class.  At any
time, the parser has a stack of the current active scopes.  When
entering a member function, or interpreting a member name, the symbol
table of the class is pushed on to the stack.

> Unfortunately, yacc doesn't work like that, because in yacc a token
> cannot be classified in two ways simultaneously.  Instead people have
> developed a hack in which the lexical analyzer classifies names as a
> CLASSname, TYPEDEFname, or NEITHERname.  Then every use of
> "identifier" in the grammar has to be expanded to possibly use any
> one of "CLASSname", "TYPEDEFname", or "NEITHERname".

For cases where an id or typedef name are both acceptable, rules like:
id
	:   ID
	|   TYPEDEF
	    { $$ = id_from_typedef($1); } 
	;

typedef
	:   TYPEDEF
	|   ID
	    { $$ = typedef_from_id($1); }
	;

isolate most of the trouble.  (Yes, there are either some semantic
checks in the conversion routines, or checks further up in the grammar).

The above are easy parts of C++ parsing.  The worst part is the
declaration/expression ambiguity that can require unbounded look-ahead
(see ARM pg 93).  A lesser problem (but one closer to my heart) is the
strain that

	a) collapsing the the union/struct/enum tag name space into
	   the typedef name space.
	b) having typedef names drift into the object/function name
	   space in the special case of constructors/function-like casts
	c) allowing typedef names as member and variable names

puts on someone trying to parse member names.  In particular, making
sure that you don't interpret a constructor declaration as a field
declaration with an extra set of parenthesis is painful.

	Peter Moore