Path: utzoo!news-server.csri.toronto.edu!rutgers!att!pacbell.com!decwrl!llustig!objy!prefect!peter From: peter@prefect.Berkeley.EDU (Peter Moore) Newsgroups: comp.std.c++ Subject: Re: parsing C++ woes Message-ID: <1991Mar12.220532.897@objy.com> Date: 12 Mar 91 22:05:32 GMT References: <12.UUL1.3#8618@softrue.UUCP> Sender: news@objy.com Reply-To: peter@objy.com Organization: Objectivity Inc. Lines: 61 > I think that everyone wants C++ to be parsable without having to write the > complete internals of a compiler. However, as things now stand, this may > not be possible. I think you are making this sound harder than it really is. The typename/id distinction is not new to C++. C has the same problem, albeit to a lesser extent. The solution is simple: you have a symbol table that the lexer uses to classify an identifier as a typename or a simple id. The parser updates entries in this symbol table as typenames go in and out of scope, and the lexer sees these changes next time it looks up an id. (I haven't thought very hard about it, but it might be technically possible to avoid the problem in C by clever grammars, but I think almost everyone makes life easier for themselves and uses symbol table information instead). C++ scoping rules are more complicated than C, but a symbol table is still straight forward. Each class has a symbol table associated with it, with its table linked to the table of its super class. At any time, the parser has a stack of the current active scopes. When entering a member function, or interpreting a member name, the symbol table of the class is pushed on to the stack. > Unfortunately, yacc doesn't work like that, because in yacc a token > cannot be classified in two ways simultaneously. Instead people have > developed a hack in which the lexical analyzer classifies names as a > CLASSname, TYPEDEFname, or NEITHERname. Then every use of > "identifier" in the grammar has to be expanded to possibly use any > one of "CLASSname", "TYPEDEFname", or "NEITHERname". For cases where an id or typedef name are both acceptable, rules like: id : ID | TYPEDEF { $$ = id_from_typedef($1); } ; typedef : TYPEDEF | ID { $$ = typedef_from_id($1); } ; isolate most of the trouble. (Yes, there are either some semantic checks in the conversion routines, or checks further up in the grammar). The above are easy parts of C++ parsing. The worst part is the declaration/expression ambiguity that can require unbounded look-ahead (see ARM pg 93). A lesser problem (but one closer to my heart) is the strain that a) collapsing the the union/struct/enum tag name space into the typedef name space. b) having typedef names drift into the object/function name space in the special case of constructors/function-like casts c) allowing typedef names as member and variable names puts on someone trying to parse member names. In particular, making sure that you don't interpret a constructor declaration as a field declaration with an extra set of parenthesis is painful. Peter Moore