Path: utzoo!news-server.csri.toronto.edu!cs.utexas.edu!uunet!softrue!kearns From: kearns@softrue.UUCP (Steven Kearns) Newsgroups: comp.std.c++ Subject: parsing C++ woes Message-ID: <12.UUL1.3#8618@softrue.UUCP> Date: 10 Mar 91 17:19:04 GMT Organization: Software Truth Lines: 138 A Parsing Pitfall related to "C++ with nested types". ============================================================= I think that everyone wants C++ to be parsable without having to write the complete internals of a compiler. However, as things now stand, this may not be possible. On pages 93 and 96 of E&S, it is made clear that the definition of C++ syntax assumes that a lexical analyzer returns identifiers, and that the parser can tell if an identifier is a CLASSname or TYPEDEFname or neither. So all CLASSnames are also identifiers, and all TYPEDEFnames are also identifiers, and an identifier may also be a CLASSname, a TYPEDEFname, or neither, but not both. This distinction is neccessary in order to disambiguate the grammar, which is otherwise quite ambiguous. Unfortunately, yacc doesn't work like that, because in yacc a token cannot be classified in two ways simultaneously. Instead people have developed a hack in which the lexical analyzer classifies names as a CLASSname, TYPEDEFname, or NEITHERname. Then every use of "identifier" in the grammar has to be expanded to possibly use any one of "CLASSname", "TYPEDEFname", or "NEITHERname". When yacc realizes that a new name has been declared, it must inform the lexical analyzer as quickly as possible so that future occurrences of that name can be classified correctly. Gross! Another possibility is to modify yacc so it understands "inherited tokens" such as the relationship between CLASSname and identifier. Yacc still has to update the name interpretation process when a new variable is declared. Now that C++ 2.1 insists on allowing type names to be nested just like variable names, the lexical analyzer's task is much harder. The classification of an identifier depends on the active scope, and there are many places that the active scope changes when parsing a C++ program: after a class qualifier, inside a class declaration, inside a compound statement, etc.. The question of the day is, Does the active scope change after the "." (DOT) or "->" (ARROW) operators? The "name" that appears after a "." or "->" must be a member of the class referred to on the left side of the "." or "->". From this it might be tempting to define the C++ SYNTAX so that the names on the right hand side are interpreted in the scope of the object referred to on the left hand side. After all, in "foo.a", we expect name "a" to be a member of the class referred to by foo. BUT THIS DEFINITION WOULD BE DISASTEROUS! The left side of "." and "->" might be an arbitrary expression, potentially involving overloaded functions and the complete range of C++ complexity, and changing to its scope would mean calculating its type DURING PARSING. This would completely eliminate the possibility of making a simple parser for C++. Instead, names on the right hand side of a "." or "->" should be interpreted in the same scope as the left hand side DURING PARSING. Later, these names should be reinterpreted in scope of the class referred to by the left hand side. The bottom line is that the syntax given for a "name" in E&S must be changed. Here is the old definition: /* grammar 1 */ name: identifier operator-function-name conversion-function-name ~ CLASSname qualified-name The problem with this definition is that the definitions for "conversion function", "~ CLASSname", and "qualified-name" all make use of the CLASSname token, thereby assuming that the lexical analyzer can make this distinction in this context. However, I just pointed out that it is infeasible to classify a name as a CLASSname or not on the right side of a "." or "->", during parsing. Therefore the syntax for "name" must be modified. Here is one way, which basically involves rewriting the relevant grammar rules so that all references to CLASSname tokens are replaced by references to the less specific "identifier" token. /* grammar 2 */ name: identifier operator-function-name conversion-function-name-2 ~ identifier qualified-name-2 qualified-name-2: qualified-class-name-2 :: name qualified-class-name-2: identifier identifier :: qualified-class-name-2 conversion-function-name-2: OPERATOR conversion-type-name-2 conversion-type-name-2: type-specifier-list-2 ptr-operator-2 ...... it goes on and on and on and on ........ I would not be surprised if this change introduces more ambiguities into the language. Here is an illustrative example: class outer { class inner { int i; }; int j; }; main () { outer o; o.j; // this must be legal int a; o.operator inner(); // are both of these legal? o.operator outer::inner(); // probably should be! o.a; // should be syntactically correct, semantically wrong o.operator bobble(); // syntactically correct, semantically wrong }; ******************************************************** * Steven Kearns ....uunet!softrue!kearns * * Software Truth softrue!kearns@uunet.uu.net * ********************************************************