Path: utzoo!news-server.csri.toronto.edu!cs.utexas.edu!uunet!softrue!kearns
From: kearns@softrue.UUCP (Steven Kearns)
Newsgroups: comp.std.c++
Subject: parsing C++ woes
Message-ID: <12.UUL1.3#8618@softrue.UUCP>
Date: 10 Mar 91 17:19:04 GMT
Organization: Software Truth
Lines: 138


A Parsing Pitfall related to "C++ with nested types".
=============================================================

I think that everyone wants C++ to be parsable without having to write the
complete internals of a compiler.  However, as things now stand, this may
not be possible.  

On pages 93 and 96 of E&S, it is made clear that the definition of
C++ syntax assumes that a lexical analyzer returns identifiers, and
that the parser can tell if an identifier is a CLASSname or
TYPEDEFname or neither.  So all CLASSnames are also identifiers, and
all TYPEDEFnames are also identifiers, and an identifier may also be
a CLASSname, a TYPEDEFname, or neither, but not both.

This distinction is neccessary in order to disambiguate the grammar,
which is otherwise quite ambiguous. 

Unfortunately, yacc doesn't work like that, because in yacc a token
cannot be classified in two ways simultaneously.  Instead people have
developed a hack in which the lexical analyzer classifies names as a
CLASSname, TYPEDEFname, or NEITHERname.  Then every use of
"identifier" in the grammar has to be expanded to possibly use any
one of "CLASSname", "TYPEDEFname", or "NEITHERname".  When yacc
realizes that a new name has been declared, it must inform the
lexical analyzer as quickly as possible so that future occurrences of
that name can be classified correctly.  Gross!

Another possibility is to modify yacc so it understands "inherited
tokens" such as the relationship between CLASSname and identifier.
Yacc still has to update the name interpretation process when a new
variable is declared.

Now that C++ 2.1 insists on allowing type names to be nested just
like variable names, the lexical analyzer's task is much harder.  The
classification of an identifier depends on the active scope, and
there are many places that the active scope changes when parsing a
C++ program: after a class qualifier, inside a class declaration,
inside a compound statement, etc..

The question of the day is, Does the active scope change after the
"." (DOT) or "->" (ARROW) operators?

The "name" that appears after a "." or "->" must be a member of the
class referred to on the left side of the "." or "->".  From this it
might be tempting to define the C++ SYNTAX so that the names on the
right hand side are interpreted in the scope of the object referred to
on the left hand side.  After all, in "foo.a", we expect name "a" to
be a member of the class referred to by foo.  

BUT THIS DEFINITION WOULD BE DISASTEROUS!  The left side of "." and
"->" might be an arbitrary expression, potentially involving
overloaded functions and the complete range of C++ complexity, and
changing to its scope would mean calculating its type DURING PARSING.
This would completely eliminate the possibility of making a simple
parser for C++.

Instead, names on the right hand side of a "." or "->" should be
interpreted in the same scope as the left hand side DURING PARSING.
Later, these names should be reinterpreted in scope of the class
referred to by the left hand side.  

The bottom line is that the syntax given for a "name" in E&S must
be changed.  Here is the old definition:

/* grammar 1 */

name:  
	identifier
	operator-function-name
	conversion-function-name
	~ CLASSname
	qualified-name

The problem with this definition is that the definitions for
"conversion function", "~ CLASSname", and "qualified-name" all make
use of the CLASSname token, thereby assuming that the lexical
analyzer can make this distinction in this context.  However, I just
pointed out that it is infeasible to classify a name as a CLASSname
or not on the right side of a "." or "->", during parsing.  Therefore
the syntax for "name" must be modified.  Here is one way, which
basically involves rewriting the relevant grammar rules so that all
references to CLASSname tokens are replaced by references to the less
specific "identifier" token.  

/* grammar 2 */

name:  
	identifier
	operator-function-name
	conversion-function-name-2
	~ identifier
	qualified-name-2

qualified-name-2:
	qualified-class-name-2 :: name

qualified-class-name-2:
	identifier
	identifier :: qualified-class-name-2

conversion-function-name-2:
	OPERATOR conversion-type-name-2

conversion-type-name-2:
	type-specifier-list-2 ptr-operator-2

...... it goes on and on and on and on ........


I would not be surprised if this change introduces more ambiguities
into the language.  

Here is an illustrative example:

class outer {
	class inner {
		int i;
	};
	int j;
};

main () {
	outer o;
	o.j;   // this must be legal
	int a;

	o.operator inner();        // are both of these legal?
	o.operator outer::inner(); // probably should be!

	o.a;   // should be syntactically correct, semantically wrong
	o.operator bobble();  // syntactically correct, semantically wrong
};

********************************************************
* Steven Kearns            ....uunet!softrue!kearns    *
* Software Truth           softrue!kearns@uunet.uu.net *
********************************************************