Path: utzoo!utgpu!watserv1!watmath!att!pacbell!pacbell.com!ucsd!usc!snorkelwacker!bloom-beacon!eru!hagbard!sunic!mcsun!tuvie!vmars!alex
From: alex@vmars.tuwien.ac.at (Alexander Vrchoticky)
Newsgroups: comp.lang.c
Subject: Re: Summary: 'C', is it's grammar context sensitive ?
Message-ID: <1787@tuvie>
Date: 31 Aug 90 14:29:58 GMT
References: <1990Aug30.223440.7377@NCoast.ORG>
Sender: news@tuvie
Lines: 49

ramsey@NCoast.ORG (Cedric Ramsey) writes:

>If you guys agree that 'C' is context sensitive then what languages
>truely are context-free, if any. 

Uh, almost none, or almost every. It depends on your view.

I won't go into the details of formal language theory here,
but clarify a few points:

- Efficient syntax analysis of programming languages dictates the use
  of context free grammars for syntax specification.

- From a formal language point of view no practical programming language
  is context-free. (It can't be context free if it allows for an 
  unlimited number of identifiers that can appear three or more 
  times in a program.)

The apparent contradiction is usually resolved as follows:

- Build a parser based on a context free grammar that generates a 
  `reasonably small' superset of the language and use other techniques 
  (symbol tables) to `filter out' the cases that are illegal in the 
  language but which slip through the parser based on the superset.

The phrase `reasonably small' is not really very well-defined.
However I think few people would call the following a reasonable grammar
for C, even though it generates a superset of C:

S --> A S | \epsilon
A --> 'a' | 'b' | ... | 'z' (and the uppercase letters, and the digits, and ...)

On the other hand most people are very happy with mapping all possible 
identifiers into one token with the use of a lexical analyzer and
using table lookup to check for things like proper declarations.
This is done routinely in every compiler.

Fewer people are happy with the fact that two different tokens in the 
C language (identifier and typedef-name) have the same lexical representation
and can therefore not be told apart without resorting to table lookup.
But that's the way C is defined. 

And to answer the original question: Most (all?) newer languages have 
avoided the problem by a somewhat more generous use of keywords. 

--
Alexander Vrchoticky  Technical University Vienna, Dept. for Real-Time Systems
Voice:  +43/222/58801-8168   Fax: +43/222/569149
e-mail: alex@vmars.tuwien.ac.at or vmars!alex@relay.eu.net  (Don't use 'r'!)