Path: utzoo!utgpu!utcsri!norvell From: norvell@utcsri.UUCP (Theodore Stevens Norvell) Newsgroups: comp.lang.misc Subject: Re: Indentation for parsing (was Re: Block Closure (was Re: FOR loops)) Message-ID: <5992@utcsri.UUCP> Date: 28 Apr 88 03:08:34 GMT Article-I.D.: utcsri.5992 Posted: Wed Apr 27 23:08:34 1988 References: <918@rlgvax.UUCP> <2400015@otter.hple.hp.com> <11532@shemp.CS.UCLA.EDU> <3925@killer.UUCP> Reply-To: norvell@utcsri.UUCP (Theodore Stevens Norvell) Organization: CSRI, University of Toronto Lines: 107 Summary: In article <3925@killer.UUCP> elg@killer.UUCP (Eric Green) writes: [By the way, are you the Eric Green who used to live in Halifax?] >I have used a language that uses indentation as its block denotation (Promal). >It gets ugly, folks. That, instead of the formalism objections, should be what >[...] Free-form languages simply offer too many advantages to go back to >fixed-format languages (COBOL!!! GARGH!). David Gast writes > It might also be possible, to use the Grammar Rule Notation, a la > Prolog to describe the indentation levels. I do not know (because > I have not really considered it). The GNR is more powerful than > a CFG. An extra argument to the grammar rules could indicate the > level of indentation. > The discussion about using indentation to indicate program structure is interesting to me as I have designed a language which does exactly this and written a compiler for it. My feeling is that the parser should read programs (approximately) the same way that people do. In particular the most important information used by people to understand the control structure of programs -- namely the indentation structure -- should not be simply thrown out by the compiler. Many of the objections raised by people turn out not to be serious. First, the lack of formal notation is not serious. David Gast is right, a simple Definite Clause Grammar captures it in a fairly readable way. Let me give an example. I am assuming that tokens are records with at least two fields, one to identify the kind of token and one with the column that the first character of the token appeared in. The end-of-file token always starts in column 0. A special nonterminal called lookahead(T) recognizes the empty string iff T is the next token in the input string. statement(Indentation) --> [token(name, Indentation)], [token( := , _)], expression. statement(Indentation) --> [token(while, Indentation)], expression, block(Indentation). statement(Indentation) --> [token(if, Indentation)], expression, block(Indentation), moreif(Indentation). moreif(Indentation) --> [token(else, Indentation)], block(Indentation). moreif(Indentation) --> [token(elsif, Indentation)], expression, block(Indentation), moreif(Indentation). moreif(Indentation) --> lookahead(token(_, Ind)), {Ind <= Indentation}. block(Indentation) --> statement(Ind), {Ind > Indentation}, block(Indentation). block(Indentation) --> lookahead(token(_, Ind)), {Ind <= Indentation}. program --> block(0), [token(end-of-file,0)]. This says (i) that each statement in the body of a while loop must be indented more than the word "while", (ii) that each statement in an alternative of an if statement must be indented more than the word "if", and (iii) that the words "else" and "elsif" must come directly beneath the corresponding "if". (The last restriction is stronger than need be, but solves the else attachment problem cleanly and provides some redundancy required for error-detection/recovery). It is not quite a "free form" language, but it allows any style at all which would be considered "readable" by most people. It allows statements to be split across lines; it doesn't need statement separators or terminators; it allows more than one statement per line. It is no burden to the lexer to produce the column number of each token, as it should do so anyway for accurate error reporting. It is easy to translate such a grammar into a recursive decent parser or an explicit-stack top-down parser. It is probably not hard to translate it into a table driven top-down parser (akin to LL(k)). It is likely that there are bottom-up techniques too, but I haven't thought about it. The point of all this, is that many of the objections to languages with significant indentation raised are due to misconceptions (misconceptions that are understandable, given that most people haven't used such languages or have used languages, like Promal, with lots of other semi-related restrictions.) There are still problems with such a language. One is readability: It requires a lot of indentation to make it easy, e.g., to match else's to if's. Another is writablity: you must be careful not to accidentally over-indent the lines which (you think) follow an if or while statement, or they won't. A third is error recovery: but that is always tricky. A solution to all of these is to make the language less free form by insisting that all statements in the same block start at the same indentation level. Finally, such specification and parsing techniques are useful even in a compiler (or lint) for a "conventional" free form language. They can be used to warn the programmer that his indentation structure does not match his syntactic structure and they can be used to guide error recovery. Theodore Norvell (poor earnest academician) BITNET,CSNET: norvell@csri.toronto.edu CDNNET: norvell@csri.toronto.cdn UUCP: {alegra,cornell,decvax,linus,utzoo}!utcsri!norvell