Path: utzoo!utgpu!utcsri!norvell
From: norvell@utcsri.UUCP (Theodore Stevens Norvell)
Newsgroups: comp.lang.misc
Subject: Re: Indentation for parsing (was Re: Block Closure (was Re: FOR loops))
Message-ID: <5992@utcsri.UUCP>
Date: 28 Apr 88 03:08:34 GMT
Article-I.D.: utcsri.5992
Posted: Wed Apr 27 23:08:34 1988
References: <918@rlgvax.UUCP> <2400015@otter.hple.hp.com> <11532@shemp.CS.UCLA.EDU> <3925@killer.UUCP>
Reply-To: norvell@utcsri.UUCP (Theodore Stevens Norvell)
Organization: CSRI, University of Toronto
Lines: 107
Summary: 

In article <3925@killer.UUCP> elg@killer.UUCP (Eric Green) writes:
[By the way, are you the Eric Green who used to live in Halifax?]
>I have used a language that uses indentation as its block denotation (Promal).
>It gets ugly, folks. That, instead of the formalism objections, should be what
>[...] Free-form languages simply offer too many advantages to go back to
>fixed-format languages (COBOL!!! GARGH!).

David Gast writes
>   It might also be possible, to use the Grammar Rule Notation, a la
>   Prolog to describe the indentation levels.  I do not know (because
>   I have not really considered it). The GNR is more powerful than
>   a CFG.  An extra argument to the grammar rules could indicate the
>   level of indentation.
>

The discussion about using indentation to indicate program structure is
interesting to me as I have designed a language which does exactly this
and written a compiler for it.  My feeling is that the parser should
read programs (approximately) the same way that people do.  In particular
the most important information used by people to understand the control
structure of programs -- namely the indentation structure -- should
not be simply thrown out by the compiler.

Many of the objections raised by people turn out not to be serious.
First, the lack of formal notation is not serious. David Gast is
right, a simple Definite Clause Grammar captures it in a fairly
readable way.

Let me give an example.  I am assuming that tokens are records with at
least two fields, one to identify the kind of token and one with the
column that the first character of the token appeared in. The
end-of-file token always starts in column 0.  A special nonterminal
called lookahead(T) recognizes the empty string iff T is the next
token in the input string.

statement(Indentation) --> [token(name, Indentation)],
			   [token( := , _)],
			   expression.
statement(Indentation) --> [token(while, Indentation)],
			   expression,
			   block(Indentation).
statement(Indentation) --> [token(if, Indentation)],
			   expression,
			   block(Indentation),
			   moreif(Indentation).

moreif(Indentation) --> [token(else, Indentation)],
			block(Indentation).
moreif(Indentation) --> [token(elsif, Indentation)],
			expression,
			block(Indentation),
			moreif(Indentation).
moreif(Indentation) --> lookahead(token(_, Ind)), {Ind <= Indentation}.

block(Indentation) --> statement(Ind), {Ind > Indentation},
		       block(Indentation).
block(Indentation) --> lookahead(token(_, Ind)), {Ind <= Indentation}.

program --> block(0), [token(end-of-file,0)].

This says (i) that each statement in the body of a while loop must be
indented more than the word "while", (ii) that each statement in an
alternative of an if statement must be indented more than the
word "if", and (iii) that the words "else" and "elsif" must come directly
beneath the corresponding "if".  (The last restriction is stronger than
need be, but solves the else attachment problem cleanly and provides some
redundancy required for error-detection/recovery).

It is not quite a "free form" language, but it allows any style at all
which would be considered "readable" by most people.  It allows statements
to be split across lines; it doesn't need statement separators or
terminators;  it allows more than one statement per line.

It is no burden to the lexer to produce the column number of each token,
as it should do so anyway for accurate error reporting.

It is easy to translate such a grammar into a recursive decent parser
or an explicit-stack top-down parser.  It is probably not hard to
translate it into a table driven top-down parser (akin to LL(k)).
It is likely that there are bottom-up techniques too, but I haven't
thought about it.

The point of all this, is that many of the objections to languages
with significant indentation raised are due to misconceptions
(misconceptions that are understandable, given that most people
haven't used such languages or have used languages, like Promal,
with lots of other semi-related restrictions.)

There are still problems with such a language.  One is readability:
It requires a lot of indentation to make it easy, e.g.,  to match
else's to if's.  Another is writablity: you must be careful not
to accidentally over-indent the lines which (you think) follow an if
or while statement, or they won't.  A third is error recovery:
but that is always tricky.  A solution to all of these is to
make the language less free form by insisting that all 
statements in the same block start at the same indentation level.

Finally, such specification and parsing techniques are useful even
in a compiler (or lint) for a "conventional" free form language.
They can be used to warn the programmer that his indentation
structure does not match his syntactic structure and they can
be used to guide error recovery.

Theodore Norvell (poor earnest academician)
BITNET,CSNET:   norvell@csri.toronto.edu
CDNNET:         norvell@csri.toronto.cdn
UUCP:           {alegra,cornell,decvax,linus,utzoo}!utcsri!norvell