Xref: utzoo sci.lang:2582 comp.compilers:260
Path: utzoo!attcan!uunet!husc6!spdcc!ima!compilers-sender
From: bart@reed.dec.com (Bart Massey)
Newsgroups: sci.lang,comp.compilers
Subject: Re: Shallow Parsing
Message-ID: <1114@ima.ISC.COM>
Date: 16 Jun 88 08:43:49 GMT
Sender: compilers-sender@ima.ISC.COM
Reply-To: Bart Massey <bart@reed.dec.com>
Organization: Reed College, Portland OR
Lines: 93
Approved: compilers@ima.UUCP
Length: long
In-Reply-To: <1080@ima.ISC.COM>

> The suggestion was that humans memorise faster than generate and are better
> at handling large chunks in parallel than small chunks nested.
> ...
> What interests me about this is the possible application to compilers:
> humans parse an ambiguous and nondeterministic language in almost always
> linear time. Most programming languages are intentionally designed with a
> deterministic context free syntax which is LL(1) or LR(1). LR(1) parsing is
> all very interesting, but the resulting language is often cumbersome: the
> programmer must write out a sufficient left context to help out the parser
> even when he/she/it/they/te/se/... can look a little to right and know what
> is happen.
> ...
> To claim LL(1)/LR(1) is superior because of the linear time, O(n), ignores
> the fact that this is context free parse and must be followed symbol
> identification. Assuming the number of distinct symbols is logarithmic in the
> program length, the time necessary for a context sensitive parse is from
> O(n log log n) to O(n**2).

I don't believe that the above argument about time is correct.  My guess is
that a human parsing *sentences of text* is much like a compiler parsing
*sequences of statements*...  Certainly such a behavior will occur in
"linear time" on the number of sequences parsed.  In fact, this is probably
part of the reason why languages have statements and texts have sentences.

I understand that statements far away from a given statement in a traditional
language are kept track of by the language semantics (e.g. symbol table)
rather than the grammar itself.  Thus, I agree it is the implementation of
the semantics of the language which will determine how parse time grows with
*program* length.  But I'm not sure what this tells me -- in particular,
both humans and compilers seem to do quite well at recovering information
about distant tokens from memory in effectively *constant time* for the
sizes of texts and programs normally compiled.  It's usually safe to assume
that a compiler will compile a 1000 line program 10 times faster than a
10000 line program, or that a person will read a 100 page book 10 times faster
than a 1000 page one.

I believe that the chief advantage of single-level context-free grammars in
the context of computer languages is to avoid the possibilities for
ambiguity the author refers to in his (omitted) last paragraph.
Context-free grammars are so simple (stupid) that there is very little
possibility for ambiguous communication of a program to the machine.  In
other words, it is the *parser* which is "helping out" the *programmer*, in
the sense that the programmer *cannot* easily write an ambiguous program.

However, it is clear that humans use multilevel grammars with great effect.
Does anyone know of work on isolating levels of recognition of text (e.g.
natural language, mathematics) in humans?  I know that HEARSAY II used a
multilevel system for speech recognition, but I'm not sure whether the
levels were modelled on human speech recognition, or just designed a
priori...  For example, there's clearly a point in recognition in which
idiomatic text is converted to its corresponding meaning.  Is this a
seperate step, or part of some other?

I think maybe what the author is getting at is this:  In C, there is the
common programming idiom

	for(<var> = 0; <var> < <const>; <var>++) {
		<loop body>;
	}

Does it pay to recognize this idiom seperately from other for() loops?  Why
or why not?  In particular: 

	Where does such recognition belong in the programming process?
	
	Must this seperate recognition lead to ambiguity, or can one somehow
	guarantee that this for() loop still has the same semantics as other
	for loops?
  	
	Can we somehow speed parsing by recognizing these loops *before*
	doing a standard LL or LR parse?

I too am very interested to hear any answers anyone has to these questions.

					Bart Massey
					UUCP: ..tektronix!reed!bart
[I always thought that computers use unambiguous context free grammars because
we know how to parse them fast, and because they seem empirically to be
convenient for humans to try to express things like computer programs.  I
also think that something like the current parsing methods is probably
time-optimal for conventional computer architectures.  Ned Irons looked at
context-free parsing on small numbers of parallel processors and decided that
it wasn't worth the effort, because each processor ends up waiting for the
one to its left to tell it something crucial about the context for its
chunk of the program.  If the number of processors approximates the number
of tokens, then using some of the processors to look for special cases might
make more sense.  Has anyone looked at highly parallel compiling, as opposed
to compiling for highly parallel machines?  -John]
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU
Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | bbn}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request