Xref: utzoo sci.lang:2603 comp.compilers:262
Path: utzoo!attcan!uunet!husc6!spdcc!ima!compilers-sender
From: smryan@garth.uucp (Steven Ryan)
Newsgroups: sci.lang,comp.compilers
Subject: Re: Shallow Parsing
Message-ID: <1118@ima.ISC.COM>
Date: 17 Jun 88 19:08:37 GMT
References: <1114@ima.ISC.COM>
Sender: compilers-sender@ima.ISC.COM
Reply-To: smryan@garth.uucp (Steven Ryan)
Organization: INTERGRAPH (APD) -- Palo Alto, CA
Lines: 72
Approved: compilers@ima.UUCP

In article <1114@ima.ISC.COM> Bart Massey <bart@reed.dec.com> writes:
>I don't believe that the above argument about time is correct.  My guess is
>that a human parsing *sentences of text* is much like a compiler parsing
>*sequences of statements*...  Certainly such a behavior will occur in
>"linear time" on the number of sequences parsed.  In fact, this is probably
>part of the reason why languages have statements and texts have sentences.

Yes, I'm thinking that we divide the stream into sentence-like entities.
In spoken language we use pauses to signify a break (and catch our breaths)
and variety of gestures and eye movements. In written language we endmarks
and paragraphs. Programming languages use begin/end/if/... keywords that
break out the overall structures.

Dividing the stream could be done in context free, deterministic fashion. It
may be the contents of each sentence are parsed differently.
 
>the semantics of the language which will determine how parse time grows with
>*program* length.  But I'm not sure what this tells me -- in particular,
>both humans and compilers seem to do quite well at recovering information
>about distant tokens from memory in effectively *constant time* for the

I don't know about humans, but most compilers use a hash table for the symbol
table. In this case the time to identify one symbol out of n distinct symbols
is O(n/m) for a hash table size m. If m is larger than n, it is constant time.
However if n is large enough, the hash table linearly decreases search time,
but the search down any hash entry chain remains n (usually) or log n (for a
sophisticated table).
 
>Does it pay to recognize this idiom seperately from other for() loops?  Why
>or why not?

Also, I'm wonderring about the tradeoff of determinstic versus nondeterministic
grammars or languages or parsing. For those unfamilar with these terms, if a
parser merrily trips through string left to right, given everything it is
seen so far, looking at the current symbol at maybe k symbols to the right, does
it know exactly which production it is in? If this true for a fixed k, the
parser is deterministic. If it requires arbritrarily large k (the lookahead),
it is nondeterminstic. Or it might be ambiguous. (Deterministic langauges
are not inherently ambiguous: each has an LR(1) grammar and LR(1) grammars are
not ambiguous.)
 
>[I always thought that computers use unambiguous context free grammars because
>we know how to parse them fast, and because they seem empirically to be
>convenient for humans to try to express things like computer programs.  I
>also think that something like the current parsing methods is probably
>time-optimal for conventional computer architectures.

I suspect it has more to do parse generators: a parser can be automatically
generated for an LR(1) grammar, but not so for a nondeterministical language.
(Personal remark:) I think Wirth has crippled a generation of programming
language by stripping them of anything that makes life tough for a
compiler writer (my occupation if you haven't already guessed).

>                                                       Ned Irons looked at
>context-free parsing on small numbers of parallel processors and decided that

I first thought about this with respect to CDC Cyber 205 and ETA-10x vector
machines. If somebody can find a way to vectorise parsing, I think it will
open up a vast field of applications which currently appear to be scalar or
sequential.

                                      Thank you for your support,
                                                 sm ryan
[Using Earley's algorithm you can parse ambigious grammars without trouble,
although it is much slower than LR or LALR for the grammars that the latter
two can handle.  -John]
[From smryan@garth.uucp (Steven Ryan)]
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU
Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | bbn}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request