Xref: utzoo sci.lang:2551 comp.compilers:259
Path: utzoo!attcan!uunet!husc6!necntc!ima!compilers-sender
From: smryan@garth.uucp (Steven Ryan)
Newsgroups: sci.lang,comp.compilers
Subject: Shallow Parsing
Message-ID: <1080@ima.ISC.COM>
Date: 13 Jun 88 23:09:51 GMT
Sender: compilers-sender@ima.ISC.COM
Reply-To: smryan@garth.uucp (Steven Ryan)
Organization: INTERGRAPH (APD) -- Palo Alto, CA
Lines: 48
Approved: compilers@ima.UUCP
Church:  Our Lady of Reluctant Software.

I once heard a suggestion humans use a different parsing strategy than
compilers. Compilers use a deep parse using lotsa malenky rules and produce
a very impressive parse tree.

The suggestion was that humans memorise faster than generate and are better
at handling large chunks in parallel than small chunks nested.

What I take that to mean, we memorise words in each inflected form, even
regular forms, and possibly even groups words possibly up to words. Than
language generation consists of inserting these chunks into a verb frame
chunk, with each sentence form being a different chunk.

I think I'll call this suggestion shallow parsing: the parse tree will
only go down two or three levels before running into unanalysed (ig est
memorised) chunks. In terms of a productions, this would mean having
thousands of similar yet distinct productions instead factoring the
similarities to reduce the number of productions.

What interests me about this is the possible application to compilers:
humans parse an ambiguous and nondeterministic language in almost always
linear time. Most programming languages are intentionally designed with a
deterministic context free syntax which is LL(1) or LR(1). LR(1) parsing is
all very interesting, but the resulting language is often cumbersome: the
programmer must write out a sufficient left context to help out the parser
even when he/she/it/they/te/se/... can look a little to right and know what
is happen.

      Example: the Pascal 'var' is there to help the parser know this
               is declaration and still remain LL(1).
         var v1,v2,...,vn:type;

To claim LL(1)/LR(1) is superior because of the linear time, O(n), ignores
the fact that this is context free parse and must be followed symbol
identification. Assuming the number of distinct symbols is logarithmic in the
program length, the time necessary for a context sensitive parse is from
O(n log log n) to O(n**2).

It would be interesting if we could learn something from our own minds and
make computer languages more like natural languages (but excluding ambiguity).

                                    Make it simple--not simplistic.
                                                            sm ryan
[From smryan@garth.uucp (Steven Ryan)]
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU
Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | bbn}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request