Xref: utzoo sci.lang:2551 comp.compilers:259 Path: utzoo!attcan!uunet!husc6!necntc!ima!compilers-sender From: smryan@garth.uucp (Steven Ryan) Newsgroups: sci.lang,comp.compilers Subject: Shallow Parsing Message-ID: <1080@ima.ISC.COM> Date: 13 Jun 88 23:09:51 GMT Sender: compilers-sender@ima.ISC.COM Reply-To: smryan@garth.uucp (Steven Ryan) Organization: INTERGRAPH (APD) -- Palo Alto, CA Lines: 48 Approved: compilers@ima.UUCP Church: Our Lady of Reluctant Software. I once heard a suggestion humans use a different parsing strategy than compilers. Compilers use a deep parse using lotsa malenky rules and produce a very impressive parse tree. The suggestion was that humans memorise faster than generate and are better at handling large chunks in parallel than small chunks nested. What I take that to mean, we memorise words in each inflected form, even regular forms, and possibly even groups words possibly up to words. Than language generation consists of inserting these chunks into a verb frame chunk, with each sentence form being a different chunk. I think I'll call this suggestion shallow parsing: the parse tree will only go down two or three levels before running into unanalysed (ig est memorised) chunks. In terms of a productions, this would mean having thousands of similar yet distinct productions instead factoring the similarities to reduce the number of productions. What interests me about this is the possible application to compilers: humans parse an ambiguous and nondeterministic language in almost always linear time. Most programming languages are intentionally designed with a deterministic context free syntax which is LL(1) or LR(1). LR(1) parsing is all very interesting, but the resulting language is often cumbersome: the programmer must write out a sufficient left context to help out the parser even when he/she/it/they/te/se/... can look a little to right and know what is happen. Example: the Pascal 'var' is there to help the parser know this is declaration and still remain LL(1). var v1,v2,...,vn:type; To claim LL(1)/LR(1) is superior because of the linear time, O(n), ignores the fact that this is context free parse and must be followed symbol identification. Assuming the number of distinct symbols is logarithmic in the program length, the time necessary for a context sensitive parse is from O(n log log n) to O(n**2). It would be interesting if we could learn something from our own minds and make computer languages more like natural languages (but excluding ambiguity). Make it simple--not simplistic. sm ryan [From smryan@garth.uucp (Steven Ryan)] -- Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | bbn}!ima Please send responses to the originator of the message -- I cannot forward mail accidentally sent back to compilers. Meta-mail to ima!compilers-request