Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!uunet!world!iecc!compilers-sender From: compres!chris@crackers.clearpoint.com (Chris Clark) Newsgroups: comp.compilers Subject: Technical question about yacc and LALR parsers Keywords: yacc, parse, errors Message-ID: <9104260319.AA00594@compres.uucp> Date: 21 Apr 91 02:04:33 GMT References: <9104222013.AA09131@florida.HQ.Ileaf.COM> Sender: compilers-sender@iecc.cambridge.ma.us Reply-To: compres!chris@crackers.clearpoint.com (Chris Clark) Organization: Compilers Central Lines: 58 Approved: compilers@iecc.cambridge.ma.us In Jim Roskind's previous posting, he describes a method for working back through a grammar to determine the source of a conflict and an interesting observation about the yacc states. His interesting observation is the fact that a given state is always entered on a transition on the same symbol. This means you'll never see a state like the one below in a yacc verbose output. state 727 parened_type : '(' VOID . ')' parened_type : '(' INT . ')' Several authors have offered proofs that this must be true. Jim's own proof brings up the key reason, which involves the meaning of a production and an item. Peter Garst properly labels this a tail property. Here is yet another explanation of why the property holds and a way of solving it by directly translating regular expression grammars. In most yacc variants, each unique right-hand-side is considered a unique production and is given a production number. Each symbol within a unique right-hand-side is given a number within the production. An item is simply an ordered pair (symbol number, production number). A yacc state is simply a unique collection of items. However, because two unique right-hand-size must have unique production numbers and thus unique items, the two states cannot be merged by the (LALR) algorithm. Fortunately, the "almost parsers" Jim wants are "easy" to create just by changing the definition of an item. For example, in Compiler Resources' Yacc++, we use regular expressions to solve the same problem. Because of our direct translation algorithm, you will often see states like the one below. state 727 parened_type : '(' (VOID | INT) . ')' This state will be reached on transitions of both symbols VOID and INT. We cannot quantify the number of states saved, except that it's a lot, perhaps as much as 50% for some grammars. The exact figures depend on how many productions have suffixes in common. A few productions with long common tails can truly increase the percentage. Other factors also come into play, so your mileage may vary. By the way, Jim's backtracking algorithm mirrors David Spector's algorithm for splitting LALR states into LR ones--it was published in SIGPLAN notices, Dec 1988, v 23 # 12, for those of you who are interested. By walking back only to the merged states, one can perform LR disambiguation. By walking back to the start state, one can determine the set of ambiguous prefixes, which is what Jim wants. Regards, Chris Clark & Barbara Zino (508) 435-5016 -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.