Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!samsung!olivea!uunet!world!iecc!compilers-sender From: ejp@bohra.cpg.oz.au (Esmond Pitt) Newsgroups: comp.compilers Subject: Re: Parsing Cobol with yacc and lex Keywords: Cobol, parse, yacc Message-ID: <91-05-086@comp.compilers> Date: 17 May 91 03:02:19 GMT References: <9105060154.AA04463@bohra.cpg.oz.au> <91-05-082@iecc.cambridge.ma.us> Sender: compilers-sender@iecc.cambridge.ma.us Reply-To: ejp@bohra.cpg.oz.au (Esmond Pitt) Organization: Software Division, Computer Power Group Lines: 52 Approved: compilers@iecc.cambridge.ma.us In article <91-05-082@iecc.cambridge.ma.us> kgs@dvncnms.cnms.dev.unisys.com (Kenneth G. Salter) writes: [following-up my posting on COBOL] > Lex should categorize the words it scans with enough granularity that > yacc is not confused. Candidate tokens are: DN_only, PIC_only, > DN_or_PIC, INT_or_PIC, etc. Lex should recognize these tokens > regardless of context. Context is a parser function. Yacc will > need productions of the form: > DataName : DN_only | DN_or_PIC ; > PIC_string : PIC_only | DN_or_PIC | INT_or_PIC ; If I have understood correctly, this would mean that the DN_or_PIC rule has to unpick the single token returned by yylex() into the dataname part, the left brace, the index, and the right brace, and then parse the result, all without benefit of lex, yacc, clergy, ... You can do it all right, but the result is not exactly lex+yacc parsing. It's more like keeping a dog and barking yourself. I prefer the scanner to scan and the parser to parse. My own solution to this was to have the parser switch Lex's start states whenever a picture-string is expected. >> Yacc: Cobol is not LR(k) for any fixed k because of: >> (a) the WITH DATA phrase >> (b) the NOT {AT END/INVALID KEY/ON OVERFLOW/SIZE ERROR} phrases, > > These phrases are no more complex than IF ... THEN ... ELSE, and not > much more complicated than parenthesized expressions. They are not _complicated_ at all, but they all contain optional noisewords, and they all require lookahead of > 1 token beyond the RHS of a production. You get reduce/reduce conflicts. The statement that COBOL is not LR(1) because of these elements was made to me by a member of the ANSI X3-23.1985 committee. I don't have details on-line; maybe I'll be able to followup with these later. There are solutions to these problems; the task has been accomplished several times. I've done it myself. My point was that the task is quite a bit more complex than just sitting down and hacking out a lex & yacc script as you might expect, and as the comp.compilers monthly message used to say. Most of this is because the basic structures of the language date from before 1960, i.e. before compiler theory really got going, and the subsequent revisions to the standard have not really addressed compiler-theoretic issues. - -- Esmond Pitt, Computer Power Group ejp@bohra.cpg.oz -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.