Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!think.com!mintaka!spdcc!iecc!compilers-sender From: kgs@dvncnms.cnms.dev.unisys.com (Kenneth G. Salter) Newsgroups: comp.compilers Subject: Re: Parsing Cobol with yacc and lex Summary: Parsing COBOL with yacc seems feasible. Keywords: Cobol, parse, yacc Message-ID: <91-05-082@iecc.cambridge.ma.us> Date: 15 May 91 00:07:51 GMT References: <9105060154.AA04463@bohra.cpg.oz.au> Sender: compilers-sender@iecc.cambridge.ma.us Reply-To: kgs@dvncnms.cnms.dev.unisys.com (Kenneth G. Salter) Organization: Compilers Central Lines: 99 Approved: compilers@iecc.cambridge.ma.us This is my first post. I'd like to comment on a response by Esmond Pitt to a post by Carlos E. Galarce. While building a COBOL compiler is a large task, prospects are not as bleak as one might think. In article <9105060154.AA04463@bohra.cpg.oz.au>, ejp@bohra.cpg.oz.au (Esmond Pitt) writes: > Cobol is neither regular, context-free, nor LR(k) for any k. This makes > use of lex and yacc highly problematic. Discussion follows. Ignoring semantics and focusing just on parsing, I think that the Identification, Environment, and Data divisions are regular, and the Procedure division is LR(1). > Lex: At least four scanning modes are required. > In addition to the default (normal) mode you need: > (a) Comment-entry mode. Comment-entries in the Identification Division > (AUTHOR etc) have their own lexical rules. ... > (b) PICTURE mode. X(120) is either a single PICTURE-string token or > 4 tokens representing an indexed identifier, depending on context. > PICTURE mode is triggered by a preceding PIC(TURE)?(IS)?). Lex should categorize the words it scans with enough granularity that yacc is not confused. Candidate tokens are: DN_only, PIC_only, DN_or_PIC, INT_or_PIC, etc. Lex should recognize these tokens regardless of context. Context is a parser function. Yacc will need productions of the form: DataName : DN_only | DN_or_PIC ; PIC_string : PIC_only | DN_or_PIC | INT_or_PIC ; > (c) DECIMAL-POINT IS COMMA mode. This phrase changes the format of > numeric literals. Rather than try to hard-code the definition of a number with a regular expression, a word that looks like a number can trigger a lex function that scans carefully, being sensitive to whether DECIMAL-POINT IS COMMA has appeared. > The whole lexical process is greatly complicated by the rules for > continued identifiers, numeric literals and alpha literals. Also, you > have to lexically ignore sequence-number area and the area to the right of > margin R (which, incidentally, is undefined except by universal > agreement). You could front-end lex with a filter that erases columns 1-6 and 73-80, and that replaces comment lines by blank lines. Or, you might write a input() function for lex to do the above and to merge continued lines into a single line so that that the parser need not deal with continuation. This input() might also solve the column sensitivity problem by prefixing non-blank text appearing in Areas A and C with distinguished characters. All of these has to be done in a way that does not result in loss of the actual source line numbers. > The rules about Area A (indentation) are formally unnecessary except when > looking for the end of a comment-entry. You still need to enforce these > rules, however ... See above. > COPY REPLACING and REPLACE have their own significant peculiarities. > Yacc: Cobol is not LR(k) for any fixed k because of: > (a) the WITH DATA phrase > (b) the NOT {AT END/INVALID KEY/ON OVERFLOW/SIZE ERROR} phrases, > both of which are scope terminators requiring arbitrary lookahead, ... These phrases are no more complex than IF ... THEN ... ELSE, and not much more complicated than parenthesized expressions. > (c) the syntax of abbreviated combined relational conditions. > In one form of these the noise-word IS becomes syntactically > significant, contrary to one of the stated objectives of Cobol-85. > Other problems: > > (a) Yes, the grammar is enormous. Cobol-85 has over 400 reserved words. > (b) The syntax for the I/O statements (READ, REWRITE, WRITE, > DELETE, ...) is dependent on the ACCESS MODE of the file named. > Depending on the access mode, either an INVALID KEY or an AT END > phrase is the legal syntactic continuation. This gets important in > COBOL-85 with the arbitrary nesting allowed; otherwise your parser > will tie e.g. an INVALID KEY phrase to the closest READ statement > instead of the one it really belongs to, and completely mess up the > syntactic scope. Good point. I'd have missed this. For initial efforts, I'd ignore it, reasoning that the COBOL programmer can avoid the problem by delimiting each READ with an END-READ. Long term, you'd need yacc actions to straighten out the parse tree as soon as yacc reduces INVALID KEY, etc. I haven't taken the time to address all the issues, just the easier ones. Hope this helps. -- Kenneth G. Salter, Unisys Corporation -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.