Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!think.com!mintaka!spdcc!iecc!compilers-sender
From: kgs@dvncnms.cnms.dev.unisys.com (Kenneth G. Salter)
Newsgroups: comp.compilers
Subject: Re: Parsing Cobol with yacc and lex
Summary: Parsing COBOL with yacc seems feasible.
Keywords: Cobol, parse, yacc
Message-ID: <91-05-082@iecc.cambridge.ma.us>
Date: 15 May 91 00:07:51 GMT
References: <9105060154.AA04463@bohra.cpg.oz.au>
Sender: compilers-sender@iecc.cambridge.ma.us
Reply-To: kgs@dvncnms.cnms.dev.unisys.com (Kenneth G. Salter)
Organization: Compilers Central
Lines: 99
Approved: compilers@iecc.cambridge.ma.us

This is my first post.  I'd like to comment on a response by Esmond Pitt
to a post by Carlos E. Galarce.  While building a COBOL compiler is a
large task, prospects are not as bleak as one might think.

In article <9105060154.AA04463@bohra.cpg.oz.au>, ejp@bohra.cpg.oz.au (Esmond Pitt) writes:
> Cobol is neither regular, context-free, nor LR(k) for any k. This makes
> use of lex and yacc highly problematic. Discussion follows.
 
Ignoring semantics and focusing just on parsing, I think that the
Identification, Environment, and Data divisions are regular, and the
Procedure division is LR(1).

> Lex: At least four scanning modes are required.
> In addition to the default (normal) mode you need:
 
>     (a) Comment-entry mode. Comment-entries in the Identification Division
>     (AUTHOR etc) have their own lexical rules. ...
 
>     (b) PICTURE mode. X(120) is either a single PICTURE-string token or
>     4 tokens representing an indexed identifier, depending on context.
>     PICTURE mode is triggered by a preceding PIC(TURE)?(IS)?).
 
Lex should categorize the words it scans with enough granularity that
yacc is not confused.  Candidate tokens are:  DN_only, PIC_only,
DN_or_PIC, INT_or_PIC, etc.  Lex should recognize these tokens
regardless of context.  Context is a parser function.  Yacc will 
need productions of the form:
	DataName   : DN_only  | DN_or_PIC ;
	PIC_string : PIC_only | DN_or_PIC | INT_or_PIC ;

>     (c) DECIMAL-POINT IS COMMA mode. This phrase changes the format of
>     numeric literals.
 
Rather than try to hard-code the definition of a number with a regular
expression, a word that looks like a number can trigger a lex function
that scans carefully, being sensitive to whether DECIMAL-POINT IS COMMA
has appeared.

> The whole lexical process is greatly complicated by the rules for
> continued identifiers, numeric literals and alpha literals.  Also, you
> have to lexically ignore sequence-number area and the area to the right of
> margin R (which, incidentally, is undefined except by universal
> agreement).
 
You could front-end lex with a filter that erases columns 1-6 and 73-80,
and that replaces comment lines by blank lines.  Or, you might write a
input() function for lex to do the above and to merge continued lines into
a single line so that that the parser need not deal with continuation.
This input() might also solve the column sensitivity problem by prefixing
non-blank text appearing in Areas A and C with distinguished characters.
All of these has to be done in a way that does not result in loss of the
actual source line numbers.

> The rules about Area A (indentation) are formally unnecessary except when
> looking for the end of a comment-entry. You still need to enforce these
> rules, however ...
 
See above.

> COPY REPLACING and REPLACE have their own significant peculiarities.
 
> Yacc: Cobol is not LR(k) for any fixed k because of:
>     (a) the WITH DATA phrase
>     (b) the NOT {AT END/INVALID KEY/ON OVERFLOW/SIZE ERROR} phrases,
> both of which are scope terminators requiring arbitrary lookahead, ...

These phrases are no more complex than IF ... THEN ... ELSE, and not
much more complicated than parenthesized expressions.

>     (c) the syntax of abbreviated combined relational conditions.
>     In one form of these the noise-word IS becomes syntactically
>     significant, contrary to one of the stated objectives of Cobol-85.
     
> Other problems:
> 
>     (a) Yes, the grammar is enormous. Cobol-85 has over 400 reserved words.
 
>     (b) The syntax for the I/O statements (READ, REWRITE, WRITE,
>     DELETE, ...) is dependent on the ACCESS MODE of the file named.
>     Depending on the access mode, either an INVALID KEY or an AT END
>     phrase is the legal syntactic continuation.  This gets important in
>     COBOL-85 with the arbitrary nesting allowed; otherwise your parser
>     will tie e.g.  an INVALID KEY phrase to the closest READ statement
>     instead of the one it really belongs to, and completely mess up the
>     syntactic scope.
 
Good point.  I'd have missed this.  For initial efforts, I'd ignore it,
reasoning that the COBOL programmer can avoid the problem by delimiting
each READ with an END-READ.  Long term, you'd need yacc actions to
straighten out the parse tree as soon as yacc reduces INVALID KEY, etc.

I haven't taken the time to address all the issues, just the easier ones.
Hope this helps.
--
Kenneth G. Salter, Unisys Corporation

-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.