Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!samsung!olivea!uunet!world!iecc!compilers-sender
From: ejp@bohra.cpg.oz.au (Esmond Pitt)
Newsgroups: comp.compilers
Subject: Re: Parsing Cobol with yacc and lex
Keywords: Cobol, parse, yacc
Message-ID: <91-05-086@comp.compilers>
Date: 17 May 91 03:02:19 GMT
References: <9105060154.AA04463@bohra.cpg.oz.au> <91-05-082@iecc.cambridge.ma.us>
Sender: compilers-sender@iecc.cambridge.ma.us
Reply-To: ejp@bohra.cpg.oz.au (Esmond Pitt)
Organization: Software Division, Computer Power Group
Lines: 52
Approved: compilers@iecc.cambridge.ma.us

In article <91-05-082@iecc.cambridge.ma.us> kgs@dvncnms.cnms.dev.unisys.com (Kenneth G. Salter) writes:
[following-up my posting on COBOL]

> Lex should categorize the words it scans with enough granularity that
> yacc is not confused. Candidate tokens are: DN_only, PIC_only,
> DN_or_PIC, INT_or_PIC, etc. Lex should recognize these tokens
> regardless of context. Context is a parser function. Yacc will 
> need productions of the form:
> 	DataName   : DN_only  | DN_or_PIC ;
> 	PIC_string : PIC_only | DN_or_PIC | INT_or_PIC ;

If I have understood correctly, this would mean that the DN_or_PIC rule
has to unpick the single token returned by yylex() into the dataname part,
the left brace, the index, and the right brace, and then parse the result,
all without benefit of lex, yacc, clergy, ... You can do it all right, but
the result is not exactly lex+yacc parsing.

It's more like keeping a dog and barking yourself. I prefer the scanner to
scan and the parser to parse. My own solution to this was to have the
parser switch Lex's start states whenever a picture-string is expected.

>> Yacc: Cobol is not LR(k) for any fixed k because of:
>>     (a) the WITH DATA phrase
>>     (b) the NOT {AT END/INVALID KEY/ON OVERFLOW/SIZE ERROR} phrases,
>
> These phrases are no more complex than IF ... THEN ... ELSE, and not
> much more complicated than parenthesized expressions.

They are not _complicated_ at all, but they all contain optional
noisewords, and they all require lookahead of > 1 token beyond the RHS of
a production. You get reduce/reduce conflicts.

The statement that COBOL is not LR(1) because of these elements was made
to me by a member of the ANSI X3-23.1985 committee. I don't have details
on-line; maybe I'll be able to followup with these later.

There are solutions to these problems; the task has been accomplished
several times. I've done it myself.

My point was that the task is quite a bit more complex than just sitting
down and hacking out a lex & yacc script as you might expect, and as the
comp.compilers monthly message used to say. Most of this is because the
basic structures of the language date from before 1960, i.e. before
compiler theory really got going, and the subsequent revisions to the
standard have not really addressed compiler-theoretic issues.

- -- 
Esmond Pitt, Computer Power Group
ejp@bohra.cpg.oz
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.