Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!uunet!mcsun!unido!mikros!mwtech!martin From: martin@mwtech.UUCP (Martin Weitzel) Newsgroups: comp.unix.wizards Subject: Re: yacc & lex - cupla questions Keywords: yacc lex compiler parse Message-ID: <869@mwtech.UUCP> Date: 30 Jul 90 13:13:41 GMT References: <1990Jul26.175831.1216@uicbert.eecs.uic.edu> <2481@onion.reading.ac.uk> Reply-To: martin@mwtech.UUCP (Martin Weitzel) Distribution: comp Organization: MIKROS Systemware, Darmstadt/W-Germany Lines: 204 In article <2481@onion.reading.ac.uk> ac1@rosemary.cs.reading.ac.uk (Andrew Cunningham) writes: [Q+A for some problems with lex and yacc; refer to previous articles in this thread for more details] [reading from other source than `yyin'] >[You can] #define yyinput to be >something which returns the character from your file. Then, when >lex.yy.c is compiled, instead of calling the yyinput function your >#define is called instead. E.g. > >#define yyinput my_yyinput >int my_yyinput() > { > /* get the character you want and return it */ > } > >You'll also have to redefine yyunput(c) if you want to do this. From this and one more article in this thread I conclude that there's a widespread misconception about how things work together. Maybe, the above works with some versions of lex outthere, but from looking to the details of the generated lex-source (lex.yy.c) of several systems (XENIX derived from SysIII; AT&T UNIX SysV; ISC 386/ix derived from SysV), I see that the above CAN NOT WORK as desired. Here are some details, how the individual routines and functions call each other: main (from lex-library or own) | V yylex --------------------------+-+ | | | | +---+ | | +-----+ | | | V V V V | ...... V V ......... | input unput | : yyless yyreject : in the | | :.... | ... | . | ..: lex-library | | V V V | | yyunput yyinput | | | | | +--------------------+ | +------------------------------------------------+ What we should note first is: When the next character is needed in yylex, input (NOT yyinput!) is called. Normally, input is #defined as macro but you can re-#define it, or #undef-ine it and make a function with this name visible when you compile and link lex.yy.c. There is another macro, unput, that must properly undo the actions of input, though unput is only called if your regular expressions require look-ahead. (If you are not *very* experienced with regular expressions, assume that there will *allways* will be look-ahead.) So if we want to change things here, we must find the right place for our re-definition, that is, we must write it somewhere into the ".l"-file (with the lex-source), so that it appears *after* the #define that is automatically generated by lex, but *before* the first use of input/unput. As the order in which the parts of your ".l"-file appear in lex.yy.c changed with the evolution of lex, you should check for the right place if you try this the first time! The "safest" (ie. most portable) place I've found is right at the beginning of the second part of the ".l"-file, immediatly before the first regular expression. file.l ------------------------------------------------ first part %% %{ #undef input /* ANSI-C requires that, though */ #undef unput /* other compilers may do without */ #define input .... whatever .... #define unput ... as you like .. %} first-regex { ... action ... } second-regex { .... action ... } ... etc. ...... %% third part ------------------------------------------------------------ Now for the tricky part: As you see from the above, there are some routines in the lex-library which need sometimes to input or unput characters. These routines *must* use exactly the re- defined versions of input and unput. How can this routines "link" to something that is defined as macro? The solution can again be seen if we carefully study lex.yy.c, the source generated by lex. At the end of this source we find the functions, yyinput und yyunput (note the yy-prefix now!), which do no more and no less than calling input and unput. As the two functions are compiled where our macro-definitions are visible, they are the "stubs" thru which the functions in the lex-library access our macros. Again: Look at the above scheme showing the calling hierarchie and try to understand the dependencies. Eventuelly study lex.yy.c a bit further. THEN you might consider writing your own input/unput macros! > >>--------------------------------------------------------------------- [managing yacc projects with make] >You'll need to specify an explicit rule to do this. Or, at the >expense of some processor time you might want to run: > >y.tab.h: yacc.y > yacc yacc.y ^-- insert "-d"-switch > rm y.tab.c > >(This shouldn't take too long, yacc is *fast* compared with the cc stage) First, the above is a good advice in so far as it generally doesn't hurt to run yacc only for the purpose of generating y.tab.h. Just for that the "-d" switch should be specified, but IMHO that is simply a typo here. What I have to criticize is that y.tab.h (as well as y.tab.c) is more some kind of "workfile" (IMHO at least) and should be renamed to something else. So we get: yacc.h: yacc.y yacc -d yacc.y mv y.tab.h yacc.h rm -f y.tab.c ^^----------- add this for portability, as on some older systems the exit status of rm is not set cleanly otherwise (`make' may complain). (BTW: I'm not quite happy with file names like yacc.y, yacc.h etc. in the presence of a command called yacc in the same lines here, but I didn't want to change the original example too much.) Some fine point I allready mentioned in an erarlier positing follows: Generally the situation is that typical changes in yacc.y will change y.tab.c, but not y.tab.h (resp. yacc.h in the above example.) The latter will only occur if new tokens or new types for the value stack are introduced, which is by far less frequently done as changes in the actions of the grammar rules. So it is recommendable to extend the above further to: yacc.h: yacc.y yacc -d yacc.y test -f yacc.h && cmp -s y.tab.h yacc.h || mv y.tab.h yacc.h rm -f y.tab.c Here the mv is only done if yacc.h doesn't exist or is different from y.tab.h >>--------------------------------------------------------------------- [making yytext available in grammar actions] > >line3: > A {atext=strdup(yytext);} > B {btext=strdup(yytext);} > C {ctext=strdup(yytxet);} > >Note: if you're grammar is more comlex than this you can lead to >all sorts of comflicts in the compiler - when the parser executes an >action it is `committed' to that branch of the parse tree and cannot >backtrack to resolve any ambiguity that might occur (the classic >problem here is if ... then ... else in programming languages). Again the poster tells something very true here ... but forgets to mention something *much* more important: Never, again NEVER, again ***NEVER*** depend on an unchanged contents of yytext in the actions of yyparse(%): In yyparse the calls to yylex which in turn change the contents of yytext are slightly "asynchroneous", ie. there might be a read-ahead of one token and yytext doesn't contain what you think! (Note: There's not ALLWAYS a read-ahead, it just depends if yyparse needs one to decide what to do further!) The only place where yytext is valid is in the action-block following the regular expression in the lex-source. %: Small note to Chris Torek who some time ago gave a similar recommendation in one of his postings: You and a few others who understand the LALR(1) parsing algorithm used by yyparse and hence can decide under which circumstances read-ahead will occur, are explicitly excempt from the above "never"-rule :-) > >Hope this information helps. > >AndyC Hope this corrections avoid frustration. P.S. to AndyC: I didn't intend to make your recommendations look bad. Topics like lex and yacc are really not well covered by the docs, or at least you have to look very hard to get to the information you need. Stay in tuned ... -- Martin Weitzel, email: martin@mwtech.UUCP, voice: 49-(0)6151-6 56 83