Path: utzoo!attcan!uunet!mcsun!unido!mikros!mwtech!martin From: martin@mwtech.UUCP (Martin Weitzel) Newsgroups: comp.unix.wizards Subject: Re: yacc & lex - cupla questions Keywords: yacc lex compiler parse Message-ID: <861@mwtech.UUCP> Date: 27 Jul 90 10:05:47 GMT References: <1990Jul26.175831.1216@uicbert.eecs.uic.edu> Reply-To: martin@mwtech.UUCP (Martin Weitzel) Distribution: comp Organization: MIKROS Systemware, Darmstadt/W-Germany Lines: 204 In article <1990Jul26.175831.1216@uicbert.eecs.uic.edu> woodward@uicbert.eecs.uic.edu writes: > >i have been trying to parse a straightforward stream of bytes using the >c-preprocessors lex & yacc. being a new user of these utilities, i have >a couple of problems for which i'd like to solicit your suggestions: Since the "standard docs" for lex + yacc are very terse (not to say: incomplete in many places), I think I make this a followup rather than an emailed answer. Now, let's see where the problems are ... >--------------------------------------------------------------------- >1.) how does one redefine the i/o in a yacc/lex piece of code? i.e. >the code which is generated defaults to stdin and stdout for input and >output, respectively. i'd like to redefine these defaults w/o having >to hack on the intermediate c-code, since this is a live production >project; i'd like to be able to update and modify the program simply by >saying "make". The "calling"-tree in a lex+yacc application, when it comes to read input and you do not change anything, is normally: (main or whatever) ---> yyparse ---> yylex ---> input[Macro] ---> getc(yyin) [yyin defaults to stdin] If you want to read from some other source as stdin, you have several points where you can change something. (In the very simplest case you could even change nothing and use the input redirection of UNIX.) .sidenote on Though there are often good reasons, I sometimes wonder why a program cares about file arguments at all instead of using stdin only. I once found it annoying that there were programs like "tr" which don't handle file arguments in the unix style ... until I learned that it's so easy to put such programs in a shell wrapper like "cat $* | tr .....". .sidenote off If your lex-generated program has to read from another source as stdin, just fopen the file and assign the returned FILE-pointer to yyin. The latter is a global symbol in the object file which results from compiling what lex generated. If you prefer seperate compilation you must define it as "extern FILE *yyin;" in the module where you assign to it. The right place for this will probably be the one where you call yyparse in the above example. Note that the standard main program from the yacc-library is not linked if you supply your own. So you can play any games you like before calling yyparse. Next step of complexity is to change the input-macro of yylex. This is useful sometimes, but I would not recommend to do so until you have gathered a bit experience with lex and understand the implications (but I'm willing to answer questions on this by email). Finally, you can consider avoiding lex at all and roll your own version of yylex. If you have only the "ancient" lex which is supplied with most unix systems (contrary to the rewrite "flex" which is IMHO in the public domain?), it could eventually be an advantage to do so, since lex-generated programs are known to be not so much efficient as hand-written scanners. (I have no exact metrics for that and comparisions made are often based on trivial scanners, which are easily written by hand. In any case I would recommend to use lex during development as prototyping tool.) For the redirection of output I see no problem at all, since this is fully under control of the C-program fragments you write in your actions of the lex+yacc source. >--------------------------------------------------------------------- >2.) how can one get the automagically-defined #defines, which can >normally be created from yacc with the -d flag, to come out when you >use a makefile? i.e. suppose i have lex.l and yacc.y lex and yacc >source files, respectively, and i have object files defined in my makefile >called lex.o and yacc.o such that "make" follows default rules to create >these from the aforementioned source files. If you use lex + yacc with the Unix tool make, you can add your own explicit dependencies or change the default-rules and add your own commands there. There is no "catch all" method for this - several variations all with their specific advantages and drawbacks exist - but if you know make or are willing to learn about make, you can determine the dependencies between your files generated by lex + yacc in the same way as if it were normal sources (BTW: I found the book in the O'Reilly Nutshell Series, "Managing Projects with Make", excellent for learning about make, though for the basics the "The Unix Programming Environment" by Kernighan + Pike [K+P] is sufficient. The latter is also recommendable because of its treatment of lex + yacc.) One thing is to mention here (you also find this in K+P): During development it's much more probable that the actions in your grammar will change rather than you add new tokens or change the type of the value stack. Hence when running yacc, the contents of y.tab.c will often change, but y.tab.h will stay the same. Since both are generated in one run (yacc -d), and some other targets may depend on y.tab.h, you often will have unnecessary compiles caused by this scheme (BTW: This is a mistake in the design of yacc. A better choice would have been to let yacc -d create *only* y.tab.h. If GNU's replacement for yacc, bison, hasn't allready done this, it should add an option switch for that purpose. This would ease "clean" integration into make-managed projects.) K+P has a solution for this. Mine is basically the same, just in another package: Write two shell-wrappers (or one with an option) for yacc which generate the y.tab.c and y.tab.h seperately. For any grammar in a file "grammar.y", this wrappers should generate appropriate "grammar.c" and "grammar.h" files. Since yacc writes its output into y.tab.c and y.tab.h, the wrappers must rename these files and before doing so for y.tab.h this file should be compared (eg. with cmp(1) or diff(1)) to grammar.h (if one allready exits). Leaving the existing one if nothing has changed will avoid the unnecessary re-compiles of other modules. > >--------------------------------------------------------------------- >3.) if i have a yacc construct such as: > >line3 : A B C > { yacc action sequence } > > >which indicates that the construct line3 is composed of the 3 tokens >A B and C, in that order ... > >how can i now assign the values of A, B, and C into local vars of my >choice? the problem lies in the fact that each of A B and C represent >three calls to lex, and if i pass back a pointer to yytext[] from lex, >i only retain the value of the last token in the sequence, in this case C, >when i get to the action sequence in my yacc code. what if i want to >be able to select the EXACT ascii tokens for each of A B and C above in >my yacc code. how do i do that? Yes, that's a frequently asked one. Transfering strings from yylex to yyparse (resp. the action which has the relevant tokens on the RHS of its grammar rule) must be done with care: Using pointers to yytext is not feasible here - you must copy the contents to a safe place. For that purpose you could malloc some space in the action of yylex (not yyparse!!) which recognizes the token (see example following below). Your C-standard-library may also contain strdup, which does malloc and strcpy all in one, but its not difficult to do without. Of course you must be careful here: - malloc may return a NULL-pointer because of memory limits - you must not forget to allocate space for the terminating NUL-byte; malloc(yylen + 1) is the right thing! - you must carefully plan for de-allocation, if your program should not run out of memory when it analyzes some large input If you transfer pointers to the malloc-ed space via the value stack, the last chance for free-ing is before the stack is cleared. So, if you don't copy the pointers which correspond to A, B, and C in the above example, your last chance is in the grammar action. A short code excerpt should help to understand what is required: lex-source ------------------------------------------------------- ........ %% regex-for-token-A { yylval.str = malloc(yylen + 1); if (yylval.str == (char *)0) { srceam and shout and die horrible death } strcpy(yylval.str, yytext); return(A); } ......... etc, same for token B and C ------------------------------------------------------------------- yacc-source ------------------------------------------------------ ...... %union { ..... char *str; ..... } ...... %token A B C ...... %% ...... line3 : A B C { $1, $2, and $3 are pointers to "safe" copies of the original tokens now, but if you don't copy these pointers to variables that will SURVIVE THIS BLOCK, you must cleanup befor this action ends: free($1); free($2); free($3); Be especially careful if you create multiple references to the malloc-ed space or if you transfer one of these further out, say: $$ = $1. In this case you must of course *not* free $1 here, instead the action(s) of the rule(s) where the non-terminal "line3" appears on the RHS are now responsible to do so. } ...... > >any comments or suggestions would be most heartily appreciated. Enough? Good, lex+yacc lesson ends for today :-). -- Martin Weitzel, email: martin@mwtech.UUCP, voice: 49-(0)6151-6 56 83