Path: utzoo!attcan!utgpu!news-server.csri.toronto.edu!clyde.concordia.ca!uunet!wuarchive!brutus.cs.uiuc.edu!ux1.cso.uiuc.edu!midway!mimsy!chris From: chris@mimsy.umd.edu (Chris Torek) Newsgroups: comp.lang.c Subject: Re: Using Lex (and Yacc) on a string. Message-ID: <25996@mimsy.umd.edu> Date: 12 Aug 90 22:09:16 GMT References: <1990Aug10.012927.5558@basho.uucp> <174@ittc.wec.com> Organization: U of Maryland, Dept. of Computer Science, Coll. Pk., MD 20742 Lines: 94 (This topic probably belongs elsewhere; perhaps comp.lang.misc or comp.unix.questions.... Ah well.) In article <174@ittc.wec.com> ptb@ittc.wec.com (Pat Broderick) writes: >The things needed might look something like: ># define input() (((yytchar=yysptr>yysbuf?U(*--yysptr):getc(yyin))==10? (yylineno++,yytchar):yytchar)==EOF?0:yytchar) /* standard defn from lex */ ># define input() (((yytchar=yysptr>yysbuf?U(*--yysptr):(*yynyy++))==10? (yylineno++,yytchar):yytchar)==EOF?0:yytchar) ^^^^^^^^^^ >This works fine for us, hope it helps. This should work, but is overkill. (It also does not address a question whose answer I myself am unsure about.) Here is what is going on: Lex (or Flex or other similar lexer of your choice) implements a DFSA (Deterministic Finite State Automaton) that is simply an optimized (in space if not time) variant on int table[NSTATES][128] = { huge amounts of junk }; yylex() { int state = 0; /* ignoring BEGINs that is */ yyleng = 0; for (;;) { c = input(); /* eat the next char */ yytext[yyleng++] = c; /* store it in yytext */ state = table[state][c];/* find the next state */ switch (state) { /* and see what to do */ ... cases that exactly match something ... do actions from C code; ... cases indicating we ate too much ... unput(some of the things we ate); do actions from C code; ... cases indicating `no match' ... output(the things we ate); break; } } } One noteworthy thing about this is that lex can never unput() something it has not input() `from the same place' (unless you put you own unput() actions into your lexer: a dangerous practise). Thus, if you are reading from a string in a buffer, your `unput' action can be much simpler, and likewise your input() macro can be simplified: #define input() ((yytchar = *mystring++) == '\n' ? (yylineno++, yytchar) : \ yytchar) #define unput(c) (mystring--) These two also take advantage of the fact that Lex wants `EOF' to be the value 0, rather than the (implementation-defined but usually) -1 that stdio returns. The end of a C string is the character '\0' which has the value 0. The question I have is whether lex might call input() again after reading EOF once. Since the end of a real file tends to remain the end of the file no matter how many times it is read per second, it seems possible that the implementation might invoke input() again after input() returns 0 but without an intervening unput(0)---i.e., it may depend on EOF being `sticky'. In this case the input macro must be more careful: #define input() ((yytchar = *mystr++) == 0 ? (mystr--, 0) : \ (yytchar == '\n' ? (yylineno++, '\n') : yytchar)) or as a GCC inline function: static inline input() { int c = *mystr++; if (c == 0) mystr--; else if (c == '\n') yylineno++; return c; } This requires a corresponding change to unput(), however, since now unput(0) should do nothing: #define unput(c) ((c) ? mystr-- : 0) All of these eliminate the need for yysbuf (an array of size YYLMAX that holds unput() characters since ungetc() only guarantees one character of pushback). Lex could be considerably more efficient by avoiding all this copying of text from one place to another; I believe flex does this. This usually means bypassing stdio, of course.... -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@cs.umd.edu Path: uunet!mimsy!chris (New campus phone system, active sometime soon: +1 301 405 2750)