Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Posting-Version: version B 2.10.1 6/24/83; site utcsstat.UUCP Path: utzoo!utcsstat!ian From: ian@utcsstat.UUCP (Ian F. Darwin) Newsgroups: net.unix-wizards Subject: Comment recognition in Lex, again Message-ID: <1876@utcsstat.UUCP> Date: Fri, 4-May-84 23:24:49 EDT Article-I.D.: utcsstat.1876 Posted: Fri May 4 23:24:49 1984 Date-Received: Fri, 4-May-84 23:45:48 EDT References: <245@uwvax.ARPA> Organization: Univ of Toronto (UTCS) Lines: 109 From: anderson@uwvax.ARPA I have received several replies to my request for a lex expression to recognize /* ... */ comments. The only one that works (sent in by Jim Hogue) is "/*"([^*]*"*"*"*"[^/*])*[^*]*"*"*"*/" which I can't claim to fully understand. Nor do I understand why my original, "/*"([^*]|("*"/[^/]))*"*/", doesn't work. The idea is that each character in the string between /* and */ can either be something other than *, or * followed by something other than /. Can anyone come up with an expression simpler than Hogue's that works? By "works", I mean put it in a real "lex" program, as in: (your expr) printf("recognized (%s)\n",yytext); and try it on inputs such as /***/, /*/*/, etc. -- David Anderson (uwvax!anderson) It's not clear what the goal of this exercise actually is. On the assumption that you are trying to build part of a compiler, here is a simple, *readable* code fragment which inputs a C program and eats all the comments, which is what you do in a real compiler. I have no wish to strain my eyes on Jim Hogue's excellent APL program (nor the original, for that matter), so I wrote this. It's actually more complex than it needs to be; this is more a comment on the simple-mindedness of lex than on my coding style (and no, I have nothing better to offer than lex; flames to /dev/null given that I do acknowledge lex's authors' accomplishments). Our approach is to use the lex ``start condition''. There can be many start conditions defined, but only 001 of them can be active at any time, and you can neither turn off a start condition (you can only say BEGIN 0, which returns to ``the normal state''), nor use rule 0 in the <...> prefix. With that in mind, here is lexcom.l: %START INCOM NOTIN %% "*/" {BEGIN NOTIN;} "/*" unput('*'); . ; "/*" {BEGIN INCOM;} "/*" BEGIN INCOM; . ECHO; %% In order to understand this, read carefully section 10 of the Lex manual. If you think you have a better (shorter) way, **try it before you post it** because the most obvious optimisations do not work! Geoff Collyer and I spent some time interpreting the manual and making the minimum test case that would work *correctly*. While our version is not as short as Jim Hogue's submission, %% "/*"([^*]*"*"*"*"[^/*])*[^*]*"*"*"*/" ; . ECHO; %% (I've made it into a full program) it has the following advantages: 1) it handles multi-line comments correctly. 2) it does not overflow lex's input buffer (see the admonition in section 5 of the Lex manual about ``Don't try to defeat this with expressions like ... ... or equivalents; the Lex-generated program will try to read the entire input file, causing internal buffer overflows.'' This is what the APL version does. Nor does it suffice to simply ``expand Lex's buffers''. This will not work on some machines and certainly not on binary-only systems (they do exist!). Here are the test cases used. Our program passes both. 'To pass' means to do the same thing that CPP does. (CPP is the zeroeth pass of the C compiler, the preprocessor, which normally eats the comments from live C programs when they are being compiled. I diffed the output as follows: cc -P t.c # produces t.i lexcom t.i2 diff t.i t.i2 (where t.c contains the first test case) and got no differences. The second test case produces nothing but null lines (cpp and our program) and a core dump (the one line program). 1) A simple C fragment: /************************************/ /* this is a test program */ /**/int i=2; /* initialise an int */ /*/*/int j=3; /* init an int */ /* bletch end of file in the middle of a comment - a good test. 2) a longer fragment, but still valid C: (echo '/*'; cat /etc/passwd; echo '*/') | a.out (assuming that */ does not appear in /etc/passwd at any given moment; ours doesn't). Our program produced many newlines, as did CPP. The one-liner dumped core! Is there a moral in all this? I think it's that a program that is a few lines longer, but is readable and conforms to the manual, is in fact a better program. Ian Darwin, Toronto Canada {ihnp4|decvax}!utcsstat!ian -- Ian Darwin, Toronto uucp: utcsstat!ian Arpa: decvax!utcsstat!ian@Berkeley