Path: utzoo!mnetor!uunet!husc6!cmcl2!arizona!mike From: mike@arizona.edu (Mike Coffin) Newsgroups: comp.lang.c Subject: Re: LEX Message-ID: <3703@megaron.arizona.edu> Date: 3 Feb 88 18:21:46 GMT References: <260@nyit.UUCP> Organization: U of Arizona CS Dept, Tucson Lines: 41 Keywords: LEX, comments, C, regular expression Summary: beware of buffer overflow In article <260@nyit.UUCP>, michael@nyit.UUCP (Michael Gwilliam) writes: > I'm writting a C like language to discribe data structures. When I > was writting the tokenizer using LEX and I got intrigued by a little > problem. Is it possible to write a regular expression that will > transform a /* comment */ into nothing? I tried to mail this, but the mailer couldn't find you: You can probably write a single regular expression to recognize C comments, but it would be a bad idea. In general, comments can be long. Lex, being a tokenizer, is not designed to recognize things bigger than its internal buffer size, which is only a few thousand characters. When presented with long tokens, lex drops core. Two possible solutions: 1) Upon recognizing "/*", call a C routine to eat the rest of the comment. Inside the routine, use the Lex macro "input()" to get characters. 2) Use Lex "start conditions". These allow you to specify several different tokenizers and switch between them explicitly. Untested code: "/*" {BEGIN BC;} [^*\n]* ; "*" ; "\n" {lineno++;} "*/" {BEGIN N;} Start condition is the "normal" start condition, while is "block-comment" condition. This is much safer than recognizing entire comments as one token; to overflow the buffer a single line would have to be longer than the buffer. -- Mike Coffin mike@arizona.edu Univ. of Ariz. Dept. of Comp. Sci. {allegra,cmcl2,ihnp4}!arizona!mike Tucson, AZ 85721 (602)621-4252