Path: utzoo!mnetor!uunet!husc6!cmcl2!arizona!mike
From: mike@arizona.edu (Mike Coffin)
Newsgroups: comp.lang.c
Subject: Re: LEX
Message-ID: <3703@megaron.arizona.edu>
Date: 3 Feb 88 18:21:46 GMT
References: <260@nyit.UUCP>
Organization: U of Arizona CS Dept, Tucson
Lines: 41
Keywords: LEX, comments, C, regular expression
Summary: beware of buffer overflow

In article <260@nyit.UUCP>, michael@nyit.UUCP (Michael Gwilliam) writes:

> I'm writting a C like language to discribe data structures.  When I
> was writting the tokenizer using LEX and I got intrigued by a little
> problem.  Is it possible to write a regular expression that will
> transform a /* comment */ into nothing?

I tried to mail this, but the mailer couldn't find you:

You can probably write a single regular expression to recognize C
comments, but it would be a bad idea.  In general, comments can be
long.  Lex, being a tokenizer, is not designed to recognize things
bigger than its internal buffer size, which is only a few thousand
characters.  When presented with long tokens, lex drops core.  Two
possible solutions:

1) Upon recognizing "/*", call a C routine to eat the rest of the
comment.  Inside the routine, use the Lex macro "input()" to get
characters.

2) Use Lex "start conditions".  These allow you to specify several
different tokenizers and switch between them explicitly.  Untested
code:

<N>"/*"			{BEGIN BC;}
<BC>[^*\n]*		;
<BC>"*"			;
<BC>"\n"		{lineno++;}
<BC>"*/"		{BEGIN N;}

Start condition <N> is the "normal" start condition, while <BC> is
"block-comment" condition.  This is much safer than recognizing entire
comments as one token; to overflow the buffer a single line would have
to be longer than the buffer.


-- 

Mike Coffin				mike@arizona.edu
Univ. of Ariz. Dept. of Comp. Sci.	{allegra,cmcl2,ihnp4}!arizona!mike
Tucson, AZ  85721			(602)621-4252