Path: utzoo!attcan!uunet!lll-winken!ncis.llnl.gov!helios.ee.lbl.gov!ucsd!chem.ucsd.edu!tps From: tps@chem.ucsd.edu (Tom Stockfisch) Newsgroups: comp.lang.c Subject: Re: not the way ... (was Re: Want a way to strip comments from a) Message-ID: <425@chem.ucsd.EDU> Date: 23 Mar 89 06:12:38 GMT References: <7150@siemens.UUCP> <9900010@bradley> <4221@omepd.UUCP> Reply-To: tps@chem.ucsd.edu (Tom Stockfisch) Organization: Chemistry Dept, UC San Diego Lines: 70 In article <4221@omepd.UUCP> merlyn@intelob.intel.com (Randal L. Schwartz @ Stonehenge) writes: >| >Does anyone have a sed or awk script which we >| > can use to preprocess the C source and get rid of all the comments >| The following works in vi: :%s/\/\*.*\*\///g >Nope. Just try it on the line: > foo; bar; /* comment1 */ bletch; /* comment2 */ >'bletch;' disappears with the comments. >The regexp that matches comments looks like (in egrep/lex notation): > [/][*]([*]*[^*/])*[*]+[/] >Didn't we just go through this about nine months ago? :-) >(And didn't I give the wrong answer at least twice? :-) :-) You still don't have it right, I'm afraid. This pattern won't work on / /* / */ It is unbelievable how hard this task is in regular expressions, when it is trivial to code by hand. To convince yourself that a pattern is correct, I think you have to show two things 1. That the body between the "/*" and "*/" cannot possibly contain a "*/", 2. That the body can contain any other sequence of characters. Various other patterns which have been posted (including ones by famous net gurus) have failed correctly to match the following: 1. /*****//hello world */ 2. /* hello /* /* world */ 3. /* */ hello /* */ 4. /**// /* this input should produce "/ \n" for output */ 5. /* */ hello */ So what works? I haven't been able to crack this one, which also correctly ignores comments in strings and character constants. If you want a practical program, use start states and don't match an entire comment with one pattern -- you won't be in danger of overflowing yytext[]. If you want to see how it's done with regular expressions, study the following. /* lex program that strips comments */ okslash ([^*/]"/"+) %% "/*""/"*([^/]|{okslash})*"*/" ; \"((\\(.|\n))|[^\\"])*\" ECHO; \'((\\(.|\n))|[^\\'])*\' ECHO; .|\n ECHO; -- || Tom Stockfisch, UCSD Chemistry tps@chem.ucsd.edu