Path: utzoo!attcan!uunet!jarthur!usc!zaphod.mps.ohio-state.edu!rpi!dali.cs.montana.edu!milton!unicorn!ogicse!intelhf!littlei!omepd!inews!iwarp.intel.com!psueea!parsely!agora!markb From: markb@agora.uucp (Mark Biggar) Newsgroups: comp.lang.perl Subject: Tokenizing in Perl Message-ID: <1990Jul11.024939.25549@agora.uucp> Date: 11 Jul 90 02:49:39 GMT References: <1990Jul10.095016.2473@uvaarpa.Virginia.EDU> Reply-To: markb@.UUCP (Mark Biggar) Organization: Betazoid Central Lines: 34 As Larry said the simplest way to write a tokenizer in perl is to use s/^...// to chop token off the front of your string. With that in mind the following are a set of usefull regular expressions for this purpose: m|/\*[^*]*\*+([^/][^*]*\*+)*/| matches just the first C-style nonnested comment on line /"[^"\\]*(\\.[^"\\]*)*"/ matches just the first " string on line with \ escapes /("[^"]*")+/ matches just the first string on line ADA style You can use the following to translate C-style \ escapes in a string matched by the RE above. NOTE: the order of the alternatives in the RE below if significant. s/\\(([0-7]{1,3})|x([\da-fA-F]+)|(.))/$trans($2,$3,$4)/eg sub trans { local($oct,$hex,$single) = @_; if ($oct ne '') { pack("c",oct($oct)); } elsif ($hex ne '') { pack("c",hex($hex)); } else { #singleton case must have matched if others didn't substr($trans,ord($single),1); # def of $trans left as exericse for reader :-) } } -- Perl's Maternal Uncle Mark Biggar