Path: utzoo!attcan!uunet!jarthur!usc!zaphod.mps.ohio-state.edu!rpi!dali.cs.montana.edu!milton!unicorn!ogicse!intelhf!littlei!omepd!inews!iwarp.intel.com!psueea!parsely!agora!markb
From: markb@agora.uucp (Mark Biggar)
Newsgroups: comp.lang.perl
Subject: Tokenizing in Perl
Message-ID: <1990Jul11.024939.25549@agora.uucp>
Date: 11 Jul 90 02:49:39 GMT
References: <1990Jul10.095016.2473@uvaarpa.Virginia.EDU>
Reply-To: markb@.UUCP (Mark Biggar)
Organization: Betazoid Central
Lines: 34

As Larry said the simplest way to write a tokenizer in perl is to use
s/^...// to chop token off the front of your string.  With that in mind
the following are a set of usefull regular expressions for this purpose:

m|/\*[^*]*\*+([^/][^*]*\*+)*/|
	matches just the first C-style nonnested comment on line

/"[^"\\]*(\\.[^"\\]*)*"/
	matches just the first " string on line with \ escapes

/("[^"]*")+/
	matches just the first string on line ADA style

You can use the following to translate C-style \ escapes in a string matched
	by the RE above. NOTE: the order of the alternatives in the RE below
	if significant.


s/\\(([0-7]{1,3})|x([\da-fA-F]+)|(.))/$trans($2,$3,$4)/eg
sub trans {
	local($oct,$hex,$single) = @_;
	if ($oct ne '') {
		pack("c",oct($oct));
	} elsif ($hex ne '') {
		pack("c",hex($hex));
	} else { #singleton case must have matched if others didn't
		substr($trans,ord($single),1);
			# def of $trans left as exericse for reader :-)
	}
}

--
Perl's Maternal Uncle
Mark Biggar