Path: utzoo!attcan!uunet!wuarchive!usc!elroy.jpl.nasa.gov!jpl-devvax!lwall
From: lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall)
Newsgroups: comp.unix.questions
Subject: Re: Fuzzy grep?
Keywords: grep
Message-ID: <10240@jpl-devvax.JPL.NASA.GOV>
Date: 5 Nov 90 19:03:26 GMT
References: <242@locke.water.ca.gov>
Reply-To: lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall)
Organization: Jet Propulsion Laboratory, Pasadena, CA
Lines: 87

In article <242@locke.water.ca.gov> rfinch@caldwr.water.ca.gov (Ralph Finch) writes:
: Is there something like grep, except it will (easlly) search an entire
: file (not just line-by-line) for regexp's near each other? Ideally it
: would rank hits by how much or how close they match, e.g.
: 
: fzgrep 'abc.*123' filename
: 
: would return hits not by line number but by how close abc & 123 are
: found together.  Also it wouldn't matter what order the regexp's are.

I sincerely doubt you're going to find a specialized tool to do that.
But if you just slurp a file into a string in Perl, you can then
start playing with it.  For example, if your search strings are fixed,
you can use index:

	#!/usr/bin/perl
	undef $/;
	while (<>) {	# for each file
	    $posabc = index($_, "abc");
	    next if $posabc < 0;
	    $pos123 = index($_, "123");
	    next if $pos123 < 0;
	    $diff = $posabc - $pos123;
	    $diff = -$diff if $diff < 0;
	    print "$ARGV: $diff\n";
	}

Of course, you'd probably want to make a subroutine of that middle junk.
Or you can say:

	#!/usr/bin/perl
	undef $/;
	while (<>) {	# for each file
	    tr/\n/ /;			# so . matches anything
	    (/(abc.*)123/ || /(123.*)abc/)
		&& print "$ARGV: " . (length($1)-3) . "\n"
	}
	
Those .*'s are going to be expensive, though.  Maybe

	#!/usr/bin/perl
	undef $/;
	while (<>) {	# for each file
	    next unless /abc/;
	    $posabc = length($`);
	    next unless /123/;
	    $pos123 = length($`);
	    $diff = $posabc - $pos123;
	    $diff = -$diff if $diff < 0;
	    print "$ARGV: $diff\n";
	}

Of course, none of these solutions is going to find the closest pair,
necessarily.  To do that, use a nested split, which also works with arbitrary
regular expressions:


	#!/usr/bin/perl
	undef $/;
	while (<>) {	# for each file
	    $min = length($_);
	    @abc = split(/abc/, $_, 999999);
	    next if @abc == 1;		# no match
	    &try(shift(@abc), 0, 1);
	    &try(pop(@abc),   1, 0);
	    foreach $chunk (@abc) {
		&try($chunk, 1, 1);
	    }
	    next if $min == length($_);
	    print "$ARGV: $min\n";
	}

	sub try {
	    ($hunk, $first, $last) = @_;
	    @pieces = split(/123/, $hunk, 999999);
	    if ($first && $min > length($pieces[0]) {
		$min = length($pieces[0]);
	    }
	    if ($last && $min > length($pieces[$#pieces]) {
		$min = length($pieces[$#pieces]);
	    }
	}

Or something like that...

Larry Wall
lwall@jpl-devvax.jpl.nasa.gov