Path: utzoo!attcan!uunet!wuarchive!usc!elroy.jpl.nasa.gov!jpl-devvax!lwall From: lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) Newsgroups: comp.unix.questions Subject: Re: Fuzzy grep? Keywords: grep Message-ID: <10240@jpl-devvax.JPL.NASA.GOV> Date: 5 Nov 90 19:03:26 GMT References: <242@locke.water.ca.gov> Reply-To: lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) Organization: Jet Propulsion Laboratory, Pasadena, CA Lines: 87 In article <242@locke.water.ca.gov> rfinch@caldwr.water.ca.gov (Ralph Finch) writes: : Is there something like grep, except it will (easlly) search an entire : file (not just line-by-line) for regexp's near each other? Ideally it : would rank hits by how much or how close they match, e.g. : : fzgrep 'abc.*123' filename : : would return hits not by line number but by how close abc & 123 are : found together. Also it wouldn't matter what order the regexp's are. I sincerely doubt you're going to find a specialized tool to do that. But if you just slurp a file into a string in Perl, you can then start playing with it. For example, if your search strings are fixed, you can use index: #!/usr/bin/perl undef $/; while (<>) { # for each file $posabc = index($_, "abc"); next if $posabc < 0; $pos123 = index($_, "123"); next if $pos123 < 0; $diff = $posabc - $pos123; $diff = -$diff if $diff < 0; print "$ARGV: $diff\n"; } Of course, you'd probably want to make a subroutine of that middle junk. Or you can say: #!/usr/bin/perl undef $/; while (<>) { # for each file tr/\n/ /; # so . matches anything (/(abc.*)123/ || /(123.*)abc/) && print "$ARGV: " . (length($1)-3) . "\n" } Those .*'s are going to be expensive, though. Maybe #!/usr/bin/perl undef $/; while (<>) { # for each file next unless /abc/; $posabc = length($`); next unless /123/; $pos123 = length($`); $diff = $posabc - $pos123; $diff = -$diff if $diff < 0; print "$ARGV: $diff\n"; } Of course, none of these solutions is going to find the closest pair, necessarily. To do that, use a nested split, which also works with arbitrary regular expressions: #!/usr/bin/perl undef $/; while (<>) { # for each file $min = length($_); @abc = split(/abc/, $_, 999999); next if @abc == 1; # no match &try(shift(@abc), 0, 1); &try(pop(@abc), 1, 0); foreach $chunk (@abc) { &try($chunk, 1, 1); } next if $min == length($_); print "$ARGV: $min\n"; } sub try { ($hunk, $first, $last) = @_; @pieces = split(/123/, $hunk, 999999); if ($first && $min > length($pieces[0]) { $min = length($pieces[0]); } if ($last && $min > length($pieces[$#pieces]) { $min = length($pieces[$#pieces]); } } Or something like that... Larry Wall lwall@jpl-devvax.jpl.nasa.gov