Path: utzoo!attcan!uunet!samsung!brutus.cs.uiuc.edu!ux1.cso.uiuc.edu!uxh.cso.uiuc.edu!ejk
From: ejk@uxh.cso.uiuc.edu (Ed Kubaitis)
Newsgroups: comp.lang.perl
Subject: Re: Fuzzy pattern matching
Message-ID: <1990Apr7.160946.13651@ux1.cso.uiuc.edu>
Date: 7 Apr 90 16:09:46 GMT
Sender: usenet@ux1.cso.uiuc.edu (News)
Reply-To: ejk@uxh.cso.uiuc.edu (Ed Kubaitis)
Followup-To: Fuzzy pattern matching
Organization: University of Illinois at Urbana
Lines: 38

My application involves finding the beginning of articles matching articles
listed in the table of contents in machine-readable issues of a publication 
intended for human readers, not programs.  (The problem is that these documents
are apparently hand entered or edited so titles in the table-of-contents do not
always exactly match the titles in the text: misspellings, abreviations in one
place but not the other, different punctuation, etc.)

Here's a fuzzy match that has (so far) served well. The algorithm is to
calculate the ratio of two letter substrings in string A occuring anywhere
in string B to the total number of two letter substrings in the shorter of
the two strings. A ratio of 0.8 or greater empirically seems to do what
I want. You may want to change the way of dealing with differences in
lengths of the two strings, or try 3 character substrings, etc. It may not 
be the best, but it's small, easy to understand, and fairly fast.

Ed Kubaitis
Computing Services Office
University of Illinois
------------------------------------------------------------------------------
sub fmatch {
   local($A, $B) = @_;
   local($l) = (length($A) < length($B)) ? length($A) : length($B);
   local($m) = 0;
   local($w) = 2;
   local($k);

   $A eq $B && return(1.0);
   $l > $w || return(0.0);

   for $k(0..$l-$w) { 
      local($s) = substr($A, $k, $w);
      #---escape r.e. characters in string A---
      $s =~ s/([()[\]*|?.{}\\])/\\$1/;
      $B =~ $s && $m++;
      }

   ($m/($l-$w) > 0.80) ? 1 : 0;
   }