Xref: utzoo comp.unix.wizards:20064 comp.lang.c:24957
Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!uwm.edu!cs.utexas.edu!tut.cis.ohio-state.edu!rutgers!cbmvax!mitchell
From: mitchell@cbmvax.commodore.com (Fred Mitchell - PA)
Newsgroups: comp.unix.wizards,comp.lang.c
Subject: Re: fuzzy strcmp
Message-ID: <9245@cbmvax.commodore.com>
Date: 8 Jan 90 22:30:42 GMT
References: <297@hhb.UUCP>
Reply-To: mitchell@cbmvax.commodore.com (Fred Mitchell - PA)
Organization: Commodore, West Chester, PA
Lines: 71

In article <297@hhb.UUCP> istvan@hhb.UUCP (Istvan Mohos) writes:
>tchrist@convexe.uucp (Tom Christiansen @ Convex Computer) writes:
>>I'm looking for an algorithm that would allow me to determine
>>whether two strings were similar.  Thus 
>>
>>	"abcde" !~ "xyzzy"
>>	"this old man can read" =~ "that old man can't read"
>>
>>... perhaps just
>>    float   strfzcmp(string1,string2)

I will give a general description of an algorithm that can accomplish
what you ask.

NOTE: I am doing this from the 'top of my head'. Some refinement may
be in order.

You will do a weighted comparision based on two factors:
a) The number of characters each string has in common
b) The number of matches the strings have in sequence

Let's take two arbitrary strings:
	"This old man can't read"    string alpha
	"That silly man can't read"  string beta

Notice that they are of different lengths. Also, there is an alignment
shift. The following algorithm _should_ properly handle this:

a) Count the number of occurences of each character in each string.
   Compare the count of each character in alpha to that in beta in the
   following way:
     For each different character from alpha & beta:
	Normalize the counts so that the greater is normalized to 1.
	Multiply the two normalized values toghether.
	Add this product to a running total.
     Normalize this total by the the length of the longest string
     Multiply that by weight w1 yeilding (we shall call it) yw1

b) Compare the two strings byte for byte as follows:
     Start at the beginning of alpha & beta (let's use i and j as indexes)
     initialize k to 0
     Until we hit the end of either string:
	if (alpha[i] == beta[j])
		++k, ++i, ++j
	else 
		we scan forward on beta to find a byte that matchess
		our current location on alpha. If so, adjust j to index
		it. If nothing was found, do vice-versa for alpha.
     When the above loop is completed,
     Normalize k against the size of the longest string
     and multiply it by weight w2 yeilding (we shall call it) yw2.

     return (yw1 + yw2) / (w1 + w2)


This should produce a good evaluation of how closely matched alpha
is to beta, taking into account mis-alignment. The value returned
will be between 0 and 1 inclusive. Weight w1 should be less than w2.
Perhaps w1 = w2 / 3. Experiment.

As I said, this is off the top of my head, so some refinement is in
order, as should be evident during the implementation/testing phase.

If anyone knows of a better way to accomplish this, let's hear it!

	-Mitchell
	 mitchell@cbmvax.UUCP
	 "The shortening of the Way."