Xref: utzoo comp.unix.wizards:20064 comp.lang.c:24957 Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!uwm.edu!cs.utexas.edu!tut.cis.ohio-state.edu!rutgers!cbmvax!mitchell From: mitchell@cbmvax.commodore.com (Fred Mitchell - PA) Newsgroups: comp.unix.wizards,comp.lang.c Subject: Re: fuzzy strcmp Message-ID: <9245@cbmvax.commodore.com> Date: 8 Jan 90 22:30:42 GMT References: <297@hhb.UUCP> Reply-To: mitchell@cbmvax.commodore.com (Fred Mitchell - PA) Organization: Commodore, West Chester, PA Lines: 71 In article <297@hhb.UUCP> istvan@hhb.UUCP (Istvan Mohos) writes: >tchrist@convexe.uucp (Tom Christiansen @ Convex Computer) writes: >>I'm looking for an algorithm that would allow me to determine >>whether two strings were similar. Thus >> >> "abcde" !~ "xyzzy" >> "this old man can read" =~ "that old man can't read" >> >>... perhaps just >> float strfzcmp(string1,string2) I will give a general description of an algorithm that can accomplish what you ask. NOTE: I am doing this from the 'top of my head'. Some refinement may be in order. You will do a weighted comparision based on two factors: a) The number of characters each string has in common b) The number of matches the strings have in sequence Let's take two arbitrary strings: "This old man can't read" string alpha "That silly man can't read" string beta Notice that they are of different lengths. Also, there is an alignment shift. The following algorithm _should_ properly handle this: a) Count the number of occurences of each character in each string. Compare the count of each character in alpha to that in beta in the following way: For each different character from alpha & beta: Normalize the counts so that the greater is normalized to 1. Multiply the two normalized values toghether. Add this product to a running total. Normalize this total by the the length of the longest string Multiply that by weight w1 yeilding (we shall call it) yw1 b) Compare the two strings byte for byte as follows: Start at the beginning of alpha & beta (let's use i and j as indexes) initialize k to 0 Until we hit the end of either string: if (alpha[i] == beta[j]) ++k, ++i, ++j else we scan forward on beta to find a byte that matchess our current location on alpha. If so, adjust j to index it. If nothing was found, do vice-versa for alpha. When the above loop is completed, Normalize k against the size of the longest string and multiply it by weight w2 yeilding (we shall call it) yw2. return (yw1 + yw2) / (w1 + w2) This should produce a good evaluation of how closely matched alpha is to beta, taking into account mis-alignment. The value returned will be between 0 and 1 inclusive. Weight w1 should be less than w2. Perhaps w1 = w2 / 3. Experiment. As I said, this is off the top of my head, so some refinement is in order, as should be evident during the implementation/testing phase. If anyone knows of a better way to accomplish this, let's hear it! -Mitchell mitchell@cbmvax.UUCP "The shortening of the Way."