Path: utzoo!attcan!uunet!mcvax!enea!kth!draken!bmc1!kuling!waldau From: waldau@kuling.UUCP (Mattias Waldau) Newsgroups: comp.lang.prolog Subject: Re: Soundex revisited Message-ID: <816@kuling.UUCP> Date: 30 Aug 88 12:05:23 GMT References: <1178@tuhold> Reply-To: waldau@kuling.UUCP (Mattias Waldau) Organization: UPMAIL, Computing Science Department, Uppsala University, Sweden Lines: 166 Thom Fruewirth and R.O.Keefe have both specified following soundex algorithm with Horn clauses. Following the tradition of Keith CLark, Sten-Ake Tarnlund, Zohar Manna, Richard Waldinger, and Alan Bundy (see his last lecture at the logic conference in Seattle) I have tried to specify the same algorithm in first order predicate logic with equality. The idea is that that specification should be easier to understand, and therefor more likely to be correct. The question to the newsgroup is whether this is true or not. /* This description of the Soundex algorithm is taken from Knuth, "The Art of Computer Programming", Vol 3 "Sorting and Searching", page 392 of the first edition. 1. Retain the first letter of the name, and drop all occurrences of a, e, h, i, o, u, w, y in other positions. 2. Assign the following numbers to the remaining letters after the first: b, f, p, v -> 1 l -> 4 c, g, j, k, q, s, x, z -> 2 m, n -> 5 d, t -> 3 r -> 6 3. If two or more letters with the same code were adjacent in the original name (before step 1), omit all but the first. 4. Convert to the form "letter, digit, digit, digit" by adding trailing zeros (if there are fewer than three digits) or by dropping rightmost digits (if there are more than three). */ Syntax: ======= I use ~, (x), (e x), &, |, ->, <-, <-> for not, forall, exists, and, or, onlyif, if, and iff respectively. Not, exists, forall binds titest, then and, then or, then onlyif and if, and at last iff. So a<->b|c&(e x)d&e->f is the same as a<->((b|(c&((e x)d)&e))->f). Variable names start with a capital letter, constants, functions and predicates are written with lower-case only. Free variables are quantified with forall. I use standard Prolog list syntax. * is the append function. *, eq, teq, eq2, teq2, in, =, and < are written as infix operators. Specification: ============== % soundex(Sequence_of_chars, Soundex_code), Soundex_code is the % soundex code of the Sequence_of_chars soundex([Letter|Rest], [Letter|Digits])) <-> (e T)(e V) {without_adjacent_identical_codes([Letter|Rest], [Letter|T]) & without_vowels(T, V) & three_first_digits(V, Digits)} % without_identical_adjancent_codes(With, Without), Without is the % shortest sequence without identical adjancent codes that is identical % to With. without_adjancent_identical_codes(With, Without) <-> With teq Without & ~(e Shorter){With teq Shorter & Shorter (e Before)(e After)(e Dup1)(e Dup2) {With=Before*[Dup1, Dup2]*After & Without=Before*[Dup1]*After & code(Dup1)=code(Dup2)} A teq B <-> A=B | (e C){A eq C & C teq B} % A (e Before)(e After)(e Element) {B=Before*[Element]*After & (A=Before*After | A With teq2 Without & (Element){Element in Without -> ~vowel(Element)} % A eq2 B, B is the same sequence as A, but with one vowel removed. % A teq2 B, teq2 is the transitive closure of eq2. A eq2 B <-> (e Before)(e Vowel)(e After) {A=Before*[Vovel]*After & B=Before*After & vowel(Vowel)} A teq2 B <-> A=B | (e C){A eq2 C & C teq2 B} % three_digits(V, Digits), V is a sequence of letters, Digits is a % sequence of the codes of the three first letters. If there are fewer % than three letters, add trailing zeroes to Digits. three_digits(V, Digits) <-> V=[] & Digits=[0, 0, 0] | V=[Char1] & Digits=[code(char1), 0, 0] | V=[Char1, Char2] & Digits=[code(Char1), code(Char2), 0] | (e Garbage){V=[Char1, Char2, Char3|Garbage] & Digits=[code(Char1), code(Char2), code(Char3)]} % vowel(V), V is a vowel (almost, h and w isn't) vowel(V) <-> {V=a | V=e | V=h | V=i | V=o | V=u | V=w | V=y } % code(Letter)=Code, Letter has the code Code. code(L)=C <-> (L=b | L=f | L=p | L=v -> C=1) & (L=c | L=g | L=j | L=k | L=q | L=s | L=x | L=z -> C=2) & (L=d | L=t -> C=3) & (L=l -> C=4) & (L=m | L=n -> C=5) & (L+r -> C=6) % A*B=C, C is the concatenation of the sequences A and B. A*B=C <-> A=[] & B=C | (e H)(e T){A=[H|T] & C=[H|T*B]} % E in S, E is a element of the sequence S. E in S <-> (e H)(e T){S=[H|T] & (E=H | E in T)} /Mattias Waldau -- Mattias Waldau mattias-waldau@aida.uu.se or Computing Science Department waldau@kuling.uucp P.O. Box 520, S-751 20 Uppsala, Sweden Phone: +46-18-181055