Path: utzoo!attcan!uunet!lll-winken!lll-tis!ames!zodiac!joyce!sri-unix!quintus!ok
From: ok@quintus.uucp (Richard A. O'Keefe)
Newsgroups: comp.lang.prolog
Subject: Re: Soundex revisited
Message-ID: <323@quintus.UUCP>
Date: 30 Aug 88 01:13:18 GMT
References: <1178@tuhold>
Sender: news@quintus.UUCP
Reply-To: ok@quintus.UUCP (Richard A. O'Keefe)
Organization: Quintus Computer Systems, Inc.
Lines: 56

In article <1178@tuhold> thom@tuhold (Thom Fruehwirth) writes:
>Richard O'Keefe's implementation is based on a different
>specification, which is easier to implement.

My program was based on the specification of the Soundex function
that Knuth provides.  Anything which computes a different function
is not Soundex.

>Note that the resulting version is arbitrarily faster than the current
version depending on the length of the word one codes.  The reason for
>for this is that my version never codes more than three digits, while
>O'Keefe's version codes the whole word.

Since the Soundex algorithm is primarily intended for encoding surnames,
this is not much of a problem.  Since the specification of the Soundex
function requires that duplicate characters be discarded, it is possible
to construct strings of arbitrary length _all_ of which must be inspected
even if only 3 codes are produced.  (Stick "sss.......sss" in front of any
word you like.)

However, Fruehwirth is quite right that I didn't take my "implementation"
version as far as I could have.  Here's my "optimised" version: the
soundex_atoms/2 predicate and char_to_code/2 table didn't change at all.  


soundex_chars([Char|Chars], [Char|Codes]) :-
	char_to_code(Char, Code),
	soundex_chars(Chars, Code, 3, Codes).

%   soundex_chars(+Letters, +PreviousCode, +DigitsWanted, -Digits)

soundex_chars([], _, N, Zeros) :-
	zeros(N, Zeros).
soundex_chars([Char|Chars], Prev, N, Codes) :-
	char_to_code(Char, Code),
	(   Code =:= Prev ->		% Discard duplicate characters
	    soundex_chars(Chars, Code, N, Codes)
	;   Code =:= 0 ->		% Ignore vowels
	    soundex_chars(Chars, Code, N, Codes)
	;   N =:= 1 ->			% Stop when 3 digits have been done
	    Codes = [Code]
	;   M is N-1,			% otherwise, convert 1 more character
	    Codes = [Code|Codes1],
	    soundex_chars(Chars, Code, M, Codes1)
	).

zeros(1, "0").
zeros(2, "00").
zeros(3, "000").

Ironically, this has even fewer cuts than the version I started with!
It provides even less support for the idea that ugliness is warranted
for efficiency's sake.

Testing the two versions on a random sample of 20 names drawn from
/usr/dict/words, the improved version was a little under twice as
fast as the version I first thought of.