Path: utzoo!attcan!uunet!lll-winken!lll-tis!ames!zodiac!joyce!sri-unix!quintus!ok From: ok@quintus.uucp (Richard A. O'Keefe) Newsgroups: comp.lang.prolog Subject: Re: Soundex revisited Message-ID: <323@quintus.UUCP> Date: 30 Aug 88 01:13:18 GMT References: <1178@tuhold> Sender: news@quintus.UUCP Reply-To: ok@quintus.UUCP (Richard A. O'Keefe) Organization: Quintus Computer Systems, Inc. Lines: 56 In article <1178@tuhold> thom@tuhold (Thom Fruehwirth) writes: >Richard O'Keefe's implementation is based on a different >specification, which is easier to implement. My program was based on the specification of the Soundex function that Knuth provides. Anything which computes a different function is not Soundex. >Note that the resulting version is arbitrarily faster than the current version depending on the length of the word one codes. The reason for >for this is that my version never codes more than three digits, while >O'Keefe's version codes the whole word. Since the Soundex algorithm is primarily intended for encoding surnames, this is not much of a problem. Since the specification of the Soundex function requires that duplicate characters be discarded, it is possible to construct strings of arbitrary length _all_ of which must be inspected even if only 3 codes are produced. (Stick "sss.......sss" in front of any word you like.) However, Fruehwirth is quite right that I didn't take my "implementation" version as far as I could have. Here's my "optimised" version: the soundex_atoms/2 predicate and char_to_code/2 table didn't change at all. soundex_chars([Char|Chars], [Char|Codes]) :- char_to_code(Char, Code), soundex_chars(Chars, Code, 3, Codes). % soundex_chars(+Letters, +PreviousCode, +DigitsWanted, -Digits) soundex_chars([], _, N, Zeros) :- zeros(N, Zeros). soundex_chars([Char|Chars], Prev, N, Codes) :- char_to_code(Char, Code), ( Code =:= Prev -> % Discard duplicate characters soundex_chars(Chars, Code, N, Codes) ; Code =:= 0 -> % Ignore vowels soundex_chars(Chars, Code, N, Codes) ; N =:= 1 -> % Stop when 3 digits have been done Codes = [Code] ; M is N-1, % otherwise, convert 1 more character Codes = [Code|Codes1], soundex_chars(Chars, Code, M, Codes1) ). zeros(1, "0"). zeros(2, "00"). zeros(3, "000"). Ironically, this has even fewer cuts than the version I started with! It provides even less support for the idea that ugliness is warranted for efficiency's sake. Testing the two versions on a random sample of 20 names drawn from /usr/dict/words, the improved version was a little under twice as fast as the version I first thought of.