Path: utzoo!attcan!uunet!lll-winken!lll-tis!helios.ee.lbl.gov!pasteur!ucbvax!decwrl!labrea!sri-unix!quintus!ok
From: ok@quintus
Newsgroups: comp.lang.prolog
Subject: Soundex revisited
Message-ID: <305@quintus.UUCP>
Date: 23 Aug 88 01:46:30 GMT
Sender: news@quintus.UUCP
Reply-To: ok@quintus ()
Organization: Quintus Computer Systems, Inc.
Lines: 113


In a recent article, Thom Fruehwirth suggested that it is a good idea
to distinguish between Prolog as a specification language and Prolog
as an implementation language.  Quite right.  When he suggests that
you should write a specification version and an implementation version
of every predicate, I think he goes rather too far.  In particular, it
is not necessary for the implementation of Soundex to be quite as ugly
as his implementation version.  Proof by illustration:

/*  This description of the Soundex algorithm is taken from
    Knuth, "The Art of Computer Programming", Vol 3 "Sorting and Searching",
    page 392 of the first edition.

    1.  Retain the first letter of the name,
	and drop all occurrences of a, e, h, i, o, u, w, y in other positions.

    2.	Assign the following numbers to the remaining letters after the first:
	b, f, p, v -> 1			l -> 4
	c, g, j, k, q, s, x, z -> 2	m, n -> 5
	d, t -> 3			r -> 6

    3.	If two or more letters with the same code were adjacent in the
	original name (before step 1), omit all but the first.

    4.	Convert to the form "letter, digit, digit, digit" by adding
	trailing zeros (if there are fewer than three digits) or by
	dropping rightmost digits (if there are more than three).

    Note that this specification refers only to lower-case letters, so
    that is what the Prolog code below implements.  To handle upper-
    case letters as well, see the comment in chars_to_codes/2.
*/

soundex_atoms(Name, Soundex) :-
	name(Name, Chars),
	soundex_chars(Chars, Code),
	name(Soundex, Code).

soundex_chars([Char|Chars], [Char|Digits]) :-
	chars_to_codes([Char|Chars], [Code|Codes]),
	omit_duplicates_and_vowels(Codes, Code, CodesWithoutDuplicates),
	three_digits(CodesWithoutDuplicates, Digits).

/*  To handle both cases of letters, do
	to_lower(Char, Lower),
	char_to_code(Lower, Code)
    if you have to_lower/2, or do
	Lower is Char \/ 8'040,
	char_to_code(Lower, Code)
    if the input uses ISO 8859/1 and you haven't got to_lower/2.
*/
chars_to_codes([], []).
chars_to_codes([Char|Chars], [Code|Codes]) :-
	char_to_code(Char, Code),
	chars_to_codes(Chars, Codes).

char_to_code(0'a, 0).
char_to_code(0'e, 0).
char_to_code(0'h, 0).
char_to_code(0'i, 0).
char_to_code(0'o, 0).
char_to_code(0'u, 0).
char_to_code(0'w, 0).
char_to_code(0'y, 0).
char_to_code(0'b, 0'1).
char_to_code(0'f, 0'1).
char_to_code(0'p, 0'1).
char_to_code(0'v, 0'1).
char_to_code(0'c, 0'2).
char_to_code(0'g, 0'2).
char_to_code(0'j, 0'2).
char_to_code(0'k, 0'2).
char_to_code(0'q, 0'2).
char_to_code(0's, 0'2).
char_to_code(0'x, 0'2).
char_to_code(0'z, 0'2).
char_to_code(0'd, 0'3).
char_to_code(0't, 0'3).
char_to_code(0'l, 0'4).
char_to_code(0'm, 0'5).
char_to_code(0'n, 0'5).
char_to_code(0'r, 0'6).

%   Note that omit_duplicates_and_vowels/3 uses if->then;elses, but that
%   the "then" arrows could be replaced by conjunction without harming
%   the logic.

omit_duplicates_and_vowels([], _, []).
omit_duplicates_and_vowels([Code|Codes], Prev, Without) :-
	(   Code =:= Prev -> Without = More	% omit a duplicate
	;   Code =:= 0    -> Without = More	% omit a vowel
	;   Code =\= Prev, Code =\= 0 -> Without = [Code|More]
	),
	omit_duplicates_and_vowels(Codes, Code, More).


%   The cuts in three_digits/2 are all "green" ones.  That is, they
%   prune away (failed) proofs, not solutions.  Deeper indexing would
%   obviate the need for such cuts.

three_digits([],		"000").
three_digits([D1],	    [D1|"00"]) :- !.
three_digits([D1,D2],	  [D1,D2|"0"]) :- !.
three_digits([D1,D2,D3|_], [D1,D2,D3]).


/*  Test cases (from Knuth):
	euler, ellery		-> e460
	gauss, ghosh		-> g200
	hilbert, heilbronn	-> h416
	knuth, kant		-> k530
	lloyd, ladd		-> l300
	lukasiewicz, lissajous	-> l222
*/