Path: utzoo!attcan!uunet!lll-winken!lll-tis!helios.ee.lbl.gov!pasteur!ucbvax!decwrl!labrea!sri-unix!quintus!ok From: ok@quintus Newsgroups: comp.lang.prolog Subject: Soundex revisited Message-ID: <305@quintus.UUCP> Date: 23 Aug 88 01:46:30 GMT Sender: news@quintus.UUCP Reply-To: ok@quintus () Organization: Quintus Computer Systems, Inc. Lines: 113 In a recent article, Thom Fruehwirth suggested that it is a good idea to distinguish between Prolog as a specification language and Prolog as an implementation language. Quite right. When he suggests that you should write a specification version and an implementation version of every predicate, I think he goes rather too far. In particular, it is not necessary for the implementation of Soundex to be quite as ugly as his implementation version. Proof by illustration: /* This description of the Soundex algorithm is taken from Knuth, "The Art of Computer Programming", Vol 3 "Sorting and Searching", page 392 of the first edition. 1. Retain the first letter of the name, and drop all occurrences of a, e, h, i, o, u, w, y in other positions. 2. Assign the following numbers to the remaining letters after the first: b, f, p, v -> 1 l -> 4 c, g, j, k, q, s, x, z -> 2 m, n -> 5 d, t -> 3 r -> 6 3. If two or more letters with the same code were adjacent in the original name (before step 1), omit all but the first. 4. Convert to the form "letter, digit, digit, digit" by adding trailing zeros (if there are fewer than three digits) or by dropping rightmost digits (if there are more than three). Note that this specification refers only to lower-case letters, so that is what the Prolog code below implements. To handle upper- case letters as well, see the comment in chars_to_codes/2. */ soundex_atoms(Name, Soundex) :- name(Name, Chars), soundex_chars(Chars, Code), name(Soundex, Code). soundex_chars([Char|Chars], [Char|Digits]) :- chars_to_codes([Char|Chars], [Code|Codes]), omit_duplicates_and_vowels(Codes, Code, CodesWithoutDuplicates), three_digits(CodesWithoutDuplicates, Digits). /* To handle both cases of letters, do to_lower(Char, Lower), char_to_code(Lower, Code) if you have to_lower/2, or do Lower is Char \/ 8'040, char_to_code(Lower, Code) if the input uses ISO 8859/1 and you haven't got to_lower/2. */ chars_to_codes([], []). chars_to_codes([Char|Chars], [Code|Codes]) :- char_to_code(Char, Code), chars_to_codes(Chars, Codes). char_to_code(0'a, 0). char_to_code(0'e, 0). char_to_code(0'h, 0). char_to_code(0'i, 0). char_to_code(0'o, 0). char_to_code(0'u, 0). char_to_code(0'w, 0). char_to_code(0'y, 0). char_to_code(0'b, 0'1). char_to_code(0'f, 0'1). char_to_code(0'p, 0'1). char_to_code(0'v, 0'1). char_to_code(0'c, 0'2). char_to_code(0'g, 0'2). char_to_code(0'j, 0'2). char_to_code(0'k, 0'2). char_to_code(0'q, 0'2). char_to_code(0's, 0'2). char_to_code(0'x, 0'2). char_to_code(0'z, 0'2). char_to_code(0'd, 0'3). char_to_code(0't, 0'3). char_to_code(0'l, 0'4). char_to_code(0'm, 0'5). char_to_code(0'n, 0'5). char_to_code(0'r, 0'6). % Note that omit_duplicates_and_vowels/3 uses if->then;elses, but that % the "then" arrows could be replaced by conjunction without harming % the logic. omit_duplicates_and_vowels([], _, []). omit_duplicates_and_vowels([Code|Codes], Prev, Without) :- ( Code =:= Prev -> Without = More % omit a duplicate ; Code =:= 0 -> Without = More % omit a vowel ; Code =\= Prev, Code =\= 0 -> Without = [Code|More] ), omit_duplicates_and_vowels(Codes, Code, More). % The cuts in three_digits/2 are all "green" ones. That is, they % prune away (failed) proofs, not solutions. Deeper indexing would % obviate the need for such cuts. three_digits([], "000"). three_digits([D1], [D1|"00"]) :- !. three_digits([D1,D2], [D1,D2|"0"]) :- !. three_digits([D1,D2,D3|_], [D1,D2,D3]). /* Test cases (from Knuth): euler, ellery -> e460 gauss, ghosh -> g200 hilbert, heilbronn -> h416 knuth, kant -> k530 lloyd, ladd -> l300 lukasiewicz, lissajous -> l222 */