Path: utzoo!censor!geac!torsqnt!news-server.csri.toronto.edu!cs.utexas.edu!sdd.hp.com!zaphod.mps.ohio-state.edu!maverick.ksu.ksu.edu!ux1.cso.uiuc.edu!s.psych.uiuc.edu!amead From: amead@s.psych.uiuc.edu (alan mead) Newsgroups: comp.lang.pascal Subject: Re: efficient string searches Keywords: clever, Boyer-Moore, string searches Message-ID: <1991Feb7.045231.11671@ux1.cso.uiuc.edu> Date: 7 Feb 91 04:52:31 GMT Sender: news@ux1.cso.uiuc.edu (News) Distribution: comp Organization: University of Illinois at Urbana Lines: 137 Here is a file from TUG002.ZIP which can be had at chyde. The other files mentioned in this document (the source code) are there as well as many other useful files. It describes what I find to be a clever search algorithm (it would only be worth it to use this for searching many characters): TUG PDS CERT 1.01 (Documentation) ========================================================================== TUG PUBLIC DOMAIN SOFTWARE CERTIFICATION The Turbo User Group (TUG) is recognized by Borland International as the official support organization for Turbo languages. This file has been verified by the TUG library staff. We are reasonably certain that the information contained in this file is public domain material, but it is also subject to any restrictions applied by its author. This diskette contains information determined to be in the PUBLIC DOMAIN, provided as a service of TUG for the use of its members. The Turbo User Group will not be liable for any damages, including any lost profits, lost savings or other incidental or consequential damages arising out of the use of or inability to use the contents, even if TUG has been advised of the possibility of such damages, or for any claim by any other party. To the best of our knowledge, the information in this file is accurate. If you discover an error in this file, we would appreciate it if you would report it to us. To report bugs, or to request information on membership in TUG, please contact us at: Turbo User Group PO Box 1510 Poulsbo, Washington USA 98370 -------------------------------------------------------------------------- F i l e I n f o r m a t i o n * DESCRIPTION Documentation file explaining about Boyer-MOORE. * ASSOCIATED FILES SEARCHES.PAS BOYER.DOC SEARCHES.DOC * CHECKED BY DRM - 08/14/88 * KEYWORDS TURBO PASCAL V4.0 DOCUMENTATION SEARCH INLINE ========================================================================== } W. Vann Hall Pala Designs P.O. Box 10804 Arlington, VA 22210 CIS 70346,1144 MCI WVHALL ABOUT Boyer-MOORE: Boyer-Moore is an attempt to speed up the usually serial text search. Traditionally, text searches proceed from the first character onward. (In these examples, "string" refers to string we're trying to locate, "text" the array of characters we're trying to find the string in. These all ignore case sensitivity, matching around line boundaries, punctuation, etc.) String to find: STRING Text to find it in: HOW YOU SEARCH FOR A STRING Our StringPointer and TextPointer both start at 1. We compare String[1] and Text[1]. Since "S" and "H" don't match, we increment the text pointer and compare again. At TextPointer = 9, the "S" of "STRING" and the "S" of "SEARCH" match, so we increment TextPointer *and* StringPointer and compare String[2] and Text[10]. "T" and "E" don't match, so we reset StringPointer and increment TextPointer again. And so on. You can see that it takes 22 comparisons before we locate the correct "S" and a full 27 before we know that we have located "STRING." Boyer-Moore, on the other hand, works from the end of the string (but still the beginning of the text). First, it creates a look-up table of values corresponding to the distance of the first occurence of each character from the end of the string. For instance, the table for "STRING" would read: A=6 B=6 C=6 D=6 E=6 F=6 G=6 H=6 I=2 J=6 K=6 L=6 M=6 N=1 O=6 P=6 Q=6 R=3 S=5 T=4 U=6 V=6 W=6 X=6 Y=6 Z=6 Note that characters not located in the string are set to Length(String), as it the final character. Since a 6-character string can't be located entirely within the first 5 characters, we start by comparing the String[6] -- "G" -- with Text[6]: "O". They don't match, so we increment TextPointer by the value associated with "O" in the table; that is, by 6. Our next compare is "G" with the "R" in "SEARCH". We now increment TextPointer by "R"'s value, 3, and compare "S" to " ". And so on. Here's a tracing of the progression; the letter to be compared is highlighted by lower casing: STRINg HOW YoU SEARCH FOR A STRING STRINg HOW YOU SEArCH FOR A STRING STRINg HOW YOU SEARCH^FOR A STRING STRINg HOW YOU SEARCH FOR A STRING STRINg HOW YOU SEARCH FOR A STRINg STRIng HOW YOU SEARCH FOR A STRIng STRing HOW YOU SEARCH FOR A STRing STring HOW YOU SEARCH FOR A STring String HOW YOU SEARCH FOR A String string HOW YOU SEARCH FOR A string There; 10 comparisons total. You can see how this would speed searching a long text -- enough to make up for the additional overhead of compiling the look-up table.