Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!cis.ohio-state.edu!pacific.mps.ohio-state.edu!linac!att!rutgers!bagate!cbmvax!cbmehq!cbmdeo!lenler!moria!bojsen From: bojsen@moria.uucp (Per Bojsen) Newsgroups: comp.sys.amiga.programmer Subject: The Boyer-Moore string searching algorithm Re: Source code for search Message-ID: <19453c80.ARN360b@moria.uucp> Date: 8 Jun 91 23:51:12 GMT References: <492@regina.uregina.ca> Reply-To: cbmehq!moria!bojsen Followup-To: comp.sys.amiga.programmer Organization: IDUN-Soft Aps. Lines: 248 In article <492@regina.uregina.ca>, Dave Plummer writes: > Would anyone happen to have source code for a (very) fast search routine? > Basically, what I need is something like strstr(), but I need to be able > to modify it to make it non-case sensitive. I'm basically looking through > a small (512-2048 bytes) buffer for an ASCII string. > The fastest-known single-keyword string searching algorithm both in theory and practice is the Boyer-Moore string matching algorithm described in Boyer, R.S. and J.S. Moore, ``A Fast String Searching Algorithm'', Comm. ACM 20(10) (1977) pp. 62-72. I'll try to give a short description of the algorithm. Consider the problem of finding a pattern p consisting of a single keyword, i.e., not a regular expression, and an input string s. That is if s = xpy for some x and y p occurs in s, else not. Let p = p p ... p and s = s s ... s , where p 1 2 m 1 2 n i (there's no subscripts in ASCII) is the ith character of the pattern, and s j the jth character of the input string. The basic idea of the Boyer-Moore algorithm is to look for a match by comparing characters in the keyword from *right*to*left* against the input string. We start by comparing p with s . Now, if s does not occur in p there can be m m m *no* match for the pattern beginning at any of the first m characters of s. So we can immediately proceed by comparing p with s , thereby avoiding m-1 m 2m character comparisons. Now for the general case. We have just shifted the keyword to the right and are going to compare p with s : m k | V p ... p 1 m s ... s ... s ... s 1 k-m+1 k n ^ | j 1) We find that p and s do not match. If the rightmost occurence of s m k k in the pattern is p , we can shift the pattern g positions to the right. m-g This aligns p and s (which by definition match). We then continue m-g k with comparing p with s : m k+g | V p ... p ... p 1 m-g m s ... s ... s ... s ... s 1 k-m+1 k k+g n ^ | j As the special case, if sk did not occur in the pattern, the pattern is shifted m positions to the right. 2) Now assume the last m-i characters of the keyword matches the last m-i characters of the input string ending at position k. I.e.: p p ... p = s s ... s i+1 i+2 m k-m+i+1 k-m+i+2 k If i == 0 we are finished! We found a match. If i>0 and p != s i k-m+i we are left with two cases. (a) If the rightmost occurence of the character s in the pattern k-m+1 is p , then as in case 1 we shift the keyword g positions to the i-g right so that p and s are aligned and continue by comparing i-g k-m+i p with s : m k+g | V p ... p ... p 1 i-g m s ... s ... s ... s ... s 1 k-m+g+1 k-m+i k+g n ^ | j If p is to the right of p , that is g < 0, then we should just i-g i shift the pattern one position to the right and continue matching by comparing p with s . m k+1 (b) A longer shift than that obtained in the previous case may sometimes be possible. If the suffix p p ... p reoccurs as the sub- i+1 i+2 m string p p ...p in the keyword and p != p (if there're i+1-g i+2-g m-g i i-g more than one occurence then take the rightmost one). Then we may get a bigger shift than in the previous case by aligning p p ...p above s s ...s and continuing the i+1-g i+2-g m-g k-m+i+1 k-m+i+2 k search by comparing p with s : m k+g | V p ... p ... p ... p 1 i+1-g m-g m s ... s ... s ... s ... s ... s 1 k-m+g+1 k-m+i+1 k k+g n ^ | j The details of this algorithm may be formulated as (in pseudocode): begin j := m while j <= n do begin i := m while i > 0 cand p == s do i j begin i := i - 1; j := j - 1 end if i := 0 then return "yes" else j := j + max(d [s ], d [i]) 1 j 2 end return "no" end The `cand' is a conditional and, aka a shortcut and, or the && operator in C. The algorithm uses two tables d and d in determining how far the keyword 1 2 should be shifted to the right when p != s . The first table is indexed by i j characters. For every character c occuring in p, d [p] is m-i, where i is 1 the largest i such that p == c; for all c *not* occuring in p, d [p] is m. i 1 This table takes care of cases 1 and 2a. The second table takes care of case 2b. This table is indexed by positions in the keyword. For each 1 <= i <= m, d [i] gives the minimum shift g such 2 that when we position p above s , the substring p p ...p of m k+g i+1-g i+2-g m-g the keyword matches the substring s s ...s of the input string, k-m+i+1 k-m+i+2 k under the assumption that p did not match s . i k-m+i Formally: d [i] = min{g+m-i | g >= 1 and (g >= i or p != p ) 2 i-g i and ((g >= k or p == p ) for i < k <= m)} k-g k To compute this table use the following algorithm due to Donald Knuth: begin for i := 1 to n do d [i] := 2m - i 2 j := m; k := m + 1 while j > 0 do begin f[j] := k while k <= m cand p != p do j k begin d [k] := min(d [k], m - j) 2 2 k := f[k] end j := j - 1; k := k - 1 end for i := 1 to k do d [i] := min(d [i], m + k - i) 2 2 j := f[k] while k <= m do begin while k <= j do begin d [k] := min(d [k], j - k + m) 2 2 k := k + 1 end j := f[j] end end The intermediate table f computed by the algorithm has the property that f[m] == m + 1 and for 1 <= j < m, f[j] == min{i | j < i <= m and p p ...p == p p ...p } i+1 i+2 m j+1 j+2 m+j-i So much for the description of the algorithm. There are at least one caveat when you try to implement this algorithm in C. All arrays in the description has a first element index of 1. I have not had the time to convert the algorithm to the C standard, yet. As for the performance of the algorithm it is a theorem that it solves our problem in O(m+n) time and O(m) space. This is the worst case behavior, though, and the average behavior is much better. In average the algorithm need only compare about n/m input string characters. Compare this to the obvious brute-force algorithm, where you would compare *all* characters of the input string. As a nice example of how finding a better algorithm is much more important than trying to handoptimize code it has been experimentally observed how the Boyer-Moore algorithm outperformed simpler algorithms that were implemen- ted to take advantage of special-purpose machine instructions. The description above is adapted from A.V. Aho ``Algorithms for Finding Patterns in Strings'', pp. 267--271 in ``Algorithms and Complexity'' Vol. A of ``Handbook of Theoretical Computer Science''. As for C source code, I have just finished a small program today that uses the Boyer-Moore algorithm. If I get the time I'll extract the Boyer- Moore parts and post the source, if there's interest. Otherwise you might want to look at the GNU regex.c module, which includes code for the Boyer- Moore algorithm. Hope this helps :-) -- .------------------------------------------------------------------------------. | Greetings from Per Bojsen. | +------------------------------+-----------------------------------------------+ | EMail: cbmehq!lenler!bojsen | "Names do have power, after all, that of | | Or: bojsen@dc.dth.dk | conjuring images of places we have not seen" | `------------------------------+-----------------------------------------------'