Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!cis.ohio-state.edu!pacific.mps.ohio-state.edu!linac!att!rutgers!bagate!cbmvax!cbmehq!cbmdeo!lenler!moria!bojsen
From: bojsen@moria.uucp (Per Bojsen)
Newsgroups: comp.sys.amiga.programmer
Subject: The Boyer-Moore string searching algorithm
     Re: Source code for search
Message-ID: <19453c80.ARN360b@moria.uucp>
Date: 8 Jun 91 23:51:12 GMT
References: <492@regina.uregina.ca>
Reply-To: cbmehq!moria!bojsen
Followup-To: comp.sys.amiga.programmer
Organization: IDUN-Soft Aps.
Lines: 248

In article <492@regina.uregina.ca>, Dave Plummer writes:

> Would anyone happen to have source code for a (very) fast search routine?
> Basically, what I need is something like strstr(), but I need to be able
> to modify it to make it non-case sensitive.  I'm basically looking through
> a small (512-2048 bytes) buffer for an ASCII string.
>
The fastest-known single-keyword string searching algorithm both in theory
and practice is the Boyer-Moore string matching algorithm described in
Boyer, R.S. and J.S. Moore, ``A Fast String Searching Algorithm'', Comm.
ACM 20(10) (1977) pp. 62-72.

I'll try to give a short description of the algorithm.  Consider the problem
of finding a pattern p consisting of a single keyword, i.e., not a regular
expression, and an input string s.  That is if s = xpy for some x and y p
occurs in s, else not.  Let p = p p  ... p  and s = s s  ... s , where p
                                 1 2      m          1 2      n         i
(there's no subscripts in ASCII) is the ith character of the pattern, and s
                                                                           j
the jth character of the input string.

The basic idea of the Boyer-Moore algorithm is to look for a match by comparing
characters in the keyword from *right*to*left* against the input string.  We
start by comparing p  with s .  Now, if s  does not occur in p there can be
                    m       m            m
*no* match for the pattern beginning at any of the first m characters of s.
So we can immediately proceed by comparing p  with s  , thereby avoiding m-1
                                            m       2m
character comparisons.

Now for the general case.  We have just shifted the keyword to the right and
are going to compare p  with s :
                      m       k

                          |
                          V
               p      ... p
                1          m
        s  ... s      ... s  ... s
         1      k-m+1      k      n
                          ^
                          |
                          j

1) We find that p  and s  do not match.  If the rightmost occurence of s
                 m      k                                               k
   in the pattern is p   , we can shift the pattern g positions to the right.
                      m-g
   This aligns p    and s  (which by definition match).  We then continue
                m-g      k
   with comparing p  with s   :
                   m       k+g
                                   |
                                   V
               p      ... p    ... p
                1          m-g      m
        s  ... s      ... s    ... s    ... s
         1      k-m+1      k        k+g      n
                                   ^
                                   |
                                   j

   As the special case, if sk did not occur in the pattern, the pattern is
   shifted m positions to the right.

2) Now assume the last m-i characters of the keyword matches the last m-i
   characters of the input string ending at position k.  I.e.:

        p   p    ... p  = s       s        ... s
         i+1 i+2      m    k-m+i+1 k-m+i+2      k

   If i == 0 we are finished!  We found a match.  If i>0 and p  != s
                                                              i     k-m+i
   we are left with two cases.

   (a) If the rightmost occurence of the character s      in the pattern
                                                    k-m+1
       is p   , then as in case 1 we shift the keyword g positions to the
           i-g
       right so that p    and s      are aligned and continue by comparing
                      i-g      k-m+i
       p  with s   :
        m       k+g
                                       |
                                       V
               p        ... p      ... p
                1            i-g        m
        s  ... s        ... s      ... s    ... s
         1      k-m+g+1      k-m+i      k+g      n
                                       ^
                                       |
                                       j

       If p    is to the right of p , that is g < 0, then we should just
           i-g                     i
       shift the pattern one position to the right and continue matching by
       comparing p  with s   .
                  m       k+1

   (b) A longer shift than that obtained in the previous case may sometimes
       be possible.  If the suffix p   p    ... p  reoccurs as the sub-
                                    i+1 i+2      m
       string p     p     ...p    in the keyword and p  != p    (if there're
               i+1-g i+2-g    m-g                     i     i-g
       more than one occurence then take the rightmost one).  Then we may get
       a bigger shift than in the previous case by aligning
       p     p     ...p    above s       s       ...s  and continuing the
        i+1-g i+2-g    m-g        k-m+i+1 k-m+i+2    k
       search by comparing p  with s   :
                            m       k+g
                                                  |
                                                  V
               p        ... p        ... p    ... p
                1            i+1-g        m-g      m
        s  ... s        ... s        ... s    ... s    ... s
         1      k-m+g+1      k-m+i+1      k        k+g      n
                                                  ^
                                                  |
                                                  j

The details of this algorithm may be formulated as (in pseudocode):


        begin
          j := m
          while j <= n do
          begin
            i := m
            while i > 0 cand p == s  do
                              i    j
              begin
                i := i - 1; j := j - 1
              end
            if i := 0 then return "yes"
            else j := j + max(d [s ], d [i])
                               1  j    2
          end
          return "no"
        end


The `cand' is a conditional and, aka a shortcut and, or the && operator in C.

The algorithm uses two tables d  and d  in determining how far the keyword
                               1      2
should be shifted to the right when p  != s .  The first table is indexed by
                                     i     j
characters.  For every character c occuring in p, d [p] is m-i, where i is
                                                   1
the largest i such that p  == c; for all c *not* occuring in p, d [p] is m.
                         i                                       1
This table takes care of cases 1 and 2a.

The second table takes care of case 2b.  This table is indexed by positions
in the keyword.  For each 1 <= i <= m, d [i] gives the minimum shift g such
                                        2
that when we position p  above s   , the substring p     p     ...p    of
                       m        k+g                 i+1-g i+2-g    m-g
the keyword matches the substring s       s       ...s  of the input string,
                                   k-m+i+1 k-m+i+2    k
under the assumption that p  did not match s     .
                           i                k-m+i

Formally:

        d [i] = min{g+m-i | g >= 1 and (g >= i or p    != p )
         2                                         i-g     i
                            and ((g >= k or p    == p ) for i < k <= m)}
                                             k-g     k

To compute this table use the following algorithm due to Donald Knuth:


        begin
          for i := 1 to n do d [i] := 2m - i
                              2
          j := m; k := m + 1
          while j > 0 do
          begin
            f[j] := k
            while k <= m cand p  != p  do
                               j     k
            begin
              d [k] := min(d [k], m - j)
               2            2
              k := f[k]
            end
            j := j - 1; k := k - 1
          end
          for i := 1 to k do d [i] := min(d [i], m + k - i)
                              2            2
          j := f[k]
          while k <= m do
          begin
            while k <= j do
            begin
              d [k] := min(d [k], j - k + m)
               2            2
              k := k + 1
            end
            j := f[j]
          end
        end


The intermediate table f computed by the algorithm has the property that
f[m] == m + 1 and for 1 <= j < m,

        f[j] == min{i | j < i <= m and p   p   ...p  == p   p   ...p     }
                                        i+1 i+2    m     j+1 j+2    m+j-i


So much for the description of the algorithm.  There are at least one
caveat when you try to implement this algorithm in C.  All arrays in the
description has a first element index of 1.  I have not had the time to
convert the algorithm to the C standard, yet.

As for the performance of the algorithm it is a theorem that it solves
our problem in O(m+n) time and O(m) space.  This is the worst case behavior,
though, and the average behavior is much better.  In average the algorithm
need only compare about n/m input string characters.  Compare this to
the obvious brute-force algorithm, where you would compare *all* characters
of the input string.

As a nice example of how finding a better algorithm is much more important
than trying to handoptimize code it has been experimentally observed how
the Boyer-Moore algorithm outperformed simpler algorithms that were implemen-
ted to take advantage of special-purpose machine instructions.

The description above is adapted from A.V. Aho ``Algorithms for Finding
Patterns in Strings'', pp. 267--271 in ``Algorithms and Complexity'' Vol. A
of ``Handbook of Theoretical Computer Science''.

As for C source code, I have just finished a small program today that
uses the Boyer-Moore algorithm.  If I get the time I'll extract the Boyer-
Moore parts and post the source, if there's interest.  Otherwise you might
want to look at the GNU regex.c module, which includes code for the Boyer-
Moore algorithm.

Hope this helps :-)

--
.------------------------------------------------------------------------------.
|  Greetings from Per Bojsen.                                                  |
+------------------------------+-----------------------------------------------+
|  EMail: cbmehq!lenler!bojsen | "Names do have power, after all, that of      |
|     Or: bojsen@dc.dth.dk     |  conjuring images of places we have not seen" |
`------------------------------+-----------------------------------------------'