Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!samsung!think.com!zaphod.mps.ohio-state.edu!unix.cis.pitt.edu!dsinc!bagate!cbmvax!cbmehq!cbmdeo!lenler!moria!bojsen From: bojsen@moria.uucp (Per Bojsen) Newsgroups: comp.sys.amiga.programmer Subject: Re: The Boyer-Moore string searching algorithm Message-ID: <194e4efd.ARN377f@moria.uucp> Date: 15 Jun 91 21:00:29 GMT References: <492@regina.uregina.ca> <19453c80.ARN360b@moria.uucp> <1991Jun10.140028.9664@ira.uka.de> <1947e763.ARN36eb@moria.uucp> <1991Jun12.143556.4127@ira.uka.de> Reply-To: cbmehq!moria!bojsen Followup-To: comp.sys.amiga.programmer Organization: IDUN-Soft Aps. Lines: 51 In article <1991Jun12.143556.4127@ira.uka.de>, |S| Angela Schmidt writes: > In <1947e763.ARN36eb@moria.uucp> bojsen@moria.uucp writes: > > > I'm sorry, but I believe Boyer-Moore is O(m+n) like KMP. But Boyer-Moore > > has a much better average behavior than KMP. For sufficiently large > > alphabets Boyer-Moore only needs to examin about n/m characters of the > > input string on the average. > > > Sorry, but I'm *very* shure it has O(n*m) since - in the worst case - it > always tries to match the whole string and only recognizes at the last > character, that there is a mismatch. Doing this n times causes O(m*n). ~~~~~~~~~~~~~~~~~~ This is the fallacy in your argument! We don't need to do this n times, but rather just about n/m times giving a behavior of approximately O(n) in this case. See below: > aaaaaaaa String you search in called A > baaa String you search for called B > ^ means that the character matches > ! means that the character does not match > > Strings Characters that will be checked Number of checks > --------------------------------------------------------------- > aaaaaaaa > baaa > !^^^ A[3] A[2] A[1] - mismatch at A[0] m checks in 1st round > > aaaaaaaa > baaa A[4] A[3] A[2] - mismatch at A[1] m checks in 2nd round > !^^^ > Nope! The Boyer-Moore algorithm will skip the first m characters of the input string entirely at this point: aaaaaaaa baaa A[8] A[7] A[6] - mismatch at A[4] m checks in 2nd round !^^^ If you read my article you would know why it is possible to do this. The reason is that there cannot possibly be a match starting from any of the first four characters of the input string since the `b' does not occur anywhere among the last three characters of the search string. -- .------------------------------------------------------------------------------. | Greetings from Per Bojsen. | +------------------------------+-----------------------------------------------+ | EMail: cbmehq!lenler!bojsen | "Names do have power, after all, that of | | Or: bojsen@dc.dth.dk | conjuring images of places we have not seen" | `------------------------------+-----------------------------------------------'