Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!uakari.primate.wisc.edu!relay.nswc.navy.mil!oasys!mimsy!mimsy.umd.edu!mpf From: mpf@triplea.cs.umd.edu (Martin Farach) Newsgroups: comp.theory Subject: Re: Longest repeated substring problem. Message-ID: Date: 16 Apr 91 19:06:56 GMT References: <70943@eerie.acsu.Buffalo.EDU> <1991Apr15.200508.757@bingvaxu.cc.binghamton.edu> <1991Apr16.154935.21123@bingvaxu.cc.binghamton.edu> Sender: news@mimsy.umd.edu Organization: U of Maryland, Dept. of Computer Science, Coll. Pk., MD 20742 Lines: 79 In-reply-to: kym@bingvaxu.cc.binghamton.edu's message of 16 Apr 91 15:49:35 GMT In article <1991Apr16.154935.21123@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes: In article mpf@triplea.cs.umd.edu (Martin Farach) writes: > In article <70943@eerie.acsu.Buffalo.EDU> ajay@cs.Buffalo.EDU ( Ajay Shekhawat ) writes: > > > >Sometime ago I stumbled across this problem: > >Given a string, find the longest repeated substring of > >that string. Can this problem be solved in time better > >than O(n^2) ( where n = length of the string ) ? > > > >Any references would be appreciated. >Actually, it can done in linear time whether or not it is consecutive >repetitions. Just build a suffix tree. Please elucidate. -kym OK. In: @Article{W-73, author = "P. Weiner", title = "Linear Pattern Matching Algorithm", journal = "Proc. 14 IEEE Symposium on Switching and Automata Theory", year = 1973, pages = "1-11" } or, more legibly in: @InCollection{CS-85, author = "M. T. Chen and J. Seiferas", title = "Efficient and Elegant Subword Tree Construction", booktitle = "Combinatorial Algorithms on Words", publisher = "NATO ASI Series F: Computer and System Sciences", year = 1985, editor = "A. Apostolico and Z. Galil", chapter = 12, pages = "97-107" } there is an algorithm for constructing the suffix tree of a string of lenght n in O(n) time (actually, this assumes a finite alphabet - for an unbounded alphabet S, it takes time O(n log S)). In a suffix tree, each node has a string associated it. The string s(n) of node n is defined as follows. If v is the least common ancestor of n and m, then s(v) is the longest common prefix of s(n) and s(m). Furthermore, for any suffix of the orginal string, there is a leaf whose string is that suffix. Now consider the longest repeated substring, r, of a string S. Since the string is repeated (and maximal), there will be some node n in S's suffix tree such that s(n) = r. Furthermore, for any (internal) node v in S's suffix tree |s(v)| \leq |s(n)| = |r|. So you can build the suffix tree in linear time and then scan through the internal nodes to find which one has the longest associated string - also in linear time. For finding longest _consecutive_ repeated sequence (note that this is usually called a square), you need a different algorithm. It can also be done in linear time, further, in time O(n log n) you can find *all* the squares of a string: @ARTICLE{AP-83, AUTHOR = "A. Apostolico and F.P. Preparata", TITLE = "Optimal Off-line Detection of Repetitions in a String", JOURNAL = "Theoretical Computer Science", YEAR = 1983, VOLUME = 22, PAGES = "297-315", } -- Martin Farach mpf@cs.umd.edu University of Maryland "Triviality and certainty are the Department of Computer Science kinderkrankheiten of knowledge." College Park, Maryland 20742 -Imre Lakatos