Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!uakari.primate.wisc.edu!relay.nswc.navy.mil!oasys!mimsy!mimsy.umd.edu!mpf
From: mpf@triplea.cs.umd.edu (Martin Farach)
Newsgroups: comp.theory
Subject: Re: Longest repeated substring problem.
Message-ID: <MPF.91Apr16150656@triplea.cs.umd.edu>
Date: 16 Apr 91 19:06:56 GMT
References: <70943@eerie.acsu.Buffalo.EDU>
	<1991Apr15.200508.757@bingvaxu.cc.binghamton.edu>
	<MPF.91Apr15210650@triplea.cs.umd.edu>
	<1991Apr16.154935.21123@bingvaxu.cc.binghamton.edu>
Sender: news@mimsy.umd.edu
Organization: U of Maryland, Dept. of Computer Science, Coll. Pk., MD 20742
Lines: 79
In-reply-to: kym@bingvaxu.cc.binghamton.edu's message of 16 Apr 91 15:49:35 GMT

In article <1991Apr16.154935.21123@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes:

   In article <MPF.91Apr15210650@triplea.cs.umd.edu> mpf@triplea.cs.umd.edu (Martin Farach) writes:
   >   In article <70943@eerie.acsu.Buffalo.EDU> ajay@cs.Buffalo.EDU ( Ajay Shekhawat ) writes:
   >   >
   >   >Sometime ago I stumbled across this problem: 
   >   >Given a string, find the longest repeated substring of
   >   >that string. Can this problem be solved in time better
   >   >than O(n^2) ( where  n = length of the string ) ?
   >   >
   >   >Any references would be appreciated.
   >Actually, it can done in linear time whether or not it is consecutive
   >repetitions.  Just build a suffix tree.

   Please elucidate.

   -kym

OK.  In:

@Article{W-73,
  author = 	"P. Weiner",
  title = 	"Linear Pattern Matching Algorithm",
  journal = 	"Proc. 14 IEEE Symposium on Switching and Automata Theory",
  year = 	1973,
  pages = 	"1-11"
}

or, more legibly in:

@InCollection{CS-85,
  author = 	"M. T. Chen and J. Seiferas",
  title = 	"Efficient and Elegant Subword Tree Construction",
  booktitle = 	"Combinatorial Algorithms on Words",
  publisher = 	"NATO ASI Series F: Computer and System Sciences",
  year = 	1985,
  editor = 	"A. Apostolico and Z. Galil",
  chapter = 	12,
  pages = 	"97-107"
}


there is an algorithm for constructing the suffix tree of a string of
lenght n in O(n) time (actually, this assumes a finite alphabet - for
an unbounded alphabet S, it takes time O(n log S)).  In a suffix tree,
each node has a string associated it.  The string s(n) of node n is
defined as follows.  If v is the least common ancestor of n and m,
then s(v) is the longest common prefix of s(n) and s(m).  Furthermore,
for any suffix of the orginal string, there is a leaf whose string is
that suffix.  

Now consider the longest repeated substring, r, of a string S.  Since
the string is repeated (and maximal), there will be some node n in S's
suffix tree such that s(n) = r.  Furthermore, for any (internal) node
v in S's suffix tree |s(v)| \leq |s(n)| = |r|.  So you can build the
suffix tree in linear time and then scan through the internal nodes to
find which one has the longest associated string - also in linear
time.

For finding longest _consecutive_ repeated sequence (note that this is
usually called a square), you need a different algorithm.  It can also
be done in linear time, further, in time O(n log n) you can find *all*
the squares of a string:

@ARTICLE{AP-83,
	AUTHOR = "A. Apostolico and F.P. Preparata",
	TITLE =  "Optimal Off-line Detection of Repetitions in a String",
	JOURNAL = "Theoretical Computer Science",
	YEAR = 1983,
	VOLUME = 22,
	PAGES = "297-315",
}

--
Martin Farach			mpf@cs.umd.edu
University of Maryland		"Triviality and certainty are the
Department of Computer Science	 kinderkrankheiten of knowledge."
College Park, Maryland 20742			  -Imre Lakatos