Path: utzoo!mnetor!uunet!husc6!bloom-beacon!gatech!psuvax1!gondor.cs.psu.edu!schwartz From: schwartz@gondor.cs.psu.edu (Scott E. Schwartz) Newsgroups: comp.unix.questions Subject: Re: Unix diff(1) algorithm Message-ID: <3238@psuvax1.psu.edu> Date: 22 Jan 88 19:42:01 GMT References: <11229@brl-adm.ARPA> <1571@aurora.UUCP> <3219@psuvax1.psu.edu> <23307@cca.CCA.COM> Sender: netnews@psuvax1.psu.edu Reply-To: schwartz@gondor.cs.psu.edu (Scott E. Schwartz) Organization: Penn State University Lines: 29 In article <23307@cca.CCA.COM> g-rh@CCA.CCA.COM.UUCP (Richard Harter) writes: >But, but, but... Finding the longest common subsequence is a linear >process. Or at least it is sufficiently close to being linear that >it makes no difference. I don't know how diff does it or fcomp does >it, but the algorithm that I use builds a ordered list of intermediate >subsequence heads. Each new item goes onto a list head. The trick to >making it quasi-linear is to do an interpolation search when placing >a new item into the list of intermediates. [This isn't really linear, >but, given the conditions, might as well be.] I don't know if finding the longest common subsequence is a linear process, in general. All of the published algorythms I've seen have some sort of dependency on the input data that stops one from making that claim. Which conditions are you referring to? Also, when you say quasi-linear, what does that really mean? Diff runs in O( (n+r) log n) time, where n is the size of the input, and r is the number of pairs of positions in which the the two sequences match (c.f. Hunt & Szymanski). >As I remarked earlier, my experience is that the time to find the >longest common subsequence is dominated by the sort time, even if a >linear sort if being used. (By about a factor of 4 to 1, as I recall.) >Am I missing something? Do diff and/or fcomp bypass the sort? Diff does a sort, so you are basically correct there. Fcomp does no sorting at all, and uses a completely different algorythm in any case. -- Scott Schwartz schwartz@gondor.cs.psu.edu