Path: utzoo!utgpu!water!watmath!clyde!bellcore!faline!ulysses!allegra!princeton!udel!rochester!bbn!husc6!cca!g-rh From: g-rh@cca.UUCP Newsgroups: comp.unix.questions Subject: Re: Unix diff(1) algorithm Message-ID: <23568@cca.CCA.COM> Date: 23 Jan 88 08:02:22 GMT References: <11229@brl-adm.ARPA> <1571@aurora.UUCP> <3219@psuvax1.psu.edu> <23307@cca.CCA.COM> <3238@psuvax1.psu.edu> Reply-To: g-rh@CCA.CCA.COM.UUCP (Richard Harter) Organization: Computer Corp. of America, Cambridge, MA Lines: 109 Posted: Sat Jan 23 03:02:22 1988 In article <3238@psuvax1.psu.edu> schwartz@gondor.cs.psu.edu (Scott E. Schwartz) writes: >In article <23307@cca.CCA.COM> g-rh@CCA.CCA.COM.UUCP (Richard Harter) writes: >>But, but, but... Finding the longest common subsequence is a linear >>process. Or at least it is sufficiently close to being linear that >>it makes no difference. I don't know how diff does it or fcomp does >>it, but the algorithm that I use builds a ordered list of intermediate >>subsequence heads. Each new item goes onto a list head. The trick to >>making it quasi-linear is to do an interpolation search when placing >>a new item into the list of intermediates. [This isn't really linear, >>but, given the conditions, might as well be.] > >I don't know if finding the longest common subsequence is a linear process, >in general. All of the published algorythms I've seen have some sort of >dependency on the input data that stops one from making that claim. > >Which conditions are you referring to? Also, when you say quasi-linear, >what does that really mean? Diff runs in O( (n+r) log n) time, where >n is the size of the input, and r is the number of pairs of positions >in which the the two sequences match (c.f. Hunt & Szymanski). > Since what I use may not be the same as diff, and may be an unpublished algorithm, I will run through the steps and take it from there. (1) Hash each line in each file into an integer to produce two arrays of integers, representing the files. Also create an index array for each file array. (2) Sort the file arrays by permuting the index arrays. (3) Go through the index arrays to get two arrays of matches. (4) Sort the match list for the old file, and permute the match list for the new file to correspond. (5) Extract the longest ascending subsequence of the match list for the new file. This is the longest common subsequence of the two files. (6) Make a backfill pass to account for identical lines that were misaligned. (7) Convert the l.c.s to difference format. The step in question is step five, which is a permuted array of indices with some integers missing. This is a little tricky, so I am going to lift the documentation of the routine that does it. "The algorithm runs as follows: Suppose that, for a given sequence, we have determined ascending subsequences of all possible lengths such that each is headed by the largest possible value for subsequences of that length. Now consider a new sequence formed by adding a new value to the head of the first sequence. There are three possible case -- the new value .le. the head of the longest subsequence; it is .gt. that the head of the longest but .le. than some shorter subsequence, or it is .gt. the heads of all subsequences. In the first case the new subsequence has a subsequence which is one value longer than the longest subsequence, headed by the added value. In the second case let k be the length of the subsequence such that the new value is .le. its head but .gt. than the next longer subsequence. In this case the new subsequence of length k+1 headed by the new value and followed by the old subsequence of length k. In the final case the new value is a new 'best' sequence of length one." There is more, but that's the key. There are various ingenuities which are not relevant here. The point is that the algorithm constructs an ordered array of list heads, all data being integers in the range 1 to n. At each step you get a new value which must be placed in the array; if it is smaller it extends the array, otherwise it replaces the value it is just larger than. The factor that is important is the time it takes to locate the right position for the new item. This particular code checks to see if the item is a new longest or a new 'best' of length one. (If the files match each item is an extension). In the general case one has to do a search. The code uses an interpolation search. Since the values are integers in the range 1...n the search is fast. This is why I said quasi-linear. The task of determining the actual O() is not trivial. I don't know how close this is to the algorithm that diff uses. The length of the array of subsequence heads ends up being the number of matches (by position, not by value). The n is the number of matches by value rather than the file length(s). If you used a general purpose binary seach, you would get O(n *log(r)) for the l.a.s. algorithm. This suggests that diff is, you should excuse the expression, doing it differ- ently. This code actually uses three sorts -- one on each of the hash arrays and one on the match indices of the old file. In this situation an O(n) math sort is feasible. Let n_o be the old file length, n_n the new file length, n_v the number of matches before order is taken into account, and r the number of matches after order is taken into account. If the search is O(n*log(log(r)), which I suspect it is, the time is O(n_o) +O(n_n) + O(n_v) + O(n_v*log(log(r)). In principle the latter factor dominates. In practice, the sort times dominate. >Diff does a sort, so you are basically correct there. Fcomp does no >sorting at all, and uses a completely different algorythm in any case. Color me curious. I understand l.c.s difference schemes and sliding window schemes. Can you give a quick summary of the idea of fcomp? -- In the fields of Hell where the grass grows high Are the graves of dreams allowed to die. Richard Harter, SMDS Inc.