Path: utzoo!utgpu!water!watmath!clyde!bellcore!faline!ulysses!allegra!princeton!udel!rochester!bbn!husc6!cca!g-rh
From: g-rh@cca.UUCP
Newsgroups: comp.unix.questions
Subject: Re: Unix diff(1) algorithm
Message-ID: <23568@cca.CCA.COM>
Date: 23 Jan 88 08:02:22 GMT
References: <11229@brl-adm.ARPA> <1571@aurora.UUCP> <3219@psuvax1.psu.edu> <23307@cca.CCA.COM> <3238@psuvax1.psu.edu>
Reply-To: g-rh@CCA.CCA.COM.UUCP (Richard Harter)
Organization: Computer Corp. of America, Cambridge, MA
Lines: 109
Posted: Sat Jan 23 03:02:22 1988

In article <3238@psuvax1.psu.edu> schwartz@gondor.cs.psu.edu (Scott E. Schwartz) writes:
>In article <23307@cca.CCA.COM> g-rh@CCA.CCA.COM.UUCP (Richard Harter) writes:
>>But, but, but...  Finding the longest common subsequence is a linear
>>process.  Or at least it is sufficiently close to being linear that
>>it makes no difference.  I don't know how diff does it or fcomp does
>>it, but the algorithm that I use builds a ordered list of intermediate
>>subsequence heads.  Each new item goes onto a list head.  The trick to
>>making it quasi-linear is to do an interpolation search when placing
>>a new item into the list of intermediates.  [This isn't really linear,
>>but, given the conditions, might as well be.]
>
>I don't know if finding the longest common subsequence is a linear process,
>in general.  All of the published algorythms I've seen have some sort of
>dependency on the input data that stops one from making that claim.
>
>Which conditions are you referring to?  Also, when you say quasi-linear,
>what does that really mean?  Diff runs in O( (n+r) log n) time, where
>n is the size of the input, and r is the number of pairs of positions 
>in which the the two sequences match (c.f. Hunt & Szymanski).
>
	Since what I use may not be the same as diff, and may be an
unpublished algorithm, I will run through the steps and take it from
there.

(1)	Hash each line in each file into an integer to produce two
arrays of integers, representing the files.  Also create an index
array for each file array.

(2)	Sort the file arrays by permuting the index arrays.

(3)	Go through the index arrays to get two arrays of matches.

(4)	Sort the match list for the old file, and permute the match
list for the new file to correspond.

(5)	Extract the longest ascending subsequence of the match list
for the new file.  This is the longest common subsequence of the two
files.

(6)	Make a backfill pass to account for identical lines that were
misaligned.

(7)	Convert the l.c.s to difference format.

	The step in question is step five, which is a permuted array
of indices with some integers missing.  This is a little tricky, so
I am going to lift the documentation of the routine that does it.

"The algorithm runs as follows:  Suppose that, for a given sequence,
we have determined ascending subsequences of all possible lengths
such that each is headed by the largest possible value for subsequences
of that length.  Now consider a new sequence formed by adding a new value
to the head of the first sequence.  There are three possible case --
the new value .le. the head of the longest subsequence; it is .gt.
that the head of the longest but .le. than some shorter subsequence,
or it is .gt. the heads of all subsequences.  In the first case the
new subsequence has a subsequence which is one value longer than the
longest subsequence, headed by the added value.  In the second case
let k be the length of the subsequence such that the new value is .le.
its head but .gt. than the next longer subsequence.  In this case the
new subsequence of length k+1 headed by the new value and followed
by the old subsequence of length k.  In the final case the new value
is a new 'best' sequence of length one."

	There is more, but that's the key.  There are various ingenuities
which are not relevant here.  The point is that the algorithm constructs
an ordered array of list heads, all data being integers in the range
1 to n.  At each step you get a new value which must be placed in the
array; if it is smaller it extends the array, otherwise it replaces
the value it is just larger than.

	The factor that is important is the time it takes to locate
the right position for the new item.  This particular code checks to
see if the item is a new longest or a new 'best' of length one.  (If
the files match each item is an extension).  In the general case one
has to do a search.  The code uses an interpolation search.  Since
the values are integers in the range 1...n the search is fast.  This
is why I said quasi-linear.  The task of determining the actual O() is
not trivial.

	I don't know how close this is to the algorithm that diff uses.
The length of the array of subsequence heads ends up being the number
of matches (by position, not by value).  The n is the number of matches
by value rather than the file length(s).  If you used a general purpose
binary seach, you would get O(n *log(r)) for the l.a.s. algorithm.  This
suggests that diff is, you should excuse the expression, doing it differ-
ently.

	This code actually uses three sorts -- one on each of the hash
arrays and one on the match indices of the old file.  In this situation
an O(n) math sort is feasible.  Let n_o be the old file length, n_n
the new file length, n_v the number of matches before order is taken
into account, and r the number of matches after order is taken into
account.  If the search is O(n*log(log(r)), which I suspect it is, the
time is  O(n_o) +O(n_n) + O(n_v) + O(n_v*log(log(r)).  In principle
the latter factor dominates.  In practice, the sort times dominate.


>Diff does a sort, so you are basically correct there.  Fcomp does no
>sorting at all, and uses a completely different algorythm in any case.

	Color me curious.  I understand l.c.s difference schemes and
sliding window schemes.  Can you give a quick summary of the idea of
fcomp?
-- 

In the fields of Hell where the grass grows high
Are the graves of dreams allowed to die.
	Richard Harter, SMDS  Inc.