Path: utzoo!utgpu!water!watmath!clyde!rutgers!umn-d-ub!uwvax!husc6!cca!g-rh From: g-rh@cca.CCA.COM (Richard Harter) Newsgroups: comp.unix.questions Subject: Re: Unix diff(1) algorithm Summary: Probably more than you wanted to know Message-ID: <23225@cca.CCA.COM> Date: 13 Jan 88 19:23:29 GMT References: <11229@brl-adm.ARPA> <7068@brl-smoke.ARPA> Reply-To: g-rh@CCA.CCA.COM.UUCP (Richard Harter) Organization: Computer Corp. of America, Cambridge, MA Lines: 65 In article <7068@brl-smoke.ARPA> gwyn@brl.arpa (Doug Gwyn (VLD/VMB) ) writes: >In article <11229@brl-adm.ARPA> @RELAY.CS.NET,@humus.huji.ac.IL:amoss@batata.bitne (Amos Shapira) writes: >>Does any one knows where to find a description of the algorithm >>used in Unix diff(1)? > >It is described in Bell Labs CSTR #41, "An Algorithm for Differential >File Comparison" by J. W. Hunt and M. D. McIlroy (June 1976). >Unfortunately this seems to be out of print. I think there were some >errors in the report, too. I attempted to rewrite "diff" once, and >found some minor speed improvements that could be made to the stock >UNIX version. The comments in the UNIX version were fairly accurate >and complete, and they're probably more useful than the CSTR anyway. > >I seem to recall a CACM article some time in the past several years >that presented a supposedly better diff algorithm, and it may have >also discussed the UNIX diff algorithm. I used to have a copy of the Bell report, which of course had disappeared when I had occasion to actually write a diff program. My recollection is that the original is rather obscure. However the essential idea is fairly simple, and has four steps: (1) Hash each line in each file to produce an array of integers. (2) Sort the two arrays of hash integers, retaining pointers back to the original lines. (3) Compare the two lists and extract the longest common ascending subsequence. (4) Relate the l.a.s. back to the original files. There is a straightforward algorithm which is mathematically rather pretty for step three -- if the hash integers are all unique. The tricky part is what to do when there are duplicates, i.e. multiple occurences of the same text line in the file. The solution I used was to ignore the possibility of mis-assignment of duplicates during the las pass and do a gap fill afterwards; I don't know diff handles this. This was written as part of a larger program, so I don't have a comparison of 'mydiff' versus diff. However I did do timing analyses on 'mydiff'. It turns out that the critical part of the code is the sort. In this case, the arrays being sorted are uniformly distributed numbers, and a histogram math sort wins, since you can write an O(n) sort. The l.c.a.s code can be made O(n) is space and time. The gap fill code is very fast but is tricky to get right. The sort that I wrote, by the way, has a line of code that breaks the optimizer on the Masscomp C compiler (unless they have finally gotten a fix in on it.) It runs as follows for (i=0;i