Xref: utzoo comp.unix.questions:26593 comp.lang.perl:2809 Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!wuarchive!zaphod.mps.ohio-state.edu!maverick.ksu.ksu.edu!rutgers!cmcl2!kramden.acf.nyu.edu!brnstnd From: brnstnd@kramden.acf.nyu.edu (Dan Bernstein) Newsgroups: comp.unix.questions,comp.lang.perl Subject: Re: Need help ** removing duplicate rows ** Summary: sort (-m) -u -t: +0 -1 +2. Why bother with Perl? Message-ID: <28220:Oct3105:18:3290@kramden.acf.nyu.edu> Date: 31 Oct 90 05:18:32 GMT References: <1990Oct30.234654.23547@agate.berkeley.edu> <1990Oct31.003627.641@iwarp.intel.com> <10182@jpl-devvax.JPL.NASA.GOV> Organization: IR Lines: 39 In article <10182@jpl-devvax.JPL.NASA.GOV> lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) writes: > In article <1990Oct31.003627.641@iwarp.intel.com> merlyn@iwarp.intel.com (Randal Schwartz) writes: > : In article <1990Oct30.234654.23547@agate.berkeley.edu>, c60b-3ac@web (Eric Thompson) writes: [ if multiple (consecutive?) rows of colon-separated columns ] [ have the same second column, scrap 'em ] > : perl -ne '($a,$b,$c) = split(":",$_,3); print unless $seen{$a,$c}++;' > : Fast enough? [ as happens with every Perl program posted to the net, Larry points ] [ out how inefficient this can be: ] > Maybe, but he said they were very long files, and that may mean more than > you'd want to store in an associative array, even with virtual memory. > Presuming the files are sorted reasonably, you can get away with this: > perl -ne '($this = $_) =~ s/:[^:]*//; print if $this ne $that; $that = $this' That does look like what Eric was asking for, but what if the file is not sorted? Is there a fast Perl solution? > Of course, someone will post a solution using cut and uniq, which will be > fine if you don't mind losing the second field. Or swapping the first > two fields around. cut? uniq? Why? There's already a tool perfectly matched to the job: sort -u -t: +0 -1 +2 sort already knows how to work in limited memory. If the input is already sorted, sort -m -u -t: +0 -1 +2 should do the trick. Both of these solutions are easy to figure out, easy to type, very fast even on long files, and quite portable. > I'll leave the awk and sed solutions to someone else. Yes, I seem to always be defending the classic tools against this onslaught of Perl code that nobody but you can ever optimize. ---Dan