Xref: utzoo comp.unix.questions:26593 comp.lang.perl:2809
Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!wuarchive!zaphod.mps.ohio-state.edu!maverick.ksu.ksu.edu!rutgers!cmcl2!kramden.acf.nyu.edu!brnstnd
From: brnstnd@kramden.acf.nyu.edu (Dan Bernstein)
Newsgroups: comp.unix.questions,comp.lang.perl
Subject: Re: Need help ** removing duplicate rows **
Summary: sort (-m) -u -t: +0 -1 +2. Why bother with Perl?
Message-ID: <28220:Oct3105:18:3290@kramden.acf.nyu.edu>
Date: 31 Oct 90 05:18:32 GMT
References: <1990Oct30.234654.23547@agate.berkeley.edu> <1990Oct31.003627.641@iwarp.intel.com> <10182@jpl-devvax.JPL.NASA.GOV>
Organization: IR
Lines: 39

In article <10182@jpl-devvax.JPL.NASA.GOV> lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) writes:
> In article <1990Oct31.003627.641@iwarp.intel.com> merlyn@iwarp.intel.com (Randal Schwartz) writes:
> : In article <1990Oct30.234654.23547@agate.berkeley.edu>, c60b-3ac@web (Eric Thompson) writes:
      [ if multiple (consecutive?) rows of colon-separated columns ]
      [ have the same second column, scrap 'em ]
> : perl -ne '($a,$b,$c) = split(":",$_,3); print unless $seen{$a,$c}++;'
> : Fast enough?
  [ as happens with every Perl program posted to the net, Larry points ]
  [ out how inefficient this can be: ]
> Maybe, but he said they were very long files, and that may mean more than
> you'd want to store in an associative array, even with virtual memory.
> Presuming the files are sorted reasonably, you can get away with this:
> perl -ne '($this = $_) =~ s/:[^:]*//; print if $this ne $that; $that = $this'

That does look like what Eric was asking for, but what if the file is
not sorted? Is there a fast Perl solution?

> Of course, someone will post a solution using cut and uniq, which will be
> fine if you don't mind losing the second field.  Or swapping the first
> two fields around. 

cut? uniq? Why? There's already a tool perfectly matched to the job:

  sort -u -t: +0 -1 +2

sort already knows how to work in limited memory. If the input is
already sorted,

  sort -m -u -t: +0 -1 +2

should do the trick. Both of these solutions are easy to figure out, easy
to type, very fast even on long files, and quite portable.

> I'll leave the awk and sed solutions to someone else.

Yes, I seem to always be defending the classic tools against this
onslaught of Perl code that nobody but you can ever optimize.

---Dan