Xref: utzoo comp.unix.questions:26585 comp.lang.perl:2803
Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!wuarchive!usc!ucsd!ucbvax!iwarp.intel.com!news
From: merlyn@iwarp.intel.com (Randal Schwartz)
Newsgroups: comp.unix.questions,comp.lang.perl
Subject: Re: Need help ** removing duplicate rows **
Message-ID: <1990Oct31.003627.641@iwarp.intel.com>
Date: 31 Oct 90 00:36:27 GMT
References: <1990Oct30.234654.23547@agate.berkeley.edu>
Sender: news@iwarp.intel.com
Reply-To: merlyn@iwarp.intel.com (Randal Schwartz)
Organization: Stonehenge; netaccess via Intel, Beaverton, Oregon, USA
Lines: 32
In-Reply-To: c60b-3ac@web.berkeley.edu (Eric Thompson)

In article <1990Oct30.234654.23547@agate.berkeley.edu>, c60b-3ac@web (Eric Thompson) writes:
| I have a few very long files that contain rows of ASCII data.  Each row
| looks something like this (not the actual data here):
| 
| a:A:b:c:d:e:f:g:h:i:j:k:l:m
| a:B:b:c:d:e:f:g:h:i:j:k:l:m
| a:C:b:c:d:e:f:g:h:i:j:k:l:m
| a:D:b:c:d:e:f:g:h:i:j:k:l:m
| b:A:n:o:p:q:s:t:u:v:w:x:y:z
| c:A:x:a:x:b:x:c:d:a:m:l:v:x
| d:A:m:l:k:j:i:h:g:f:e:d:c:b
| d:B:m:l:k:j:i:h:g:f:e:d:c:b
| d:C:m:l:k:j:i:h:g:f:e:d:c:b
| 
| It's the second column that's important.  If there are multiple rows that
| are exactly the same except for the second column, I want to GET RID of them.
| If the row is unique (for example, the ones starting with "b" and "c" above)
| then it should stay.  Sounds like what I need is a way to filter out rows
| that are duplicate except in the second column.

A one-liner in Perl:

perl -ne '($a,$b,$c) = split(":",$_,3); print unless $seen{$a,$c}++;'

Fast enough?

print "Just another Perl hacker,"
-- 
/=Randal L. Schwartz, Stonehenge Consulting Services (503)777-0095 ==========\
| on contract to Intel's iWarp project, Beaverton, Oregon, USA, Sol III      |
| merlyn@iwarp.intel.com ...!any-MX-mailer-like-uunet!iwarp.intel.com!merlyn |
\=Cute Quote: "Intel put the 'backward' in 'backward compatible'..."=========/