Xref: utzoo comp.unix.questions:26592 comp.lang.perl:2808 Path: utzoo!utgpu!watserv1!watmath!att!att!emory!wuarchive!usc!elroy.jpl.nasa.gov!jpl-devvax!lwall From: lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) Newsgroups: comp.unix.questions,comp.lang.perl Subject: Re: Need help ** removing duplicate rows ** Message-ID: <10182@jpl-devvax.JPL.NASA.GOV> Date: 31 Oct 90 01:26:06 GMT References: <1990Oct30.234654.23547@agate.berkeley.edu> <1990Oct31.003627.641@iwarp.intel.com> Reply-To: lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) Organization: Jet Propulsion Laboratory, Pasadena, CA Lines: 38 In article <1990Oct31.003627.641@iwarp.intel.com> merlyn@iwarp.intel.com (Randal Schwartz) writes: : In article <1990Oct30.234654.23547@agate.berkeley.edu>, c60b-3ac@web (Eric Thompson) writes: : | I have a few very long files that contain rows of ASCII data. Each row : | looks something like this (not the actual data here): : | : | a:A:b:c:d:e:f:g:h:i:j:k:l:m : | a:B:b:c:d:e:f:g:h:i:j:k:l:m : | a:C:b:c:d:e:f:g:h:i:j:k:l:m : | a:D:b:c:d:e:f:g:h:i:j:k:l:m : | b:A:n:o:p:q:s:t:u:v:w:x:y:z : | c:A:x:a:x:b:x:c:d:a:m:l:v:x : | d:A:m:l:k:j:i:h:g:f:e:d:c:b : | d:B:m:l:k:j:i:h:g:f:e:d:c:b : | d:C:m:l:k:j:i:h:g:f:e:d:c:b : | : | It's the second column that's important. If there are multiple rows that : | are exactly the same except for the second column, I want to GET RID of them. : | If the row is unique (for example, the ones starting with "b" and "c" above) : | then it should stay. Sounds like what I need is a way to filter out rows : | that are duplicate except in the second column. : : A one-liner in Perl: : : perl -ne '($a,$b,$c) = split(":",$_,3); print unless $seen{$a,$c}++;' : : Fast enough? Maybe, but he said they were very long files, and that may mean more than you'd want to store in an associative array, even with virtual memory. Presuming the files are sorted reasonably, you can get away with this: perl -ne '($this = $_) =~ s/:[^:]*//; print if $this ne $that; $that = $this' Of course, someone will post a solution using cut and uniq, which will be fine if you don't mind losing the second field. Or swapping the first two fields around. I'll leave the awk and sed solutions to someone else. Larry