Path: utzoo!utgpu!news-server.csri.toronto.edu!clyde.concordia.ca!uunet!convex!convex.COM
From: tchrist@convex.COM (Tom Christiansen)
Newsgroups: comp.sources.wanted
Subject: Re: removing duplicate lines from a text file???
Keywords: duplicate lines sort uniq perl
Message-ID: <100886@convex.convex.com>
Date: 27 Mar 90 15:47:34 GMT
References: <1990Mar25.182039.25565@jarvis.csri.toronto.edu> <2309@network.ucsd.edu> <3081@auspex.auspex.com> <90Mar26.232441est.2199@smoke.cs.toronto.edu>
Sender: news@convex.com
Reply-To: tchrist@convex.COM (Tom Christiansen)
Organization: CONVEX Software Development, Richardson, TX
Lines: 43

In article <90Mar26.232441est.2199@smoke.cs.toronto.edu> moraes@cs.toronto.edu (Mark Moraes) writes:
>>>>    Is there any simple way to remove duplicate lines from a text file?
>>>    sort -u orig_file > new_file
>>Assuming, of course, that the order of the lines in the file isn't
>>important.
>
>In that case, perhaps something like
>
>awk '{printf "%8d %s\n", NR, $0}' | 
>	sort -u +1 | 
>	sort -n | 
>	sed 's/^.........//'
>
>assuming, of course, that none of the lines are longer than the
>maximum lengths your awk/sed can handle.

1.  This seems like truly massive overkill.
2.  That max line length thing can be a real bitch.

If your duplicate lines are adjacent, just use uniq(1).
If not, this seems much clearer and cheaper:

    perl -ne 'print unless $seen{$_}++;' 

If the duplicate lines ARE adjacent AND you don't have uniq(1) AND
you don't want to chew up as much memory as the previous line, do this:

    perl -ne 'print $last = $_ unless $_ eq $last;'

Perl doesn't have the silly arbitrary line-length restrictions 
of sed and awk, and the code is often much clearer: compare
the logic of the awk/sort/sort/sed example with that of the perl ones.

Followups either to comp.lang.perl or alt.religion.computers, 
depending on your religion. :-)


--tom
--

    Tom Christiansen                       {uunet,uiucdcs,sun}!convex!tchrist 
    Convex Computer Corporation                            tchrist@convex.COM
		 "EMACS belongs in <sys/errno.h>: Editor too big!"