Path: utzoo!utgpu!news-server.csri.toronto.edu!clyde.concordia.ca!uunet!convex!convex.COM From: tchrist@convex.COM (Tom Christiansen) Newsgroups: comp.sources.wanted Subject: Re: removing duplicate lines from a text file??? Keywords: duplicate lines sort uniq perl Message-ID: <100886@convex.convex.com> Date: 27 Mar 90 15:47:34 GMT References: <1990Mar25.182039.25565@jarvis.csri.toronto.edu> <2309@network.ucsd.edu> <3081@auspex.auspex.com> <90Mar26.232441est.2199@smoke.cs.toronto.edu> Sender: news@convex.com Reply-To: tchrist@convex.COM (Tom Christiansen) Organization: CONVEX Software Development, Richardson, TX Lines: 43 In article <90Mar26.232441est.2199@smoke.cs.toronto.edu> moraes@cs.toronto.edu (Mark Moraes) writes: >>>> Is there any simple way to remove duplicate lines from a text file? >>> sort -u orig_file > new_file >>Assuming, of course, that the order of the lines in the file isn't >>important. > >In that case, perhaps something like > >awk '{printf "%8d %s\n", NR, $0}' | > sort -u +1 | > sort -n | > sed 's/^.........//' > >assuming, of course, that none of the lines are longer than the >maximum lengths your awk/sed can handle. 1. This seems like truly massive overkill. 2. That max line length thing can be a real bitch. If your duplicate lines are adjacent, just use uniq(1). If not, this seems much clearer and cheaper: perl -ne 'print unless $seen{$_}++;' If the duplicate lines ARE adjacent AND you don't have uniq(1) AND you don't want to chew up as much memory as the previous line, do this: perl -ne 'print $last = $_ unless $_ eq $last;' Perl doesn't have the silly arbitrary line-length restrictions of sed and awk, and the code is often much clearer: compare the logic of the awk/sort/sort/sed example with that of the perl ones. Followups either to comp.lang.perl or alt.religion.computers, depending on your religion. :-) --tom -- Tom Christiansen {uunet,uiucdcs,sun}!convex!tchrist Convex Computer Corporation tchrist@convex.COM "EMACS belongs in : Editor too big!"