Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!ub!galileo.cc.rochester.edu!rochester!cornell!ressler From: ressler@CS.Cornell.EDU (Gene Ressler) Newsgroups: comp.binaries.ibm.pc.d Subject: Re: Need a program to delete duplicate lines of text Keywords: duplicate, text, deletes, lines, non-document, ascii file Message-ID: <1991Jun14.162117.23364@cs.cornell.edu> Date: 14 Jun 91 16:21:17 GMT References: <1991Jun14.012035.6708@disk.uucp> <1991Jun14.131006.22487@midway.uchicago.edu> Sender: news@cs.cornell.edu (USENET news user) Distribution: usa Organization: Cornell Univ. CS Dept, Ithaca NY 14853 Lines: 32 Nntp-Posting-Host: cello.cs.cornell.edu >For the past weeks I have been creating dictionary lists for a friend's word >game (on his BBS). I've been taking data files for various programs and >converting them to word lists. Each word must be alone on its own line. OftenI have many duplicate words in every file. I would love to find a program that >will quickly delete all the duplicates. ... This is a canonical example for pipes in several Unix texts. You say sort file | uniq > file_with_no_dups Uniq is a very simple filter that looks at the current line and prints it iff it's not different than the last. Both sort (which is faster/nicer than DOS sort) and uniq are available in the gnuish MSDOS ports of gnu utilities by Thurston Ohl. See wsmr-simtel20.army.mil, pd1:. Another alternative to uniq if you have awk (also available on simtel) is to use the following awk program: { if ($0 != last) print $0; last = $0 } of course it's almost as easy in any other language. Awk is a wonderful tool for this sort of thing. For instance, the program { for (i = 1; i <= NF; ++i) print $i } will print the words one per line. It's not much harder to strip punctuation, force to lower case, or surround with quotes and comma separate for that matter! Gene