Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!ub!galileo.cc.rochester.edu!rochester!cornell!ressler
From: ressler@CS.Cornell.EDU (Gene Ressler)
Newsgroups: comp.binaries.ibm.pc.d
Subject: Re: Need a program to delete duplicate lines of text
Keywords: duplicate, text, deletes, lines, non-document, ascii file
Message-ID: <1991Jun14.162117.23364@cs.cornell.edu>
Date: 14 Jun 91 16:21:17 GMT
References: <1991Jun14.012035.6708@disk.uucp> <1991Jun14.131006.22487@midway.uchicago.edu>
Sender: news@cs.cornell.edu (USENET news user)
Distribution: usa
Organization: Cornell Univ. CS Dept, Ithaca NY 14853
Lines: 32
Nntp-Posting-Host: cello.cs.cornell.edu

>For the past weeks I have been creating dictionary lists for a friend's word
>game (on his BBS).   I've been taking data files for various programs and 
>converting them to word lists.   Each word must be alone on its own line.  OftenI have many duplicate words in every file.  I would love to find a program that
>will quickly delete all the duplicates.
...
This is a canonical example for pipes in several Unix texts.
You say

sort file | uniq > file_with_no_dups

Uniq is a very simple filter that looks at the
current line and prints it iff it's not different than the last.

Both sort (which is faster/nicer than DOS sort) and uniq are 
available in the gnuish MSDOS ports of gnu utilities by Thurston
Ohl.  See wsmr-simtel20.army.mil,  pd1:<msdos.gnuish>.

Another alternative to uniq if you have awk (also available on 
simtel) is to use the following awk program:

{ if ($0 != last) print $0; last = $0 }

of course it's almost as easy in any other language.  Awk is
a wonderful tool for this sort of thing.  For instance, the program

{ for (i = 1; i <= NF; ++i) print $i }

will print the words one per line.  It's not much harder to strip
punctuation, force to lower case, or surround with quotes and comma
separate for that matter!

Gene