Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!dali.cs.montana.edu!uakari.primate.wisc.edu!sdd.hp.com!wuarchive!uunet!bionet!lhc!ncifcrf!fcs260c2!toms
From: toms@fcs260c2.ncifcrf.gov (Tom Schneider)
Newsgroups: bionet.molbio.gene-org
Subject: Re: (none)
Message-ID: <2148@fcs280s.ncifcrf.gov>
Date: 6 May 91 18:37:10 GMT
References: <9105021738.AA02234@genbank.bio.net>
Sender: news@ncifcrf.gov
Organization: NCI Supercomputer Facility, Frederick, MD
Lines: 31

In article <9105021738.AA02234@genbank.bio.net> UDAA420@hazel.cc.kcl.ac.uk writes:

>	I am looking for a publically accessible nucleotide database that has
>been screened to eliminate any duplications/redundancy.

GenBank itself should be doing this, has been saying they would for years,
BUT THEY HAVE FAILED TO DO IT.  If everybody in the biological community
would raise enough noise about this maybe they would actually do it!

In the early years, before GenBank, many people entered sequences independently
of everyone else.  Phix174 was probably entered a hundred times worldwide.
Finally GenBank came into existence and this redundancy was removed.

Today, everyone who wants to do a statistical analysis of the database must
deal with the huge redundancy in the database, but little to nothing is being
done to eliminate this.  People talk about keeping the original sequences
and creating a 'view' of the data which is merged, but the actual work
to do this is rarelly done.

Fortunately, Kenn Rudd (rudd@bio.nlm.nih.gov) is working on a merged E. coli
database.  Bravo Kenn!  We need much more effort along these lines to avoid
being flooded with this problem.

>Phil Cunningham
>King's College London

  Tom Schneider
  National Cancer Institute
  Laboratory of Mathematical Biology
  Frederick, Maryland  21702-1201
  toms@ncifcrf.gov