Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!dali.cs.montana.edu!uakari.primate.wisc.edu!sdd.hp.com!wuarchive!uunet!bionet!lhc!ncifcrf!fcs260c2!toms From: toms@fcs260c2.ncifcrf.gov (Tom Schneider) Newsgroups: bionet.molbio.gene-org Subject: Re: (none) Message-ID: <2148@fcs280s.ncifcrf.gov> Date: 6 May 91 18:37:10 GMT References: <9105021738.AA02234@genbank.bio.net> Sender: news@ncifcrf.gov Organization: NCI Supercomputer Facility, Frederick, MD Lines: 31 In article <9105021738.AA02234@genbank.bio.net> UDAA420@hazel.cc.kcl.ac.uk writes: > I am looking for a publically accessible nucleotide database that has >been screened to eliminate any duplications/redundancy. GenBank itself should be doing this, has been saying they would for years, BUT THEY HAVE FAILED TO DO IT. If everybody in the biological community would raise enough noise about this maybe they would actually do it! In the early years, before GenBank, many people entered sequences independently of everyone else. Phix174 was probably entered a hundred times worldwide. Finally GenBank came into existence and this redundancy was removed. Today, everyone who wants to do a statistical analysis of the database must deal with the huge redundancy in the database, but little to nothing is being done to eliminate this. People talk about keeping the original sequences and creating a 'view' of the data which is merged, but the actual work to do this is rarelly done. Fortunately, Kenn Rudd (rudd@bio.nlm.nih.gov) is working on a merged E. coli database. Bravo Kenn! We need much more effort along these lines to avoid being flooded with this problem. >Phil Cunningham >King's College London Tom Schneider National Cancer Institute Laboratory of Mathematical Biology Frederick, Maryland 21702-1201 toms@ncifcrf.gov