Newsgroups: bionet.molbio.genbank Path: utzoo!utgpu!lamoran From: lamoran@gpu.utcs.utoronto.ca (L.A. Moran) Subject: Quality of submitted data Message-ID: <1990Aug15.204400.26622@gpu.utcs.utoronto.ca> Organization: UTCS Public Access Date: Wed, 15 Aug 90 20:44:00 GMT David Kristofferson (kristoff@genbank.BIO.NET) writes: "The basic fact which has been brought up by journal editors repeatedly is that the vast majority of reviewers who get a paper containing sequence data in hardcopy are not going to take the time to enter the data into a computer." and Roy Smith (roy@alanine.phri.nyu.edu) replies; "You seem to be implying that this is the fault of the reviewers, that they are not taking the time to do their job properly (or are implying that the editors are implying that)." Ellen Murphy (murphy@phri.nyu.edu) also responded; "Surely you are not suggesting that reviewers are expected to type sequences into their computers whenever they get a sequence paper to review? And to what end? Just to verify that what the author claims to be an ORF really is?" If an incorrect sequence is published the most guilty party is the author. In many cases it was impossible for the reviewer to recognize that the sequence was wrong. However, there are examples in the literature that clearly reflect incompetence on the part of the reviewer (IMHO). Allow me to present some case histories for discussion. I. The sequence of a gene is published and aligned with that of an orthologous gene from another species. Many deletions and insertions are added to align the sequence. These include one and two base pair deletions in the coding region which destroy the reading frame. The paper does not mention that their gene could not possibly encode a homologous protein. (published in NAR) CONCLUSION: the authors are stupid and the reviewers incompetent II. The sequence of a gene is published with a transposition of a 75bp fragment from one part of the gene to another. The figure showing the predicted amino acid sequence is correct. (published in PNAS) CONCLUSION: the authors were careless, reviewers could have detected it III. A new sequence is aligned with that of a homologous gene from another species and the alignment includes deletions and substitutions. A much better alignment can be obtained by assuming a small number of sequencing errors. (published in MCB) CONCLUSION: the authors were careless, and so was the reviewer IV. The sequence of a gene fragment is published and the authors recognize that it is closely related (probably orthologous) to a gene in another species. The gene is decoded from the first methionine codon in the available sequence because there is an upstream in frame stop codon. The presumed initiation codon is clearly an internal methionine and the stop codon is a sequencing error. (published in NAR) CONCLUSION: the authors were stupid and the reviewers were careless V. A new sequence is published which is similar to one which is already in the database but the similarity is not noted. (published in JCB) CONCLUSION: ? VI. The complete sequence of a gene is published by a laboratory that has previously published a partial sequence of the same gene. There are several differences in the regions which overlap but these are not mentioned. (published in NAR) CONCLUSION: ? Now here's a hypothetical problem for you to grapple with. I have a database of many examples of a highly conserved gene. Assume that I receive a paper to review which includes a new sequence of one of these genes. From my analysis of the sequence I recognize that it almost certainly contains many errors because there are nucleotide substitutions in highly conserved regions, unusual codons, and because the sequence does not fit into the phylogeny. How do I respond as a reviewer of this paper? There are many examples of sequences in the GenBank database which I know to be incorrect (see above). Is there any way that my doubts can be communicated to users of the database? The accuracy of sequences that I analyze ranges from 95-100% and half of the sequences have an accuracy of less than 99.6% or 4 errors in every 1000 nucleotides. These are sequences of genes in a highly conserved gene family where workers are able to compare their data with published sequences. Imagine what the accuracy of sequences of newly discovered genes must be? -Larry Moran Dept. of Biochemistry Faculty of Medicine University of Toronto