Newsgroups: bionet.molbio.genbank
Path: utzoo!utgpu!lamoran
From: lamoran@gpu.utcs.utoronto.ca (L.A. Moran)
Subject: Quality of submitted data
Message-ID: <1990Aug15.204400.26622@gpu.utcs.utoronto.ca>
Organization: UTCS Public Access
Date: Wed, 15 Aug 90 20:44:00 GMT


David Kristofferson (kristoff@genbank.BIO.NET) writes:
     "The basic fact which has been brought up by journal editors repeatedly 
      is that the vast majority of reviewers who get a paper containing 
      sequence data in hardcopy are not going to take the time to enter the 
      data into a computer."

and Roy Smith (roy@alanine.phri.nyu.edu) replies;
     "You seem to be implying that this is the fault of the reviewers,
      that they are not taking the time to do their job properly (or are 
      implying that the editors are implying that)."

Ellen Murphy (murphy@phri.nyu.edu) also responded;
     "Surely you are not suggesting that reviewers are expected to type
      sequences into their computers whenever they get a sequence paper to
      review?  And to what end?  Just to verify that what the author claims
      to be an ORF really is?"


     If an incorrect sequence is published the most guilty party is the author.
In many cases it was impossible for the reviewer to recognize that the sequence
was wrong. However, there are examples in the literature that clearly reflect
incompetence on the part of the reviewer (IMHO). Allow me to present some case
histories for discussion.

I. The sequence of a gene is published and aligned with that of an orthologous
   gene from another species. Many deletions and insertions are added to align
   the sequence. These include one and two base pair deletions in the coding
   region which destroy the reading frame. The paper does not mention that 
   their gene could not possibly encode a homologous protein. (published in
   NAR)
   CONCLUSION: the authors are stupid and the reviewers incompetent

II. The sequence of a gene is published with a transposition of a 75bp 
   fragment from one part of the gene to another. The figure showing the 
   predicted amino acid sequence is correct. (published in PNAS)
   CONCLUSION: the authors were careless, reviewers could have detected it

III. A new sequence is aligned with that of a homologous gene from another
  species and the alignment includes deletions and substitutions. A much 
  better alignment can be obtained by assuming a small number of sequencing
  errors. (published in MCB)
  CONCLUSION: the authors were careless, and so was the reviewer

IV. The sequence of a gene fragment is published and the authors recognize that
  it is closely related (probably orthologous) to a gene in another species.
  The gene is decoded from the first methionine codon in the available 
  sequence because there is an upstream in frame stop codon. The presumed 
  initiation codon is clearly an internal methionine and the stop codon is a
  sequencing error. (published in NAR)
  CONCLUSION: the authors were stupid and the reviewers were careless

V. A new sequence is published which is similar to one which is already in 
  the database but the similarity is not noted. (published in JCB)
  CONCLUSION: ?

VI. The complete sequence of a gene is published by a laboratory that has
  previously published a partial sequence of the same gene. There are several
  differences in the regions which overlap but these are not mentioned.
  (published in NAR)
  CONCLUSION: ?

    Now here's a hypothetical problem for you to grapple with. I have a 
database of many examples of a highly conserved gene. Assume that I receive a 
paper to review which includes a new sequence of one of these genes. From my 
analysis of the sequence I recognize that it almost certainly contains many 
errors because there are nucleotide substitutions in highly conserved regions, 
unusual codons, and because the sequence does not fit into the phylogeny. 
How do I respond as a reviewer of this paper?

    There are many examples of sequences in the GenBank database which I know
to be incorrect (see above). Is there any way that my doubts can be 
communicated to users of the database? The accuracy of sequences that I 
analyze ranges from 95-100% and half of the sequences have an accuracy of less
than 99.6% or 4 errors in every 1000 nucleotides. These are sequences of genes
in a highly conserved gene family where workers are able to compare their data
with  published sequences. Imagine what the accuracy of sequences of newly
discovered genes must be?


-Larry Moran
Dept. of Biochemistry
Faculty of Medicine
University of Toronto