Path: utzoo!utgpu!watserv1!watmath!uunet!snorkelwacker!bionet!LANL.GOV!pgil%histone
From: pgil%histone@LANL.GOV (Paul Gilna)
Newsgroups: bionet.molbio.genbank
Subject: Re:  Quality of submitted data
Message-ID: <9008141623.AA00744@histone.lanl.gov.LANL.GOV>
Date: 14 Aug 90 16:23:26 GMT
Sender: daemon@genbank.BIO.NET
Lines: 87

The link between GenBank data and the published literature is an
historic artifact which existed simply because this was the only forum
from which the databank could obtain the sequence data. As such, it was
assumed that these data were "peer reviewed" and complied with
editorial directives, such as data determination from both strands.

The realities are that the vast majority of sequence data per se are
not subjected to any rigorous form of integrity review by the
conventional editorial peer review process. In fact in the past, most
of the verification was performed after-the-fact by the annotation
process conducted by databank staff. Our estimates are that as much as
30% of the data appearing in the literature was incorrect.

However this is changing: the combined efforts of the databanks and the
journals to encourage direct submissions are beginning to pay off.
Genbank currently receives 80% of its data by direct submission. The
majority of this data comes in early enough, and is processed fast
enough that any errors spotted by our in-house verification processes
can be passed back to the author in time to have the errors corrected
for publication.  So in a sense the databanks themselves have become an
adjunct to the conventional peer-review process.

All e-mail submissions are currently passed back to the author for
review upon completion of the annotation process (usually within weeks
of receipt of the submission. We hope to expand this system soon to
encompass all submissions.

At the GenBank project it is clear to us that journal publication will
not be the primary forum for dissemination of sequence data in the
future:  Journals are already exercizing editorial prudence on the
quantity of data they choose to print--we estimate that about 10% of
the sequence data we receive will not be printed in the published
article which reports those data.


Accordingly we have identified the need to apply an even greater degree
of data validation than is presently used. While we currently check
entities such as ORF's (and data does not enter the database until it
passes these checks), we have embarked on the development of a Sequence
Validation Suite of software which will be used to check all sequence
data coming in and will incorporate such checks as promotor
verification, vector contamination (coming soon), splice site
verification, and more. Our recent announcment of the pilot phase of
the curator program represents our thoughts on how to enhance further
the quality and depth of data integrity by enlisting and supporting the
help of the scientific community.

The primary message here is that the quality of submitted data is
probably better than that of printed data, and is destined to increase
both through the use of enhanced in-house QC software, the curator
program, and through the use of automated submission processes which
will free up our staff to pay more attention to the quality and depth
of sequence annotation and integrity.


Finally to answer your specific questions:

Our rule of thumb is that ambiguous sequence data must not represent
more than 10% of the length of sequence; beyond that we will split the
sequence and note the ambiguous span in the features table. Be aware
that we also accept the IUPAC codes for uncertainty in nucleotide
sequences--these may have contributed to your "other" figures.

We do not at this point apply an editorial standard which dictates that
both strands be sequenced to qualify for entry into the database. That
datum is easy to collect (through Authorin) and we would prefer to
denote the quality attributes rather than use them as editorial
standards.  With the advent of automated sequencing and image
processing techniques, it is concievable that all data could come with
a statistical "confidence index" generated by the base calling
software; again this is information that we could easily track and
gather.


Though it may seem as if I have digressed in my answer to what might
have been conceived as a simple issue, your question touches upon a
great deal of complex issues, many of which have yet to be properly
answered.  Our approach has been to try and anticipate most of the
possible answers in the design of our database systems, and design in
the flexibility to incorporate the answers to as yet unasked
questions.

Regards,

Paul Gilna
Genbank Biology Domain Leader
GenBank/LANL