Path: utzoo!utgpu!watserv1!watmath!uunet!snorkelwacker!bionet!LANL.GOV!pgil%histone From: pgil%histone@LANL.GOV (Paul Gilna) Newsgroups: bionet.molbio.genbank Subject: Re: Quality of submitted data Message-ID: <9008141623.AA00744@histone.lanl.gov.LANL.GOV> Date: 14 Aug 90 16:23:26 GMT Sender: daemon@genbank.BIO.NET Lines: 87 The link between GenBank data and the published literature is an historic artifact which existed simply because this was the only forum from which the databank could obtain the sequence data. As such, it was assumed that these data were "peer reviewed" and complied with editorial directives, such as data determination from both strands. The realities are that the vast majority of sequence data per se are not subjected to any rigorous form of integrity review by the conventional editorial peer review process. In fact in the past, most of the verification was performed after-the-fact by the annotation process conducted by databank staff. Our estimates are that as much as 30% of the data appearing in the literature was incorrect. However this is changing: the combined efforts of the databanks and the journals to encourage direct submissions are beginning to pay off. Genbank currently receives 80% of its data by direct submission. The majority of this data comes in early enough, and is processed fast enough that any errors spotted by our in-house verification processes can be passed back to the author in time to have the errors corrected for publication. So in a sense the databanks themselves have become an adjunct to the conventional peer-review process. All e-mail submissions are currently passed back to the author for review upon completion of the annotation process (usually within weeks of receipt of the submission. We hope to expand this system soon to encompass all submissions. At the GenBank project it is clear to us that journal publication will not be the primary forum for dissemination of sequence data in the future: Journals are already exercizing editorial prudence on the quantity of data they choose to print--we estimate that about 10% of the sequence data we receive will not be printed in the published article which reports those data. Accordingly we have identified the need to apply an even greater degree of data validation than is presently used. While we currently check entities such as ORF's (and data does not enter the database until it passes these checks), we have embarked on the development of a Sequence Validation Suite of software which will be used to check all sequence data coming in and will incorporate such checks as promotor verification, vector contamination (coming soon), splice site verification, and more. Our recent announcment of the pilot phase of the curator program represents our thoughts on how to enhance further the quality and depth of data integrity by enlisting and supporting the help of the scientific community. The primary message here is that the quality of submitted data is probably better than that of printed data, and is destined to increase both through the use of enhanced in-house QC software, the curator program, and through the use of automated submission processes which will free up our staff to pay more attention to the quality and depth of sequence annotation and integrity. Finally to answer your specific questions: Our rule of thumb is that ambiguous sequence data must not represent more than 10% of the length of sequence; beyond that we will split the sequence and note the ambiguous span in the features table. Be aware that we also accept the IUPAC codes for uncertainty in nucleotide sequences--these may have contributed to your "other" figures. We do not at this point apply an editorial standard which dictates that both strands be sequenced to qualify for entry into the database. That datum is easy to collect (through Authorin) and we would prefer to denote the quality attributes rather than use them as editorial standards. With the advent of automated sequencing and image processing techniques, it is concievable that all data could come with a statistical "confidence index" generated by the base calling software; again this is information that we could easily track and gather. Though it may seem as if I have digressed in my answer to what might have been conceived as a simple issue, your question touches upon a great deal of complex issues, many of which have yet to be properly answered. Our approach has been to try and anticipate most of the possible answers in the design of our database systems, and design in the flexibility to incorporate the answers to as yet unasked questions. Regards, Paul Gilna Genbank Biology Domain Leader GenBank/LANL