Path: utzoo!utgpu!watserv1!watmath!uunet!bionet!benton From: benton@genbank.BIO.NET (David Benton) Newsgroups: bionet.molbio.genbank Subject: Re: GenBank Release 64 was incomplete Message-ID: Date: 25 Aug 90 17:35:42 GMT References: <9008241347.AA08860@acme.med.unc.edu> <0093BAB0.F3070380@aclcb.purdue.edu> <1990Aug24.193617.7712@murdoch.acc.Virginia.EDU> <6967.26d645d3@mcclb0.med.nyu.edu> Organization: GenBank Online Service Lines: 50 After checking the GenBank release 64 files which were on-line in the genbank.bio.net ftp directory (~ftp/pub/db/gb-rel64) from early July to mid-August, I can say unequivocally that GenBank Release 64.0 was *not* incomplete, either as it appeared in that directory or as distributed on mag tape. In the course of preparing floppy disk- format files from those GenBank data files, we discovered a systematic error in the way certain feature locations were formatted in the files. This error affected about 1250 of the 185,079 features in Release 64.0. I, therefore, applied a global correction to the files and replaced the files in the ftp directory with the corrected files. All but one or two of the annotated divisions were affected and the total number of affected entries is probably greater than 1000. The number of entries in each division, the number of lines (and the number of words) in each data file did not change. The only change was to certain location which originally were written as (for example) 357357 which should have been 357. So each of the affected files has grown smaller by a small number of bytes. (I think Bill Peason's results can be explained by the fact that he compared the sizes of the compressed files and the amount of L-Z compression is sensitive to the content.) Due to my own oversight, I failed to post a notice to this newsgroup announcing the availability of the new files and the reason for the correction. I apologize for the inconvenience this has caused. While I am no longer in a position to guarantee that this won't happen in the future, Dave Kristofferson assures me that the new management of the project will be more vigilant. Our philosophy has always been that, since GenBank is a human endeavor, any snapshot of the database will contain "errors", but we bend all our efforts toward removing known errors before distribution of releases. Now that a more continuously updated GenBank is widely available in many forms, we have attempted to correct errors as soon as they are known to us and notify recipients of the corrections as soon as possible. In general, as Dr. Smith recommended, because these corrections are applied to single entries (by the GenBank annotation staff at Los Alamos National Laboratory), the corrected entries are posted to bionet.molbio.genbank.updates. In the present case, however, because the corrections were globally applied to the entire database, I never had the 1000+ entries in my hand to individually post to the updates newsgroup. It may be, in cases like this one, that if extracting each changed entry and posting it is a requirement for making a change to the database, we will be forced to decide not to make the corrections until the next release simply because the overhead (imposed by this requirement) is too great. I'm sure Dave Kristofferson will be happy to hear from users of the updates newsgroup on their requirements for the operation of that group. Sincerely, David Benton GenBank Staff benton@karyon.bio.net