Path: utzoo!utgpu!watserv1!watmath!att!pacbell.com!ucsd!swrinde!zaphod.mps.ohio-state.edu!julius.cs.uiuc.edu!apple!bionet!LANL.GOV!pgil%histone From: pgil%histone@LANL.GOV (Paul Gilna) Newsgroups: bionet.molbio.genbank Subject: Re: Fewer new sequences in Oct and Nov Message-ID: <9012062223.AA00377@histone.lanl.gov> Date: 6 Dec 90 22:23:21 GMT Sender: daemon@genbank.bio.net Lines: 109 J. Michael Cherry writes with concern on the recent drop in number of entries passed to the servers from the GenBank project. It is correct to assume that events surrounding the RDBMS conversion have led to an apparent drop in our output. However this drop is about to reverse dramatically, and this is an appropriate point to place the events of the past few weeks in perspective. Firstly some definitions. When we here at GenBank speak of the "conversion" to the RDBMS, we in fact are speaking of a number of conversions that occurred in parallel: 1. Conversion to internal maintenance of data in RDBMS format; this occurred by translation of the conventional flat file into the RDBMS tables. 2. Conversion to internal input of data into the RDBMS; while (1) could continue by simply passing in and out flat-files (and indeed is exactly how the previous two releases have been generated), we designed and implemented a new annotation interface to the RDBMS, the annotators workbench. 3. Conversion to external output of the RDBMS. Release 64 and 65 of GenBank were created by translating in the flatfile and then writing it out again in the new Feature Table format. Release 66 will consist of only flatfiles written from the database, where all data since release 65 have been entered only through the workbench. 4. Conversion to new external flatfile format, the new Features table. In contrast to the old FT format, the new format exists only as a report on the database, we do not enter the features as they are written. Our annotators know about the features, but do not have to work with the syntax. These four conversion factors are also presented in order of priority: it is important to note that we concentrated on input first, working on output second; the maxim, "garbage in, garbage out", helps clarify that particular development philosophy! Secondly, there are really two classes of "entry" passed to the servers (under which lie a number of sub-classes); new data; i.e., new sequences, and updated data, i.e., updates to existing, publicaly available entries, error corrections, citation updates, etc. In any conversion of this scale, a drop in productivity is inevitable. When we came towards the point of beta test and conversion, we really had two choices of approach. We could conduct a beta test in which we worked in parallel on the two databases (RDBMS and FLATFILE), repeating the work in both, where only the flatfile version continued to be released to the worls and where we waited until we were absolutely confident that we could throw away the old tools before true conversion. Alternatively, we could conduct a live beta test, in which we worked only once on the data within the RDBMS, but had some failsafe mechanisms in place in case things went wrong. The former mechanism had the disadvantage of extreme redundancy of work and a significant effect on production in addition to a prolonged learning curve for annotators. We chose the latter option, effectively throwing ourselves in at the deep end. Our failsafe mechanism was a temporary flatfile report from the database which could be both saved and used to continue distribution to the servers. It is the generation and handling of this flatfile that has created the primary effect on performance. To cut a long story short, the flatfiles had to be examined and tweaked to pass our flatfile integrity checking programs. To minimise the effect on performance we made the following call: only new data would be passed to the servers until we could distribute data directly to the servers from the RDBMS without recourse to the temporary flatfile; all modifications to existing data (e.g. citation updates to entries already avaliable, etc.,) would not be passed to the servers until this could be done automatically. About 4 out of 5 entries which we have handled over the past few weeks have consisted of citation updates to existing released submitted data, and hence have not been passed to the servers. That point of automation is about to occur: as of Monday of next week we will cease "tweaking" flatfiles. Throughout the next week, we will also release the 1000 or so entries that have not gone out to the server as a result of the above actions. Freedom from this manual flatfile work should also result in a marked increase in output of all entries over the ensuing weeks. Distribution of flatfiles (in the new Feature Table format) will now happen automatically each night. Dr. Cherry, you are correct to raise concern over your observations, and indeed we apologise for the events which have led to this concern. I hope, however, that the cause for such concern will have vanished over the course of the next few weeks. Regards, Paul Gilna, Ph.D., Biology Domain Leader GenBank, Los Alamos. ----- End Included Message ----- Brought to you by Super Global Mega Corp .com