Path: utzoo!utgpu!watserv1!watmath!att!pacbell.com!ucsd!swrinde!zaphod.mps.ohio-state.edu!julius.cs.uiuc.edu!apple!bionet!LANL.GOV!pgil%histone
From: pgil%histone@LANL.GOV (Paul Gilna)
Newsgroups: bionet.molbio.genbank
Subject: Re:  Fewer new sequences in Oct and Nov
Message-ID: <9012062223.AA00377@histone.lanl.gov>
Date: 6 Dec 90 22:23:21 GMT
Sender: daemon@genbank.bio.net
Lines: 109



J. Michael Cherry writes with concern on the recent drop in number
of entries passed to the servers from the GenBank project.


It is correct to assume that events surrounding the RDBMS conversion
have led to an apparent drop in our output. However this drop is about
to reverse dramatically, and this is an appropriate point to place the
events of the past few weeks in perspective.

Firstly some definitions. 

When we here at GenBank speak of the "conversion" to the RDBMS, we in
fact are speaking of a number of conversions that occurred in
parallel:

1.  Conversion to internal maintenance of data in RDBMS format; this
occurred by translation of the conventional flat file into the RDBMS
tables.

2.  Conversion to internal input of data into the RDBMS; while (1)
could continue by simply passing in and out flat-files (and indeed is
exactly how the previous two releases have been generated), we designed
and implemented a new annotation interface to the RDBMS, the annotators
workbench.

3.  Conversion to external output of the RDBMS. Release 64 and 65 of
GenBank were created by translating in the flatfile and then writing it
out again in the new Feature Table format. Release 66 will consist of
only flatfiles written from the database, where all data since release
65 have been entered only through the workbench.

4.  Conversion to new external flatfile format, the new Features
table.  In contrast to the old FT format, the new format exists only as
a report on the database, we do not enter the features as they are
written.  Our annotators know about the features, but do not have to
work with the syntax.

These four conversion factors are also presented in order of priority:
it is important to note that we concentrated on input first, working on
output second; the maxim, "garbage in, garbage out", helps clarify that
particular development philosophy!


Secondly, there are really two classes of "entry" passed to the servers
(under which lie a number of sub-classes); new data; i.e., new
sequences, and updated data, i.e., updates to existing,
publicaly available entries, error corrections, citation updates, etc.



In any conversion of this scale, a drop in productivity is inevitable.
When we came towards the point of beta test and conversion, we really
had two choices of approach. We could conduct a beta test in which we
worked in parallel on the two databases (RDBMS and FLATFILE), repeating
the work in both, where only the flatfile version continued to be released to the worls and where we waited until we were absolutely confident that
we could throw away the old tools before true conversion. 

Alternatively, we could conduct a live beta test, in which we worked
only once on the data within the RDBMS, but had some failsafe
mechanisms in place in case things went wrong. 

The former mechanism had the disadvantage of extreme redundancy of work
and a significant effect on production in addition to a prolonged
learning curve for annotators.

We chose the latter option, effectively throwing ourselves in at the
deep end. Our failsafe mechanism was a temporary flatfile report from
the database which could be both saved and used to continue
distribution to the servers.

It is the generation and handling of this flatfile that has created the
primary effect on performance. To cut a long story short, the flatfiles
had to be examined and tweaked to pass our flatfile integrity checking
programs. To minimise the effect on performance we made the following
call: only new data would be passed to the servers until we could
distribute data directly to the servers from the RDBMS without recourse
to the temporary flatfile; all modifications to existing data (e.g.
citation updates to entries already avaliable,  etc.,) would not be
passed to the servers until this could be done automatically.

About 4 out of 5 entries which we have handled over the past few weeks
have consisted of citation updates to existing released submitted
data, and hence have not been passed to the servers.

That point of automation is about to occur: as of Monday of next week
we will cease "tweaking" flatfiles. Throughout the next week, we will
also release the 1000 or so entries that have not gone out to the
server as a result of the above actions. Freedom from this manual
flatfile work should also result in a marked increase in output of all
entries over the ensuing weeks. Distribution of flatfiles (in the new
Feature Table format) will now happen automatically each night.

Dr. Cherry, you are correct to raise concern over your observations,
and indeed we apologise for the events which have led to this concern.
I hope, however, that the cause for such concern will have vanished
over the course of the next few weeks.

Regards,


Paul Gilna, Ph.D.,
Biology Domain Leader
GenBank,
Los Alamos.


----- End Included Message -----


Brought to you by Super Global Mega Corp .com