Path: utzoo!utgpu!watserv1!watmath!att!pacbell.com!ucsd!sdd.hp.com!zaphod.mps.ohio-state.edu!julius.cs.uiuc.edu!apple!bionet!AARDVARK.UCS.UOKNOR.EDU!BROE
From: BROE@AARDVARK.UCS.UOKNOR.EDU (Bruce Roe)
Newsgroups: bionet.molbio.genbank
Subject: Re: A question for FTP users
Message-ID: <9101232140.AA25920@genbank.bio.net>
Date: 23 Jan 91 14:24:00 GMT
Sender: daemon@genbank.bio.net
Lines: 94

Hi,

Obviously the problem with the databases, their formats, and
programs to access the databases continues.  As with most things
in life there are no simple solutions until, of course, the solution
is found and then everyone says:

"My, that solution was so simple, why didn't we think of it before".

The solution is rather simple, a common, stable, database format.

Without this, venders of software have 2 choices:
	1. Reformat the databases to fit their software
	2. Change their software to read the distributed databases.
	
Until now the choice of the venders has been the former, mainly
because the format of the databases was (and still is) in a state
of change.  It is more efficient to write a program to change the
database format than it is to change the multitude of code for dealing
with the databases.

John and the folks at GCG have provided tools for converting GenBank
to GCG format and for inter-converting individual sequences from one
format to another.
The Staden programs read the databases stored in the PIR format but
can analyze individual sequences stored in any of several formats.
I do not know what IG does in their package but am sure they have
some similar approaches or do they use GenBank without reformatting?


David K. has written:
>        As I am sure you are aware, it is not in GenBank's charter to
>supply the databank in any commercial format.  Reformatting costs
>money regardless of who does it.  If we were required to reformat the
>database as you suggest, we would be obligated to provide it for
>*every* commercial vendor.  This is clearly impractical.  Also since
>many users do not have access to FTP, they would still have to rely on
>tape or CDROM distributions.  The net effect of this would be to delay
>the production of GenBank tremendously.  Reformatting GenBank clearly
>belongs where it is right now, in the hands of the commercial vendors.

Give me a break.  How many many vendors is *every* ? Do folks really
search the entire GenBank from their pc's?  Some search the protein
databases on their pc's/macs but the entire GenBank?

Could we at least concentrate our discussion on MainFrame computer
programs and databases on these.  Maybe I'm mistaken but I count 
three MainFrame program sets as the vast majority used, GCG, IG, and
NBRF/PIR.  A few sites have the Staden programs but most of us who use
the Staden programs use them for purposes other than database searching.

In reality, Bill Pearson's FASTA and companion programs probably are
used the most and they handle the GCG formatted databases.

I think what we need here is a survey of what's out there.  If we limit
our discussion to Main Frame programs and FTP sites and not deal with
individual users but rather with sites. I also do not think we should
consider other forms of the databases, such as those which require
pre-processing for the NLM BLAST programs or GCG's QUICKSEARCH.

The problem is time and money.  If GCG supplies users with tapes for
$1600 they make money but they sure save me lots of time and I get ALL
the databases we want and need in a format we can use.  I also do not
have to worry about transmission error which may corrupt an ftp-ed
database.  If I get the GenBank tapes I still have to pay (although less)
but then I have to spend time re-formatting databases and also get additional
tapes from PIR and maybe others which could bring the cost in tapes and
effort to a figure greater than the cost from GCG.  No matter what it
looks like the NIH is going to pay the bills, either from individual
grants or from contracts to GenBank/IG.

I'd like to hear from the funding agencies and also like comments from
those who supply databases to the rest of us.

My overall conclusions are:
(1) pay the money to GCG and get quarterly database updates on tape as
it is the least hassle for me and our system folks.  
(2) encourage users to search the latest databases using FASTA-Mail,etc.
(3) continue to join with others to encourage discussions which will
result in a common, stable database format.

Best to one and all,

        Bruce A. Roe
        Professor of Chemistry and Biochemistry
        INTERNET: BROE@aardvark.ucs.uoknor.edu
        BITNET:   BROE@uokucsvx
        AT&TNET:  405-325-4912 or 405-325-7610
        SnailNet: Department of Chemistry and Biochemistry
                  University of Oklahoma
                  620 Parrington Oval, Rm 208
                  Norman, Oklahoma 73019
        FAXnet:   405-325-6111
        ICBMnet:  35 deg 14 min North, 97 deg 27 min West