Path: utzoo!attcan!uunet!tut.cis.ohio-state.edu!zaphod.mps.ohio-state.edu!rpi!sci.ccny.cuny.edu!phri!roy
From: roy@phri.nyu.edu (Roy Smith)
Newsgroups: bionet.molbio.genome-program
Subject: Re: local copies of genbank
Message-ID: <1990Mar20.182202.11824@phri.nyu.edu>
Date: 20 Mar 90 18:22:02 GMT
References: <4448@mace.cc.purdue.edu>
Sender: news@phri.nyu.edu (News System)
Organization: Public Health Research Institute, New York City
Lines: 68

cjv@mace.cc.purdue.edu (Rick Westerman) writes:
> I'd like to respond to Dave Kristofferson's recent posting on why he
> thinks local copies of the database should be discouraged.

	Both Dave and Rick make good points.  This is basicly the
centralized vs. distributed computing argument all over again.  The latest
incarnation of this argument is raging right now in (I think) comp.arch,
and involves X-terminals vs. workstations, but the gist is the same.  For
some people, one solution will be the right one, but not for everybody.
For somebody with something like a PC/AT class machine with a 40 Meg disk,
I don't think there is any doubt that accessing a central fasta server is
the only rational way to go.  I am almost totally ignorant of the ways of
PC's, but I assume that there is some sort of software one can run on a PC
to allow you to send and recieve mail.

	I suspect, however, that if everybody who currently maintains their
own GB database were to suddenly switch to banging on the bionet Solbourne
for fasta searches, that machine would quickly roll over and die.  That may
not be a fair argument, however, since presumably the switch would be slow
and you can always buy more Solbournes to keep up with the load.  But then
the question becomes, if we (in the global sense) are going to buy 10 more
Solbournes for the central fasta site, why not buy me and 9 other research
institutes Solbournes instead and let us use them to number crunch when
we're not running fasta?

	Of course, while that might make sense, it's not true that while
I'm not doing fasta searches, I can use the 100+ Meg of disk genbank takes
up for something else, or at least not without a lot of fuss to shuffle
things to tape, or what have you.  One could, I guess, just NFS mount the
genbank partition directly from the bionet server, but that's probably not
a good way to make use of network bandwidth (although, with the plans afoot
to make the whole NFSNet 45 Mbps or even 1 Gbps, we're going to have to
find something to do with all that bandwidth!).  On the other hand, it
might make sense for a half dozen sites within a single University-wide 10
Mbps LAN to share one copy on disk.

> other programs that need to access the entire database besides fasta

	This one's the kicker.  While I think Dave is right that the vast
majority of what people want to do with genbank is run fasta, there are
enough other uses to make me want to keep my own copy.  For example, we
have a program given to us years ago by Jim Fickett which parses the
genbank features table and generates a protein data base by translating the
annontated reading frames.  People can then search that derived database.
Yes, a lot of our derived data base overlaps with dayhoff/PIR, but there is
a lot of stuff which doesn't make it into PIR.  You could run tfasta, but
people around here say they prefer using fasta on the derived database,
claiming that it finds things that tfasta doesn't.  In practice, if people
are serious about doing a protein search, they use all three methods and
merge the results.

	One thing that strikes me about genbank is that it's about an order
of magnitude bigger than it has to be.  If all you want to do is run fasta
locally, you don't need the annotations.  Right there, you cut the size of
the files in half.  Next, with the database stored as ascii, each base
takes up 8 bits when it really only needs 2.  Another factor of 4 savings.
Maybe what people like me should be doing is storing a binary version of
just the sequence data to run fasta against and throwing away the ascii
files?  I could then retrieve the full annotated ascii version of any
interesting loci from a central server after a fasta run is finished.  PIR,
by the way, is even worse.  For some reason that I have never figured out,
they put a blank space between every residue in the ascii version of the
data base!  This makes the files bigger without adding any information.
--
Roy Smith, Public Health Research Institute
455 First Avenue, New York, NY 10016
roy@alanine.phri.nyu.edu -OR- {att,philabs,cmcl2,rutgers,hombre}!phri!roy
"My karma ran over my dogma"