Xref: utzoo bionet.molbio.genbank:72 news.software.nntp:456
Path: utzoo!utstat!helios.physics.utoronto.ca!jarvis.csri.toronto.edu!mailrus!iuvax!rutgers!phri!roy
From: roy@phri.nyu.edu (Roy Smith)
Newsgroups: bionet.molbio.genbank,news.software.nntp
Subject: Re: Distributing GenBank over the Internet
Message-ID: <1989Dec11.160609.5436@phri.nyu.edu>
Date: 11 Dec 89 16:06:09 GMT
References: <1989Dec7.213027.8591@phri.nyu.edu> <1364@uvm-gen.UUCP>
Sender: news@phri.nyu.edu (News System)
Reply-To: roy@alanine.UUCP (Roy Smith)
Followup-To: bionet.molbio.genbank
Organization: Public Health Research Institute, NYC
Lines: 69

[NOTE: this is only marginally related to nntp issues, so I've directed
followups to bionet.molbio.genbank only]

In  <1364@uvm-gen.UUCP> cavrak@uvm-gen.UUCP (Steve Cavrak) writes:
>Taking the sugestion one step further, why "distribute" the database at
>all ?  Why not pursue a "server" model where queries against the
>database could be directed to one (or several) "database servers".

	Several reasons.  First, the stupid one, but possibly the one which
will prove most significant. People want their own copy on their own disk.
Never mind that they don't really need it, they want it.  Similar arguments
recently surfaced in another forum on the "central 50 MIPS machine with an
X-terminal on each desk vs. lots of 2 MIPS Unix workstations" issue.

	That said, the real problem with a query server is that you limit
the types of queries you allow people to do.  I havn't used the currently
available servers, but I gather they allow you to retrieve an entry by locus
name or accession number, or do searches based using one or another of the
fasta family of programs.  That's great, but what if you want to do
something different?

	One of the things we do is translate the whole genbank data base
into a protein data base using some code kindly provided by Jim Fickett
years ago which parses the feature tables.  True, with tfasta you get sort
of the same effect, but running tfasta against genbank is a lot slower than
running fasta against our ficketized database.  There are advantages and
disadvantages to both ways, but the point is with just a query server, we
would not have had the option to do it the way we do.  Sometimes people do
searches by just grepping the genbank files; the keyword indicies don't
always have what you want and sometimes it's nice to just grep the
definition or comment lines.  Maybe it would be possible to make the
databases available via a publicly (read-only!) mountable NFS file system?

	On the other hand, we are able to devote significant amounts of disk
space to the databases (our /usr/database file system is something over 100
Mbytes) and have the CPU power and time to make use of the material.  I
would imagine that for people with a PC and a 40 Meg hard disk in their lab,
a query server might be exactly what they need.  I honestly don't know which
type of installation is more typical.

>The other alternative is to just publish the database on CD-ROM and
>distribute it that way.

	CD-ROM is nice, but doesn't really solve the problems that tape has.
You still have to get a physical object from point A to point B, and you
still have to produce those objects.  How long does it take to press CDs
compared to the time it takes to cut tapes?  Also, from what I know of CDs,
they are much slower than magnetic hard disks.  Also, I'm not sure that
CD-ROM is really practical yet.  Maybe in a couple of years, but it's still
pretty much of a specialty item today.

> the bandwidth of a 747 loaded with floppy disks, was nothing to yawn at.

	I've always heard it expressed in terms of a station wagon full of
mag tapes, but the point is well taken.  In the best case, FedEx can get a
magtape from me to you in about 16 hours.  I usually figure you can fit
about 150 Mbytes on a 2400' reel at 6250bpi with a large blocking factor.
Unless I did the math wrong, that works out to an effective bandwidth of
about 22 kbps.  Of course, both the magtape and the serial link can gain a
factor of 2-4 by using L-Z compression.  But then again, is it unreasonable
to assume that most of the people who want genbank have 1.5Mbps or (at the
very least) 56kbps connections to something connected to NSFNet, or will
have such within a couple of years (i.e. the same time scale I hypothesize
for the ubiquitization of CD-ROMs)?
--
Roy Smith, Public Health Research Institute
455 First Avenue, New York, NY 10016
roy@alanine.phri.nyu.edu -OR- {att,philabs,cmcl2,rutgers,hombre}!phri!roy
"My karma ran over my dogma"