Path: utzoo!attcan!uunet!wuarchive!zaphod.mps.ohio-state.edu!sol.ctr.columbia.edu!cica!iuvax!noose.ecn.purdue.edu!mentor.cc.purdue.edu!mace.cc.purdue.edu!cjv
From: cjv@mace.cc.purdue.edu (westerman)
Newsgroups: bionet.molbio.genome-program
Subject: local copies of genbank
Message-ID: <4448@mace.cc.purdue.edu>
Date: 20 Mar 90 16:17:44 GMT
Reply-To: cjv@mace.cc.purdue.edu (westerman)
Organization: Purdue University
Lines: 79


As a system manager who plans to keep the Genbank database online on my
systems and who plans *not* to, except rarely, utilize the fasta and
retreival capiblities of genbank.bio.net, I'd like to respond to Dave
Kristofferson's recent posting on why he thinks local copies of the
database should be discouraged.

First, my circumstances:

   1) We are running the GCG (Wisconsin) sequence analysis package on
	VAX/VMS systems.

   2) My systems are not overloaded; we have spare CPU power and disk
	space.

   3) I do weekly updates of the database via ftp. This takes about 5
        minutes of my time and about 1 1/2 hours of machine time (done
        in the background at very low priority, maybe 15 minutes of 
	actual CPU time).

   4) I have looked at/installed Clark's shells. They are very nice
	and hide the "dirty details" from the user.


My objections to using genbank server are threefold:

   1) Time. 
        While it only takes a little bit more time to retreive a
        database entry from the server as it does from our local database (I
	estimate twice as long, which isn't bad considering the emailing that
	needs to be done), this delay is irritating when you sitting looking
	at a blank CRT.  

	While I haven't done fasta timing tests, I suspect that the genbank
	computer is faster than mine; on the other hand, having 4 computers
	at my disposal means I can do 4 searches simitaneously. In any case,
	fasta searching is not time critical -- a search of 1/2 hour (via
	genbank) or 2 hours (maximum via my computers) still means that I 
	must walk away from my desk and/or do something else; in any case 
	I am not sitting around just waiting (unlike in the retreival case 
	above).

   2) Formatting
	Retreival results from genbank come back in a form that I cannot
	immediately use for further processing, instead I must extract
	the sequence from my mail and then convert the sequence to a 
	form the GCG package can use. Granted, these steps are minor, but
	they are extra steps and irritating because of that.

   3) Other uses of the database
	I have other programs that need to access the entire database
	besides fasta. One of these is the GCG program "FIND", which finds 
	short matches in sequences; one of my group is using this program
	to try to find various promoter sites. By having a local copy of
	the database, we can do theoretical analysis of the database.


A further comment:

   4) I suspect that the reason genbank is currently a feasible option
      is that it is not overload, much in the same manner as my system
      is able to handle a minimum of 4 fasta searches at a time; however
      if we started getting over 6 searches we would start bogging down;
      and if genbank starting getting over XXX (60? ten times my load?)
      searches at a time, they would bog down too. (BTW: I have about a 10
      MIPs system)


I wish I could contribute further to this thread of netnews, but I am off on
vacation for a week or so. 

-- Rick


-- 

Rick Westerman                        AIDS Center Laboratory for Computational
Internet: cjv@mace.cc.purdue.edu      Biochemistry, Biochemistry building,
(317) 494-0505                        Purdue University, W. Lafayette, IN 47907