Path: utzoo!attcan!uunet!tut.cis.ohio-state.edu!zaphod.mps.ohio-state.edu!rpi!sci.ccny.cuny.edu!phri!roy From: roy@phri.nyu.edu (Roy Smith) Newsgroups: bionet.molbio.genome-program Subject: Re: local copies of genbank Message-ID: <1990Mar20.182202.11824@phri.nyu.edu> Date: 20 Mar 90 18:22:02 GMT References: <4448@mace.cc.purdue.edu> Sender: news@phri.nyu.edu (News System) Organization: Public Health Research Institute, New York City Lines: 68 cjv@mace.cc.purdue.edu (Rick Westerman) writes: > I'd like to respond to Dave Kristofferson's recent posting on why he > thinks local copies of the database should be discouraged. Both Dave and Rick make good points. This is basicly the centralized vs. distributed computing argument all over again. The latest incarnation of this argument is raging right now in (I think) comp.arch, and involves X-terminals vs. workstations, but the gist is the same. For some people, one solution will be the right one, but not for everybody. For somebody with something like a PC/AT class machine with a 40 Meg disk, I don't think there is any doubt that accessing a central fasta server is the only rational way to go. I am almost totally ignorant of the ways of PC's, but I assume that there is some sort of software one can run on a PC to allow you to send and recieve mail. I suspect, however, that if everybody who currently maintains their own GB database were to suddenly switch to banging on the bionet Solbourne for fasta searches, that machine would quickly roll over and die. That may not be a fair argument, however, since presumably the switch would be slow and you can always buy more Solbournes to keep up with the load. But then the question becomes, if we (in the global sense) are going to buy 10 more Solbournes for the central fasta site, why not buy me and 9 other research institutes Solbournes instead and let us use them to number crunch when we're not running fasta? Of course, while that might make sense, it's not true that while I'm not doing fasta searches, I can use the 100+ Meg of disk genbank takes up for something else, or at least not without a lot of fuss to shuffle things to tape, or what have you. One could, I guess, just NFS mount the genbank partition directly from the bionet server, but that's probably not a good way to make use of network bandwidth (although, with the plans afoot to make the whole NFSNet 45 Mbps or even 1 Gbps, we're going to have to find something to do with all that bandwidth!). On the other hand, it might make sense for a half dozen sites within a single University-wide 10 Mbps LAN to share one copy on disk. > other programs that need to access the entire database besides fasta This one's the kicker. While I think Dave is right that the vast majority of what people want to do with genbank is run fasta, there are enough other uses to make me want to keep my own copy. For example, we have a program given to us years ago by Jim Fickett which parses the genbank features table and generates a protein data base by translating the annontated reading frames. People can then search that derived database. Yes, a lot of our derived data base overlaps with dayhoff/PIR, but there is a lot of stuff which doesn't make it into PIR. You could run tfasta, but people around here say they prefer using fasta on the derived database, claiming that it finds things that tfasta doesn't. In practice, if people are serious about doing a protein search, they use all three methods and merge the results. One thing that strikes me about genbank is that it's about an order of magnitude bigger than it has to be. If all you want to do is run fasta locally, you don't need the annotations. Right there, you cut the size of the files in half. Next, with the database stored as ascii, each base takes up 8 bits when it really only needs 2. Another factor of 4 savings. Maybe what people like me should be doing is storing a binary version of just the sequence data to run fasta against and throwing away the ascii files? I could then retrieve the full annotated ascii version of any interesting loci from a central server after a fasta run is finished. PIR, by the way, is even worse. For some reason that I have never figured out, they put a blank space between every residue in the ascii version of the data base! This makes the files bigger without adding any information. -- Roy Smith, Public Health Research Institute 455 First Avenue, New York, NY 10016 roy@alanine.phri.nyu.edu -OR- {att,philabs,cmcl2,rutgers,hombre}!phri!roy "My karma ran over my dogma"