Xref: utzoo bionet.molbio.genbank:72 news.software.nntp:456 Path: utzoo!utstat!helios.physics.utoronto.ca!jarvis.csri.toronto.edu!mailrus!iuvax!rutgers!phri!roy From: roy@phri.nyu.edu (Roy Smith) Newsgroups: bionet.molbio.genbank,news.software.nntp Subject: Re: Distributing GenBank over the Internet Message-ID: <1989Dec11.160609.5436@phri.nyu.edu> Date: 11 Dec 89 16:06:09 GMT References: <1989Dec7.213027.8591@phri.nyu.edu> <1364@uvm-gen.UUCP> Sender: news@phri.nyu.edu (News System) Reply-To: roy@alanine.UUCP (Roy Smith) Followup-To: bionet.molbio.genbank Organization: Public Health Research Institute, NYC Lines: 69 [NOTE: this is only marginally related to nntp issues, so I've directed followups to bionet.molbio.genbank only] In <1364@uvm-gen.UUCP> cavrak@uvm-gen.UUCP (Steve Cavrak) writes: >Taking the sugestion one step further, why "distribute" the database at >all ? Why not pursue a "server" model where queries against the >database could be directed to one (or several) "database servers". Several reasons. First, the stupid one, but possibly the one which will prove most significant. People want their own copy on their own disk. Never mind that they don't really need it, they want it. Similar arguments recently surfaced in another forum on the "central 50 MIPS machine with an X-terminal on each desk vs. lots of 2 MIPS Unix workstations" issue. That said, the real problem with a query server is that you limit the types of queries you allow people to do. I havn't used the currently available servers, but I gather they allow you to retrieve an entry by locus name or accession number, or do searches based using one or another of the fasta family of programs. That's great, but what if you want to do something different? One of the things we do is translate the whole genbank data base into a protein data base using some code kindly provided by Jim Fickett years ago which parses the feature tables. True, with tfasta you get sort of the same effect, but running tfasta against genbank is a lot slower than running fasta against our ficketized database. There are advantages and disadvantages to both ways, but the point is with just a query server, we would not have had the option to do it the way we do. Sometimes people do searches by just grepping the genbank files; the keyword indicies don't always have what you want and sometimes it's nice to just grep the definition or comment lines. Maybe it would be possible to make the databases available via a publicly (read-only!) mountable NFS file system? On the other hand, we are able to devote significant amounts of disk space to the databases (our /usr/database file system is something over 100 Mbytes) and have the CPU power and time to make use of the material. I would imagine that for people with a PC and a 40 Meg hard disk in their lab, a query server might be exactly what they need. I honestly don't know which type of installation is more typical. >The other alternative is to just publish the database on CD-ROM and >distribute it that way. CD-ROM is nice, but doesn't really solve the problems that tape has. You still have to get a physical object from point A to point B, and you still have to produce those objects. How long does it take to press CDs compared to the time it takes to cut tapes? Also, from what I know of CDs, they are much slower than magnetic hard disks. Also, I'm not sure that CD-ROM is really practical yet. Maybe in a couple of years, but it's still pretty much of a specialty item today. > the bandwidth of a 747 loaded with floppy disks, was nothing to yawn at. I've always heard it expressed in terms of a station wagon full of mag tapes, but the point is well taken. In the best case, FedEx can get a magtape from me to you in about 16 hours. I usually figure you can fit about 150 Mbytes on a 2400' reel at 6250bpi with a large blocking factor. Unless I did the math wrong, that works out to an effective bandwidth of about 22 kbps. Of course, both the magtape and the serial link can gain a factor of 2-4 by using L-Z compression. But then again, is it unreasonable to assume that most of the people who want genbank have 1.5Mbps or (at the very least) 56kbps connections to something connected to NSFNet, or will have such within a couple of years (i.e. the same time scale I hypothesize for the ubiquitization of CD-ROMs)? -- Roy Smith, Public Health Research Institute 455 First Avenue, New York, NY 10016 roy@alanine.phri.nyu.edu -OR- {att,philabs,cmcl2,rutgers,hombre}!phri!roy "My karma ran over my dogma"