Path: utzoo!utgpu!water!watmath!uunet!ig!daemon From: TSPRINGER@BIONET-20.ARPA Newsgroups: bionet.molbio.proteins Subject: SWISS-PROT via fastp-mail for homology searches, PLEASE. Message-ID: <5899@ig.ig.com> Date: 15 Apr 88 05:36:14 GMT Sender: daemon@presto.ig.com Lines: 136 From: Timothy Springer Searching for homologies in protein databases is much trickier than I ever anticipated. THe problem is that all are incomplete and the degree of incompleteness varies greatly. The local protein database based on NBRF which is assuredly updated very frequently with which I started using fastp lacked the most interesting sequences to me. When I started searching supposedly the same database on the NBRF computer in Washington,D.C. some very interesting homologies appeared. The protein sequence which was published some 2 years ago somehow was not present in our own version of the database. The NBRF computer is excellent in responsiveness. An account for $400 may be had on NBRF just like on Intelligenetics. (See PIR bulletins). We have given up on Intelligenetics for any online work (but see about batch jobs below). We typically wait 6 to 14 hours on the terminal for a fastp search. The postdocs and graduate students in the lab will patiently do shifts, check- ing the terminal every 30 min for any sign of a response. If there is one, they have to respond to the interactive prompt, lest they be timed out and lose all. I have given up in disgust after waiting for 6 hours with no response (during nonpeak time). Either the system is hopelessly overloaded or the efficient Lipmann-Pearson algorithim is implemented awkwardly; I also hear mumbling about lots of file openings and closings taking lots of time. On NBRF, the online search is completed within about 2 minutes. The PIR as well as the NEW database may be used. The NBRF also has an excellent program for alignment, ALIGN, which is quick, and alignments with randomized sequences readily generate the statistical significance of the alignment. I have yet to see anything giving meaningful statistics on the Intelligenetics programs. My enthusiasm for Intelligenetics skyrocketed last week when I read the PROTEIN-ANALYSIS bulletin board messages by Amos Bairoch on the SWISS-PROT database he has compiled and made available. While cDNA sequences may appear within a few months of publication in the nucleic acid databases, it may take a year or two for them to be translated and appear in the protein databases (even though the translated sequence was right there in the publication). So Dr. Bairoch (as have others for other databases) has translated the nucleic acid databases via computer program and included these additional sequences in SWISS-PROT, which with 1,654,416 residues in 6,102 sequences is the biggest nonredundant database I have seen. Hot to try it, I ran into the same endless hours of waiting at the termi- nal with no response. Pouring out my frustrations to Vickie Johncox at In- telligenetics, I learned about fastp-mail and batch jobs. (Documented under HELP FASTP-MAIL and HELP BATCH). Fastp mail is a way of sending the search to the Sun computer via mail, with results coming back in less than a minute ex- actly as advertised (Why can't we get responsiveness like that online?). How- ever, fastp mail cannot search the SWISS-PROT database! For this you must write a batch file, and I don't recommend using the BFASTP command to build your file because it was set up before the availability of SWISS-PROT and doesn't have that option in it. Instead, you should write your own batch file. See documentation in HELP BAT-FASTP. A model batch file follows which probes for homologies with a sequence in the file YOUR.PEP: @TAKE BATCH.CMD @XFASTP *YOUR.PEP *2 * *YOUR.REC *50 * * *20 @LOGOUT You could name this file YOUR.CTL and send it from your wordprocessor via kermit or edit it yourself on Bionet. You may then submit it to the batch queue: SUBMIT YOUR.CTL /TIME 00:10:00. This allows for 10 minutes of cpu time, much more than the 5 minutes needed. You can check on status with INFO BATCH. Your results with the 50 best scores and 20 best alignments will ap- pear in the file YOUR.REC. Note that YOUR.PEP should be in the in- telligenetics file format with lines of comments preceded by ";", one line of title not preceded by a ";", and then the sequence with no more than 499 residues in one letter code, followed by a "1", rather than in the format recommended for fastp. If you have what the computer thinks is an extra se- quence (like a line preceded by a "<") you will get an extra query about which sequence to search which will throw off the batch file. Our results? An exciting hit with a sequence in SWISS-PROT not present in other databases, and which was published in 1987. The moral? Intelligenetics/Bionet/good guys/gals, could we get SWISS- PROT availability on fastp-mail? The PIR database is antiquated by SWISS- PROT. And batch files are a real pain compared to fastp-mail. Who wants to wait overnight for one search? Even better? Could we get responsiveness online? Could we be connected directly to programs and computers that would do this for us rather than hav- ing to use a mail connection? The only problem I can see is that you would become so popular that you would be swamped with users and demands for your time and help. And rather than just using Bionet as a way to get a taste of molecular biology computing before becoming frustrated by its slowness and moving on to other resources, or other program families such as those by the University of Wisconsin Genetics Computer Group, users might become devoted, longtime customers. -------