Path: utzoo!utgpu!watserv1!watmath!att!news.cs.indiana.edu!samsung!usc!apple!agate!ucbvax!postgres.Berkeley.EDU!aoki From: aoki@postgres.Berkeley.EDU (Paul M. Aoki) Newsgroups: bionet.molbio.genome-program Subject: Re: General Reference Message-ID: <39971@ucbvax.BERKELEY.EDU> Date: 10 Dec 90 08:19:39 GMT References: <1990Dec10.005756.2694@agate.berkeley.edu> Sender: usenet@ucbvax.BERKELEY.EDU Reply-To: aoki@postgres.Berkeley.EDU (Paul M. Aoki) Organization: U. C. Berkeley (alumnus) Lines: 96 chiueh@sprite.Berkeley.EDU (Tzi-cker Chiueh) writes: > I am a new reader to this group. I am wondering if somebody in the net can > briefly explain this genome program or refer me to the associated arcticles/books. > In particular, I am interested in knowing how as a computer scientist can help > in this project (e.g. database and CAD) and what is the state of the art? I'm going to take a stab at this, not because I claim to be an expert on the subject, but because I've been digging into this issue (database support for genome mapping) for a couple of weeks and want to bounce my results against the experts .. First, some specific advice. Frank Olken up on the hill (Lawrence Berkeley Labs) is involved with the project and can probably steer you in the appropriate direction. I assume you're still at UCB (I'm not) so he should be easy to find. There are a bunch of pop-science books on the US Human Genome Project; you might get some background out of them, but none of them seem to focus much on computational aspects of the problem :-) One reference I found particularly useful was a journal called "Computer Applications in the Biological Sciences" (often cited as CABIOS). The articles are written by a mishmash of molbio types, CS theorists and hackers. Not much background, but a lot of stuff on what people are working on in terms of real software. Other journals on computers in medical or biological science tend to focus on problems other than genome mapping. The idea behind genome mapping is that if you can find the areas in a given section of DNA that control certain biological functions, you can perform all sorts of genetic engineering (elimination of genetic/inherited diseases, selection of "good" genetic traits, etc.). To this end many nations, including the US, are pouring hundreds of millions of dollars into their various national projects. The biggest problem appears to be sequence matching. Once molecular biologists extract gene segments and identify the base pair (nucleic acids) or amino acid sequences, you wind up with a bunch of long text strings ("ACCGTAA...") that describe these sequences. A variety of archives and formats exist for these sequences (for more information, poke around on genbank.bio.net); suffice it to say that people are compiling sequence maps of many different species and annotating the database entries with what they think different parts of the sequences do. The function of new sequences can be determined by comparing them to current database entries, finding sequences that are "close" by some metric or that share many close subsequences. For example, one might find that the intestinal peptides of horses and humans are very similar and that much of the mapping done on the horse protein can be applied to the corresponding human protein. Some algorithms to compute "closeness" are: 1. Alignment: compute the "evolutionary distance" between one protein and another in terms of simple operations (substitution, insertion, deletion) and cost functions for the operations. Various dynamic programming techniques are commonly used. 2. Subsequence search: proteins that share many subsequences get high scores. Hashing is useful here. This method is limited, however. First, applying it to long sequences and large databases is hard because the distance functions tend to be computationally expensive. There are many optimizations (hybrid dynamic programming/hashing, complex data structures like that found in the latest issue of "ACM Trans. on Information Systems") but it's still expensive to compare against the whole database. Second, plain sequence matching ignores a great deal of physical and chemical information. More complex comparison methods take these factors into account and are even slower. Third, distance (using any distance metric) is only a push in the right direction. The "closest" match in terms of genetic function may well be one of the top 10 matches by (say) the Dayhoff PAM metrics but not the top. The insensitivity of most metrics to "weak matches" is commonly cited. So, what it to be done. For some reason, AI techniques aren't very popular -- people like brute-force, optimal-cost methods. Parallel programming is popular, since the dynamic programming computations are easily parallelized (one group is using a Connection Machine, another uses a Sequent, yet another uses the ICL DAP array processor). Most database technology flies right out the window because the databases are still small enough that a system that goes to disk a lot will have horrible performance relative to more ad-hoc, main-memory- oriented search software. It seems to me the best bet for database research is in the development of data structures like that in the TOIS article (but more space-efficient). There are other computational problems, but most are like the basic form of sequence matching in that they involve some kind of text string matching. (Computers are also being used to automate the sequencing process, controlling lab equipment and whatnot, but that isn't much of a CS research problem.) Anyway, I'll stop here and let people take potshots at this. Let me know what I'm misunderstanding and what I'm missing in terms of major approaches or research areas. Do bear in mind that two weeks ago I didn't know a ribosome from a RFLP so be patient with my misuse of biological terminology. Also, I'd be very interested in hearing what people are working on now. (I noticed at least one USENETter, Roy Smith at PHRI, published in CABIOS ..) -- Paul M. Aoki | aoki@postgres.Berkeley.EDU | ...!ucbvax!aoki