Path: utzoo!utgpu!watserv1!watmath!att!news.cs.indiana.edu!samsung!usc!apple!agate!ucbvax!postgres.Berkeley.EDU!aoki
From: aoki@postgres.Berkeley.EDU (Paul M. Aoki)
Newsgroups: bionet.molbio.genome-program
Subject: Re: General Reference
Message-ID: <39971@ucbvax.BERKELEY.EDU>
Date: 10 Dec 90 08:19:39 GMT
References: <1990Dec10.005756.2694@agate.berkeley.edu>
Sender: usenet@ucbvax.BERKELEY.EDU
Reply-To: aoki@postgres.Berkeley.EDU (Paul M. Aoki)
Organization: U. C. Berkeley (alumnus)
Lines: 96

chiueh@sprite.Berkeley.EDU (Tzi-cker Chiueh) writes:
> I am a new reader to this group. I am wondering if somebody in the net can 
> briefly explain this genome program or refer me to the associated arcticles/books.
> In particular, I am interested in knowing how as a computer scientist can help
> in this project (e.g. database and CAD) and what is the state of the art?

I'm going to take a stab at this, not because I claim to be an expert on
the subject, but because I've been digging into this issue (database
support for genome mapping) for a couple of weeks and want to bounce my
results against the experts ..

First, some specific advice.  Frank Olken up on the hill (Lawrence
Berkeley Labs) is involved with the project and can probably steer
you in the appropriate direction.  I assume you're still at UCB (I'm
not) so he should be easy to find.

There are a bunch of pop-science books on the US Human Genome Project;
you might get some background out of them, but none of them seem to 
focus much on computational aspects of the problem :-)  One reference 
I found particularly useful was a journal called "Computer Applications 
in the Biological Sciences" (often cited as CABIOS).  The articles are 
written by a mishmash of molbio types, CS theorists and hackers.  Not 
much background, but a lot of stuff on what people are working on in 
terms of real software.  Other journals on computers in medical or
biological science tend to focus on problems other than genome mapping.

The idea behind genome mapping is that if you can find the areas in
a given section of DNA that control certain biological functions,
you can perform all sorts of genetic engineering (elimination of
genetic/inherited diseases, selection of "good" genetic traits,
etc.).  To this end many nations, including the US, are pouring
hundreds of millions of dollars into their various national projects.

The biggest problem appears to be sequence matching.  Once molecular 
biologists extract gene segments and identify the base pair (nucleic 
acids) or amino acid sequences, you wind up with a bunch of long text 
strings ("ACCGTAA...") that describe these sequences.  A variety of
archives and formats exist for these sequences (for more information,
poke around on genbank.bio.net); suffice it to say that people are 
compiling sequence maps of many different species and annotating
the database entries with what they think different parts of the
sequences do.  The function of new sequences can be determined by
comparing them to current database entries, finding sequences that
are "close" by some metric or that share many close subsequences.
For example, one might find that the intestinal peptides of horses
and humans are very similar and that much of the mapping done on
the horse protein can be applied to the corresponding human protein.
Some algorithms to compute "closeness" are:
	1. Alignment: compute the "evolutionary distance" between
	   one protein and another in terms of simple operations
	   (substitution, insertion, deletion) and cost functions
	   for the operations.  Various dynamic programming 
	   techniques are commonly used.
	2. Subsequence search: proteins that share many subsequences
	   get high scores.  Hashing is useful here.
This method is limited, however.  First, applying it to long sequences
and large databases is hard because the distance functions tend to be
computationally expensive.  There are many optimizations (hybrid
dynamic programming/hashing, complex data structures like that found
in the latest issue of "ACM Trans. on Information Systems") but it's
still expensive to compare against the whole database.  Second, plain
sequence matching ignores a great deal of physical and chemical
information.  More complex comparison methods take these factors into
account and are even slower.  Third, distance (using any distance
metric) is only a push in the right direction.  The "closest" match
in terms of genetic function may well be one of the top 10 matches
by (say) the Dayhoff PAM metrics but not the top.  The insensitivity
of most metrics to "weak matches" is commonly cited.

So, what it to be done.  For some reason, AI techniques aren't very
popular -- people like brute-force, optimal-cost methods.  Parallel
programming is popular, since the dynamic programming computations
are easily parallelized (one group is using a Connection Machine,
another uses a Sequent, yet another uses the ICL DAP array processor).
Most database technology flies right out the window because the
databases are still small enough that a system that goes to disk a lot
will have horrible performance relative to more ad-hoc, main-memory-
oriented search software.  It seems to me the best bet for database
research is in the development of data structures like that in the 
TOIS article (but more space-efficient).

There are other computational problems, but most are like the basic
form of sequence matching in that they involve some kind of text
string matching.  (Computers are also being used to automate the
sequencing process, controlling lab equipment and whatnot, but that 
isn't much of a CS research problem.)

Anyway, I'll stop here and let people take potshots at this.  Let
me know what I'm misunderstanding and what I'm missing in terms of
major approaches or research areas.  Do bear in mind that two weeks
ago I didn't know a ribosome from a RFLP so be patient with my misuse
of biological terminology.  Also, I'd be very interested in hearing 
what people are working on now.  (I noticed at least one USENETter,
Roy Smith at PHRI, published in CABIOS ..)
--
    Paul M. Aoki   |   aoki@postgres.Berkeley.EDU   |   ...!ucbvax!aoki