Path: utzoo!utgpu!watserv1!watmath!uunet!wuarchive!uwm.edu!bionet!lhc!ncifcrf!fcs260c2!toms
From: toms@fcs260c2.ncifcrf.gov (Tom Schneider)
Newsgroups: bionet.molbio.bio-matrix
Subject: Re: In defense of the Genome Boondoggle
Message-ID: <2050@fcs280s.ncifcrf.gov>
Date: 12 Feb 91 16:14:04 GMT
References: <9102111942.AA08834@genbank.bio.net> <12145@ur-cc.UUCP>
Sender: news@ncifcrf.gov
Organization: NCI Supercomputer Facility, Frederick, MD
Lines: 70

In article <12145@ur-cc.UUCP> elmo@troi.cc.rochester.edu (Eric Cabot) writes:
>In article <9102111942.AA08834@genbank.bio.net> gunnell@FCRFV1.NCIFCRF.GOV ("Gunnell, Mark") writes:
>>In article <9102111731.AA00773@genbank.bio.net>
>>Ellington@frodo.mgh.harvard.edu (Deaddog) writes:
>>
>>> 
>>> Make me a list of similar worth that has to do with the Genome Boondoggle.
>>
>>Catalogue all human genes! Discover the functions of mapped genes; see how 
>>genes evolve; evaluate molecular evolution theories and how species originate;
>>find amazing biological phenomena never before observed by human eyes.  Yes,
>>all these and more can ...  etc.,etc. 8-)

>You *must* be either kidding us or yourself!
>But seriously, item 1 is hardly possible, item
>2 is probably not possible, and the remaining items are not even
>close to possible from a mere sequence determination of the (a?)
>human genome.

I think that Mark is exactly correct, and you have missed the point.  Having a
huge database full of human sequences opens vistas for those of us who know how
to use statistical tools to analyse sequences.  There are many things that can
be done.  Some of them include learning how to identify genes from raw
sequences alone.  Predictions can be tested - which leads to rapid discovery of
new genes.  I have been involved in two cases of this already (see Stormo et al
NAR 10:2997 1982 for the first example of gene identification by computer; the
second one is in preparation), and it will certainly will happen more as people
use neural nets more.

A straight sequencing of the genome will avoid the terrible biases that we
currently have in the GenBank database.  For example, the database is missing
the insides of introns.  If you think that these are not important, then you
may well be in for some super surprises later.  The phrase "junk DNA" is a
statement of ignorance, not scientific fact.  People currently chop off the
bases near the 3' sides of introns and don't report them in the database.  The
proof is that they often end 10, 20 or 30 bases from the splice junction.  This
would not happen if people reported all their data.  Unfortunately, this means
that people have thrown out important parts of splice junctions BECAUSE THEY
THOUGHT THEY WERE UN-IMPORTANT.  Do you follow?  People think something is not
important, so they don't report it in the database, or limit the reports, so
nobody discovers that it IS important!  Another example is the reporting of
only the coding sequence of a procaryotic gene, even though we KNOW that there
is a region upstream (the Shine/Dalgarno) which is important for translational
initiation.  Any statistical analysis of human sequences must be done carefully
to avoid biases from the highly over-represented immunoglobulin and MHC
sequences.  I'm sure you can think of other examples.  A complete sequence,
without any bias is the best way to get around this.  I think that that alone
justifies the project.

The second major justification is the enormous boost to sequencing technology
that the project is making.  We are eventually going to be able to sequence
everybody's DNA in a few minutes.  This will have enormous medical implications,
since it will remove much guess work from medicine.

I also used to think that the project was foolish, but these reasons have
convinced me that it is worthwhile.

There is also the spirit of adventure.  Fred Blattner once pointed out that it
would be really neat (my words, not his) to have the entire sequence of E. coli
- simply because it would be the first time that we knew the entire
specification of a living organism.  (Viruses don't count since they are
dependent on the host.)

>Eric Cabot elmo@uhura.cc.rochester.edu elmo@uordbv.bitnet

  Tom Schneider
  National Cancer Institute
  Laboratory of Mathematical Biology
  Frederick, Maryland  21702-1201
  toms@ncifcrf.gov