Path: utzoo!utgpu!watserv1!watmath!uunet!bionet!lhc!ncifcrf!fcs260c2!toms
From: toms@fcs260c2.ncifcrf.gov (Tom Schneider)
Newsgroups: bionet.molbio.bio-matrix
Subject: Re: In defense of the Genome Boondoggle
Message-ID: <2055@fcs280s.ncifcrf.gov>
Date: 14 Feb 91 22:57:52 GMT
References: <12145@ur-cc.UUCP> <2050@fcs280s.ncifcrf.gov> <12180@ur-cc.UUCP>
Sender: news@ncifcrf.gov
Organization: NCI Supercomputer Facility, Frederick, MD
Lines: 108

In article <12180@ur-cc.UUCP> elmo@troi.cc.rochester.edu (Eric Cabot) writes:
>In article <2050@fcs280s.ncifcrf.gov> toms@fcs260c2.ncifcrf.gov (Tom Schneider) writes:
>>I think that Mark is exactly correct, and you have missed the point.  Having a
>>huge database full of human sequences opens vistas for those of us who know how
>>to use statistical tools to analyse sequences.  There are many things that can
>>be done.  Some of them include learning how to identify genes from raw
>
>(much stuff deleted)

>Ok, I agree that it is possible to use statistical methods to infer that
>a given sequence contains a "gene". If I read your perspective correctly
>(and ignoring the self-back patting) the main goal is to beef up the

sorry about that.

>database so that we can find new genes, whether functional or not.  I 
>I'm sorry, but I just see that as cost effective,  given that we won't
>have the slightest inkling of what most of these genes are supposed to
>do. 

I would not take it as the main goal; it is one of many worthy goals.

Also, once a gene is located, it can be deleted (in mouse or whereever), and
then one can play the usual powerful genetic tricks to figure out the
function.  The sequence is merely a starting point.  Since it (supposedly!)
contains all the information about the biology, in encripted form, it is nice
to have it for starters.  I've always been amazed when people would put off
sequencing "their" gene for a long time, since one gets such a huge amount of
solid data from the sequence.

>>A straight sequencing of the genome will avoid the terrible biases that we
>>currently have in the GenBank database.  For example, the database is missing
>
>Oh really? Wouldn't you say that concentrating on coli, fly, worm, yeast
>human and maybe a plant species puts a bit of bias into the database?

Interesting point.  I suppose it comes from my biased view of analyzing the
binding sites from one species at a time so as to avoid the assumption that the
recognizer (ie DNA binding protein, ribosome, polymerase, repressor or
whatever) is the same in all species.  (We know lots of cases where it's not.)
So I am happy if I have a complete genome to work within.  But after finishing
one, one needs to do the others to answer evolutionary questions, and you are
right, there is a huge diversity out there to be sequenced.  So until we can
sequence genomes quickly (minutes), I suppose the best we can do is to chose
the few organisms which have had lots of good genetics done on them.  I'm glad
to see that these other organisms are considered part of the project!  When I
first heard of the project I disliked it because I thought that coli wouldn't
get done first as a 'pilot'.

>>the insides of introns.  If you think that these are not important, then you
>>may well be in for some super surprises later.  The phrase "junk DNA" is a
>>statement of ignorance, not scientific fact.  People currently chop off the
>>bases near the 3' sides of introns and don't report them in the database.  The
>>proof is that they often end 10, 20 or 30 bases from the splice junction.  This
>>would not happen if people reported all their data.  Unfortunately, this means
>>that people have thrown out important parts of splice junctions BECAUSE THEY
>>THOUGHT THEY WERE UN-IMPORTANT.  Do you follow?  People think something is not
>>important, so they don't report it in the database, or limit the reports, so
>>nobody discovers that it IS important!  Another example is the reporting of

>(Nothing deleted because I am in complete agreement. Oh how I have ranted
>and raved about missing intron sequences.)  But
>frankly, I don't follow if this is part of the defense of the genome project.
>Sure it'd be great to have chromosome long tracts of sequences to infer
>gemone organization but will we really be able to make sense out of
>it all using the sequence data alone? Take the case of upstream control
>regions, their significance was worked for the most part by experimental
>techinques.  Those results are the stuff that are used to generate rules
>for sequence analysis. Not the other way around. 

That's because theoretical concepts have not been strong enough to date.  I
think that this will change.  Not to be back patting (will you excuse me?? :-),
but the example I know best is my own.  E. coli ribosome binding sites have
about 11.0 bits of pattern.  I was pretty surprised to find that the
information needed to locate the sites in the genome is about 10.6 bits!  This
correlation seems to hold for other genetic systems.  The idea (working
hypothesis) is that the amount of pattern at binding sites is in general just
enough to locate the sites in the genome.  Then I studied T7 RNA polymerase
promoters and found that they contained too much sequence pattern (35 bits of
pattern) compared to what is needed to locate them in the genome (16 to 17
bits).  This meant that either the hypothesis was wrong or something
interesting was happening at T7 promoters.  Perhaps another protein binds
there, and this accounts for the "excess" information.  If so, I should be able
to delete the excess information.  It took me a while, but I did the experiment
and found that 18 +/- 2 bits are all that the polymerase needs!  So the
hypothesis survived.  The experiment would not have been done without the
theoretical analysis.  I have another case like this that I'm writing up now.
So the idea of doing experiments first is only a tradition of molecular
biology.  Theoretical understanding can also play a role.  References to this
story can be found in:

@article{Schneider.Stephens.Logo,
author = "T. D. Schneider
 and R. M. Stephens",
title = "Sequence Logos: A New Way to Display Consensus Sequences",
journal = "Nucl. Acids Res.",
volume = "18",
pages = "6097-6100",
year = "1990"}

>Eric Cabot                              |    elmo@uhura.cc.rochester.edu
>      :-):-):-):-):-):-):-):-)          |    elmo@uordbv.bitnet

  Tom Schneider
  National Cancer Institute
  Laboratory of Mathematical Biology
  Frederick, Maryland  21702-1201
  toms@ncifcrf.gov