Path: utzoo!utgpu!watserv1!watmath!att!pacbell.com!decwrl!uunet!bionet!lhc!ncifcrf!fcs260c2!toms
From: toms@fcs260c2.ncifcrf.gov (Tom Schneider)
Newsgroups: bionet.molbio.bio-matrix
Subject: Re: In defense of the Genome Boondoggle
Message-ID: <2054@fcs280s.ncifcrf.gov>
Date: 14 Feb 91 22:09:47 GMT
References: <12145@ur-cc.UUCP> <2050@fcs280s.ncifcrf.gov> <5714@husc6.harvard.edu>
Sender: news@ncifcrf.gov
Organization: NCI Supercomputer Facility, Frederick, MD
Lines: 178

In article <5714@husc6.harvard.edu> Ellington@Frodo.MGH.Harvard.EDU (Deaddog):
>In article <2050@fcs280s.ncifcrf.gov> toms@fcs260c2.ncifcrf.gov (Tom 
>Schneider):

>> learning how to identify genes from raw
>> sequences alone.  Predictions can be tested - which leads to rapid 
>> discovery of new genes.
>
>As does PCR amplification or hybridization:  the analogue versions of your
>digital statistical analyses.

Wrong.  Those techniques only allow one to jump from previously identified
sequences in other species to the human sequence.  This is a wonderful thing,
but it does not allow one to take a pure raw sequence and identify the genetic
control systems in it.  The difference is that those techniques are only
techniques, not theoretical understanding.  And if you are going to poo-pa
theoretical understanding, then I have some papers for you to read!  Start
with:

@article{StormoPerceptron1982,
author = "G. D. Stormo
 and T. D. Schneider
 and L. Gold
 and A. Ehrenfeucht",
title = "Use of the `Perceptron' algorithm to distinguish translational
initiation sites in {E. coli.}",
year = "1982",
journal = "Nucl. Acids Res.",
volume = "10",
pages = "2997-3011"}

>  The question is not whether some genes will
>be identified, the question is (a) how many could already be identified 
>without the sequence of the genome, and (b) whether the (IMO paltry)
>number that remain be worth the enormous cost?

I'm sure that we can continue on the blind route we are following and find lots
of interesting things eventually.  The US road system comes to mind.  Sure, we
could have survived without a network of major roads.  But having started on
the big project, we were able to become much more integrated as a society, and
now it is hard to imagine not having superhighways (or are they merely
PARKways?  And why is the place one parks the car in the DRIVEway?? :-).
Similar things could be said about a uniform telephone system:  we have (had??)
the best in the world because people at Bell labs thought big.  A third example
is the improvement in making maps that Landsat and other satellites have given
us.  And, yes, Arpanet turned into internet.  In all these cases we start off
ad hoc and then eventually learn to do things systematically.  Consider the cow
paths you use to get to work!  (I refer to the roads of Boston.)  Would you
like to use muddy winding paths?  The genome project is merely a recognition
that we are close to the time that we can make our maps in a direct logical way
rather than piece meal.

>Statisticians drool at the mounds of data to be created.

And so might the rest of the biologists.  They can use the data to direct
their experiments more effectively.  If they are afraid of math and computers
(is that your problem?? :-) then there are plenty of theoretical-types whom
they can team up with.

>>  avoid the terrible biases that we
>> currently have in the GenBank database.
>
>I'm sorry, but this does not seem like a terribly important
>problem.  GenBank is skewed.  Big deal.  It gets the job done.
>We find genes, we miss some stuff.  Science slops along and
>we still find those self-splicing introns and centromeres and
>other cool things.  Without the sequence of the human genome.
>And with many people happily employed (for now) producing 
>gobs of worthwhile data. 

The problem is here, and getting worse.  You apparently haven't tried
to make a consistent dataset from the data in GenBank.  It's a tough job!
The point about the genome project is that we don't need to miss anything
anymore.  You seem to have the idea that some genes are not important,
and that 'junk' DNA exists in the genome.  Consider that this merely
is a way for you to express to the rest of us how ignorant you are.
(We are also, but we admit it.  Do you admit that you are ignorant?)

>I mean, what's a good example of what we have missed?  We know 
>the Shine/Dalgarno sequences.

Well, you missed the other statistically important features that were
discovered by looking at the sites more carefully.  See:

@article{Gold1981,
author = "L. Gold
 and D. Pribnow
 and T. Schneider
 and S. Shinedling
 and B. S. Singer
 and G. Stormo",
title = "Translational initiation in prokaryotes.",
year = "1981",
journal = "Annu. Rev. Microbiol.",
volume = "35",
pages = "365-403"}

@article{StormoInitiation1982,
author = "G. D. Stormo
 and T. D. Schneider
 and L. M. Gold",
title = "Characterization of translational initiation sites
in {{E. coli.}}",
year = "1982",
journal = "Nucl. Acids Res.",
volume = "10",
pages = "2971-2996"}

@article{Schneider1986,
author = "T. D. Schneider
 and G. D. Stormo
 and L. Gold
 and A. Ehrenfeucht",
title = "Information content of binding sites on nucleotide
sequences",
year = "1986",
journal = "J. Mol. Biol.",
volume = "188",
pages = "415-431"}

> We have learned far more from 
>mutation than we would by sequencing a bacterial genome (note:  
>sequencing the Coli genome is indeed a cool thing to do). 

This is a completely flip statement, with no foundation since you didn't
quantitate your answer and the experiment has not been done.  (But I do agree
that getting that sequence will be cool.)  Genetics is certainly a powerful way
to approach biological problems.  But once one has defined a biolgically
interesting system, direct methods can produce answers that would be difficult
if not impossible to get by genetics.  For example, the sequence of a gene, or
exactly what bases are important for a promoter to function.  See:

@article{Schneider1989,
author = "T. D. Schneider
 and G. D. Stormo",
title = "Excess Information at Bacteriophage {T7} Genomic Promoters
Detected by a Random Cloning Technique",
year = "1989",
journal = "Nucl. Acids Res.",
volume = "17",
pages = "659-674"}

>And will the "insides of introns" generate data for 2 PNAS papers and 
>a TIBS review,

yes.  The work of Andrez Konopka is an example you seem to have missed.

>or will it actually be worth the billions of 
>dollars it will take to properly correct this horrific accounting
>error?

Your mistake here is to suggest that the genome project would only give these
data.  It would give much other data also.

>> The second major justification is the enormous boost to sequencing 
>> technology that the project is making.
>
>Good sequencing technology stands on its own.  It does not need the Genome
>Boondoggle to help it along.

You have missed the point.  The project will focus more people on the
problems of sequencing, and the art will improve as a result.

>> We are eventually going to be able to sequence
>> everybody's DNA in a few minutes.
>
>Matrix-teers:  Is this nuts or what?  I've never seen this before, but
>if it is even remotely true, I'll eat the small plastic rats that reside 
>on the top of my terminal.

Ever heard of nanotechnology?  Well, bone up if you are ignorant.  I'll forgive
you, you don't need to eat those rats.

  Tom Schneider
  National Cancer Institute
  Laboratory of Mathematical Biology
  Frederick, Maryland  21702-1201
  toms@ncifcrf.gov