Path: utzoo!utgpu!watserv1!watmath!uunet!bionet!lhc!ncifcrf!fcs260c2!toms
From: toms@fcs260c2.ncifcrf.gov (Tom Schneider)
Newsgroups: bionet.molbio.genbank
Subject: Re: Eukaryotic cis-acting transcription regulatory elements
Message-ID: <2042@fcs280s.ncifcrf.gov>
Date: 7 Feb 91 22:03:41 GMT
References: <9102062259.AA01137@histone.lanl.gov> <Feb.6.16.08.05.1991.11751@genbank.bio.net>
Sender: news@ncifcrf.gov
Organization: NCI Supercomputer Facility, Frederick, MD
Lines: 60

In article <Feb.6.16.08.05.1991.11751@genbank.bio.net> kristoff@genbank.bio.net (David Kristofferson) writes:
>	Ouch!!  Are we bad or what 8-) 8-)??  The thing that I always
>find entertaining about this field is that when I was at the NCBI
>developers meeting last July, GenBank was being excoriated **for**
>including annotations for things like promoter sequences precisely for
>reasons along the lines I mentioned above.  It did not appear then
>that NCBI intended to include this type of information in their up and
>coming GenInfo database, preferring instead a less elaborately
>annotated entry.  However, the latest version I have heard indicated
>that their position was under revision due to input from yet other
>sections of the community.
>
>	Never a dull moment, is there? 8-)  In the absence of a
>concrete consensus, GenBank could spend a considerable amount of time
>doing and then undoing things.

Well!  If in last July they were objecting to annotations of promoter LOCATIONS
then some people have gone completely bonkers (besides me! :-).  The
coordinates of many binding sites are highly well specified, and it would be a
great service to molecular biologists to record the ones for which there is
experimental evidence (along with what the evidence is).  What should NOT be
recorded in any way is the sequence of the site (because that's redundant) nor
any consensus derived from the sites.  These things are total guesses for the
most part.  The best example I can give you is the T7 promoters I work on.  We
know the coordinates at which the T7 RNA polymerase initiates transcription,
down to the base.  What is wrong is to assume that the patterns around that
point are in fact at all associated with transcription!  Now, that may seem to
be a silly statement, but bare with me.  Consider position -3 relative to the
initiation base (0).  It is always an A, and a consensus would place an A
there.  But it turns out that extremely strong normal promoters can be had
which have any of the other 3 bases there.  SO THAT A IS NOT A PART OF THE
PROMOTER!  To read more about this, look at NAR 17:  659-674 (1989).  Consensus
bites the dust.

Also, only one point on the site should be specified.  Anything else
is interpretation.  Just where does a ribosome binding site end?  That's
an open scientific question, not one to be decided arbitrarily.  The solution
is simply to record one point as the 'zero coordinate', perhaps with
and orientation.

Another example is the initiation codon in E. coli.  A few percent (5 or 7?) of
the time it is GTG; most of the time it is ATG.  The consensus model THROWS OUT
DATA and ignores the GTG, even though they exist.  So no matter how you define
consensus, anything less than the frequencies will require data-destruction.
So I object to having consensus in GenBank because it is a horrible model.

The bottom line:  only experimentally verified data should be stored in
GenBank.  If you don't you'll have to fix it later (and be red in the face).

I understand that NCBI is going for things that they can do WELL, and that
they are not adverse to doing more later.  In the longer run, we will want
these 'signals' because it gets so hard to do surveys as the base gets bigger.

>Dave

  Tom Schneider
  National Cancer Institute
  Laboratory of Mathematical Biology
  Frederick, Maryland  21702-1201
  toms@ncifcrf.gov