Path: utzoo!utgpu!watserv1!watmath!uunet!bionet!lhc!ncifcrf!fcs260c2!toms From: toms@fcs260c2.ncifcrf.gov (Tom Schneider) Newsgroups: bionet.molbio.genbank Subject: Re: Eukaryotic cis-acting transcription regulatory elements Message-ID: <2042@fcs280s.ncifcrf.gov> Date: 7 Feb 91 22:03:41 GMT References: <9102062259.AA01137@histone.lanl.gov> Sender: news@ncifcrf.gov Organization: NCI Supercomputer Facility, Frederick, MD Lines: 60 In article kristoff@genbank.bio.net (David Kristofferson) writes: > Ouch!! Are we bad or what 8-) 8-)?? The thing that I always >find entertaining about this field is that when I was at the NCBI >developers meeting last July, GenBank was being excoriated **for** >including annotations for things like promoter sequences precisely for >reasons along the lines I mentioned above. It did not appear then >that NCBI intended to include this type of information in their up and >coming GenInfo database, preferring instead a less elaborately >annotated entry. However, the latest version I have heard indicated >that their position was under revision due to input from yet other >sections of the community. > > Never a dull moment, is there? 8-) In the absence of a >concrete consensus, GenBank could spend a considerable amount of time >doing and then undoing things. Well! If in last July they were objecting to annotations of promoter LOCATIONS then some people have gone completely bonkers (besides me! :-). The coordinates of many binding sites are highly well specified, and it would be a great service to molecular biologists to record the ones for which there is experimental evidence (along with what the evidence is). What should NOT be recorded in any way is the sequence of the site (because that's redundant) nor any consensus derived from the sites. These things are total guesses for the most part. The best example I can give you is the T7 promoters I work on. We know the coordinates at which the T7 RNA polymerase initiates transcription, down to the base. What is wrong is to assume that the patterns around that point are in fact at all associated with transcription! Now, that may seem to be a silly statement, but bare with me. Consider position -3 relative to the initiation base (0). It is always an A, and a consensus would place an A there. But it turns out that extremely strong normal promoters can be had which have any of the other 3 bases there. SO THAT A IS NOT A PART OF THE PROMOTER! To read more about this, look at NAR 17: 659-674 (1989). Consensus bites the dust. Also, only one point on the site should be specified. Anything else is interpretation. Just where does a ribosome binding site end? That's an open scientific question, not one to be decided arbitrarily. The solution is simply to record one point as the 'zero coordinate', perhaps with and orientation. Another example is the initiation codon in E. coli. A few percent (5 or 7?) of the time it is GTG; most of the time it is ATG. The consensus model THROWS OUT DATA and ignores the GTG, even though they exist. So no matter how you define consensus, anything less than the frequencies will require data-destruction. So I object to having consensus in GenBank because it is a horrible model. The bottom line: only experimentally verified data should be stored in GenBank. If you don't you'll have to fix it later (and be red in the face). I understand that NCBI is going for things that they can do WELL, and that they are not adverse to doing more later. In the longer run, we will want these 'signals' because it gets so hard to do surveys as the base gets bigger. >Dave Tom Schneider National Cancer Institute Laboratory of Mathematical Biology Frederick, Maryland 21702-1201 toms@ncifcrf.gov