Path: utzoo!utgpu!watserv1!watmath!uunet!bionet!lhc!ncifcrf!fcs260c2!toms From: toms@fcs260c2.ncifcrf.gov (Tom Schneider) Newsgroups: bionet.molbio.bio-matrix Subject: Re: In defense of the Genome Boondoggle Message-ID: <2055@fcs280s.ncifcrf.gov> Date: 14 Feb 91 22:57:52 GMT References: <12145@ur-cc.UUCP> <2050@fcs280s.ncifcrf.gov> <12180@ur-cc.UUCP> Sender: news@ncifcrf.gov Organization: NCI Supercomputer Facility, Frederick, MD Lines: 108 In article <12180@ur-cc.UUCP> elmo@troi.cc.rochester.edu (Eric Cabot) writes: >In article <2050@fcs280s.ncifcrf.gov> toms@fcs260c2.ncifcrf.gov (Tom Schneider) writes: >>I think that Mark is exactly correct, and you have missed the point. Having a >>huge database full of human sequences opens vistas for those of us who know how >>to use statistical tools to analyse sequences. There are many things that can >>be done. Some of them include learning how to identify genes from raw > >(much stuff deleted) >Ok, I agree that it is possible to use statistical methods to infer that >a given sequence contains a "gene". If I read your perspective correctly >(and ignoring the self-back patting) the main goal is to beef up the sorry about that. >database so that we can find new genes, whether functional or not. I >I'm sorry, but I just see that as cost effective, given that we won't >have the slightest inkling of what most of these genes are supposed to >do. I would not take it as the main goal; it is one of many worthy goals. Also, once a gene is located, it can be deleted (in mouse or whereever), and then one can play the usual powerful genetic tricks to figure out the function. The sequence is merely a starting point. Since it (supposedly!) contains all the information about the biology, in encripted form, it is nice to have it for starters. I've always been amazed when people would put off sequencing "their" gene for a long time, since one gets such a huge amount of solid data from the sequence. >>A straight sequencing of the genome will avoid the terrible biases that we >>currently have in the GenBank database. For example, the database is missing > >Oh really? Wouldn't you say that concentrating on coli, fly, worm, yeast >human and maybe a plant species puts a bit of bias into the database? Interesting point. I suppose it comes from my biased view of analyzing the binding sites from one species at a time so as to avoid the assumption that the recognizer (ie DNA binding protein, ribosome, polymerase, repressor or whatever) is the same in all species. (We know lots of cases where it's not.) So I am happy if I have a complete genome to work within. But after finishing one, one needs to do the others to answer evolutionary questions, and you are right, there is a huge diversity out there to be sequenced. So until we can sequence genomes quickly (minutes), I suppose the best we can do is to chose the few organisms which have had lots of good genetics done on them. I'm glad to see that these other organisms are considered part of the project! When I first heard of the project I disliked it because I thought that coli wouldn't get done first as a 'pilot'. >>the insides of introns. If you think that these are not important, then you >>may well be in for some super surprises later. The phrase "junk DNA" is a >>statement of ignorance, not scientific fact. People currently chop off the >>bases near the 3' sides of introns and don't report them in the database. The >>proof is that they often end 10, 20 or 30 bases from the splice junction. This >>would not happen if people reported all their data. Unfortunately, this means >>that people have thrown out important parts of splice junctions BECAUSE THEY >>THOUGHT THEY WERE UN-IMPORTANT. Do you follow? People think something is not >>important, so they don't report it in the database, or limit the reports, so >>nobody discovers that it IS important! Another example is the reporting of >(Nothing deleted because I am in complete agreement. Oh how I have ranted >and raved about missing intron sequences.) But >frankly, I don't follow if this is part of the defense of the genome project. >Sure it'd be great to have chromosome long tracts of sequences to infer >gemone organization but will we really be able to make sense out of >it all using the sequence data alone? Take the case of upstream control >regions, their significance was worked for the most part by experimental >techinques. Those results are the stuff that are used to generate rules >for sequence analysis. Not the other way around. That's because theoretical concepts have not been strong enough to date. I think that this will change. Not to be back patting (will you excuse me?? :-), but the example I know best is my own. E. coli ribosome binding sites have about 11.0 bits of pattern. I was pretty surprised to find that the information needed to locate the sites in the genome is about 10.6 bits! This correlation seems to hold for other genetic systems. The idea (working hypothesis) is that the amount of pattern at binding sites is in general just enough to locate the sites in the genome. Then I studied T7 RNA polymerase promoters and found that they contained too much sequence pattern (35 bits of pattern) compared to what is needed to locate them in the genome (16 to 17 bits). This meant that either the hypothesis was wrong or something interesting was happening at T7 promoters. Perhaps another protein binds there, and this accounts for the "excess" information. If so, I should be able to delete the excess information. It took me a while, but I did the experiment and found that 18 +/- 2 bits are all that the polymerase needs! So the hypothesis survived. The experiment would not have been done without the theoretical analysis. I have another case like this that I'm writing up now. So the idea of doing experiments first is only a tradition of molecular biology. Theoretical understanding can also play a role. References to this story can be found in: @article{Schneider.Stephens.Logo, author = "T. D. Schneider and R. M. Stephens", title = "Sequence Logos: A New Way to Display Consensus Sequences", journal = "Nucl. Acids Res.", volume = "18", pages = "6097-6100", year = "1990"} >Eric Cabot | elmo@uhura.cc.rochester.edu > :-):-):-):-):-):-):-):-) | elmo@uordbv.bitnet Tom Schneider National Cancer Institute Laboratory of Mathematical Biology Frederick, Maryland 21702-1201 toms@ncifcrf.gov