Path: utzoo!mnetor!uunet!yale!cmcl2!beta!dd From: dd@beta.UUCP (Dan Davison) Newsgroups: sci.bio Subject: similarity searching; statistical significance Message-ID: <18202@beta.UUCP> Date: 20 Apr 88 05:25:27 GMT Organization: Los Alamos Natl Lab, Los Alamos, N.M. Lines: 250 Keywords: DNA, RNA, protein, statistical significance The following discussion appeaed recently on the MOLECULAR-EVOLUTION bboard on the arpanet, and I thought sci.bio readers would be interested in it. dan davison theoretical biology t-10 ms k710 los alamos natl lab los alamos nm 87545 dd@lanl.gov ...cmcl2!lanl!dd ------------------------------------------------------------------- Date: Wed 30 Mar 88 17:47:28-PST From: Winston Hide Subject: Random sequence homology. In all your learned opinions, what do you thinkj the value is for % homology of twoi pieces of completely random sequence of dna? I would appreciate it if you could state what you think the most random value would be , and by what process. Thankyou for your indulgence. From: Jack Kramer Subject: Statistical significance of "PERCENT" homology. Winston, Percent similarity can be used to imply homology (CELL 50:667 Aug 87) as well as analogy. The statistical significance of sequence similarity has been extensively analysed. A couple of good places to start looking are the appropriate chapters in von Heijne's "Sequence Analysis in Molecular Biology", reviews and articles in Computer Applications in the Biosciences (e.g. 3:1 Mar 87) and the Application of Computers to Research on Nucleic Acids I, II and III supplements to NAR. Jack Date: Fri 1 Apr 88 18:58:17-PST From: Dan Davison Subject: statistical significance of "%" "homology" (acckkk gasp) > From: Jack Kramer > Subject: Statistical significance of "PERCENT" homology. > > The statistical significance of sequence similarity has been > extensively analysed. I would add, and still not well understood. There has been a lot of important work by Arratia, Waterman, and Galas, but I would not say that the problem is well understood. The distribution of nucleic acid similarities is described by extreme value theory and the Erdos-Renyai (sp?) law (with shifts) but there is no well defined distribution (a la binomial or Poisson) for them. Yet. I'd be glad to hear about any work in this field. Date: Wed, 06 Apr 88 13:51:12 BST From: MJB1%PHX.CAM.AC.UK@CUNYVM.CUNY.EDU Surely the point here is that there are an infinite number of statistical models of sequence similarity. There is no problem in assigning significance under a particular model, thought there may well be a problem in assessing its biological relevance. I think the questions being asked should be (1) What is a good model for the similarity of molecular sequences. (2) How can one assess the biological relevance of statistical significance in relation to a particular model. Put in this way, one soon realises that the original problem has been framed in too broad a way. What are the conditions relating to the comparison, surely not just that we have sequenced too bits of DNA and want to know how similar they are (though it could be that if you insist). People should worry more about the conditions relating to the particular problem and try to get experimental evidence about biologically relevant parameters. To emphasise the point about conditions consider the old coin tossing problem. We all know that we come up heads half the time and tails half the time. But do we... the coin rolled down the drain and the result was indeterminate. My friend has made a ballistic machine which tosses the coin so that the way it lands depends which way it was placed on the machine before tossing. How much more complex then are the conditions under which DNA evolves. Trying to improve our knowledge about that for specific gene families would be a good thing to attempt. A completely general model is too broad and naive to be useful, I suspect. Date: Wed 6 Apr 88 08:39:39-PDT From: Jack Kramer Subject: [Dan Davison : Re: Statistical significance of "PERCENT" homology.] You might check out the method of Manske and Chapman, in the second issue of the current volume of the J. of Molecular Evolution. It's an interesting attempt in the general direction you describe. Would it be OK to post your note to MOLECULAR-EVOLUTION? I wouldn't mind starting a fight or two. I also think it's unlikely that *anything* other than Dr. Who's Tardis would allow "the detection of evolutionary relationships objectively". Perhaps I'm too jaded from the SUNY Stony Brook [phenetics vs. cladistics] wars. Date: Thu 31 Mar 88 17:54:09-PST From: Jack Kramer Subject: Re: Statistical significance of "PERCENT" homology. Dan, I strongly second your "not well understood". A significant reason for this, in my opinion, is the use of the scalar alphabet used as the basis for almost all studies. I am very interested in the assignment multiple coefficient atribute vectors to the alphabet and "words" and the application of neural net semantic pattern recognition AI techniques to this problem. Some very rudementary work has been done along these lines has been done by Stormo and Wold et al. The great flurry of current activity in massively parallel fine grained hardware and software architectures will eventually percolate into the molecular biology arena. Most of the advances now being made along these lines in speech recognition will directly apply. Illucidation of the syntactic and semantic patterns in biomacromolecules should be a direct fallout from this work and if combined with cladistic clustering will finally allow the detection of evolutionary relationships objectively. (my opinions - and I love to argue, especially about mol evol). I think it would be very interesting to participate in electronic debate ala Farris/Felsenstein through this media. Anybody out there want to start stiring things up? Jack Kramer Center for Gene Research Oregon State University Date: Thu, 7 Apr 88 12:31:50 MDT From: dbd%benden@LANL.GOV (Dan Davison) Surely the point here is that there are an infinite number of statistical models of sequence similarity. Yes. There is no problem in assigning significance under a particular model, thought there may well be a problem in assessing its biological relevance. One of the most common mistakes I encounter among molecular biologists who are looking at search results is "This result isn't statistically significant, so I can ignore it". Enhancer and TATA boxes are examples of statistically insignificant matches that are biologically significant. In these cases there is an additional element--position--that determines the biological signifcance. Only the biologist (or an AI tool) can do such analysis. I think the questions being asked should be (1) What is a good model for the similarity of molecular sequences. How about "what is a good model for assessing the similarity of molecular sequences"? I think this is what you mean. (2) How can one assess the biological relevance of statistical significance in relation to a particular model. Put in this way, one soon realises that the original problem has been framed in too broad a way. What are the conditions relating to the comparison, surely not just that we have sequenced too bits of DNA and want to know how similar they are (though it could be that if you insist). As you can tell from my remarks above, I agree with this statement, with a caveat. I have next to my desk a printout 14 inches thick of 8 point type. It is the result (accidental) of specifying too low a similarity criterion to a library search routine. Suppose that this search was an enhancer core against all of GenBank. Every bit of that printout would be potentially biologically significant. However, it would take a month (or more) of effort to check the biological significance of each result. We must have ways of sifting through the incredible amount of output that will be generated by similarity comparisons. The best method at the moment is by using statistical significance. The quality of the statistical model used will determine how much of the search space is *productively* examined. This certainly cuts out much information that is of biological significance, but *at present there is no automated way of assessing biological sigificance*. People should worry more about the conditions relating to the particular problem and try to get experimental evidence about biologically relevant parameters. To emphasise the point about conditions consider the old coin tossing problem. We all know that we come up heads half the time and tails half the time. But do we... the coin rolled down the drain and the result was indeterminate. My friend has made a ballistic machine which tosses the coin so that the way it lands depends which way it was placed on the machine before tossing. We use the statistical methods and the parameter choices in similarity searching to do precisely this., ie to make up for the lack of time to get the experimental evidence about biologically relevant parameters. No one has the time the necessary expertise in all 20,000 sequences in the nucleic acid databanks. How much more complex then are the conditions under which DNA evolves. Trying to improve our knowledge about that for specific gene families would be a good thing to attempt. A completely general model is too broad and naive to be useful, I suspect. I not sure what "a completely general model" refers to here, but if you mean a completely general model of statistical similarity of genetic sequences: Yes, it would be naive, but not "too broad". It would lack the biological knowledge, which is what the "too broad" probably refers to. The quantification of knowledge is a risky business. In this context, biologists are not going to be unemployed for a long, long time. Given the concerns we have both stated, can you imagine how much fun it is going to be to have complete "real" (mycoplasma & up) genomes to analyze? dan davison / theoretical biology / los alamos national laboratory Date: 8-APR-1988 10:51:24 GMT From: DBO%VAX.LEICESTER.AC.UK@CUNYVM.CUNY.EDU Having seen Winston Hides query on homology between random sequences I believe I may have a partial answer albeit in a somewhat simplified form. The most basic method of obtaining percentage homology between 2 sequences is to simply line them up and count the matches. For random DNA this will approach 25% with increasing length of sequences compared. As there are 2 mutually exclusive events here, match and mismatch, binomial probability theory is applicable and I have therefore calculated the percentage homology that is required for confidence that the sequences are not random but are in fact homologous. This, predictably, decreases with increasing sequence length. The figures I arrived at are given below. Sequence Confidence at Confidence at Confidence at Length 95% level 99% level 99.9% level 100 32.0% 35.0% 39.0% 200 30.0% 31.5% 32.0% It was interesting to note that as the sequence length reached 300 the % homology required for confidence had dropped below 25% and this suggests that as I am certain that I am applying the probability formulae properly, my initial assumption of the homology level of truly random DNA above is incorrect Perhaps if the % homology required for confidence could be expressed as a function of sequence length (which I lack both the time and inclination to do) it could be shown to converge on a limit as length approaches infinity. This would, I suspect, be close to what is required to answer the origional questio As I am not a statistician by trade please dont take all this without a few pinches of salt but if there are any statisticians reading this the comments they make on my dabbling in their field should make interesting reading over the next few weeks! Dave Booth University of Reading UK -- dan davison/theoretical biology/t-10 ms k710/los alamos national laboratory los alamos, nm 875545/dd@lanl.gov (arpa)/dd@lanl.uucp(new)/..cmcl2!lanl!dd "I think, therefore I am confused"