Path: utzoo!mnetor!uunet!yale!cmcl2!beta!dd
From: dd@beta.UUCP (Dan Davison)
Newsgroups: sci.bio
Subject: similarity searching; statistical significance
Message-ID: <18202@beta.UUCP>
Date: 20 Apr 88 05:25:27 GMT
Organization: Los Alamos Natl Lab, Los Alamos, N.M.
Lines: 250
Keywords: DNA, RNA, protein, statistical significance


The following discussion appeaed recently on the MOLECULAR-EVOLUTION
bboard on the arpanet, and I thought sci.bio readers would be interested
in it.

dan davison theoretical biology t-10 ms k710 los alamos natl lab
los alamos nm 87545 dd@lanl.gov ...cmcl2!lanl!dd
-------------------------------------------------------------------
Date: Wed 30 Mar 88 17:47:28-PST
From: Winston Hide <LYAGER.HIDE@BIONET-20.ARPA>
Subject: Random sequence homology.

In all your learned opinions, what do you thinkj the value is for %
homology of twoi pieces of completely random sequence of dna?
I would appreciate it if you could state what you think the most
random value would be , and by what process.
Thankyou for your indulgence.

From: Jack Kramer  <CMATHEWS.KRAMER@BIONET-20.ARPA>
Subject: Statistical significance of "PERCENT" homology.

Winston,

	Percent similarity can be used to imply homology (CELL 50:667 Aug 87)
as well as analogy.  The statistical significance of sequence similarity
has been extensively analysed. A couple of good places to start looking
are the appropriate chapters in von Heijne's "Sequence Analysis in Molecular
Biology", reviews and articles in Computer Applications in the Biosciences
(e.g. 3:1 Mar 87) and the Application of Computers to Research on Nucleic
Acids I, II and III supplements to NAR.

Jack

Date: Fri 1 Apr 88 18:58:17-PST
From: Dan Davison <GOAD.DAVISON@BIONET-20.ARPA>
Subject: statistical significance of "%" "homology" (acckkk gasp)

> From: Jack Kramer  <CMATHEWS.KRAMER@BIONET-20.ARPA>
> Subject: Statistical significance of "PERCENT" homology.
>
>     The statistical significance of sequence similarity has been
> extensively analysed.

I would add, and still not well understood.  There has been a lot of 
important work by Arratia, Waterman, and Galas, but I would not say 
that the problem is well understood.  The distribution of nucleic acid 
similarities is described by extreme value theory and the Erdos-Renyai
(sp?) law (with shifts) but there is no well defined distribution
(a la binomial or Poisson) for them.  Yet.  

I'd be glad to hear about any work in this field.

Date:       Wed, 06 Apr 88 13:51:12 BST
From:       MJB1%PHX.CAM.AC.UK@CUNYVM.CUNY.EDU

Surely the point here is that there are an infinite number of statistical
models of sequence similarity.  There is no problem in assigning significance
under a particular model, thought there may well be a problem in assessing
its biological relevance.  I think the questions being asked should be
(1) What is a good model for the similarity of molecular sequences.
(2) How can one assess the biological relevance of statistical significance
in relation to a particular model.

Put in this way, one soon realises that the original problem has been framed
in too broad a way.  What are the conditions relating to the comparison,
surely not just that we have sequenced too bits of DNA and want to know
how similar they are (though it could be that if you insist).

People should worry more about the conditions relating to the particular
problem and try to get experimental evidence about biologically relevant
parameters.  To emphasise the point about conditions consider the old coin
tossing problem. We all know that we come up heads half the time and tails
half the time.  But do we... the coin rolled down the drain and the
result was indeterminate.  My friend has made a ballistic machine which
tosses the coin so that the way it lands depends which way it was placed
on the machine before tossing.

How much more complex then are the conditions under which DNA evolves.
Trying to improve our knowledge about that for specific gene families
would be a good thing to attempt.  A completely general model is
too broad and naive to be useful, I suspect.

Date: Wed 6 Apr 88 08:39:39-PDT
From: Jack Kramer  <CMATHEWS.KRAMER@BIONET-20.ARPA>
Subject: [Dan Davison <GOAD.DAVISON@BIONET-20.ARPA>: Re: Statistical significance of "PERCENT" homology.]

You might check out the method of Manske and Chapman, in the second issue
of the current volume of the J. of Molecular Evolution.  It's an interesting
attempt in the general direction you describe.  

Would it be OK to post your note to MOLECULAR-EVOLUTION?  I wouldn't mind
starting a fight or two.  I also think it's unlikely that *anything*
other than Dr. Who's Tardis would allow "the detection of evolutionary
relationships objectively".  Perhaps I'm too jaded from the SUNY
Stony Brook [phenetics vs. cladistics] wars.


Date: Thu 31 Mar 88 17:54:09-PST
From: Jack Kramer  <CMATHEWS.KRAMER@BIONET-20.ARPA>
Subject: Re: Statistical significance of "PERCENT" homology.

Dan,

	I strongly second your "not well understood".

	A significant reason for this, in my opinion, is the use of the
scalar alphabet used as the basis for almost all studies.  I am very
interested in the assignment multiple coefficient atribute vectors
to the alphabet and "words" and the application of neural net semantic 
pattern recognition AI techniques to this problem.  Some very
rudementary work has been done along these lines has been done by
Stormo and Wold et al.  The great flurry of current activity in
massively parallel fine grained hardware and software architectures will
eventually percolate into the molecular biology arena.  Most of the 
advances now being made along these lines in speech recognition will 
directly apply.  Illucidation of the syntactic and semantic patterns in
biomacromolecules should be a direct fallout from this work and if 
combined with cladistic clustering will finally allow the detection
of evolutionary relationships objectively.  (my opinions - and I love
to argue, especially about mol evol).

	I think it would be very interesting to participate in electronic
debate ala Farris/Felsenstein through this media.  Anybody out there
want to start stiring things up?

Jack Kramer
Center for Gene Research
Oregon State University


Date: Thu, 7 Apr 88 12:31:50 MDT
From: dbd%benden@LANL.GOV (Dan Davison)


 Surely the point here is that there are an infinite number of statistical
 models of sequence similarity. 

Yes.  

 There is no problem in assigning significance
 under a particular model, thought there may well be a problem in assessing
 its biological relevance. 

One of the most common mistakes I encounter among molecular biologists who
are looking at search results is "This result isn't statistically significant,
so I can ignore it".  Enhancer and TATA boxes are examples of statistically
insignificant matches that are biologically significant.  In these cases
there is an additional element--position--that determines the biological
signifcance.  Only the biologist (or an AI tool) can do such analysis.


 I think the questions being asked should be
 (1) What is a good model for the similarity of molecular sequences.

How about "what is a good model for assessing the similarity of molecular
sequences"? I think this is what you mean.

 (2) How can one assess the biological relevance of statistical significance	 in relation to a particular model.
 Put in this way, one soon realises that the original problem has been framed 	 in too broad a way.  What are the conditions relating to the comparison, 	 surely not just that we have sequenced too bits of DNA and want to know
 how similar they are (though it could be that if you insist).

As you can tell from my remarks above, I agree with this statement, with a
caveat.  I have next to my desk a printout 14 inches thick of 8 point type.
It is the result (accidental) of specifying too low a similarity criterion 
to a library search routine.  Suppose that this search was an enhancer core
against all of GenBank.  Every bit of that printout would be potentially
biologically significant.  However, it would take a month (or more) of effort
to check the biological significance of each result.  We must have ways of
sifting through the incredible amount of output that will be generated by
similarity comparisons.  The best method at the moment is by using statistical
significance.  The quality of the statistical model used will determine how
much of the search space is *productively* examined.  This certainly cuts
out much information that is of biological significance, but *at present there
is no automated way of assessing biological sigificance*.

  People should worry more about the conditions relating to the particular
  problem and try to get experimental evidence about biologically relevant
  parameters.  To emphasise the point about conditions consider the old coin
  tossing problem. We all know that we come up heads half the time and tails
  half the time.  But do we... the coin rolled down the drain and the
  result was indeterminate.  My friend has made a ballistic machine which
  tosses the coin so that the way it lands depends which way it was placed
   on the machine before tossing.

We use the statistical methods and the parameter choices in similarity
searching to do precisely this., ie to make up for the lack of time to
get the experimental evidence about biologically relevant parameters.
No one has the time the necessary expertise in all 20,000 sequences 
in the nucleic acid databanks.  
	
 How much more complex then are the conditions under which DNA evolves.
 Trying to improve our knowledge about that for specific gene families
 would be a good thing to attempt.  A completely general model is
 too broad and naive to be useful, I suspect.
	
I not sure what "a completely general model" refers to here, but if you
mean a completely general model of statistical similarity of genetic 
sequences: Yes, it would be naive, but not "too broad".  It would lack
the biological knowledge, which is what the "too broad" probably refers
to.  The quantification of knowledge is a risky business.  In this context,
biologists are not going to be unemployed for a long, long time.

Given the concerns we have both stated, can you imagine how much fun it is
going to be to have complete "real" (mycoplasma & up) genomes to analyze?

dan davison / theoretical biology / los alamos national laboratory
Date:     8-APR-1988 10:51:24 GMT
From:     DBO%VAX.LEICESTER.AC.UK@CUNYVM.CUNY.EDU

Having seen Winston Hides query on homology between random sequences I believe
I may have a partial answer albeit in a somewhat simplified form.  The most
basic method of obtaining percentage homology between 2 sequences is to
simply line them up and count the matches.  For random DNA this will approach
25% with increasing length of sequences compared.

As there are 2 mutually exclusive events here, match and mismatch, binomial
probability theory is applicable and I have therefore calculated the
percentage homology that is required for confidence that the sequences are
not random but are in fact homologous.  This, predictably, decreases with
increasing sequence length.  The figures I arrived at are given below.

Sequence    Confidence at    Confidence at    Confidence at
Length        95% level    99% level    99.9% level

100        32.0%        35.0%        39.0%

200        30.0%        31.5%        32.0%

It was interesting to note that as the sequence length reached 300 the %
homology required for confidence had dropped below 25% and this suggests that
as I am certain that I am applying the probability formulae properly, my
initial assumption of the homology level of truly random DNA above is incorrect

Perhaps if the % homology required for confidence could be expressed as a
function of sequence length (which I lack both the time and inclination to do)
it could be shown to converge on a limit as length approaches infinity.  This
would, I suspect, be close to what is required to answer the origional questio

As I am not a statistician by trade please dont take all this without a few
pinches of salt but if there are any statisticians reading this the comments
they make on my dabbling in their field should make interesting reading over
the next few weeks!

Dave Booth  University of Reading UK

-- 
dan davison/theoretical biology/t-10 ms k710/los alamos national laboratory
los alamos, nm 875545/dd@lanl.gov (arpa)/dd@lanl.uucp(new)/..cmcl2!lanl!dd
"I think, therefore I am confused"