Path: utzoo!utgpu!water!watmath!uunet!ig!daemon
From: CMATHEWS.KRAMER@BIONET-20.ARPA
Newsgroups: bionet.molbio.evolution
Subject: [Jack Kramer  <CMATHEWS.KRAMER@BIONET-20.ARPA>: Re: [Dan Davison <GOAD.DAVISON@BIONET-20.ARPA>: Re: Statistical significance of "PERCENT" homology.]]
Message-ID: <6258@ig.ig.com>
Date: 13 May 88 16:56:48 GMT
Sender: daemon@presto.ig.com
Lines: 91

From: Jack Kramer  <CMATHEWS.KRAMER@BIONET-20.ARPA>

Mail-From: CMATHEWS.KRAMER created at 13-May-88 09:24:18
Date: Fri 13 May 88 09:24:18-PDT
From: Jack Kramer  <CMATHEWS.KRAMER@BIONET-20.ARPA>
Subject: Re: [Dan Davison <GOAD.DAVISON@BIONET-20.ARPA>: Re: Statistical significance of "PERCENT" homology.]
To: CMATHEWS.KRAMER@BIONET-20.ARPA
In-Reply-To: <12388310066.28.CMATHEWS.KRAMER@BIONET-20.ARPA>
Message-ID: <12398017522.38.CMATHEWS.KRAMER@BIONET-20.ARPA>

Dan,

	I finally had a chnce to go back and review the Manske and
Chapman article.  As I thought it is not ong the lines of my interests.

	The idea I was trying to convey was that a single letter is a
very poor representation of an amino acid for the purposes of all but the
very simplest sequence analysis tasks.  Each amino acid is a real molecule
which can be sescribed by a set of many chemical and physical and chemical
parameters.  thus I feel that the best way to represent this is by using a
vector which is some linear combination of all these properties for each 
sequence element at the primary level.  At higher levels groups of
contiguous vectors can be used  to extend syntactic analysis.  Abstractions
of these which cluster with biological properties would then open the 
door to semantic analysis.
	To make the idea more concrete the concept is very similar to the
manual graphically assisted analysis performed by plotting several
parameters such as hydropahty, structural propensity predictors, mass, etc.
for the concerned sequences on the same scale and then visually comparing
the graphs to detect any pattern similarity.  The Modelevsky article in the
recent April addition of CABIOS presents annother first step approach.
DeLisi's paper describes some efforts to use perceptrons as automated
learning machines to extend the concept to the pattern analysis of the 
biological semantics level.  Many papers have recently described initial
attempts and using multivariate statistical analysis of vector representations
of sequences.  I cite Gribskov, Kubota, and Sjostrom.  The venn diagram
classification approach of Taylor could provide a basis for initial
weighting matrices for learning based on the primary sequence elements.
Another paper by the same Taylor (which I remember but don't have the 
citation at hand) describes another extension to abstract secondary
structure domains and the semantic database type analyses that would be
possible.
	I believe the increasing availability of supercomputers and
vector and array processors to those working in these areas will make
these multivariate techniques the basis of the nest generation of 
sequence analysis software.  (I'm sure that could elicit some response)

Modelevskey and Akers(1988) 3-D Multivariate data display tool as a protein
design aid.  CABIOS 4:2 308 April 1988

DeLisi(1988)  Computers in Molecular Biology.  Science 240:47-52 April 1988

Gribskov et al  Profile analysis: Detection fo distantly related proteins.
PNAS 84:4355-4358   July 1987

Kubota et al  Correspondence of homologies in amino acid sequence and
tertiary structure of protein molecules.  BiochimBioophys Acta 701:242-252
1982

Sjostrom and Wold(1987)  Signal peptide amino acid sequences in E. coli
contain information related to final protein localization. A multi-variate
data analysis.  EMBO Journal 6:3 823-831

Sjostrom and Wold(1985)  A multi variate study of the relationship between the
genettic code and the physical-chemical properties of amino acids.
J Mol Evol  22:272-277

	This list is by no means exhaustive but does provide a reasonable 
intro to the possibilities and limitation(current) of the ideas I meant to 
describe initially.

	If you feel that others may be interested please feel free to forward
this message to appropriated bboards.

Jack Kramer

PS  I had to add one comment on one of my pet peeves, the misuse of the
word homology as it applies to sequences comparisons.  the series of 
messages on the statistical significanc fo sequence comparison demonstrates
the confusion taht results when "homology" is diluted to include analogy
and similarity and even more.  Here it is even worse because not only
are similarity, homology and analogy indiscriminantly intermixed but the
two different levels of sequence comparison and the subsequent phylogeny
inference were distinguished.  Maintaining these distinctions is absolutely
necessary for thses discussions ( and those int the literature to make
sense across the multidisicplinary related fields.

et al et et al  "Homology" on proteins and Nucleic Acids: A terminology 
muddle and a way out of it.  CELL 50:667 Aug 28, 1987
-------
-------