Path: utzoo!utgpu!water!watmath!uunet!ig!daemon From: CMATHEWS.KRAMER@BIONET-20.ARPA Newsgroups: bionet.molbio.evolution Subject: [Jack Kramer : Re: [Dan Davison : Re: Statistical significance of "PERCENT" homology.]] Message-ID: <6258@ig.ig.com> Date: 13 May 88 16:56:48 GMT Sender: daemon@presto.ig.com Lines: 91 From: Jack Kramer Mail-From: CMATHEWS.KRAMER created at 13-May-88 09:24:18 Date: Fri 13 May 88 09:24:18-PDT From: Jack Kramer Subject: Re: [Dan Davison : Re: Statistical significance of "PERCENT" homology.] To: CMATHEWS.KRAMER@BIONET-20.ARPA In-Reply-To: <12388310066.28.CMATHEWS.KRAMER@BIONET-20.ARPA> Message-ID: <12398017522.38.CMATHEWS.KRAMER@BIONET-20.ARPA> Dan, I finally had a chnce to go back and review the Manske and Chapman article. As I thought it is not ong the lines of my interests. The idea I was trying to convey was that a single letter is a very poor representation of an amino acid for the purposes of all but the very simplest sequence analysis tasks. Each amino acid is a real molecule which can be sescribed by a set of many chemical and physical and chemical parameters. thus I feel that the best way to represent this is by using a vector which is some linear combination of all these properties for each sequence element at the primary level. At higher levels groups of contiguous vectors can be used to extend syntactic analysis. Abstractions of these which cluster with biological properties would then open the door to semantic analysis. To make the idea more concrete the concept is very similar to the manual graphically assisted analysis performed by plotting several parameters such as hydropahty, structural propensity predictors, mass, etc. for the concerned sequences on the same scale and then visually comparing the graphs to detect any pattern similarity. The Modelevsky article in the recent April addition of CABIOS presents annother first step approach. DeLisi's paper describes some efforts to use perceptrons as automated learning machines to extend the concept to the pattern analysis of the biological semantics level. Many papers have recently described initial attempts and using multivariate statistical analysis of vector representations of sequences. I cite Gribskov, Kubota, and Sjostrom. The venn diagram classification approach of Taylor could provide a basis for initial weighting matrices for learning based on the primary sequence elements. Another paper by the same Taylor (which I remember but don't have the citation at hand) describes another extension to abstract secondary structure domains and the semantic database type analyses that would be possible. I believe the increasing availability of supercomputers and vector and array processors to those working in these areas will make these multivariate techniques the basis of the nest generation of sequence analysis software. (I'm sure that could elicit some response) Modelevskey and Akers(1988) 3-D Multivariate data display tool as a protein design aid. CABIOS 4:2 308 April 1988 DeLisi(1988) Computers in Molecular Biology. Science 240:47-52 April 1988 Gribskov et al Profile analysis: Detection fo distantly related proteins. PNAS 84:4355-4358 July 1987 Kubota et al Correspondence of homologies in amino acid sequence and tertiary structure of protein molecules. BiochimBioophys Acta 701:242-252 1982 Sjostrom and Wold(1987) Signal peptide amino acid sequences in E. coli contain information related to final protein localization. A multi-variate data analysis. EMBO Journal 6:3 823-831 Sjostrom and Wold(1985) A multi variate study of the relationship between the genettic code and the physical-chemical properties of amino acids. J Mol Evol 22:272-277 This list is by no means exhaustive but does provide a reasonable intro to the possibilities and limitation(current) of the ideas I meant to describe initially. If you feel that others may be interested please feel free to forward this message to appropriated bboards. Jack Kramer PS I had to add one comment on one of my pet peeves, the misuse of the word homology as it applies to sequences comparisons. the series of messages on the statistical significanc fo sequence comparison demonstrates the confusion taht results when "homology" is diluted to include analogy and similarity and even more. Here it is even worse because not only are similarity, homology and analogy indiscriminantly intermixed but the two different levels of sequence comparison and the subsequent phylogeny inference were distinguished. Maintaining these distinctions is absolutely necessary for thses discussions ( and those int the literature to make sense across the multidisicplinary related fields. et al et et al "Homology" on proteins and Nucleic Acids: A terminology muddle and a way out of it. CELL 50:667 Aug 28, 1987 ------- -------