Path: utzoo!utgpu!news-server.csri.toronto.edu!bonnie.concordia.ca!uunet!bionet!FCRFV1.NCIFCRF.GOV!gribskov
From: gribskov@FCRFV1.NCIFCRF.GOV ("Gribskov, Michael")
Newsgroups: bionet.molbio.genbank
Subject: consensus sequences, motifs, and patterns
Message-ID: <9102071611.AA03799@genbank.bio.net>
Date: 7 Feb 91 16:07:00 GMT
Sender: daemon@genbank.bio.net
Lines: 39

In response to recent questions about whether GenBank and PIR should
establish and maintain entries that correspond to motifs and patterns in
sequences: 

My opinion is that it is now well established that consensus sequences,
that is the representation of a pattern as a single sequence with the
majority or plurality residue/base at each position, is now considered
to be an extremely poor way to represent patterns.  To this extent I
agree with Tom Schneider. 

I think that a database of patterns represented as alignments or weight
matrices would be VERY valuable. Such a database would be great
improvement on the current situation where, at best, you have to search
the database to find each sequence that references a given motif, and
then construct alignments de novo. In cases where keywords and
terminology have not been standardized this is quite difficult.  The
superfamily searching mechanism of PSQ is of course valuable, but is not
perfectly adapted to describing patterns that may be much smaller than
the entire sequence, and hence difficult to locate, in additional many
protein sequences show enough divergence that even when you know a motif
is present, it is still difficult to get them correctly aligned. 

However, you have to keep in mind that we have not yet heard the final
word on the best way to represent sequence patterns.  I think that it is
therefore critical that any set of patterns maintain a set of pointers
that directly enable you to access the original sequences used to derive
the pattern.  PROSITE is a good example of how this might be done,
although for protein motifs it would be nice if there was a good compact
encoding that could be used to reconstruct aligned sequences. 


I guess it is clear that what I'm suggesting is to not add this 
information to the existing sequence entries, but to have either a 
separate section or distinct kind of entry.  This would seem to be 
especially appropriate since it is clearly derived information and
will require a lot of judgement calls in defining the patterns.

Michael Gribskov
gribskov@ncifcrf.gov