Path: utzoo!utgpu!news-server.csri.toronto.edu!bonnie.concordia.ca!uunet!bionet!FCRFV1.NCIFCRF.GOV!gribskov From: gribskov@FCRFV1.NCIFCRF.GOV ("Gribskov, Michael") Newsgroups: bionet.molbio.genbank Subject: consensus sequences, motifs, and patterns Message-ID: <9102071611.AA03799@genbank.bio.net> Date: 7 Feb 91 16:07:00 GMT Sender: daemon@genbank.bio.net Lines: 39 In response to recent questions about whether GenBank and PIR should establish and maintain entries that correspond to motifs and patterns in sequences: My opinion is that it is now well established that consensus sequences, that is the representation of a pattern as a single sequence with the majority or plurality residue/base at each position, is now considered to be an extremely poor way to represent patterns. To this extent I agree with Tom Schneider. I think that a database of patterns represented as alignments or weight matrices would be VERY valuable. Such a database would be great improvement on the current situation where, at best, you have to search the database to find each sequence that references a given motif, and then construct alignments de novo. In cases where keywords and terminology have not been standardized this is quite difficult. The superfamily searching mechanism of PSQ is of course valuable, but is not perfectly adapted to describing patterns that may be much smaller than the entire sequence, and hence difficult to locate, in additional many protein sequences show enough divergence that even when you know a motif is present, it is still difficult to get them correctly aligned. However, you have to keep in mind that we have not yet heard the final word on the best way to represent sequence patterns. I think that it is therefore critical that any set of patterns maintain a set of pointers that directly enable you to access the original sequences used to derive the pattern. PROSITE is a good example of how this might be done, although for protein motifs it would be nice if there was a good compact encoding that could be used to reconstruct aligned sequences. I guess it is clear that what I'm suggesting is to not add this information to the existing sequence entries, but to have either a separate section or distinct kind of entry. This would seem to be especially appropriate since it is clearly derived information and will require a lot of judgement calls in defining the patterns. Michael Gribskov gribskov@ncifcrf.gov