Path: utzoo!utgpu!watserv1!watmath!uunet!bionet!lhc!ncifcrf!fcs260c2!toms From: toms@fcs260c2.ncifcrf.gov (Tom Schneider) Newsgroups: bionet.molbio.genbank Subject: Re: consensus sequences, motifs, and patterns Message-ID: <2045@fcs280s.ncifcrf.gov> Date: 8 Feb 91 18:00:59 GMT References: <9102071611.AA03799@genbank.bio.net> Sender: news@ncifcrf.gov Organization: NCI Supercomputer Facility, Frederick, MD Lines: 34 In article <9102071611.AA03799@genbank.bio.net> gribskov@FCRFV1.NCIFCRF.GOV ("Gribskov, Michael") writes: I agree with Michael for the most part. The only point were we differ is: >...since it is clearly derived information and >will require a lot of judgement calls in defining the patterns. I think that a list of acceptable criteria can be drawn up to define the locations. For example, cap (crp) binding sites can be defined genetically, by DNAse footprinting, by methylation protection or intereference and by ethylation phosphate blockage. Splice junctions are well defined (for the most part) by comparing DNA to spliced RNA sequences. The only tricky step is the final alignment, but in most cases this can be done closely enough not to be a problem, and the sites could be realigned by a researcher if desired. Giving approximate alignments would begin to address the problem Mike brings up about difficulty in aligning. So I don't think that judgement calls are so important - simply list the kind of data used to define the location. In general the sequence data would be used only at the last step to get the exact location. (See my previous note about being fooled as to what is a pattern). Clearly using sequence data alone is not a good idea at this stage. That is, purely derived information should not be in the database, or marked as such. Then a smart program or researcher could simply ignore the guesses. As Mike points out, the last word on site definitions is not in yet. So we must store pointers to the original raw data locations in the database. >Michael Gribskov >gribskov@ncifcrf.gov Tom Schneider National Cancer Institute Laboratory of Mathematical Biology Frederick, Maryland 21702-1201 toms@ncifcrf.gov