Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!dali.cs.montana.edu!uakari.primate.wisc.edu!sdd.hp.com!cs.utexas.edu!uunet!bionet!lhc!ncifcrf!fcs260c2!toms From: toms@fcs260c2.ncifcrf.gov (Tom Schneider) Newsgroups: bionet.molbio.genbank Subject: Re: Software for automated subseqence extraction Message-ID: <2147@fcs280s.ncifcrf.gov> Date: 6 May 91 18:27:59 GMT References: <2140@fcs280s.ncifcrf.gov> Sender: news@ncifcrf.gov Organization: NCI Supercomputer Facility, Frederick, MD Lines: 107 In article kristoff@genbank.bio.net (David Kristofferson) writes: >I ... >must admit to not understanding your comment about the lack of a >coordinate system. For example, coding sequences are clearly >annotated in the features table and one can extract these subsequences >from an entry while also carrying along the annotations which refer to >their position in the original sequence. What do you mean by the lack >of a coordinate system??? There are many ways to use a genetic sequence database. Most people are interested in a single sequence, and for this the current methods work reasonably well. However, more and more people are interested in studying collections of sequences. For example, we have a huge collection of splice junctions. To analyze these statistically, we would like to extract only a minimum region around the junctions. If we were to do this by hand, then we would be likely to make errors, and the process would be very tedious. To avoid errors, we create a set of instructions that define the regions we want to study. We used the feature table to make the instructions. But what should the output of such an extraction look like? Many years ago, Jeff Haemer and I realized that the best form for the output extraction should be identical to the input! Thus if I want bases 57 through 89 of a partiticular GenBank entry, the most useful output would look like a GenBank entry, but would only contain bases 57 through 89. The power of this is that it allows one to use the same search or other analysis program on GenBank as one uses on a subset. Using written sets of instructions (instead of interactive input), one can automatically create sub-databases and sub-sub databases. The subsets would be equivalent to the main database. For example, we created a subset of E. coli sequences that were the transcribed RNA. Further extractions of ribosome binding sites to create a sub-sub-database, were therefore guaranteed to give us sequences that were alway RNA. These were the initial steps toward creating a database which we used to train the Perceptron (a neural net) to locate ribosome binding sites (Stormo, NAR 10: 2997, 1982). In the present GenBank scheme, this means that the numbering of the extracted fragment 57 through 89 would implicitly become 1 through 89-57+1 = 33. If we made a nice printed listing of open reading frames of the original entry, then we would have to keep doing subtractions to find things in our sub-sequence. If you every have had to do this, you know how painful it is. So the idea came up that the extracted entry should carry a coordinate system. This is a set of numbers that defines the original number of each base in the extracted sequence. But if the extracted entries have coordinate systems, then so too should the main library, in keeping with the principle of equivalence between database and sub-databases. To implement such a scheme today, we would have to add a coordinate system to the extracted GenBank entries. This is equivalent to carrying along the annotations, but makes it more explicit. A true coordinate system does not depend on any 'features'. With today's GenBank, we would also have to have each analysis program check for a coordinate system, and if it is not found, assume that the numbering is 1 to n. This is possible, but is obviously a messy design, forced by the lack of an explicitly defined coordinate system in the main database. You might ask: why not simply implement this program check and be done with it? Well, if nothing else, having a coordinate system would allow GenBank to extend an old sequence before base 1 and not modify any other coordinates. (There is nothing wrong with having a zero coordinate.) These ideas were implemented in the Delila system before GenBank came into existence (NAR 10:3013, 1982; 12:129, 1984). I don't expect GenBank to write software, since that goal of GenBank was dropped for political/funding (?) reasons many years ago. However, GenBank should be creating a database which is useable for many purposes. The ability to automatically create specialized databases is becomming more and more important. Unfortunately it often means the creation of a completely new database, rather than one extracted from the original database. The trouble with absolute coordinate systems is that if two GenBank entries fuse together, the numbering of at least one sequence must change. Any instructions become out of date. The way to avoid this is to have landmarks on the sequence which do not change. For this reason I urged that every feature in GenBank have a name. I see that at least the latest entry I extracted does have a name, but I don't know if this is true of all features (I suspect it isn't). If each feature had a unique name, then the instructions for extracting fragments would remain the same. For example, I could say: organism 'E. coli'; chromosome 'main'; gene lacZ; get from gene beginning -20 to gene beginning + 10; This is pseudo-delila code since the names don't exist and the use of quote marks is not implemented yet. However, with the right database, these instructions would last forever since the names E. coli and lacZ are universal and not likely to change. The best names to use are the currently accepted genetic names (since they are the most stable), but provision must be made for using alternative names. The fragment defined by these instructions would, of course, have whatever numbering (coordinate system) the current database allowed, so that one could compare the results from several different analyses. Tom Schneider National Cancer Institute Laboratory of Mathematical Biology Frederick, Maryland 21702-1201 toms@ncifcrf.gov