Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!dali.cs.montana.edu!uakari.primate.wisc.edu!sdd.hp.com!cs.utexas.edu!uunet!bionet!lhc!ncifcrf!fcs260c2!toms
From: toms@fcs260c2.ncifcrf.gov (Tom Schneider)
Newsgroups: bionet.molbio.genbank
Subject: Re: Software for automated subseqence extraction
Message-ID: <2147@fcs280s.ncifcrf.gov>
Date: 6 May 91 18:27:59 GMT
References: <CMM.0.88.672964473.kristoff@genbank.bio.net> <2140@fcs280s.ncifcrf.gov> <May.1.10.08.02.1991.16403@genbank.bio.net>
Sender: news@ncifcrf.gov
Organization: NCI Supercomputer Facility, Frederick, MD
Lines: 107

In article <May.1.10.08.02.1991.16403@genbank.bio.net> kristoff@genbank.bio.net
(David Kristofferson) writes:
>I
...
>must admit to not understanding your comment about the lack of a
>coordinate system.  For example, coding sequences are clearly
>annotated in the features table and one can extract these subsequences
>from an entry while also carrying along the annotations which refer to
>their position in the original sequence.  What do you mean by the lack
>of a coordinate system???

There are many ways to use a genetic sequence database.  Most people are
interested in a single sequence, and for this the current methods work
reasonably well.  However, more and more people are interested in studying
collections of sequences.  For example, we have a huge collection of splice
junctions.  To analyze these statistically, we would like to extract only a
minimum region around the junctions.  If we were to do this by hand, then we
would be likely to make errors, and the process would be very tedious.  To
avoid errors, we create a set of instructions that define the regions we want
to study.  We used the feature table to make the instructions.  But what should
the output of such an extraction look like?

Many years ago, Jeff Haemer and I realized that the best form for the output
extraction should be identical to the input!  Thus if I want bases 57 through
89 of a partiticular GenBank entry, the most useful output would look like a
GenBank entry, but would only contain bases 57 through 89.

The power of this is that it allows one to use the same search or other
analysis program on GenBank as one uses on a subset.  Using written sets of
instructions (instead of interactive input), one can automatically create
sub-databases and sub-sub databases.  The subsets would be equivalent to the
main database.  For example, we created a subset of E. coli sequences that were
the transcribed RNA.  Further extractions of ribosome binding sites to create a
sub-sub-database, were therefore guaranteed to give us sequences that were
alway RNA.  These were the initial steps toward creating a database which we
used to train the Perceptron (a neural net) to locate ribosome binding sites
(Stormo, NAR 10: 2997, 1982).

In the present GenBank scheme, this means that the numbering of the extracted
fragment 57 through 89 would implicitly become 1 through 89-57+1 = 33.  If we
made a nice printed listing of open reading frames of the original entry, then
we would have to keep doing subtractions to find things in our sub-sequence.
If you every have had to do this, you know how painful it is.

So the idea came up that the extracted entry should carry a coordinate system.
This is a set of numbers that defines the original number of each base in the
extracted sequence.

But if the extracted entries have coordinate systems, then so too should the
main library, in keeping with the principle of equivalence between database
and sub-databases.

To implement such a scheme today, we would have to add a coordinate system to
the extracted GenBank entries.  This is equivalent to carrying along the
annotations, but makes it more explicit.  A true coordinate system does not
depend on any 'features'.  With today's GenBank, we would also have to have
each analysis program check for a coordinate system, and if it is not found,
assume that the numbering is 1 to n.  This is possible, but is obviously a
messy design, forced by the lack of an explicitly defined coordinate system in
the main database.

You might ask: why not simply implement this program check and be done with
it?  Well, if nothing else, having a coordinate system would allow GenBank to
extend an old sequence before base 1 and not modify any other coordinates.
(There is nothing wrong with having a zero coordinate.)

These ideas were implemented in the Delila system before GenBank came
into existence (NAR 10:3013, 1982; 12:129, 1984).

I don't expect GenBank to write software, since that goal of GenBank was
dropped for political/funding (?) reasons many years ago.  However, GenBank
should be creating a database which is useable for many purposes.  The ability
to automatically create specialized databases is becomming more and more
important.  Unfortunately it often means the creation of a completely new
database, rather than one extracted from the original database.

The trouble with absolute coordinate systems is that if two GenBank entries
fuse together, the numbering of at least one sequence must change.  Any
instructions become out of date.  The way to avoid this is to have landmarks on
the sequence which do not change.  For this reason I urged that every feature
in GenBank have a name.  I see that at least the latest entry I extracted does
have a name, but I don't know if this is true of all features (I suspect it
isn't).  If each feature had a unique name, then the instructions for
extracting fragments would remain the same.

For example, I could say:

organism 'E. coli';  chromosome 'main';
gene lacZ;
get from gene beginning -20 to gene beginning + 10;

This is pseudo-delila code since the names don't exist and the use of quote
marks is not implemented yet.  However, with the right database, these
instructions would last forever since the names E. coli and lacZ are universal
and not likely to change.  The best names to use are the currently accepted
genetic names (since they are the most stable), but provision must be made for
using alternative names.

The fragment defined by these instructions would, of course, have whatever
numbering (coordinate system) the current database allowed, so that one could
compare the results from several different analyses.

  Tom Schneider
  National Cancer Institute
  Laboratory of Mathematical Biology
  Frederick, Maryland  21702-1201
  toms@ncifcrf.gov