Path: utzoo!utgpu!watserv1!watmath!uunet!bionet!icrf.ac.uk!G_LENNON
From: G_LENNON@icrf.ac.uk
Newsgroups: bionet.molbio.genbank
Subject: (none)
Message-ID: <9102012116.AA16359@genbank.bio.net>
Date: 1 Feb 91 18:34:51 GMT
Sender: daemon@genbank.bio.net
Lines: 34


 Don Gilbert asks ....
Is there a consensus view on the proper way to enter discontinuous
sequences to GenBank?  An otherwise continuous length of
molecule contains regions which were not sequenced, of many bases
in length.  Options seem to be
  a) enter in databank under one accession number, with feature
     notations indicated where regions with no data exist. Drawback:
     users can miss feature info and incorrectly use such data
     as a continuous sequence.
  b) enter in databank under separate accession numbers for each
     continuous region.  Drawback: sequential nature of data is
     obscured by separate entries.
  c) enter as one accession, with unsequenced regions (whose size
     is known, I believe, by alignment with related sequences)
     indicated with "N" or other symbol.  Drawback: the
     N symbol may not be appropriate.

**************************************************************************
My vote is for option (b).  After all, the "sequential nature" of the entire
sequence of a chromosome (from telomere to telomere) is in practice being
reduced to a set of sequence files in GenBank currently. Second, those working
with immunoglobulin gene segments have settled on option (b) for representing
the segments that recombine somatically to form a complete ig gene, using the
features table to indicate that other sequence can be found in certain other
files. Presumably, this feature will also be the one used to link all the files
comprising megabases worth of sequence, when genome sequencing really starts
producing data. Third, guessing the number of N's should be discouraged, as
should increasing the number of N's in GenBank (they tend to foul up certain
programs unless precautions are taken), so option (c) is unwise. Lastly,
numerous programs rely on the length of a sequence in positional calculations,
which option (a) will incorrectly represent.
So much for my two cents,
g_lennon@icrf.ac.uk         Greg Lennon