Path: utzoo!utgpu!watserv1!watmath!uunet!bionet!icrf.ac.uk!G_LENNON From: G_LENNON@icrf.ac.uk Newsgroups: bionet.molbio.genbank Subject: (none) Message-ID: <9102012116.AA16359@genbank.bio.net> Date: 1 Feb 91 18:34:51 GMT Sender: daemon@genbank.bio.net Lines: 34 Don Gilbert asks .... Is there a consensus view on the proper way to enter discontinuous sequences to GenBank? An otherwise continuous length of molecule contains regions which were not sequenced, of many bases in length. Options seem to be a) enter in databank under one accession number, with feature notations indicated where regions with no data exist. Drawback: users can miss feature info and incorrectly use such data as a continuous sequence. b) enter in databank under separate accession numbers for each continuous region. Drawback: sequential nature of data is obscured by separate entries. c) enter as one accession, with unsequenced regions (whose size is known, I believe, by alignment with related sequences) indicated with "N" or other symbol. Drawback: the N symbol may not be appropriate. ************************************************************************** My vote is for option (b). After all, the "sequential nature" of the entire sequence of a chromosome (from telomere to telomere) is in practice being reduced to a set of sequence files in GenBank currently. Second, those working with immunoglobulin gene segments have settled on option (b) for representing the segments that recombine somatically to form a complete ig gene, using the features table to indicate that other sequence can be found in certain other files. Presumably, this feature will also be the one used to link all the files comprising megabases worth of sequence, when genome sequencing really starts producing data. Third, guessing the number of N's should be discouraged, as should increasing the number of N's in GenBank (they tend to foul up certain programs unless precautions are taken), so option (c) is unwise. Lastly, numerous programs rely on the length of a sequence in positional calculations, which option (a) will incorrectly represent. So much for my two cents, g_lennon@icrf.ac.uk Greg Lennon