Path: utzoo!utgpu!watserv1!watmath!uunet!bionet!benton
From: benton@genbank.BIO.NET (David Benton)
Newsgroups: bionet.molbio.genbank
Subject: Re: GenBank Release 64 was incomplete
Message-ID: <Aug.26.18.33.01.1990.555@genbank.BIO.NET>
Date: 27 Aug 90 01:33:01 GMT
References: <1990Aug25.220335.25701@ccu.umanitoba.ca>
Organization: GenBank Online Service
Lines: 79

To Brian Fristensky's question:

> Is it correct to say that the location
> formatting error you spoke of affected Release 64.0 on ALL media released
> from early July to mid Aug, and not just the floppy disk version?
> Specifically, would this error be reflected in SUN tar tapes dated Jun
> 1990?

the answer is, unfortunately, "yes".  The problem was discovered after
those tapes had been shipped.  If you want to patch these location
errors, I'll append an awk script which reads a GenBank .seq file and
writes (to standard output) a sequence file with those of the errors
which occur in simple spans (about 1224 of the 1250) corrected.  If
anyone is interested, I can post a second awk program which detects,
but does not correct, the remaining errors.  If you choose to use the
attached program, I'd recommend diffing the output against the
original file (especially if you've made any changes to that file)
before you throw the original away.  I've tested the program on the
Rel 64.0 files as distributed and found no side effects, but it is
"use at your own risk."

By the way, there was no floppy disk version of release 64.0
distributed on floppy disks. We will be shipping the floppy-format
files for Release 64.1 (corrected), as well as the floppy-format files
for Release 63.0, and the magnetic tape format files for GenBank Rel. 64.1
and GenPept Rel. 64.2 on a CD ROM in about two weeks.  But that is the
subject for another announcement.

The feature table in Rel 64 was created by automatic translation of
the old feature tables (used in Rel 63 and before) to the new feature
table format.  Brian's comments on the use of the prim_transcript and mRNA
feature keys are accurate (as I understand those keys) except that
the prim_transcript is intended to be used to annotate the primary
(initial) transcript, before any processing.  In most cases, I would
guess, the primary transcript is unknown.  I will leave more detailed
(and knowledgable) comments for the GenBank annotation staff.

David Benton
GenBank
benton@karyon.bio.net

------------------------cut here---------------------------------------

# awk program to find any feature location "to" position which is greater
# than the sequence length, determine if it is a direct repeat ("456456")
# and, if so, divide it (as a string) in half ("456")
# note that this works only on simple spans


BEGIN	{alpha = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
	numeric = "-1234567890"}

/^LOCUS /	{len = $3 + 0
		print
		next}

/^FEATURES /,/^BASE COUNT  / {if ($1 == "FEATURES" || $1 == "BASE" || substr($0,6,1) == " "){
	print
	next}
    else{
	inlin = $0
	dot = index(inlin,"..")
	if ((index(alpha,substr(inlin,22,1)) != 0) || (dot == 0)){
		print
		next}

	from = substr(inlin,1,dot+1)
	to = substr(inlin,dot+2)
	while (index(numeric,substr(to,1,1)) == 0){
		from = from substr(to,1,1)
		to = substr(to,2)}
	half = length(to) / 2

       if ((to + 0 > len) && (substr(to,1,half) == substr(to,half+1))){
		to = substr(to,1,half)}
		 print from to
		 next}}

	{ print }