Path: utzoo!utgpu!watserv1!watmath!uunet!bionet!benton From: benton@genbank.BIO.NET (David Benton) Newsgroups: bionet.molbio.genbank Subject: Re: GenBank Release 64 was incomplete Message-ID: Date: 27 Aug 90 01:33:01 GMT References: <1990Aug25.220335.25701@ccu.umanitoba.ca> Organization: GenBank Online Service Lines: 79 To Brian Fristensky's question: > Is it correct to say that the location > formatting error you spoke of affected Release 64.0 on ALL media released > from early July to mid Aug, and not just the floppy disk version? > Specifically, would this error be reflected in SUN tar tapes dated Jun > 1990? the answer is, unfortunately, "yes". The problem was discovered after those tapes had been shipped. If you want to patch these location errors, I'll append an awk script which reads a GenBank .seq file and writes (to standard output) a sequence file with those of the errors which occur in simple spans (about 1224 of the 1250) corrected. If anyone is interested, I can post a second awk program which detects, but does not correct, the remaining errors. If you choose to use the attached program, I'd recommend diffing the output against the original file (especially if you've made any changes to that file) before you throw the original away. I've tested the program on the Rel 64.0 files as distributed and found no side effects, but it is "use at your own risk." By the way, there was no floppy disk version of release 64.0 distributed on floppy disks. We will be shipping the floppy-format files for Release 64.1 (corrected), as well as the floppy-format files for Release 63.0, and the magnetic tape format files for GenBank Rel. 64.1 and GenPept Rel. 64.2 on a CD ROM in about two weeks. But that is the subject for another announcement. The feature table in Rel 64 was created by automatic translation of the old feature tables (used in Rel 63 and before) to the new feature table format. Brian's comments on the use of the prim_transcript and mRNA feature keys are accurate (as I understand those keys) except that the prim_transcript is intended to be used to annotate the primary (initial) transcript, before any processing. In most cases, I would guess, the primary transcript is unknown. I will leave more detailed (and knowledgable) comments for the GenBank annotation staff. David Benton GenBank benton@karyon.bio.net ------------------------cut here--------------------------------------- # awk program to find any feature location "to" position which is greater # than the sequence length, determine if it is a direct repeat ("456456") # and, if so, divide it (as a string) in half ("456") # note that this works only on simple spans BEGIN {alpha = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ" numeric = "-1234567890"} /^LOCUS / {len = $3 + 0 print next} /^FEATURES /,/^BASE COUNT / {if ($1 == "FEATURES" || $1 == "BASE" || substr($0,6,1) == " "){ print next} else{ inlin = $0 dot = index(inlin,"..") if ((index(alpha,substr(inlin,22,1)) != 0) || (dot == 0)){ print next} from = substr(inlin,1,dot+1) to = substr(inlin,dot+2) while (index(numeric,substr(to,1,1)) == 0){ from = from substr(to,1,1) to = substr(to,2)} half = length(to) / 2 if ((to + 0 > len) && (substr(to,1,half) == substr(to,half+1))){ to = substr(to,1,half)} print from to next}} { print }