Path: utzoo!attcan!uunet!samsung!usc!apple!bionet!bio.embnet.se!mats
From: Mats.Sundvall@bio.embnet.se (Mats Sundvall)
Newsgroups: bionet.molbio.genbank,bionet.molbio.pir
Subject: Re: GenBank gets big and PIR format has problems!
Message-ID: <52.26222fec@bio.embnet.se>
Date: 10 Apr 90 18:11:40 GMT
References: <6588@wehi.dn.mu.oz> <1990Apr10.032155.22233@phri.nyu.edu>
Organization: Embnet node in Sweden, Biomedical Center, University of Uppsala, Sweden
Lines: 52

In article <1990Apr10.032155.22233@phri.nyu.edu>, roy@phri.nyu.edu (Roy Smith) writes:
> TONY@wehi.dn.mu.oz (Tony Kyne, Walter and Eliza Hall Institute) writes:
> 
> 	The need to rebuild the database each time it is updated is a
> problem which has not escaped our attention.  I can guess how Ross Smith and
> Dave Kristofferson (my partners in crime on the daily updates experiment),
> would answer your question, but I'll let them talk for themselves.  As for
> my part, what we have done is to keep essentially a complete separate
> database just for the daily updates.  That makes the size of the index file
> rebuilds managable.  We currently have a mishmosh of all the updates in one
> file, but we envision probably doing something like a 3-tier system.
> 

We have made the changes needed to APPEND entries to the PIR format
database. We only use the GCG package and not NAQ and PSQ so we do
not know if this work with them.

The idea is quite simple. You append an entry to the sequential file.
Then you update the bytepointer in the indexfile to point to the new
entry instead of the old one. This of course leaves you with some
old entries in the seq file with no pointers to. This is not a big
problem with the programs that usees the index files to retrieve
entries. Of course you are in trouble when using database searching
programs like wordsearch and FASTA that read the database sequentually
to run faster. You get several matches to the same sequence, but the
second round of the program, when it retrieve the sequence, it will fetch
only the right one. This will screw up the statistics, but we feel this
is a minor problem compared to other solutions offered.

Of course you will need some sort of garbage collection after a while.
There is ways to do this, but at the moment we plan to let the delivery
of new tapes be our garbage collector. We just install the new tape
and start all over again.

Of course the problems with duplicated matches only occurs when you get
updates to already existing entries.

Questions about availablility of the "fixes" should be adressed to
Peter.Gad@Bio.embnet.SE who did the actual coding. He maybe read
this and can post some info himself.

> --
> Roy Smith, Public Health Research Institute
> 455 First Avenue, New York, NY 10016
> roy@alanine.phri.nyu.edu -OR- {att,cmcl2,rutgers,hombre}!phri!roy
> "Don't Worry, Be Happy"


	Mats Sundvall
	Biomedical Center
	Uppsala University
	Sweden