Path: utzoo!attcan!uunet!samsung!usc!apple!bionet!bio.embnet.se!mats From: Mats.Sundvall@bio.embnet.se (Mats Sundvall) Newsgroups: bionet.molbio.genbank,bionet.molbio.pir Subject: Re: GenBank gets big and PIR format has problems! Message-ID: <52.26222fec@bio.embnet.se> Date: 10 Apr 90 18:11:40 GMT References: <6588@wehi.dn.mu.oz> <1990Apr10.032155.22233@phri.nyu.edu> Organization: Embnet node in Sweden, Biomedical Center, University of Uppsala, Sweden Lines: 52 In article <1990Apr10.032155.22233@phri.nyu.edu>, roy@phri.nyu.edu (Roy Smith) writes: > TONY@wehi.dn.mu.oz (Tony Kyne, Walter and Eliza Hall Institute) writes: > > The need to rebuild the database each time it is updated is a > problem which has not escaped our attention. I can guess how Ross Smith and > Dave Kristofferson (my partners in crime on the daily updates experiment), > would answer your question, but I'll let them talk for themselves. As for > my part, what we have done is to keep essentially a complete separate > database just for the daily updates. That makes the size of the index file > rebuilds managable. We currently have a mishmosh of all the updates in one > file, but we envision probably doing something like a 3-tier system. > We have made the changes needed to APPEND entries to the PIR format database. We only use the GCG package and not NAQ and PSQ so we do not know if this work with them. The idea is quite simple. You append an entry to the sequential file. Then you update the bytepointer in the indexfile to point to the new entry instead of the old one. This of course leaves you with some old entries in the seq file with no pointers to. This is not a big problem with the programs that usees the index files to retrieve entries. Of course you are in trouble when using database searching programs like wordsearch and FASTA that read the database sequentually to run faster. You get several matches to the same sequence, but the second round of the program, when it retrieve the sequence, it will fetch only the right one. This will screw up the statistics, but we feel this is a minor problem compared to other solutions offered. Of course you will need some sort of garbage collection after a while. There is ways to do this, but at the moment we plan to let the delivery of new tapes be our garbage collector. We just install the new tape and start all over again. Of course the problems with duplicated matches only occurs when you get updates to already existing entries. Questions about availablility of the "fixes" should be adressed to Peter.Gad@Bio.embnet.SE who did the actual coding. He maybe read this and can post some info himself. > -- > Roy Smith, Public Health Research Institute > 455 First Avenue, New York, NY 10016 > roy@alanine.phri.nyu.edu -OR- {att,cmcl2,rutgers,hombre}!phri!roy > "Don't Worry, Be Happy" Mats Sundvall Biomedical Center Uppsala University Sweden