Xref: utzoo comp.unix.questions:25697 comp.sys.sequent:710
Path: utzoo!utgpu!news-server.csri.toronto.edu!mailrus!ames!haven!decuac!e2big.mko.dec.com!bacchus.pa.dec.com!decwrl!sdd.hp.com!uakari.primate.wisc.edu!crdgw1!montnaro
From: montnaro@spyder.crd.ge.com (Skip Montanaro)
Newsgroups: comp.unix.questions,comp.sys.sequent
Subject: Re: Checkpoints for large jobs
Message-ID: <MONTNARO.90Aug11091123@spyder.crd.ge.com>
Date: 11 Aug 90 13:11:23 GMT
References: <3193@syma.sussex.ac.uk>
Sender: news@crdgw1.crd.ge.com
Reply-To: montanaro@crdgw1.ge.com (Skip Montanaro)
Followup-To: comp.unix.questions
Organization: GE Corporate Research & Development, Schenectady, NY
Lines: 17
In-reply-to: william@syma.sussex.ac.uk's message of 6 Aug 90 16:06:18 GMT


On a case-by-case basis, you may be able to modify your applications so they
will recover. For instance, if your application is an iterative solver of
some sort, you may be able to checkpoint the intermediate data periodically.
When the program is restarted, a flag can be set so the program initializes
from the intermediate solution data.

There was a system a few years ago (maybe 1986?) developed at the University
of Wisconsin that allowed jobs to be restarted (modulo some special I/O
situations). It was reported in a USENIX conference of that era.

Also, UNICOS on the CRAY has a checkpointing facility. You might investigate
it, and ask Sequent why they haven't got something similar.


--
Skip (montanaro@crdgw1.ge.com)