Xref: utzoo comp.unix.questions:25697 comp.sys.sequent:710 Path: utzoo!utgpu!news-server.csri.toronto.edu!mailrus!ames!haven!decuac!e2big.mko.dec.com!bacchus.pa.dec.com!decwrl!sdd.hp.com!uakari.primate.wisc.edu!crdgw1!montnaro From: montnaro@spyder.crd.ge.com (Skip Montanaro) Newsgroups: comp.unix.questions,comp.sys.sequent Subject: Re: Checkpoints for large jobs Message-ID: Date: 11 Aug 90 13:11:23 GMT References: <3193@syma.sussex.ac.uk> Sender: news@crdgw1.crd.ge.com Reply-To: montanaro@crdgw1.ge.com (Skip Montanaro) Followup-To: comp.unix.questions Organization: GE Corporate Research & Development, Schenectady, NY Lines: 17 In-reply-to: william@syma.sussex.ac.uk's message of 6 Aug 90 16:06:18 GMT On a case-by-case basis, you may be able to modify your applications so they will recover. For instance, if your application is an iterative solver of some sort, you may be able to checkpoint the intermediate data periodically. When the program is restarted, a flag can be set so the program initializes from the intermediate solution data. There was a system a few years ago (maybe 1986?) developed at the University of Wisconsin that allowed jobs to be restarted (modulo some special I/O situations). It was reported in a USENIX conference of that era. Also, UNICOS on the CRAY has a checkpointing facility. You might investigate it, and ask Sequent why they haven't got something similar. -- Skip (montanaro@crdgw1.ge.com)