Path: utzoo!attcan!utgpu!news-server.csri.toronto.edu!mailrus!cs.utexas.edu!usc!zaphod.mps.ohio-state.edu!uakari.primate.wisc.edu!crdgw1!montnaro From: montnaro@spyder.crd.ge.com (Skip Montanaro) Newsgroups: comp.unix.wizards Subject: Re: Checkpoint/Restart Message-ID: Date: 18 Aug 90 18:56:33 GMT References: <24193@adm.BRL.MIL> <13611@smoke.BRL.MIL> <17543@ucsd.Edu> Sender: news@crdgw1.crd.ge.com Reply-To: montanaro@crdgw1.ge.com (Skip Montanaro) Organization: GE Corporate Research & Development, Schenectady, NY Lines: 64 In-reply-to: gkn@ucsd.Edu's message of 17 Aug 90 20:42:02 GMT In article <17543@ucsd.Edu> gkn@ucsd.Edu (Gerard K. Newman) writes: I think it's a bit unfair for every user of a system to have to invent a way to do this specific to their particular application. In many cases it may not be possible (the above "canned software" problem being an example). I would agree with the above statements if a) the effort of creating a programmer/user-transparent general-purpose solution was not much more difficult than writing a programmer/user-visible application-specific solution, b) it was impossible (nearly so) to create application-specific solutions to the problem, or c) most applications actually needed it. However, as has been discussed in this and other newsgroups off-and-on over the past couple of years a) it is very hard to solve the general-purpose problem, systems like CRAY's checkpoint/restart facility, and the University of Wisconsin's RU/Condor systems notwithstanding, b) for most applications that need such facilities, they aren't terribly difficult to write, c) very few applications actually need such facilities. Given the difficulty of adding a general solution to (various flavors of) Unix, it is probably wiser to do it on an case-by-case basis. It is unlikely that most of the relatively few applications that need checkpoint/restart capabilities will need the full range of capabilities that will need to be accounted for in a general solution. As a common case, consider many scientific applications. They typically read in a large data set, munch on it in an iterative manner for a long period of time, then write out another large data set. Checkpointing an application of this sort is pretty trivial. Just write out the intermediate state of the computation "every so often". If it must be restarted, it can be directed to read the checkpointed data, restarting the computation from that point. If the application crashes during the initial input phase, no expensive computation has been lost. There's a checkpoint facility in place during the iterative solve phase. During the final output phase, if an error occurs (such as a full disk, head crash, or system failure), you fall back to the last checkpoint during the compute phase (if you can recover it from the disk). Another example is text editors. Most editors I've used over the past several years (Emacs of several flavors, vi, EDT), provided some sort of checkpoint or playback facilities. (EDT's playback was fun to watch.) As to the second point (canned software packages), checkpoint/restart capabilities should be treated as a competitive advantage of one package over another. If your vendor(s) don't provide such facilities, and you need them, lean on them. If there's a vendor that does, factor that into your evaluation. They won't provide it until they realize you need it. The best way to get them to realize it is with your pocketbook. -- Skip (montanaro@crdgw1.ge.com)