Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!mips!spool.mu.edu!uunet!uunet.UU.NET!sef From: breynolds@UCSD.EDU (Bill Reynolds) Newsgroups: comp.std.unix Subject: Checkpointing for Unix? Message-ID: <130482@uunet.UU.NET> Date: 26 Apr 91 19:11:11 GMT Sender: usenet@uunet.UU.NET Organization: Institute for Nonlinear Science, UCSD Lines: 37 Approved: sef@uunet.uu.net (Moderator, Sean Eric Fagan - comp.std.unix) Nntp-Posting-Host: uunet.uu.net X-Submissions: std-unix@uunet.uu.net Originator: sef@uunet.UU.NET Submitted-by: breynolds@UCSD.EDU (Bill Reynolds) I originally posted this to comp.unix.questions. It was then recommended to me that I post here as well. >Greetings, > We are a computational physics group running a network of Sun >and SGI workstations. We often have long running jobs on many of our >machines. This leads to problems when a machine needs to be taken down >that has a job in the third day of a five day run. What we would like >is a routine to checkpoint a job to a disk file for later reloading >into memory. I've looked at undump, but isn't adequate, we need to >restart the job where it was interrupted. I've also looked at condor, >but it seems to be a fly-with-a-sledgehammer type solution. I'm >wondering if there are any simple unix/sun/sgi utilities to do >checkpointing. (I know that such facilities exist for crays). I would also like to add that such a facility would have to support fortran and would have to be simple enough to use that someone with only a background in scientific computing could use it (i.e. no system calls, no calls to c routines from fortran, etc). It has also been suggested that I modify the code to undump. I find this a daunting task (any takers?). (By the way, I have not actually gotten an undump working for the sun or the sgi). -- _______________________________________________________________________ | Bill Reynolds | bill@inls1.ucsd.edu [ First of all, there is Dan Bernstein's Poor Man's Checkpointing Package, posted to alt.sources (I think) a month or three ago. Also, one of the POSIX subgroups specifies checkpointing, that being the main reason I'm posting this. I will let others (who are likely to be more knowledgeable about it) comment further, if they wish. -- mod ] Volume-Number: Volume 23, Number 47