Xref: utzoo comp.unix.questions:24632 comp.sys.sequent:688 Path: utzoo!utgpu!news-server.csri.toronto.edu!mailrus!uwm.edu!uwvax!cream.cs.wisc.edu!mike From: mike@cream.cs.wisc.edu (Mike Litzkow) Newsgroups: comp.unix.questions,comp.sys.sequent Subject: Re: Checkpoints for large jobs Keywords: checkpoint interrupt signal Message-ID: <1990Aug14.104718@cream.cs.wisc.edu> Date: 14 Aug 90 15:47:18 GMT References: <3193@syma.sussex.ac.uk> Sender: news@spool.cs.wisc.edu Reply-To: mike@cream.cs.wisc.edu (Mike Litzkow) Organization: U of Wisconsin CS Dept Lines: 23 Yes, checkpointing is one part of the Condor system, (previously called RU). Condor uses cycles on idle workstations by migrating processes to them. When the workstations subsequently come under use by their normal users, the condor jobs are checkpointed, and later moved to another idle workstation to continue execution. The checkpointing is accomplished by causing the process to dump core, then combining parts of the core file with parts of the original executable. The software keeps track of what file have been opened and re-opens them after return from a checkpoint. This is accomplished by linking the user program with special versions of "crt0.o" and "libc.a". Condor is available without charge by anonymous ftp from "shorty.cs.wisc.edu" (128.105.2.8). Just log in as "ftp" and give your user name for a password. Then "cd" to the condor directory and take a look at the Readme file. You will be instructed to fetch a compressed binary file, remember to have your ftp set to "binary" mode for that. The checkpointing is set up so you can use it without process migration or remote execution if that is desired. It is able to run and compile on a Sequent Symmetry. -- mike