Path: utzoo!attcan!uunet!husc6!bbn!rochester!pt.cs.cmu.edu!PLAY.MACH.CS.CMU.EDU!bsy From: bsy@PLAY.MACH.CS.CMU.EDU (Bennet Yee) Newsgroups: comp.unix.wizards Subject: Re: Process restart. Keywords: Fork, Process, migration, remote, a.out, core Message-ID: <3571@pt.cs.cmu.edu> Date: 13 Nov 88 01:14:58 GMT References: <16@elgar.UUCP> <8831@smoke.BRL.MIL> <17@elgar.UUCP> <8857@smoke.BRL.MIL> <18@elgar.UUCP> Organization: Cranberry Melon Lines: 55 In article <18@elgar.UUCP> ag@elgar.UUCP (Keith Gabryelski) writes: > >Long running processes that don't have any means of shutdown/restart >built into them are what I am thinking of. > >Let's say we have this process computing prime numbers (or some other >simple case) and the system needs to be shutdown because of some fatal >error. Can a snapshot be done? I've done exactly this about two years ago. My implementation of M.O.Rabin's probabilistic primality test ran for about a week of real time on a uVax II surviving multiple reboots/system crashes before finding a 1000 digit probabilistic prime.... I don't know how much real CPU time it took -- the machine was a general purpose machine (I ran my program niced 19) and I didn't keep track of timing info. In retrospect it would have been easy: I had it checkpoint every 5 minutes of CPU time anyway, so all I needed to do is to increment a counter. Anyway, since the program's I/O behavior is very simple (it generated output only just before completing, and I only redirected its stdout to a file), it was particularly simple to checkpoint the process. I thought about the case of replacing open/close with library routines and syscall'ing the traps after saving state; at a checkpoint, we can lstat the known descriptors so we can restore. This would work only for files, of course, and I didn't bother. I may do this at a later date.... The code that I _do_ have simply checkpoints the data/stack portion of the address space. Note that this includes the stdio buffers etc, so if I _did_ decide to save file descriptor states all I need to do at restart is to lseek to the old location... assuming the program doesn't lseek around also. If it did, I'd have to copy all the files to get _their_ state at the time of the checkpoint (bleh). Restart is performed by running the program with a switch specifying the checkpoint file, whereupon the state from the file is loaded into the current address space (i.e, your program would have to recognize a flag and call my restore function). I have versions of this code running on Vaxen and IBM RTs. I currently have 3 1000 digit probabilistic primes. Does any factoring wizard want a 2000 digit compos... :-) To generate 100 digit probabilistic primes (probability 1 - 2^-40), it takes 129.3u 0.7s 2:28 87% on an IBM RT/APC and 290.2u 0.1s 8:49 54% on a uVax III. The primality code uses the cmump library package developed here at CMU (cmump is based on the mp package from BTL), so probably won't be useful unless you have source license or you're willing to rewrite it. As for the checkpointing code, I'm willing [and able] to share. I only use Unix syscalls and the code should have no Mach dependencies. -bsy -- Internet: bsy@cs.cmu.edu Bitnet: bsy%cs.cmu.edu%smtp@interbit CSnet: bsy%cs.cmu.edu@relay.cs.net Uucp: ...!seismo!cs.cmu.edu!bsy USPS: Bennet Yee, CS Dept, CMU, Pittsburgh, PA 15213-3890 Voice: (412) 268-7571 --