Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!tut.cis.ohio-state.edu!uc!shamash!zeke From: zeke@shamash.cdc.com (Robert Scott) Newsgroups: comp.unix.internals Subject: Re: restarting processes Summary: Checkpointing in UNIX is tough. Message-ID: <25381@shamash.cdc.com> Date: 4 Sep 90 14:17:19 GMT References: <24374@adm.BRL.MIL> <1990Sep3.235815.17361@wrl.dec.com> Organization: Control Data Corporation, Arden Hills, MN. Lines: 55 In article <1990Sep3.235815.17361@wrl.dec.com>, vixie@wrl.dec.com (Paul Vixie) writes: > I'd like to do this also. But if your process has pipes open to other > processes, then those other processes would have to be restarted in the > same state if your process was to be restarted "correctly". If you had > files open, those same files would have to be there when you restarted, > with the same contents. If you had a physical device file open, the > results could be confusing (let's say someone else dismounts your tape > and mounts one of their own -- can you get your tape back to the same > "state" it was in when you restart your program?). And of course, if > you had any network connections open, then all of this stickiness extends > to whatever processes you're talking to on (the) remote machine(s). > > This kind of restartability wasn't on the UNIX designers' minds, and the > system call interface has absolutely no architectural support for it. > The thing you're trying to do is usually done at the application layer, > as in "commit" operations in databases, and like that. > > Stuff deleted... On most Control Data machines running NOS or NOS/VE, and the old Cyber 205 supercomputer, there is a facility called "checkpointing" the system. When the operator does this, the state of all running processes are saved complete with open file info and everything. After the checkpoint, the system can be brought down for maintenance or whatever, and then restored to the initial running state by reloading the system and going through a "restart" process to reload and restore the executing jobs. I believe that on the Cyber 205 we could also checkpoint individual jobs. Big difference between UNIX and VSOS (205 OS) though, was that each 205 job is almost always a single process unless it is a system task. As Paul writes above, UNIX contains many possible problems to this kind of operation. Remember, UNIX was written basically as a small computer OS for interactive access, and wasn't originally intended to be running weather models or other large programs that might have to run on a supercomputer for 24 hours before completing. On the large mainframes, particularly in the scientific computing arena, huge data reduction or repetative calculation are the norm, as is batch input/output. Just as a course of normal operations in these giant pieces of iron, programs and entire OS states need to be saved so that the machine can be serviced or a higher priority program run. Checkpoint on UNIX would be a nice idea, though. Zeke ~~~~~~~~~~~ From the Shrine of the "Last Gasp of ETA Systems" ~~~~~~~~~~~~~ Extra zesty disclaimer: MINE! MINE! ALL MINE! Robert K. "Zeke" Scott internet: zeke@eta.cdc.com Control Data Corp, Supercomputer Support Group -- ~~~~~~~~~~~ From the Shrine of the "Last Gasp of ETA Systems" ~~~~~~~~~~~~~ Extra zesty disclaimer: MINE! MINE! ALL MINE! Robert K. "Zeke" Scott internet: zeke@eta.cdc.com Control Data Corp, Supercomputer Support Group