Path: utzoo!utgpu!news-server.csri.toronto.edu!mailrus!cs.utexas.edu!usc!sdd.hp.com!decwrl!ucbvax!bloom-beacon!world!bzs From: bzs@world.std.com (Barry Shein) Newsgroups: comp.unix.wizards Subject: Re: Checkpoint/Restart Message-ID: Date: 22 Aug 90 05:19:41 GMT References: <24239@adm.BRL.MIL> Sender: bzs@world.std.com (Barry Shein) Organization: The World Lines: 48 In-Reply-To: mike@BRL.MIL's message of 21 Aug 90 04:25:26 GMT TOPS-20 made this sort of thing trivial via the SAVE command. It just rolled all of your current foreground processes' virtual memory into a file. Kinda like a core dump, but re-executable. Actually, the foreground processes' virtual memory was always just kind of there, sort of like being able to TSTP a process and then adb (ahem, DDT) it. Not horribly different than adb (et al) defaulting to "core", tho I think you could continue stepping a stopped job (CMS also had that virtual memory quality, certainly before TOPS-20, but I don't remember any easy way to save it to a file and restart it.) TOPS-20 would issue an interrupt (signal) when the program was restarted which could be trapped to re-init anything you wanted, again, not that different from SIGCONT, but across a checkpoint. *BUT*, it was surely fraught with all the problems mentioned for Unix, nothing magic, the process had to be able to reinit itself when it got a restart interrupt, and hope that nothing in the external state had changed much. So experience bears out what people are trying to say. Some of the problems with checkpoint/restart are probably also potential problems with SIGTSTP'd jobs (try seeing how long you can ^Z a local uucico process and still continue where you left off.) Another concern is that it seems to me that once TOPS-20 had a SAVE facility it tended to get in the way of other design decisions. An answer to a question "why doesn't TOPS-20 do this" was sometimes answered with "if they did that then SAVE couldn't work right." I seem to remember this coming up in some peculiarities with the RESCAN buffer design (sort of like Unix's argv/argc, or maybe it was just that it never worked quite right on restarted jobs.) That's the real design problem, it has the potential of becoming an enormous, draconian tail wagging a quite harried dog if the OS should promise to do this. I vote for the library routine and applications being responsible. (History buffs, earn points for valuable prizes! Didn't OS/MVT do this kind of cold/warm reboot, where warm reboots, when possible, just continued everything other than perhaps the job active when the system crashed?) -- -Barry Shein Software Tool & Die | {xylogics,uunet}!world!bzs | bzs@world.std.com Purveyors to the Trade | Voice: 617-739-0202 | Login: 617-739-WRLD