Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!swrinde!ucsd!ucbvax!agate!darkstar!PLAY.MACH.CS.CMU.EDU From: bsy@PLAY.MACH.CS.CMU.EDU (Bennet Yee) Newsgroups: comp.os.research Subject: Re: OSs supporting checkpointing: looking for examples. Message-ID: <5581@darkstar.ucsc.edu> Date: 31 Jul 90 05:59:48 GMT Sender: usenet@darkstar.ucsc.edu Organization: Cranberry Melon, School of Cucumber Science Lines: 47 Approved: comp-os-research@jupiter.ucsc.edu In article <5536@darkstar.ucsc.edu>, gkn@ucsd.Edu (Gerard K. Newman) writes: |> |> In article <5496@darkstar.ucsc.edu> vic@cs.arizona.edu (Vicraj T. Thomas) writes: |> > |> >I am looking for examples of "traditional" operating systems (i.e. centralized |> >OSs) that allowed user processes to periodically checkpoint their state so |> >that, in case of a failure and subsequent recovery, they could be restarted |> >from the last checkpoint. ... |> |> UNICOS, from Cray Resarch is one such. Also, though less in the mainstream, |> CTSS (from NERSC/LLNL) also allows this. Do you mean to include just the state of the process from the point of view of the OS, or do you include any external servers with which the process have communicated via IPC? [I'm including the file system as part of the traditional OS.] To be more concrete, if we use BSD Unix as an example, is saving the current working directory, the installed signal handlers, the state of the various alarms, the position of all open file descriptors (and the contents of the files, I presume [expensive]), the contents of the address space of the process (presumably just the data and stack segment), and the contents of the user registers sufficient? The general case of including IPC sockets is certainly _much_ more complicated. If what I described above satisfies your definition, then I'd claim that traditional BSD Unix can be made to perform generic checkpointing with a little bit of user code. A few years ago, I implemented a restricted form of this which saves/restores only the address space and the registers to allow some long running jobs to survive reboots/crashes. With the help of my rc file, I had the system continue from the last checkpoint automatically after reboot. Certainly extending my code to save the state of the file descriptors, file contents, etc can be easily done by replacing the C stubs for certain syscalls, and the techniques used can be easily applied to other flavors of Unixes as well. Also, any OS that supports transaction processing could be argued to have this property.... -*-*- Bennet S. Yee, +1 412 268-7571 School of Cucumber Science, Cranberry Melon, Pittsburgh, PA 15213-3890 Internet: bsy+@cs.cmu.edu Uunet: ...!seismo!cs.cmu.edu!bsy+ Csnet: bsy+%cs.cmu.edu@relay.cs.net Bitnet: bsy+%cs.cmu.edu@cmuccvma