Xref: utzoo comp.sys.isis:409 comp.os.mach:530 comp.lang.misc:5493 Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!rice!titan.rice.edu!dbj From: dbj@titan.rice.edu (Dave Johnson) Newsgroups: comp.sys.isis,comp.os.mach,comp.lang.misc Subject: Re: Recovery (was Re: Data sharing among lightweight tasks) Keywords: sender-based message logging, optimistic recovery Message-ID: <1990Sep21.145447.20809@rice.edu> Date: 21 Sep 90 14:54:47 GMT References: <45212@cornell.UUCP> <1990Aug30.164153.25008@swbatl.sbc.com> <1990Sep13.150241.16073@arnor.uucp> Sender: news@rice.edu (News) Organization: Rice University, Houston Lines: 90 My work with Willy Zwaenepoel has recently been referred to in this newsgroup in messages by Ken Birman and Rob Strom. However, we have progressed well beyond what is represented in those messages, and I would like to take this opportunity to comment on some of our recent results. In article <45212@cornell.UUCP> ken@gvax.cs.cornell.edu (Ken Birman) writes: >Willy Zwaenapoel >(Rice) has recently done a version of this in his work on "Sender based >message logging", which is cheap, but doesn't support rollback and hence >isn't an appropriate tool for solving this particular problem (he is more >interested in a cheap transparent fault-tolerance mechanism, and anyhow, >he assumes a deterministic, non-threaded, execution). Sender-based message logging was our initial work in this area, and it is, as Ken points out, transparent and very fast, but restricted to deterministic processes and can guarantee recovery from at most a single failure at a time. A recent paper describing the implementation and performance of sender-based message logging is available as technical report Rice COMP TR90-119. Beyond sender-based message logging, we have also developed optimistic recovery methods, as mentioned by Rob Strom. As with Strom and Yemini's system, our optimistic system supports recovery from any number failures, including a total failure. Furthermore, our our current optimistic recovery system no longer requires deterministic process execution, and can thus support arbitrary multi-threaded processes. The principle in our work remains the same as in Strom and Yemini's original paper. As with their method, but unlike ISIS, we provide transparent fault tolerance, relieving the programmer from having to worry about it. Our current design stresses reducing overhead during failure-free execution in exchange for possibly somewhat slower recovery times. Each process logs its own input messages, and takes occasional checkpoints, independently of other processes. Our system differs from Strom and Yemini's in that we use a more space-efficient dependency encoding method (requiring only one extra integer per message), and do not rely on an underlying reliable message delivery protocol. Our system also guarantees recovery to the maximum consistent system state available from the surviving processes and the information on stable storage. We have completed a full implementation of optimistic recovery in the V-System, although this implementation uses our earlier recovery algorithm that does not yet support nondeterministic process execution. Based on measured performance, this method is very efficient. Since logging is asynchronous, very little overhead is imposed on individual communication operations. A 32-byte Send-Receive-Reply (similar to an RPC) between two SUN-3/60's takes 1.6 milliseconds with logging vs. 1.4 milliseconds without, for an overhead of 14 percent. Similarly, a bulk data transfer of 64 kilobytes takes 89 milliseconds with logging vs. 87 milliseconds without, for an overhead of only 2 percent. The cost of checkpointing is reduced by doing pre-copying of the address space to disk before freezing the process, similarly to the method used by Theimer for process migration. The resulting overhead is somewhat application-dependent but is in general quite small: the process is typically frozen for less than 50 milliseconds. We have also measured the performance of several distributed application programs with the system. For 300 x 300 Gaussian elimination with partial pivoting running on 8 machines, the message logging adds less than 1 percent overhead, and checkpointing each process (independently) every 15 seconds adds less than another 2 percent, a small price to pay for the ability to recover from any number of failures. A complete account of this system is included in my Ph.D. dissertation, available as technical report Rice COMP TR89-101. A new implementation is currently underway. This implementation will support nondeterministic processes, and will provide better recovery times and reduced output latency, in addition to further improved failure-free performance. Nondeterministic processes are supported by allowing a process to dynamically turn off message logging, and rely on checkpointing alone for recovery. Output latency and recovery times are improved by loosely coordinating the message logging and/or checkpointing. Details may be found in a recent technical report Rice COMP TR90-118. We are also considering the possibility of implementing this algorithm under Unix or Mach, as well. Finally, another member of our group, Mootaz Elnozahy, has been experimenting with a new pessimistic protocol, applicable to both message logging and replication. Although we do not yet have an implementation of this method, it seems to combine many of the advantages of optimistic and pessimistic protocols. His thesis proposal containing the first draft of this method is available as Rice COMP TR90-120. All of the above technical reports or any further information can be obtained by sending mail to me (dbj@rice.edu) or Willy Zwaenepoel (willy@rice.edu). David B. Johnson Department of Computer Science Rice University --