Path: utzoo!utgpu!jarvis.csri.toronto.edu!cs.utexas.edu!tut.cis.ohio-state.edu!snorkelwacker!husc6!m2c!umvlsi!dime!yodaiken From: yodaiken@freal.cs.umass.edu (victor yodaiken) Newsgroups: comp.arch Subject: Re: Fault Tolerance Message-ID: <9660@dime.cs.umass.edu> Date: 5 Feb 90 19:11:11 GMT References: <13910004@hpisod2.HP.COM> <13910009@hpisod2.HP.COM> <35300@mips.mips.COM> <1990Feb2.035201.21073@tandem.com> <7840@pt.cs.cmu.edu> Sender: news@dime.cs.umass.edu Reply-To: yodaiken@freal.cs.umass.edu (victor yodaiken) Organization: University of Massachusetts, Amherst Lines: 50 In article <7840@pt.cs.cmu.edu> lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) writes: > >What ever happened to the Auragen Unix kernel? They did checkpointing >between process pairs, and synchronized them at invervals. (Each Unix >signal caused a synch, because it had to interrupt both processes at >exactly the same instruction.) Synchonization also involved paging >out all dirty pages: certainly an argument against the VAX, which >doesn't know who's dirty. > >I believe the Auragen people also pulled some kernel functions into >server processes, where it was easier to make them survive. This >makes the various kernelization projects (such as Mach) sound ever >more attractive. >-- >Don D.C.Lindsay Carnegie Mellon Computer Science The auragen idea was very simple and, in my biased (I worked for auragen) opinion, is still the best plan for a fault tolerant system. Although Auragen died dismally, the o.s. lives on in a Nixdorf machine (Nixdorf is also reported to be in trouble, makes you wonder). The basic idea is to force all process i/o to go through messages, a primary process is associated with an inactive backup process on another machine. All messages transmitted by the original process must be transmitted to 3 sites: the destination, the destination'backup, and the backup of the transmitting process. The backup can discard the message, and just keep a count of how many messages the primary has sent since the last checkpoint. Every message accepted by theprimary process must also have been delivered to its backup and the backup of the sender. When a primary process dies,its backup is re-started and whenever it sends a message the count of messages sent by the primary is consulted. If this count is non-zero, the count is incremented, and the message is discarded: the process is unaware of the difference, but the o.s. knows the message was previously transmitted and does not need to be re-sent. Whenever the process tries to read a message, it should have messages previously read by the primary already on its input queue. Whenver the queues of mesages get too big, or the count gets too high, or whatever, the backup can be synced, that pages of the primary can be written out on backed up store, and the backups count and input message queue can be cleared. We had message bus which forced 3 or none acking of messages, but this is not strictly necessary. There was a recent article in the ACM SIGOPS newsletter on how to apply the Auragen scheme to MAch. There are a lot of complications hidden in the simplicity ofthis method, and I don't know how fast it could work in a generic distributed system architecture. For example, "time" system calls must go to a backed up system server i.e. must involve a message transaction, otherwise, the backup will not see the same time as the primary, and the recovery might disintegrate. On the other hand, perhaps the generic distributed system architecure can't run any o.s. fast.