Path: utzoo!utgpu!jarvis.csri.toronto.edu!cs.utexas.edu!tut.cis.ohio-state.edu!snorkelwacker!husc6!m2c!umvlsi!dime!yodaiken
From: yodaiken@freal.cs.umass.edu (victor yodaiken)
Newsgroups: comp.arch
Subject: Re: Fault Tolerance
Message-ID: <9660@dime.cs.umass.edu>
Date: 5 Feb 90 19:11:11 GMT
References: <13910004@hpisod2.HP.COM> <13910009@hpisod2.HP.COM> <35300@mips.mips.COM> <1990Feb2.035201.21073@tandem.com> <7840@pt.cs.cmu.edu>
Sender: news@dime.cs.umass.edu
Reply-To: yodaiken@freal.cs.umass.edu (victor yodaiken)
Organization: University of Massachusetts, Amherst
Lines: 50

In article <7840@pt.cs.cmu.edu> lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) writes:
>
>What ever happened to the Auragen Unix kernel? They did checkpointing
>between process pairs, and synchronized them at invervals. (Each Unix
>signal caused a synch, because it had to interrupt both processes at
>exactly the same instruction.) Synchonization also involved paging
>out all dirty pages: certainly an argument against the VAX, which
>doesn't know who's dirty.
>
>I believe the Auragen people also pulled some kernel functions into
>server processes, where it was easier to make them survive. This
>makes the various kernelization projects (such as Mach) sound ever
>more attractive.
>-- 
>Don		D.C.Lindsay 	Carnegie Mellon Computer Science


The auragen idea was very simple and, in my biased (I worked for auragen)
opinion, is still the best plan for a fault tolerant system. Although
Auragen died dismally, the o.s. lives on in a Nixdorf machine (Nixdorf
is also reported to be in trouble, makes you wonder).
The basic idea is to force all process i/o to go through messages, a primary
process is associated with an inactive backup process on another machine.
All messages transmitted by the original process must be transmitted to
3 sites: the destination, the destination'backup, and the backup of the
transmitting process. The backup can discard the message, and just keep a
count of how many messages the primary has sent since the last checkpoint. 
Every message accepted by theprimary process must also have been delivered
to its backup and the backup of the sender. When a primary process dies,its
backup is re-started and whenever it sends a message the count of messages
sent by the primary is consulted. If this count is non-zero, the count is
incremented, and the message is discarded: the process is unaware of the
difference, but the o.s. knows the message was previously transmitted and
does not need to be re-sent. Whenever the process tries to read a message, 
it should have messages previously read by the primary already on its input
queue. Whenver the queues of mesages get too big, or the count gets too
high, or whatever, the backup can be synced, that pages of the primary can
be written out on backed up store, and the backups count and input message
queue can be cleared.

We had message bus which forced 3 or none acking of messages, but this is
not strictly necessary. There was a recent 
article in the ACM SIGOPS newsletter on how to apply the Auragen
scheme to MAch. There are a lot of complications hidden in the simplicity
ofthis method, and I don't know how fast it could work in a generic
distributed system  architecture. For example, "time" system calls must
go to a backed up system server i.e. must involve a message transaction,
otherwise, the backup will not see the same time as the primary, and
the recovery might disintegrate.  On the other hand, perhaps the generic
distributed system architecure can't run any o.s. fast.