Path: utzoo!attcan!uunet!cs.utexas.edu!tut.cis.ohio-state.edu!ucbvax!hplabs!hpda!hpcupt1!hpisod2!dhepner
From: dhepner@hpisod2.HP.COM (Dan Hepner)
Newsgroups: comp.arch
Subject: Re: Re: Fault Tolerance [LONG]
Message-ID: <13910014@hpisod2.HP.COM>
Date: 7 Feb 90 20:12:00 GMT
References: <1990Feb2.035201.21073@tandem.com>
Organization: Hewlett Packard, Cupertino
Lines: 55

From: rtrauben@cortex.Sun.COM (Richard Trauben)

[hopefully someone can answer the excellent "just how do you get 
 it stopped" questions]

>Presumably no-one is interested in dumping the state of a failed PE-pairs
>write-back$; execution would resume from last process checkpoint. 

Hmm. If what you mean by "PE-pairs" is what is generally called
Quad Modular Redundancy (QMR), with two lockstep processors constituting
a PE, and two of those constituting a logical processor, there is
no requirement for checkpointing; the instruction will be successfully
executed.

If what you mean however by a PE-pair is two lockstepped processors,
which upon detection of a miscomparison take themselves offline,
indeed a checkpoint needs to be done to preserve the state for
some backup processor.

Part of the checkpointing mechanism is the necessity to abandon
all effects of processing done after the checkpoint by the failed  
processor, which includes any write-back state.

>How about resuming from the checkpoint and unintentionally resending 
>redundant mass store and datacom messages.

The checkpoint itself must be atomic, in that it must complete
fully or not at all.  "Half-checkpoints" must be seen as effects
of processing done after the last successful checkpoint, and
be abandoned.

The IO request atomicity can be addressed as part of the problem of 
checkpoint atomicity. Once the atomic checkpoint mechanism is developed, 
the initiation of IO requests can be incorporated, so that the initiation 
of an IO request happens only at the time of a successful checkpoint.
From the recovery processor's point of view, either the checkpoint/
IO request happened or it didn't, and that is discernible.

This has covered the case of processor failure, and guaranteed
that the request has been issued once and only once.  As noted,
reissuing a disk write after an arbitrary amount of other activity
has happened could raise real havoc.

Left uncovered is the potential for the requestee of the IO request
to loose it, but that's a different question.

 I/O caching and TCPIP
>packet sequence numbers might conceal some of these problems but probably
>not all.

WRT disks, it seems essential to get it perfect.  Some comm might be different.

>Richard

Dan Hepner