Path: utzoo!attcan!uunet!cs.utexas.edu!tut.cis.ohio-state.edu!ucbvax!hplabs!hpda!hpcupt1!hpisod2!dhepner From: dhepner@hpisod2.HP.COM (Dan Hepner) Newsgroups: comp.arch Subject: Re: Re: Fault Tolerance [LONG] Message-ID: <13910014@hpisod2.HP.COM> Date: 7 Feb 90 20:12:00 GMT References: <1990Feb2.035201.21073@tandem.com> Organization: Hewlett Packard, Cupertino Lines: 55 From: rtrauben@cortex.Sun.COM (Richard Trauben) [hopefully someone can answer the excellent "just how do you get it stopped" questions] >Presumably no-one is interested in dumping the state of a failed PE-pairs >write-back$; execution would resume from last process checkpoint. Hmm. If what you mean by "PE-pairs" is what is generally called Quad Modular Redundancy (QMR), with two lockstep processors constituting a PE, and two of those constituting a logical processor, there is no requirement for checkpointing; the instruction will be successfully executed. If what you mean however by a PE-pair is two lockstepped processors, which upon detection of a miscomparison take themselves offline, indeed a checkpoint needs to be done to preserve the state for some backup processor. Part of the checkpointing mechanism is the necessity to abandon all effects of processing done after the checkpoint by the failed processor, which includes any write-back state. >How about resuming from the checkpoint and unintentionally resending >redundant mass store and datacom messages. The checkpoint itself must be atomic, in that it must complete fully or not at all. "Half-checkpoints" must be seen as effects of processing done after the last successful checkpoint, and be abandoned. The IO request atomicity can be addressed as part of the problem of checkpoint atomicity. Once the atomic checkpoint mechanism is developed, the initiation of IO requests can be incorporated, so that the initiation of an IO request happens only at the time of a successful checkpoint. From the recovery processor's point of view, either the checkpoint/ IO request happened or it didn't, and that is discernible. This has covered the case of processor failure, and guaranteed that the request has been issued once and only once. As noted, reissuing a disk write after an arbitrary amount of other activity has happened could raise real havoc. Left uncovered is the potential for the requestee of the IO request to loose it, but that's a different question. I/O caching and TCPIP >packet sequence numbers might conceal some of these problems but probably >not all. WRT disks, it seems essential to get it perfect. Some comm might be different. >Richard Dan Hepner