Path: utzoo!utgpu!jarvis.csri.toronto.edu!clyde.concordia.ca!rutgers!usc!zaphod.mps.ohio-state.edu!tut.cis.ohio-state.edu!ucbvax!hplabs!hpda!hpcupt1!hpisod2!dhepner
From: dhepner@hpisod2.HP.COM (Dan Hepner)
Newsgroups: comp.arch
Subject: Re: Fault Tolerant Micros
Message-ID: <13910012@hpisod2.HP.COM>
Date: 6 Feb 90 18:10:53 GMT
References: <13910004@hpisod2.HP.COM>
Organization: Hewlett Packard, Cupertino
Lines: 52

From: lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay)
>
>>Ideally FT would exist completely in the hardware, and present
>>a platform to the OS which looks like a non-FT machine.
>
>I'm not so sure. How would this catch the software bugs that Tandem's
>scheme does catch?

Tandem's checkpointing scheme catches SW bugs by deliberate SW design, 
not because the scheme provides some inherent resistance to such. This
resistance has proven to be limited by its own bugs. What is as yet 
unestablished is whether the schemes described for detection of SW bugs 
are the best available, or if they are the best available, that they can't 
be equally applied to a redundant CPU implementation.

We can assume we would prefer _all_ our applications to run reliably in
the face of HW failures.  In a checkpointing system, this means
we must incorporate checkpointing logic, as compared to running them
straight off on a redundant CPU machine.  If we need to "harden" certain 
SW, making it resistant to SW bugs while already being impervious to
HW failures, we are free to do so on a redundant CPU machine.

>Lockstep redundancy is very simple to build, but it cannot catch
>Heisenbugs - order dependencies that aren't supposed to be in the
>software, but are anyway.

Lockstep redundancy, or the alternatives for that matter, are not designed 
to catch SW bugs at all, Heisenbugs or whatever.  Deliberate SW design to 
achieve resistance is the only technique that catches SW bugs.

This is not to deny the conceptual difference between "single memory
space" and "multiple memory space" machines, although the line can 
be a bit blurry.  Intuitively, it sure seems more likely that the global 
state will be more easily irretrievably corrupted if there is only one 
memory space.  But is there any actual evidence that those techniques 
(they're all SW techniques) which minimize the potential of irretrievable 
corruption due to SW bugs apply equally well to both systems?

Ultimately we have to just trust the released SW.  If some programmer 
writes precisely the line of code which corrupts the entire system, and 
that line of code manages to get past whatever QA process that is in 
place, we have no defense.

 Looser redundancy schemes declare
>synchronization events at (say) a kilohertz. 
 [...]
>However, a loose redundancy scheme is essentially the same as a
>checkpoint scheme, except for latency.

Right.

Dan Hepner