Path: utzoo!utgpu!jarvis.csri.toronto.edu!clyde.concordia.ca!rutgers!usc!zaphod.mps.ohio-state.edu!tut.cis.ohio-state.edu!ucbvax!hplabs!hpda!hpcupt1!hpisod2!dhepner From: dhepner@hpisod2.HP.COM (Dan Hepner) Newsgroups: comp.arch Subject: Re: Fault Tolerant Micros Message-ID: <13910012@hpisod2.HP.COM> Date: 6 Feb 90 18:10:53 GMT References: <13910004@hpisod2.HP.COM> Organization: Hewlett Packard, Cupertino Lines: 52 From: lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) > >>Ideally FT would exist completely in the hardware, and present >>a platform to the OS which looks like a non-FT machine. > >I'm not so sure. How would this catch the software bugs that Tandem's >scheme does catch? Tandem's checkpointing scheme catches SW bugs by deliberate SW design, not because the scheme provides some inherent resistance to such. This resistance has proven to be limited by its own bugs. What is as yet unestablished is whether the schemes described for detection of SW bugs are the best available, or if they are the best available, that they can't be equally applied to a redundant CPU implementation. We can assume we would prefer _all_ our applications to run reliably in the face of HW failures. In a checkpointing system, this means we must incorporate checkpointing logic, as compared to running them straight off on a redundant CPU machine. If we need to "harden" certain SW, making it resistant to SW bugs while already being impervious to HW failures, we are free to do so on a redundant CPU machine. >Lockstep redundancy is very simple to build, but it cannot catch >Heisenbugs - order dependencies that aren't supposed to be in the >software, but are anyway. Lockstep redundancy, or the alternatives for that matter, are not designed to catch SW bugs at all, Heisenbugs or whatever. Deliberate SW design to achieve resistance is the only technique that catches SW bugs. This is not to deny the conceptual difference between "single memory space" and "multiple memory space" machines, although the line can be a bit blurry. Intuitively, it sure seems more likely that the global state will be more easily irretrievably corrupted if there is only one memory space. But is there any actual evidence that those techniques (they're all SW techniques) which minimize the potential of irretrievable corruption due to SW bugs apply equally well to both systems? Ultimately we have to just trust the released SW. If some programmer writes precisely the line of code which corrupts the entire system, and that line of code manages to get past whatever QA process that is in place, we have no defense. Looser redundancy schemes declare >synchronization events at (say) a kilohertz. [...] >However, a loose redundancy scheme is essentially the same as a >checkpoint scheme, except for latency. Right. Dan Hepner