Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!uwm.edu!zaphod.mps.ohio-state.edu!tut.cis.ohio-state.edu!pt.cs.cmu.edu!MATHOM.GANDALF.CS.CMU.EDU!lindsay From: lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) Newsgroups: comp.arch Subject: Re: Fault Tolerant Micros Message-ID: <7843@pt.cs.cmu.edu> Date: 5 Feb 90 16:32:12 GMT References: <13910004@hpisod2.HP.COM> <13910010@hpisod2.HP.COM> Organization: Carnegie-Mellon University, CS/RI Lines: 34 In article <13910010@hpisod2.HP.COM> dhepner@hpisod2.HP.COM (Dan Hepner) writes: >Ideally FT would exist completely in the hardware, and present >a platform to the OS which looks like a non-FT machine. I'm not so sure. How would this catch the software bugs that Tandem's scheme does catch? >Redundant processor machines always execute the instruction >with more than one processor (3 for Tandem, 4 for Stratus), >and compare results. Miscomparisons result in reliable detection >of "crazy processors", including cache, logic, or whatever. There is "lockstep" redundant execution, and then there are looser forms. Lockstep redundancy is very simple to build, but it cannot catch Heisenbugs - order dependencies that aren't supposed to be in the software, but are anyway. Looser redundancy schemes declare synchronization events at (say) a kilohertz. This is no where near as clear cut, because processes may not have been scheduled in the same order on all machines. Both interrupts and traps become interesting topics in such a system. And you can't expect all machines to reach the same synchronization event at the exact same moment. However, a loose redundancy scheme is essentially the same as a checkpoint scheme, except for latency. A redundant process has been out there fighting for cycles all along. Checkpoint systems recover by running a shadow process forwards from the last checkpoint. So, the Space Shuttle uses redundancy, because they don't want The Pause That Refreshes to happen during reentry. -- Don D.C.Lindsay Carnegie Mellon Computer Science