Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!uwm.edu!zaphod.mps.ohio-state.edu!tut.cis.ohio-state.edu!pt.cs.cmu.edu!MATHOM.GANDALF.CS.CMU.EDU!lindsay
From: lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay)
Newsgroups: comp.arch
Subject: Re: Fault Tolerant Micros
Message-ID: <7843@pt.cs.cmu.edu>
Date: 5 Feb 90 16:32:12 GMT
References: <13910004@hpisod2.HP.COM> <13910010@hpisod2.HP.COM>
Organization: Carnegie-Mellon University, CS/RI
Lines: 34


In article <13910010@hpisod2.HP.COM> dhepner@hpisod2.HP.COM 
	(Dan Hepner) writes:
>Ideally FT would exist completely in the hardware, and present
>a platform to the OS which looks like a non-FT machine.

I'm not so sure. How would this catch the software bugs that Tandem's
scheme does catch?

>Redundant processor machines always execute the instruction
>with more than one processor (3 for Tandem, 4 for Stratus),
>and compare results.  Miscomparisons result in reliable detection
>of "crazy processors", including cache, logic, or whatever.

There is "lockstep" redundant execution, and then there are
looser forms.

Lockstep redundancy is very simple to build, but it cannot catch
Heisenbugs - order dependencies that aren't supposed to be in the
software, but are anyway. Looser redundancy schemes declare
synchronization events at (say) a kilohertz. This is no where near as
clear cut, because processes may not have been scheduled in the same
order on all machines. Both interrupts and traps become interesting
topics in such a system. And you can't expect all machines to reach
the same synchronization event at the exact same moment.

However, a loose redundancy scheme is essentially the same as a
checkpoint scheme, except for latency.  A redundant process has been
out there fighting for cycles all along.  Checkpoint systems recover
by running a shadow process forwards from the last checkpoint.  So,
the Space Shuttle uses redundancy, because they don't want The Pause
That Refreshes to happen during reentry.
-- 
Don		D.C.Lindsay 	Carnegie Mellon Computer Science