Path: utzoo!utgpu!jarvis.csri.toronto.edu!clyde.concordia.ca!mcgill-vision!bloom-beacon!snorkelwacker!tut.cis.ohio-state.edu!pt.cs.cmu.edu!MATHOM.GANDALF.CS.CMU.EDU!lindsay
From: lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay)
Newsgroups: comp.arch
Subject: Re: Fault Tolerant Micros
Message-ID: <7635@pt.cs.cmu.edu>
Date: 19 Jan 90 03:09:46 GMT
References: <13910004@hpisod2.HP.COM>
Organization: Carnegie-Mellon University, CS/RI
Lines: 60


In article <13910004@hpisod2.HP.COM> dhepner@hpisod2.HP.COM
	(Dan Hepner) writes:
>From: lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay)
>>Live CPU recovery has become much less interesting since
>>multiprocessors came along. With the right software, a failed
>>processor does not imply a failed process. For example, Tandem
>>checkpoints each process regularly, so that a different processor can
>>do a prompt checkpoint-resumption.
>
>Tandem has apparently decided that this was not the correct model
>to implement fault tolerance, although I've searched for and not
>found yet an official statement on just how the S2 does do FT.

The most basic thing is to contain the damage, so it's usual to use
messages between machines that don't share memory. The next problem
is the data that was inside the failed thing. There are several ways
out:
	1- whatever it was doing is dead. (Rarely acceptable, but if
  	   the customer is just buying a lottery ticket,
	   I suppose you can ask him do it again.)
	2- another CPU recomputes the lost data from its copy of
	   the last checkpoint, and copies of all the messages
	   and signals sent to the dead machine. 
	3- another CPU has been computing in parallel, and contains
	   data that should be identical to the lost data.

Case 1 is an application-level choice, but still needs error
detection. Case 2 is done with software and with error detection.
Case 3 can be done like case 2, or it can be done with a master-and-
two-checkers. As you pointed out, the three chips can hold a vote,
and decide that only two of the three contain valid data.  That can
be a lot of data: with on-chip caches, and PIDs, it can also be data
from several different processes.

>>Well, nonstop machines are ruggedized and rated for e.g. sudden
>>overpressures (no kidding). This might influence a chip company to
>>change its chip packaging, but not its chip design.
>Maybe you could expand on this.

I'm probably out of touch with ruggedizing, but there are companies
that specialize in it. They like things that tolerate flexing. They
like sealed spaces (to keep out salt air and conductive dust and
tobacco tar). They like to coat things: a grease layer is used on
some automotive chips. They like hermetically sealed chips, and they
used to like ceramic over plastic. They used to worry about pin
corrosion, but I don't know what they think of TAB. The system specs
can involve higher ambient temperatures, explosion (overpressure),
shock, high voltage shorted onto the rack, ground currents on the
Ethernet shield, ambient RF, industrial-strength noise on the power
lines, locks on the rack door, having to push a telex signal through
a one-henry coil.  I recall a system that failed from being too near
a gamma-ray source: but that wasn't in the spec.

(The gamma rays were just enough so that an EPROM would forget one
bit after about a month. We solved this with squared-law shielding.
That is, we pushed the desk further down the hall.)

-- 
Don		D.C.Lindsay 	Carnegie Mellon Computer Science