Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!sdd.hp.com!hplabs!hpl-opus!hpcc05!hpyhde4!hpycla!hpcuhc!dhepner
From: dhepner@hpcuhc.cup.hp.com (Dan Hepner)
Newsgroups: comp.arch
Subject: Re: Fault-Tolerant Systems
Message-ID: <107340005@hpcuhc.cup.hp.com>
Date: 21 Jun 91 00:42:04 GMT
References: <1991Jun19.172757.20852@mips2.ma30.bull.com>
Organization: Hewlett Packard, Cupertino
Lines: 44

From: dowlati@mips2.ma30.bull.com (Saadat Dowlati)

>	- What are the symptoms of a failing CPU, i.e., fault types?

Impossible to predict, which means that any conceivable failure must 
be handled.  Early fault tolerant machines were able to react to total 
processor failure, but vulnerable to for example a processor adding 2+2 
and getting 5, or a disk controller processor corrupting a bit on a sector 
on the way through.  These don't meet modern expectations of a fault 
tolerant system.

>	- How soon a failing/failed CPU can be detected?

Failure must be detected before affecting anything else, such as the 
state of external memory or before issuing any IO request.   It _can_ be
detected sooner, either by internal checks or before modification
of unshared cache memory, but need not be, and it is usually advantageous
for performance purposes to not detect any sooner (check more frequently)
than necessary. 
 
>	- What are the techniques used in detecting a failing/failed CPU?
>  	  (I know about processor-pair technique)

3 of the 4 major commercial FT architectures (Tandem Guardian, Stratus,
Sequoia, and Tandem S2) use other processors to check each other, albeit
each in a unique way.  Tandem's Guardian uses a combination of redundant 
components and parity checking.

>	- What are the techniques used to report a failed CPU to the OS?
>Saadat Dowlati		   Affiliation:	Bull HN Information Systems, Inc.

There are two kinds of reasons why the OS might care.  One is for support
for diagnostic messages to prompt replacement; this is handled similarly
to "ordinary" degraded conditions.  The other reason, which is more fundamental
to the Guardian and Sequoia architectures, is to prevent future work from
being scheduled on this processor.  In this case, there is always the option
of the processor module ceasing to process any more instructions; this is
soon detected by the Sequoia OS, or by another Guardian.  Both react
by scheduling the in-progress work on another processor.  The Stratus and
S2 OSs need not solve this problem, as the processor module never fails
completely and continues, from the OS point of view, as if nothing had
happened.

Dan Hepner