Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!sdd.hp.com!hplabs!hpl-opus!hpcc05!hpyhde4!hpycla!hpcuhc!dhepner From: dhepner@hpcuhc.cup.hp.com (Dan Hepner) Newsgroups: comp.arch Subject: Re: Fault-Tolerant Systems Message-ID: <107340005@hpcuhc.cup.hp.com> Date: 21 Jun 91 00:42:04 GMT References: <1991Jun19.172757.20852@mips2.ma30.bull.com> Organization: Hewlett Packard, Cupertino Lines: 44 From: dowlati@mips2.ma30.bull.com (Saadat Dowlati) > - What are the symptoms of a failing CPU, i.e., fault types? Impossible to predict, which means that any conceivable failure must be handled. Early fault tolerant machines were able to react to total processor failure, but vulnerable to for example a processor adding 2+2 and getting 5, or a disk controller processor corrupting a bit on a sector on the way through. These don't meet modern expectations of a fault tolerant system. > - How soon a failing/failed CPU can be detected? Failure must be detected before affecting anything else, such as the state of external memory or before issuing any IO request. It _can_ be detected sooner, either by internal checks or before modification of unshared cache memory, but need not be, and it is usually advantageous for performance purposes to not detect any sooner (check more frequently) than necessary. > - What are the techniques used in detecting a failing/failed CPU? > (I know about processor-pair technique) 3 of the 4 major commercial FT architectures (Tandem Guardian, Stratus, Sequoia, and Tandem S2) use other processors to check each other, albeit each in a unique way. Tandem's Guardian uses a combination of redundant components and parity checking. > - What are the techniques used to report a failed CPU to the OS? >Saadat Dowlati Affiliation: Bull HN Information Systems, Inc. There are two kinds of reasons why the OS might care. One is for support for diagnostic messages to prompt replacement; this is handled similarly to "ordinary" degraded conditions. The other reason, which is more fundamental to the Guardian and Sequoia architectures, is to prevent future work from being scheduled on this processor. In this case, there is always the option of the processor module ceasing to process any more instructions; this is soon detected by the Sequoia OS, or by another Guardian. Both react by scheduling the in-progress work on another processor. The Stratus and S2 OSs need not solve this problem, as the processor module never fails completely and continues, from the OS point of view, as if nothing had happened. Dan Hepner