Path: utzoo!utgpu!jarvis.csri.toronto.edu!cs.utexas.edu!samsung!brutus.cs.uiuc.edu!apple!sun-barr!newstop!exodus!cortex.Sun.COM!rtrauben From: rtrauben@cortex.Sun.COM (Richard Trauben) Newsgroups: comp.arch Subject: Re: Fault Tolerance Message-ID: <35@exodus.Eng.Sun.COM> Date: 6 Feb 90 21:19:10 GMT References: <13910004@hpisod2.HP.COM> <13910009@hpisod2.HP.COM> <35300@mips.mips.COM> <1990Feb2.035201.21073@tandem.com> <7840@pt.cs.cmu.edu> <29059@amdcad.AMD.COM> Sender: news@exodus.Eng.Sun.COM Reply-To: rtrauben@cortex.EBay.Sun.COM (Richard Trauben) Organization: Sun Microsystems, Inc. Mt. View, Ca. Lines: 33 I am curious about the exact mechanism available to excise a bad processor or bad processor pair once a bad processor element is detected. This is especially important for non-TMR, say PE-pairs where only differences are reported as a kill-my-PE-pair. Can anyone who has designed this explain the typical FT kill-me mechanism? There seem to be several possible kill-me schemes: 1. reset-and-hold-me-down, 2. tristate-me-and-never-let-me-go, 3. relinquish-bus-ownership-and-stop-arbiter-from-ever-granting-me-again, 4. interrupt-me-and-vector-to-branch-to-self. Presumably no-one is interested in dumping the state of a failed PE-pairs write-back$; execution would resume from last process checkpoint. How about resuming from the checkpoint and unintentionally resending redundant mass store and datacom messages. I/O caching and TCPIP packet sequence numbers might conceal some of these problems but probably not all. Back to the voter/exciser: Would the vote-tally-ing circuit itself duplicated? (To stop an insane vote tally-er is stopped from bringing down the system.) Presumably redundant tally- clusters are required to stop single point failures and keep running. In summary, can someone suggest a pointer into FT literature beyond Computer Structures: Principles and Examples by Bell, et. al? This is a fascinating area. Thanks in advance, Richard