Path: utzoo!utgpu!jarvis.csri.toronto.edu!cs.utexas.edu!samsung!brutus.cs.uiuc.edu!apple!sun-barr!newstop!exodus!cortex.Sun.COM!rtrauben
From: rtrauben@cortex.Sun.COM (Richard Trauben)
Newsgroups: comp.arch
Subject: Re: Fault Tolerance
Message-ID: <35@exodus.Eng.Sun.COM>
Date: 6 Feb 90 21:19:10 GMT
References: <13910004@hpisod2.HP.COM> <13910009@hpisod2.HP.COM> <35300@mips.mips.COM> <1990Feb2.035201.21073@tandem.com> <7840@pt.cs.cmu.edu> <29059@amdcad.AMD.COM>
Sender: news@exodus.Eng.Sun.COM
Reply-To: rtrauben@cortex.EBay.Sun.COM (Richard Trauben)
Organization: Sun Microsystems, Inc.  Mt. View, Ca.
Lines: 33

I am curious about the exact mechanism available to excise a bad processor
or bad processor pair once a bad processor element is detected. This is
especially important for non-TMR, say PE-pairs where only differences are
reported as a kill-my-PE-pair. 

Can anyone who has designed this explain the typical FT kill-me mechanism?

There seem to be several possible kill-me schemes:
1. reset-and-hold-me-down,
2. tristate-me-and-never-let-me-go,
3. relinquish-bus-ownership-and-stop-arbiter-from-ever-granting-me-again,
4. interrupt-me-and-vector-to-branch-to-self.

Presumably no-one is interested in dumping the state of a failed PE-pairs
write-back$; execution would resume from last process checkpoint. 

How about resuming from the checkpoint and unintentionally resending 
redundant mass store and datacom messages. I/O caching and TCPIP
packet sequence numbers might conceal some of these problems but probably
not all.

Back to the voter/exciser: Would the vote-tally-ing circuit itself 
duplicated? (To stop an insane vote tally-er is stopped from bringing
down the system.) Presumably redundant tally- clusters are required 
to stop single point failures and keep running.

In summary, can someone suggest a pointer into FT literature beyond
Computer Structures: Principles and Examples by Bell, et. al?
This is a fascinating area.

Thanks in advance,

Richard