Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!mips!spool.mu.edu!munnari.oz.au!metro!dramba!janm
From: janm@dramba.neis.oz (Jan Mikkelsen)
Newsgroups: comp.arch
Subject: Re: Fault-Tolerant Systems
Message-ID: <1991Jun22.165349.27263@dramba.neis.oz>
Date: 22 Jun 91 16:53:49 GMT
References: <1991Jun19.172757.20852@mips2.ma30.bull.com>
Organization: Dramba Holdings, Lindfield, Australia
Lines: 67

In article <1991Jun19.172757.20852@mips2.ma30.bull.com> dowlati@mips2.ma30.bull.com (Saadat Dowlati) writes:
>
>I have been reading a lot of papers on fault-tolerant systems. One thing 
>they all have in common is the many worderful expectations that they have 
>from the underlying hardware: fail-stop processors, self-checking 
>components, non-partionable networks, etc. But none says how. So, I am 
>curious.

We just installed a Tandem S2 (MIPS based, fault-tolerant Unix),
and have had Tandem Guardian machines in our parent company
for some time.

>          I like to know, for example:
>
>	- What are the symptoms of a failing CPU, i.e., fault types?

In the S2, you have three CPU's, each executing the same code.  Whenever
they require access to "global" memory, they have a vote to confirm that
they are all attempting to do the same thing.

If one of the CPUs looses the vote, it is taken off-line, and processing
continues with two processors.  If these two disagree, then the system
is stopped.

So, I think the point here is that rather than looking for a specific type
of failure, this implementation goes with a majority decision.  It is of
course possible, but unlikely, that two will fail in the same way, and one
will succeed.

>	- How soon a failing/failed CPU can be detected?

As soon as the instruction flow requires access to something outside of
a processors local memory.  Memory in the S2 is organised into local and
global memory;  a vote is required whenever access is required to global
memory or when an I/O operation is attempted.  I am note sure of the size
of transfers between local and global memory.

>	- What are the techniques used in detecting a failing/failed CPU?
>  	  (I know about processor-pair technique)

See above.

>	- What are the techniques used to report a failed CPU to the OS?

I suspect that this would vary considerably between a machine with multiple
logical processors and a machine with a single logical processor.  Each
logical processor in the Tandem S2 architecture consists of three physical
CPUs.

In a machine with one logical CPU, and failure of a physical CPU should
probably not affect the OS.  The failure of a logical CPU obviously has
a more severe impact on a machine like this.

In a machine with multiple logical CPUs, the failure of a logical CPU
should involve the OS, which should start rescheduling jobs to working
CPUs.  How do machines like the Sequent or the Stratus i860 based machines
handle this?

>
>I also have similar questions about Buses, Disks, and the Memory subsystem. 

I think in essence, the Tandem philosophy is have two or more of everything.

-- 
Jan Mikkelsen
janm@dramba.neis.oz.AU or janm%dramba.neis.oz@metro.ucc.su.oz.au
"She really is."