Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!mips!spool.mu.edu!munnari.oz.au!metro!dramba!janm From: janm@dramba.neis.oz (Jan Mikkelsen) Newsgroups: comp.arch Subject: Re: Fault-Tolerant Systems Message-ID: <1991Jun22.165349.27263@dramba.neis.oz> Date: 22 Jun 91 16:53:49 GMT References: <1991Jun19.172757.20852@mips2.ma30.bull.com> Organization: Dramba Holdings, Lindfield, Australia Lines: 67 In article <1991Jun19.172757.20852@mips2.ma30.bull.com> dowlati@mips2.ma30.bull.com (Saadat Dowlati) writes: > >I have been reading a lot of papers on fault-tolerant systems. One thing >they all have in common is the many worderful expectations that they have >from the underlying hardware: fail-stop processors, self-checking >components, non-partionable networks, etc. But none says how. So, I am >curious. We just installed a Tandem S2 (MIPS based, fault-tolerant Unix), and have had Tandem Guardian machines in our parent company for some time. > I like to know, for example: > > - What are the symptoms of a failing CPU, i.e., fault types? In the S2, you have three CPU's, each executing the same code. Whenever they require access to "global" memory, they have a vote to confirm that they are all attempting to do the same thing. If one of the CPUs looses the vote, it is taken off-line, and processing continues with two processors. If these two disagree, then the system is stopped. So, I think the point here is that rather than looking for a specific type of failure, this implementation goes with a majority decision. It is of course possible, but unlikely, that two will fail in the same way, and one will succeed. > - How soon a failing/failed CPU can be detected? As soon as the instruction flow requires access to something outside of a processors local memory. Memory in the S2 is organised into local and global memory; a vote is required whenever access is required to global memory or when an I/O operation is attempted. I am note sure of the size of transfers between local and global memory. > - What are the techniques used in detecting a failing/failed CPU? > (I know about processor-pair technique) See above. > - What are the techniques used to report a failed CPU to the OS? I suspect that this would vary considerably between a machine with multiple logical processors and a machine with a single logical processor. Each logical processor in the Tandem S2 architecture consists of three physical CPUs. In a machine with one logical CPU, and failure of a physical CPU should probably not affect the OS. The failure of a logical CPU obviously has a more severe impact on a machine like this. In a machine with multiple logical CPUs, the failure of a logical CPU should involve the OS, which should start rescheduling jobs to working CPUs. How do machines like the Sequent or the Stratus i860 based machines handle this? > >I also have similar questions about Buses, Disks, and the Memory subsystem. I think in essence, the Tandem philosophy is have two or more of everything. -- Jan Mikkelsen janm@dramba.neis.oz.AU or janm%dramba.neis.oz@metro.ucc.su.oz.au "She really is."