Path: utzoo!utgpu!news-server.csri.toronto.edu!bonnie.concordia.ca!thunder.mcrcim.mcgill.edu!snorkelwacker.mit.edu!usc!elroy.jpl.nasa.gov!swrinde!zaphod.mps.ohio-state.edu!think.com!cass.ma02.bull.com!mips2!mips2.ma30.bull.com!dowlati
From: dowlati@mips2.ma30.bull.com (Saadat Dowlati)
Newsgroups: comp.arch
Subject: Fault-Tolerant Systems
Message-ID: <1991Jun19.172757.20852@mips2.ma30.bull.com>
Date: 19 Jun 91 17:27:57 GMT
Sender: dowlati@mips2.ma30.bull.com (Saadat Dowlati)
Organization: Bull HN Information Systems Inc.
Lines: 22


I have been reading a lot of papers on fault-tolerant systems. One thing 
they all have in common is the many worderful expectations that they have 
from the underlying hardware: fail-stop processors, self-checking 
components, non-partionable networks, etc. But none says how. So, I am 
curious. I like to know, for example:

	- What are the symptoms of a failing CPU, i.e., fault types?
	- How soon a failing/failed CPU can be detected?
	- What are the techniques used in detecting a failing/failed CPU?
  	  (I know about processor-pair technique)
	- What are the techniques used to report a failed CPU to the OS?

I also have similar questions about Buses, Disks, and the Memory subsystem. 
I would like to hear specially from those who have actual experiences.

Regards,
-- 
Saadat Dowlati		   Affiliation:	Bull HN Information Systems, Inc.
Voice:	(508) 294-3426			300 Concord Road, MA30-826A
Fax:	(508) 294-3807			Billerica, Massachusetts 01821-4186
E-mail:	S.Dowlati@ma30.bull.com       	U.S.A.