Path: utzoo!utgpu!news-server.csri.toronto.edu!bonnie.concordia.ca!thunder.mcrcim.mcgill.edu!snorkelwacker.mit.edu!usc!elroy.jpl.nasa.gov!swrinde!zaphod.mps.ohio-state.edu!think.com!cass.ma02.bull.com!mips2!mips2.ma30.bull.com!dowlati From: dowlati@mips2.ma30.bull.com (Saadat Dowlati) Newsgroups: comp.arch Subject: Fault-Tolerant Systems Message-ID: <1991Jun19.172757.20852@mips2.ma30.bull.com> Date: 19 Jun 91 17:27:57 GMT Sender: dowlati@mips2.ma30.bull.com (Saadat Dowlati) Organization: Bull HN Information Systems Inc. Lines: 22 I have been reading a lot of papers on fault-tolerant systems. One thing they all have in common is the many worderful expectations that they have from the underlying hardware: fail-stop processors, self-checking components, non-partionable networks, etc. But none says how. So, I am curious. I like to know, for example: - What are the symptoms of a failing CPU, i.e., fault types? - How soon a failing/failed CPU can be detected? - What are the techniques used in detecting a failing/failed CPU? (I know about processor-pair technique) - What are the techniques used to report a failed CPU to the OS? I also have similar questions about Buses, Disks, and the Memory subsystem. I would like to hear specially from those who have actual experiences. Regards, -- Saadat Dowlati Affiliation: Bull HN Information Systems, Inc. Voice: (508) 294-3426 300 Concord Road, MA30-826A Fax: (508) 294-3807 Billerica, Massachusetts 01821-4186 E-mail: S.Dowlati@ma30.bull.com U.S.A.