Path: utzoo!utgpu!jarvis.csri.toronto.edu!clyde.concordia.ca!rutgers!uwm.edu!rpi!zaphod.mps.ohio-state.edu!tut.cis.ohio-state.edu!ucbvax!hplabs!hpda!hpcupt1!hpisod2!dhepner From: dhepner@hpisod2.HP.COM (Dan Hepner) Newsgroups: comp.arch Subject: Re: Re: Fault Tolerance [LONG] Message-ID: <13910011@hpisod2.HP.COM> Date: 6 Feb 90 17:14:00 GMT References: <1990Feb2.035201.21073@tandem.com> Organization: Hewlett Packard, Cupertino Lines: 69 From: jimbo@tandem.com (Jim Lyon) >In article <35300@mips.mips.COM> mash@mips.COM (John Mashey) writes: >> a) Various people do fault-tolerance various ways. How about > >Now that the discussion is back to technology, I'll be happy to put >in my two-cents worth. We'll thank John Mashey for his contribution. >fault in a component causing one of the following two behaviors: >a) It stops dead. or >b) It goes insane. >The latter case is VERY difficult to deal with. People put it very >much work to try to translate it into the first case. TMR schemes >try to shoot the insane processor before it manages to poison the >outside world. Are you suggesting that the techniques available to a TMR/QMR designer are not totally effective in isolating an insane processor? Could you offer an example of a type of insanity which can spread through a voting/comparison barrier? [a lot of good stuff on software faults deleted] >In summary, checkpointing not only allows you to survive most of your >hardware failures, but also most of your operating system bugs, most >of your database manager bugs, most of your communication protocol bugs, >most of your transaction manager bugs, and even many of your application >bugs. and later: >If you want the highest degree of fault tolerance possible, design it >from the start to use checkpointing [If you come work for Tandem for >a few years, you'll learn how]. Could you clarify the claim here? It sure seems you suggesting that the checkpointing system, including HW and SW, is inherently more reliable than would be a TMR system which had had an equivelent amount of effort devoted to reliability enhancement of its SW. But most of the excellent recommendations made WRT the SW are equally applicable to both products, or even non-FT products. What is unique to checkpointing is the notion that each SW layer has available to it a backup process(s), and that the hardware checkpointing mechanism can be used as a tool for abandoning work which led to some failure, succeeding in avoiding the Hiesenbug panic case. As long as we're willing to pay substantially increased SW development costs, we might consider what else we might get for our money. There are other tools which can be used to attain high reliability, and the basic save state/ fall back on failure mechanism can be used in the absence of even a backup process, let alone a process in a different memory space. Is there really something offered by checkpointing to another memory space which makes such SW inherently more reliable? And from there, is there really something offered by completely checkpointing HW/SW systems which is not achievable on TMR/QMR systems? >I hope this has been informative and hasn't sounded too much like a >Tandem commercial. If not, well, I'll put on my asbestos suit now. > >-- Jim Lyon >-- Tandem Computers >-- jimbo@tandem.com Thanks a lot. Dan Hepner