Path: utzoo!utgpu!jarvis.csri.toronto.edu!clyde.concordia.ca!rutgers!uwm.edu!rpi!zaphod.mps.ohio-state.edu!tut.cis.ohio-state.edu!ucbvax!hplabs!hpda!hpcupt1!hpisod2!dhepner
From: dhepner@hpisod2.HP.COM (Dan Hepner)
Newsgroups: comp.arch
Subject: Re: Re: Fault Tolerance [LONG]
Message-ID: <13910011@hpisod2.HP.COM>
Date: 6 Feb 90 17:14:00 GMT
References: <1990Feb2.035201.21073@tandem.com>
Organization: Hewlett Packard, Cupertino
Lines: 69

From: jimbo@tandem.com (Jim Lyon)

>In article <35300@mips.mips.COM> mash@mips.COM (John Mashey) writes:
>>        a) Various people do fault-tolerance various ways.  How about
>
>Now that the discussion is back to technology, I'll be happy to put
>in my two-cents worth. 

We'll thank John Mashey for his contribution.

>fault in a component causing one of the following two behaviors:
>a)  It stops dead.  or
>b)  It goes insane.
>The latter case is VERY difficult to deal with.  People put it very
>much work to try to translate it into the first case.  TMR schemes
>try to shoot the insane processor before it manages to poison the
>outside world.

Are you suggesting that the techniques available to a TMR/QMR
designer are not totally effective in isolating an insane processor?
Could you offer an example of a type of insanity which can
spread through a voting/comparison barrier?

[a lot of good stuff on software faults deleted]

>In summary, checkpointing not only allows you to survive most of your
>hardware failures, but also most of your operating system bugs, most
>of your database manager bugs, most of your communication protocol bugs,
>most of your transaction manager bugs, and even many of your application
>bugs.

and later:

>If you want the highest degree of fault tolerance possible, design it
>from the start to use checkpointing [If you come work for Tandem for
>a few years, you'll learn how].

Could you clarify the claim here?  It sure seems you suggesting 
that the checkpointing system, including HW and SW, is inherently 
more reliable than would be a TMR system which had had an equivelent
amount of effort devoted to reliability enhancement of its SW. But
most of the excellent recommendations made WRT the SW are equally
applicable to both products, or even non-FT products.

What is unique to checkpointing is the notion that each SW layer has 
available to it a backup process(s), and that the hardware checkpointing
mechanism can be used as a tool for abandoning work which led to
some failure, succeeding in avoiding the Hiesenbug panic case.

As long as we're willing to pay substantially increased SW development
costs, we might consider what else we might get for our money.
There are other tools which can be used to attain high reliability,
and the basic save state/ fall back on failure mechanism can be used 
in the absence of even a backup process, let alone a process in a different 
memory space. Is there really something offered by checkpointing to another 
memory space which makes such SW inherently more reliable?  And from
there, is there really something offered by completely checkpointing
HW/SW systems which is not achievable on TMR/QMR systems?

>I hope this has been informative and hasn't sounded too much like a
>Tandem commercial.  If not, well, I'll put on my asbestos suit now.
>
>-- Jim Lyon
>-- Tandem Computers
>-- jimbo@tandem.com

Thanks a lot.

Dan Hepner