Path: utzoo!utgpu!jarvis.csri.toronto.edu!clyde.concordia.ca!rutgers!usc!zaphod.mps.ohio-state.edu!tut.cis.ohio-state.edu!ucbvax!hplabs!hpda!hpcupt1!hpisod2!dhepner
From: dhepner@hpisod2.HP.COM (Dan Hepner)
Newsgroups: comp.arch
Subject: Re: Fault Tolerant Micros
Message-ID: <13910013@hpisod2.HP.COM>
Date: 6 Feb 90 19:42:17 GMT
References: <13910004@hpisod2.HP.COM>
Organization: Hewlett Packard, Cupertino
Lines: 54

From: donnh@ziggy.SanDiego.NCR.COM (Donn Holtzman)

 If
>the bug causes your kernel to hang, TMR or pair-and-spare approaches
>won't succeed.

This is an interesting case. 

Kernel "Hangs" are actually double SW bugs: a combination of some original
bug and a lack of detection by failed assertion. 

There is a last defense TMR/QMR schemes have against such, the 
deadman switch initiated "fast reboot", which effectively converts
the hang into a "panic".

Which brings up the comparison of reaction to the more generic panic.

In order to make the case that loose redundancy is superior to
lock-step in its response time to panics, one must assert that
the backup loose processor will notice the failure of the primary,
and complete its takeover in less time than the primary could have
reset itself and achieved a similar state.  How does this comparison
turn out in real life?

The question of whether the state of the machine is sound enough to 
return to seems independent of the basic question; one can do a fast 
reboot and leave machine state mostly intact, maybe suffering a repeat.  
Alternatively one can bet on a checkpointed machine and suffer the 
same repeat.  Is there some fundamental difference that makes the
takeover from the checkpointed machine faster? 

 Performance is certainly and issue but one can trade
>check pointing overhead for recovery speed (at least in the OLTP
>arena).

Maybe you could elaborate here.

>>an OS port to such a 
>>machine will always be more difficult than on a non-FT platform.
>>
>This is a good point. I would be surprised if Tandem didn't have to
>make kernel changes to make their machine work.

There is a real dividing line: can you port the next kernel or
do you have to retrofit the new functionality into your existing,
80% proprietary code kernel.

>Interesting stuff.

Yes!

>Donn Holtzman

Dan Hepner