Path: utzoo!utgpu!jarvis.csri.toronto.edu!clyde.concordia.ca!rutgers!usc!zaphod.mps.ohio-state.edu!tut.cis.ohio-state.edu!ucbvax!hplabs!hpda!hpcupt1!hpisod2!dhepner From: dhepner@hpisod2.HP.COM (Dan Hepner) Newsgroups: comp.arch Subject: Re: Fault Tolerant Micros Message-ID: <13910013@hpisod2.HP.COM> Date: 6 Feb 90 19:42:17 GMT References: <13910004@hpisod2.HP.COM> Organization: Hewlett Packard, Cupertino Lines: 54 From: donnh@ziggy.SanDiego.NCR.COM (Donn Holtzman) If >the bug causes your kernel to hang, TMR or pair-and-spare approaches >won't succeed. This is an interesting case. Kernel "Hangs" are actually double SW bugs: a combination of some original bug and a lack of detection by failed assertion. There is a last defense TMR/QMR schemes have against such, the deadman switch initiated "fast reboot", which effectively converts the hang into a "panic". Which brings up the comparison of reaction to the more generic panic. In order to make the case that loose redundancy is superior to lock-step in its response time to panics, one must assert that the backup loose processor will notice the failure of the primary, and complete its takeover in less time than the primary could have reset itself and achieved a similar state. How does this comparison turn out in real life? The question of whether the state of the machine is sound enough to return to seems independent of the basic question; one can do a fast reboot and leave machine state mostly intact, maybe suffering a repeat. Alternatively one can bet on a checkpointed machine and suffer the same repeat. Is there some fundamental difference that makes the takeover from the checkpointed machine faster? Performance is certainly and issue but one can trade >check pointing overhead for recovery speed (at least in the OLTP >arena). Maybe you could elaborate here. >>an OS port to such a >>machine will always be more difficult than on a non-FT platform. >> >This is a good point. I would be surprised if Tandem didn't have to >make kernel changes to make their machine work. There is a real dividing line: can you port the next kernel or do you have to retrofit the new functionality into your existing, 80% proprietary code kernel. >Interesting stuff. Yes! >Donn Holtzman Dan Hepner