Path: utzoo!utgpu!jarvis.csri.toronto.edu!cs.utexas.edu!usc!apple!motcsd!hpda!hpcupt1!hpisod2!dhepner From: dhepner@hpisod2.HP.COM (Dan Hepner) Newsgroups: comp.arch Subject: Fault Tolerant Micros Message-ID: <13910004@hpisod2.HP.COM> Date: 17 Jan 90 21:24:59 GMT Organization: Hewlett Packard, Cupertino Lines: 51 From: lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) >Some >machines had redundant hardware (e.g. two ALUs) and could reconfigure >to cut failed units out of the "processor complex". I don't see >micros following this path. Fault tolerant micro machines are following a path at least something like this. The general scheme, as implemented by Stratus, Sequoia, and Tandem [S2] involves redundant CPUs, and a scheme for 'cutting out' any offender who gets a wrong answer. >Live CPU recovery has become much less interesting since >multiprocessors came along. With the right software, a failed >processor does not imply a failed process. For example, Tandem >checkpoints each process regularly, so that a different processor can >do a prompt checkpoint-resumption. Tandem has apparently decided that this was not the correct model to implement fault tolerance, although I've searched for and not found yet an official statement on just how the S2 does do FT. The CPU and IO interconnects have >to be up to it, of course (dual port those disks). Dual porting is a classic question in FT. You're going to use redundant disks, of course. Once you have redundant disks, and have paid attention to your interconnect scheme to insure that a path failure won't take down both disks, then dual porting will not enhance FT, if FT is defined as "the ability to sustain any single point of failure". And besides: if a >master/checker pair of CPUs disagree, which one was the one that >failed? Better to ignore them both and force the board into self test >mode. This scheme works fine for Stratus, but one can get to roughly the same place by using three, and tossing out any one which disagrees with the other two. >Well, nonstop machines are ruggedized and rated for e.g. sudden >overpressures (no kidding). This might influence a chip company to >change its chip packaging, but not its chip design. Maybe you could expand on this. Great. A discussion of fault tolerance. Dan Hepner dhepner@hpda.hp.com