Path: utzoo!utgpu!jarvis.csri.toronto.edu!clyde.concordia.ca!mcgill-vision!bloom-beacon!snorkelwacker!tut.cis.ohio-state.edu!pt.cs.cmu.edu!MATHOM.GANDALF.CS.CMU.EDU!lindsay From: lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) Newsgroups: comp.arch Subject: Re: Fault Tolerant Micros Message-ID: <7635@pt.cs.cmu.edu> Date: 19 Jan 90 03:09:46 GMT References: <13910004@hpisod2.HP.COM> Organization: Carnegie-Mellon University, CS/RI Lines: 60 In article <13910004@hpisod2.HP.COM> dhepner@hpisod2.HP.COM (Dan Hepner) writes: >From: lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) >>Live CPU recovery has become much less interesting since >>multiprocessors came along. With the right software, a failed >>processor does not imply a failed process. For example, Tandem >>checkpoints each process regularly, so that a different processor can >>do a prompt checkpoint-resumption. > >Tandem has apparently decided that this was not the correct model >to implement fault tolerance, although I've searched for and not >found yet an official statement on just how the S2 does do FT. The most basic thing is to contain the damage, so it's usual to use messages between machines that don't share memory. The next problem is the data that was inside the failed thing. There are several ways out: 1- whatever it was doing is dead. (Rarely acceptable, but if the customer is just buying a lottery ticket, I suppose you can ask him do it again.) 2- another CPU recomputes the lost data from its copy of the last checkpoint, and copies of all the messages and signals sent to the dead machine. 3- another CPU has been computing in parallel, and contains data that should be identical to the lost data. Case 1 is an application-level choice, but still needs error detection. Case 2 is done with software and with error detection. Case 3 can be done like case 2, or it can be done with a master-and- two-checkers. As you pointed out, the three chips can hold a vote, and decide that only two of the three contain valid data. That can be a lot of data: with on-chip caches, and PIDs, it can also be data from several different processes. >>Well, nonstop machines are ruggedized and rated for e.g. sudden >>overpressures (no kidding). This might influence a chip company to >>change its chip packaging, but not its chip design. >Maybe you could expand on this. I'm probably out of touch with ruggedizing, but there are companies that specialize in it. They like things that tolerate flexing. They like sealed spaces (to keep out salt air and conductive dust and tobacco tar). They like to coat things: a grease layer is used on some automotive chips. They like hermetically sealed chips, and they used to like ceramic over plastic. They used to worry about pin corrosion, but I don't know what they think of TAB. The system specs can involve higher ambient temperatures, explosion (overpressure), shock, high voltage shorted onto the rack, ground currents on the Ethernet shield, ambient RF, industrial-strength noise on the power lines, locks on the rack door, having to push a telex signal through a one-henry coil. I recall a system that failed from being too near a gamma-ray source: but that wasn't in the spec. (The gamma rays were just enough so that an EPROM would forget one bit after about a month. We solved this with squared-law shielding. That is, we pushed the desk further down the hall.) -- Don D.C.Lindsay Carnegie Mellon Computer Science