Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!tut.cis.ohio-state.edu!pt.cs.cmu.edu!MATHOM.GANDALF.CS.CMU.EDU!lindsay From: lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) Newsgroups: comp.arch Subject: Re: Reliability Keywords: parity checkers detection Message-ID: <7608@pt.cs.cmu.edu> Date: 17 Jan 90 03:00:52 GMT References: <34030@mips.mips.COM> <4322@nttmhs.ntt.JP> <39807@ames.arc.nasa.gov> <3101@umn-d-ub.D.UMN.EDU> <28674@amdcad.AMD.COM> <7566@pt.cs.cmu.edu> <34469@mips.mips.COM> Organization: Carnegie-Mellon University, CS/RI Lines: 56 In article <34469@mips.mips.COM> mash@mips.COM (John Mashey) writes: > a) What are CPU differences between micros and mainframes in this area? > Are there reliability features of current big machine CPUs that > are impossible to duplicate in micros? hard to duplicate? easy, but, > too expensive? Are there features of micros that make them easier to > make reliable systems from? Mainframes tend to put parity _everywhere_, and most micros are completely without. Ah, but what do you do when an error occurs? Microcoded machines drop into diagnostic microcode, which analyzes, reports, and then tries to resume/restart the macroinstruction. Some machines had redundant hardware (e.g. two ALUs) and could reconfigure to cut failed units out of the "processor complex". I don't see micros following this path. > b) Are any of these reliability features from mainframes not so > necessary when entire CPUs are on single chips? Yes: the stuff above. Chips have failure modes, and "age", but to the first order, the (un)reliability of a box depends on its chip-pin count. Putting the CPU on one chip, instead of 2,000, has a serious impact. Futher, the micro solution allows tricks like master/checker pairs, which you just wouldn't do if the processor was in a box 40 feet long. The #1 reason for "parity everywhere" was to detect that you were in trouble. The #2 reason was to identify the field-replaceable module (which for a micro is the whole CPU, or more). The trailing #3 reason was the hope of live CPU recovery. Live CPU recovery has become much less interesting since multiprocessors came along. With the right software, a failed processor does not imply a failed process. For example, Tandem checkpoints each process regularly, so that a different processor can do a prompt checkpoint-resumption. The CPU and IO interconnects have to be up to it, of course (dual port those disks). And besides: if a master/checker pair of CPUs disagree, which one was the one that failed? Better to ignore them both and force the board into self test mode. > c) Beyond the CPUs, what are the issues that might be different > at the system level? Well, nonstop machines are ruggedized and rated for e.g. sudden overpressures (no kidding). This might influence a chip company to change its chip packaging, but not its chip design. > d) ECC, parity, nothing: where are the boundaries on tradeoffs? Well, the Cyclone uses cache refill as a way to fix cache parity errors. And, they have extra cache RAMs that they can spare in. But it would be probably be OK (and simpler) if the machine just disabled a quarter or a half of its cache, and then ran on one lung. -- Don D.C.Lindsay Carnegie Mellon Computer Science