Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!tut.cis.ohio-state.edu!pt.cs.cmu.edu!MATHOM.GANDALF.CS.CMU.EDU!lindsay
From: lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay)
Newsgroups: comp.arch
Subject: Re: Reliability
Keywords: parity checkers detection
Message-ID: <7608@pt.cs.cmu.edu>
Date: 17 Jan 90 03:00:52 GMT
References: <34030@mips.mips.COM> <4322@nttmhs.ntt.JP> <39807@ames.arc.nasa.gov> <3101@umn-d-ub.D.UMN.EDU> <28674@amdcad.AMD.COM> <7566@pt.cs.cmu.edu> <34469@mips.mips.COM>
Organization: Carnegie-Mellon University, CS/RI
Lines: 56


In article <34469@mips.mips.COM> mash@mips.COM (John Mashey) writes:
> a) What are CPU differences between micros and mainframes in this area?
> Are there reliability features of current big machine CPUs that
> are impossible to duplicate in micros? hard to duplicate? easy, but,
> too expensive? Are there features of micros that make them easier to
> make reliable systems from?

Mainframes tend to put parity _everywhere_, and most micros are
completely without. Ah, but what do you do when an error occurs?
Microcoded machines drop into diagnostic microcode, which analyzes,
reports, and then tries to resume/restart the macroinstruction.  Some
machines had redundant hardware (e.g. two ALUs) and could reconfigure
to cut failed units out of the "processor complex". I don't see
micros following this path.

> b) Are any of these reliability features from mainframes not so
> necessary when entire CPUs are on single chips?

Yes: the stuff above. Chips have failure modes, and "age", but to the
first order, the (un)reliability of a box depends on its chip-pin
count. Putting the CPU on one chip, instead of 2,000, has a serious
impact. Futher, the micro solution allows tricks like master/checker
pairs, which you just wouldn't do if the processor was in a box 40
feet long.

The #1 reason for "parity everywhere" was to detect that you were in
trouble. The #2 reason was to identify the field-replaceable module
(which for a micro is the whole CPU, or more). The trailing #3 reason
was the hope of live CPU recovery.

Live CPU recovery has become much less interesting since
multiprocessors came along. With the right software, a failed
processor does not imply a failed process. For example, Tandem
checkpoints each process regularly, so that a different processor can
do a prompt checkpoint-resumption. The CPU and IO interconnects have
to be up to it, of course (dual port those disks). And besides: if a
master/checker pair of CPUs disagree, which one was the one that
failed? Better to ignore them both and force the board into self test
mode.

> c) Beyond the CPUs, what are the issues that might be different
> at the system level?

Well, nonstop machines are ruggedized and rated for e.g. sudden
overpressures (no kidding). This might influence a chip company to
change its chip packaging, but not its chip design.

> d) ECC, parity, nothing: where are the boundaries on tradeoffs?

Well, the Cyclone uses cache refill as a way to fix cache parity
errors. And, they have extra cache RAMs that they can spare in.  But
it would be probably be OK (and simpler) if the machine just disabled
a quarter or a half of its cache, and then ran on one lung. 
-- 
Don		D.C.Lindsay 	Carnegie Mellon Computer Science