Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!sdd.hp.com!decwrl!sgi!vjs@rhyolite.wpd.sgi.com From: vjs@rhyolite.wpd.sgi.com (Vernon Schryver) Newsgroups: comp.arch Subject: Re: Workstation Data Integrity Message-ID: <68604@sgi.sgi.com> Date: 6 Sep 90 01:43:33 GMT References: <6797.26d6edce@vax1.tcd.ie> <56qmo1w162w@zl2tnm.gp.govt.nz> <10397@pt.cs.cmu.edu> Sender: guest@sgi.sgi.com Organization: Silicon Graphics, Inc., Mountain View, CA Lines: 26 In article <10397@pt.cs.cmu.edu>, lindsay@gandalf.cs.cmu.edu (Donald Lindsay) writes: > > Yes. Also, note that parity/ECC may catch problems with connectors, > bus drivers, fans and filters (== overheating), system environment, > and so on. ... Exactly. There is a workstation in a lab near my office that was having several parity errors per hour, until an unnamed idiot removed the extra SIMMS he'd scrounged from a different model of the same brand of machine. The diagnostics reported no problems, and the errors occurred only when the machine got hot. Parity saved days of looking for strange, new kernel bugs, which would have been the diagnose without the parity error reports. Parity errors caused by a timing problem figured promenently in the resolution after years of searching for a problem in the old 68K SGI line. Without the parity error reports, we would still be looking for a wild pointer. From reading the UNIX-on-PC-clones news groups, it seems to me that parity errors are the main and most universally available and reliable memory diagnostic on such machines, detecting all kinds of speed, heat, and compatibility problems. Vernon Schryver, vjs@sgi.com