Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!usc!apple!portal!cup.portal.com!Don_A_Corbitt From: Don_A_Corbitt@cup.portal.com Newsgroups: comp.arch Subject: Re: Workstation Data Integrity Message-ID: <33491@cup.portal.com> Date: 3 Sep 90 05:28:21 GMT References: <6797.26d6edce@vax1.tcd.ie> <56qmo1w162w@zl2tnm.gp.govt.nz> <19875@crg5.UUCP> <19208@dime.cs.umass.edu> <2201@lectroid.sw.stratus.com> <68362@sgi.sgi.com> Organization: The Portal System (TM) Lines: 57 > I think you're system is more likely to be hit by lightning than to have > sporadic crashes due to this failure mode. Do we have any real hard > numbers on how often this failure occurs? > > You're probably more likely to see this failure on the SIM socket rather > than on the chip leads. In that case, there could easily be more than > a single bit error and the parity detection could still fail to catch the > error. > > > > But if a part who's only job is to decrease the rate of undetected failures > does not make a significant improvement in the rate of undetected failures, > then what good is it? > > If someone can show me that those parity chips really do significantly > decrease the rate of undetected system failures, then I'll agree that > they are necessary. Even if they only make a 5% reduction in this > rate, they may be an acceptable idea. > > Somehow I think that if they make any reduction at all, it's several > places to the right of the decimal point. E.g. .0001%. Even worse > though, they may actually be decreasing the overall reliability of systems. > > Bruce Karsh > karsh@sgi.com Well, I have some anecdotal evidence of the benefits of parity. I've been working with IBM PC and clones (the original subject of this discussion) since the early days. I've probably been around machines for 8 years * 4 machines (average) or 32 machine years I've seen 5 or 6 machines that would tend to get parity errors. Each time, it was possible to fix by replacing one or more RAM chips (with one exception). These machines all passed their power-on-self-test, but would fail every few minutes/hours/days. Knowing that hardware was broken, we were able to blindly swap RAMs until things worked. If we didn't have parity checking, we would suspect our software (SW developers) for bugs, pointer problems, etc. Each machine treats parity errors differently. Some show suspected address and ram chip, others just say "Parity Error R)eboot or I)gnore". The one time the problem wasn't bad RAM chips was when I installed a memory expansion board improperly (vendor sent wrong docs). It used page mode RAM, but I had the page mode switch turned off. This was for the upper 4MB of an 8BM 386 machine. POST worked fine, using the RAM for Ram Disk worked fine, but OS/2 would crash with a parity error when booting. It appeared that the access pattern would change the failure mode. What's the point? RAM is an area where the end-user often gets involved. Since it is so easy to damage chips when installing them, I find it to be worthwhile to have some sanity checking on their operation. Also, most of the transistors of a given system will be in the RAM chips. Parity gives an inexpensive way to reduce the number of "silent wrong answers". --- Don_A_Corbitt@cup.portal.com Not a spokesperson for CrystalGraphics, Inc. Mail flames, post apologies. Support short .signatures, three lines max.