Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!uunet!decwrl!sgi!karsh@trifolium.esd.sgi.com From: karsh@trifolium.esd.sgi.com (Bruce Karsh) Newsgroups: comp.arch Subject: Re: Workstation Data Integrity Message-ID: <68128@sgi.sgi.com> Date: 30 Aug 90 09:09:18 GMT References: <1990Aug3.204358.330@portia.Stanford.EDU> <40694@mips.mips.COM> <2399@crdos1.crd.ge.COM> <1990Aug10.171744.9639@zoo.toronto.edu> <2421@crdos1.crd.ge.COM> <1990Aug18.210132.25203@sco.COM> <2434@crdos1.crd.ge.COM> <6797.26d6edce@vax1.tcd.ie> <2469@crdos1.cr Sender: news@sgi.sgi.com Reply-To: karsh@trifolium.sgi.com (Bruce Karsh) Organization: Silicon Graphics, Inc., Mountain View, CA Lines: 68 >I know that there are an awful lot of ways that a computer can produce >wrong answers. That is no excuse for failing to catch the ones that it >is practical to catch. Adding an extra bit to each byte (or whatever) >seems like a small price to pay for a bit more confidence in the >results. But adding the extra bit has a reliability cost too: Memory boards need more pins on their connectors. Mechanical connections are a notorious failure point. More power is used so the system runs hotter. There may need to be more reliance on fans (which are also notorious) to cool the system. The component count is increased so there are more components which can potentially fail. Parity checking circuitry which can also fail has been added. Multiple bit errors may not be detected. These may all be small effects, but with a modern, well designed memory system, parity errors are a small effect as well. > Also, given the reliability of current memory and such, crashes >due to parity errors would probably be a lot less frequent than crashes >due to other random events (i.e. adding this feature probably wouldn't >do much harm to the MTBF numbers for the system). Given the reliability of current memory and such, how probably is the event that parity protects against. I don't have the answer to this, but I have to believe that someone has studied this problem. Are memory parity errors in any way a significant contributer to computer errors? It seems to me that there are so many other sources of computer error which are so much more significant that memory parity is just silly. We don't usually put parity on floating point processors or internal CPU data paths and registers. Putting it on memory seems like a very expensive "spit in the ocean". Is there some real hard data which shows that memory is so failure-prone that parity checking is called for? If so, why is it that a single bit of parity checking is adequate. Is the failure mode such that even-bit failures are by far the most common kind? The few memory failures that I've looked carefully at have been pretty massive, not single-bit. Has memory parity become a sensless security blanket for the insecure and uninformed? >One final note: a lot of small computers are used for business applications >like payroll, accounting, inventory and such. This may not be as flashy as >simulating the space shuttle but silent failures in these applications can >be pretty devastating to the business. Unfortunately, the users of such >systems are probably the least likely to appreciate the value of knowing that >the computer detected an error and aborted rather than giving wrong answers. True, but if the protection is from an extremely unlikely event, it makes sense to put the cost of protection into protecting against a more likely event. Or, alternatively, just leave it off entirely. You'll never make a perfectly reliable computer. You have to settle for some statistical level of reliability. I'd like to see a comparison of the probability of a memory parity error causing a business to make a significant financial mistake, versus the probability of a software error causing the mistake. Bruce Karsh karsh@sgi.com