Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!uunet!decwrl!sgi!karsh@trifolium.esd.sgi.com
From: karsh@trifolium.esd.sgi.com (Bruce Karsh)
Newsgroups: comp.arch
Subject: Re: Workstation Data Integrity
Message-ID: <68128@sgi.sgi.com>
Date: 30 Aug 90 09:09:18 GMT
References: <1990Aug3.204358.330@portia.Stanford.EDU> <40694@mips.mips.COM> <2399@crdos1.crd.ge.COM> <1990Aug10.171744.9639@zoo.toronto.edu> <2421@crdos1.crd.ge.COM> <1990Aug18.210132.25203@sco.COM> <2434@crdos1.crd.ge.COM> <6797.26d6edce@vax1.tcd.ie> <2469@crdos1.cr
Sender: news@sgi.sgi.com
Reply-To: karsh@trifolium.sgi.com (Bruce Karsh)
Organization: Silicon Graphics, Inc., Mountain View, CA
Lines: 68

>I know that there are an awful lot of ways that a computer can produce
>wrong answers.  That is no excuse for failing to catch the ones that it
>is practical to catch.  Adding an extra bit to each byte (or whatever)
>seems like a small price to pay for a bit more confidence in the
>results.

But adding the extra bit has a reliability cost too:

    Memory boards need more pins on their connectors.  Mechanical connections
    are a notorious failure point.

    More power is used so the system runs hotter.  There may need to be more
    reliance on fans (which are also notorious) to cool the system.

    The component count is increased so there are more components which can
    potentially fail.

    Parity checking circuitry which can also fail has been added.

    Multiple bit errors may not be detected.

These may all be small effects, but with a modern, well designed memory
system, parity errors are a small effect as well.

> Also, given the reliability of current memory and such, crashes
>due to parity errors would probably be a lot less frequent than crashes
>due to other random events (i.e. adding this feature probably wouldn't
>do much harm to the MTBF numbers for the system).

Given the reliability of current memory and such, how probably is the
event that parity protects against.  I don't have the answer to this, but
I have to believe that someone has studied this problem.  Are memory
parity errors in any way a significant contributer to computer errors?

It seems to me that there are so many other sources of computer error which
are so much more significant that memory parity is just silly.  We don't
usually put parity on floating point processors or internal CPU data paths
and registers.  Putting it on memory seems like a very expensive "spit in
the ocean".

Is there some real hard data which shows that memory is so failure-prone
that parity checking is called for?  If so, why is it that a single bit
of parity checking is adequate.  Is the failure mode such that even-bit
failures are by far the most common kind?  The few memory failures that
I've looked carefully at have been pretty massive, not single-bit.

Has memory parity become a sensless security blanket for the insecure and
uninformed?

>One final note:  a lot of small computers are used for business applications
>like payroll, accounting, inventory and such.  This may not be as flashy as
>simulating the space shuttle but silent failures in these applications can
>be pretty devastating to the business.  Unfortunately, the users of such
>systems are probably the least likely to appreciate the value of knowing that
>the computer detected an error and aborted rather than giving wrong answers.

True, but if the protection is from an extremely unlikely event, it makes
sense to put the cost of protection into protecting against a more likely
event.  Or, alternatively, just leave it off entirely.  You'll never make
a perfectly reliable computer.  You have to settle for some statistical level
of reliability.

I'd like to see a comparison of the probability of a memory parity error
causing a business to make a significant financial mistake, versus the
probability of a software error causing the mistake.

			Bruce Karsh
			karsh@sgi.com