Path: utzoo!utgpu!news-server.csri.toronto.edu!rutgers!uwm.edu!zaphod.mps.ohio-state.edu!mips!sgi!karsh@trifolium.esd.sgi.com
From: karsh@trifolium.esd.sgi.com (Bruce Karsh)
Newsgroups: comp.arch
Subject: Re: Workstation Data Integrity
Message-ID: <68362@sgi.sgi.com>
Date: 1 Sep 90 09:28:18 GMT
References: <6797.26d6edce@vax1.tcd.ie> <56qmo1w162w@zl2tnm.gp.govt.nz> <19875@crg5.UUCP> <19208@dime.cs.umass.edu> <2201@lectroid.sw.stratus.com>
Sender: guest@sgi.sgi.com
Reply-To: karsh@trifolium.sgi.com (Bruce Karsh)
Organization: Silicon Graphics, Inc., Mountain View, CA
Lines: 132

In article <2201@lectroid.sw.stratus.com> leavitt@mordor.sw.stratus.com (Will Leavitt) writes:

>All true...  but modern CMOS memory runs ridiculously cool, parity
>detects half of all multi bit errors, and as we'll see, there are very 
>definite reasons why memory chips fail more often.

CMOS devices run cool when they are switched slowly.  They can consume a lot
of power when they are switched rapidly.  Also, CMOS memory is expensive.
Large memory systems are not often CMOS.

>Data bits are stored as charge on a capacitor in the memory cell, and
>is read (sensed) by connecting the cell to a sense amplifier via a bit
>line shared by other cells.  Thus if an alpha particle zips through
>your capacitor, it can flip a bit in memory.  If it zips through the
>bitline while you are reading, you get wrong data, and the data gets
>writen back wrong.  (internally, DRAM reads are destructive, and are
>always followed by a restore).  For CMOS 1 Meg parts, a typical error
>rate is 270 failures per 10^9 device hours. According to Seimens, bit
>line failures now dominate alpha sensitivity.  Both of these lead to
>INTERMITTANT SINGLE BIT ERRORS.

That works out to less than one single-bit error every 13 years of
continuous operation on a system with 4 megabytes of CMOS DRAM.  An in
most cases, that single-bit error would not even affect the operation
of the system.  Surely this is a spit in the ocean.  I doubt that most
people would ever observe one of these in their entire computing life.
Certainly there are sources of failure in most computer systems which
are much higher than this.  Like the electrical wall outlet!

If the failure rate of 4 Meg DRAMs is really a lot higher than this,
then perhaps some protection is called for.  But what good is parity?
It just replaces the system damage caused by the memory error with the
system damage caused by a system failure caused by a catastrophic system
crash.

>There is
>a story where Amdahl built a series of mainframes with no error
>correction.  Their DRAM vendor packaged a batch of DRAMs in ceramic
>DIPs with a good dose of uranium contaminating the ceramic, and the
>resulting mainframes wouldn't stay up for more than a day.  Of course,
>they failed a different way each time.  Amdahls now have ECC.

A company sent out a bad batch of DRAMS.  So what else is new?  It happens
all the time.  How common is this failure.  It sounds like a spit in the
ocean to me.

>2) DRAMs come in difficult to solder-inspect packages

>Crack open the top of your Sparcstation or Iris for this one...  The
>most popular package for DRAMs these days is the SOJ; the leads curl
>underneath the chip and are impossible to inspect.  The most popular
>packages for logic are either through hole (like pin grid arrays), or
>gull wing (plastic quad flat pack).  Those big gate arrays are PQFPs.
>Both are easy to inspect.  Now if not quite enough solder gets
>squigied through the solder mask when they make the board, and/or if
>the chip has a slightly non-coplanar lead, then instead of being
>soldered to the board, the lead ends up resting on a bump of solder
>below it.  Because of the springiness in the leads, this will work for
>a while, but eventually oxidation will cause intermitant contact.
>Now, most DRAMs used for main memory are 1 bit wide parts, so this
>results in INTERMITTANT SINGLE BIT ERRORS.

No doubt true, but at what rate does this failure mode occur?  There
are a lot of high density interconnect schemes now and even more is on
the way.  Are you suggesting that they are so failure prone that they
require error detecting logic?

In most all cases, this failure would be detected during system thermal
testing and it would never make it out the door.  It is possible that a
certain number would slip through.  How common is this failure?  Would
a typical system ever have a failure because of this failure mode?  Is
it worth adding 12% to the cost and size of a memory system and making
it run more slowly because of this?  Couldn't that money be spent elsewhere
to more effectively improve the reliability of the system?

I think you're system is more likely to be hit by lightning than to have
sporadic crashes due to this failure mode.  Do we have any real hard
numbers on how often this failure occurs?

You're probably more likely to see this failure on the SIM socket rather
than on the chip leads.  In that case, there could easily be more than
a single bit error and the parity detection could still fail to catch the
error.

>3) DRAMs (like all dynamic circuits) are prone to forgetfullness

>DRAMs store a bit by either charging or not charging a tiny capacitor;
>the charge on the capacitor must be refreshed every 15ms before it
>disipates.  Normally this works fine, but marginal chips are prone to
>data retention problems-- especially at high temperatures and out of
>spec voltage ranges.  Dynamic circuits are used in many CMOS
>microprocessors as well, but typicaly refreshing is not a problem (it
>happens on every clock tick, for example).

DRAMS, when properly used, are not any more prone to forgetfullness than
the other logic chips, unless the DRAM is defective.

A defective memory chip will have errors.  But are memory chips defective
at so much of a higher rate than other chips that it is a problem? If not,
then why single out memory chips for parity protection?

>Probably.  Pretty soon error correction will be standard on all machines
>with signifigant memory sizes.

I suspect that won't happen.  Memory parity errors are a very rare
failure mode.  I don't think too many designers are going to add extra
cost to their systems to guard against this failure.  Especially not in
the price-competitive computer market of today.  There are just too many
better places to improve reliability at.

>>I'd like to see a comparison of the probability of a memory parity error
>>causing a business to make a significant financial mistake, versus the
>>probability of a software error causing the mistake.

>True.  But soft memory errors, like bad disk blocks, are a solved problem.
>Software errors are not.

But if a part who's only job is to decrease the rate of undetected failures
does not make a significant improvement in the rate of undetected failures,
then what good is it?

If someone can show me that those parity chips really do significantly
decrease the rate of undetected system failures, then I'll agree that
they are necessary.  Even if they only make a 5% reduction in this
rate, they may be an acceptable idea.

Somehow I think that if they make any reduction at all, it's several
places to the right of the decimal point.  E.g. .0001%.  Even worse
though, they may actually be decreasing the overall reliability of systems.

			Bruce Karsh
			karsh@sgi.com