Path: utzoo!utgpu!news-server.csri.toronto.edu!mailrus!uwm.edu!zaphod.mps.ohio-state.edu!samsung!xylogics!transfer!lectroid!mordor!leavitt
From: leavitt@mordor.hw.stratus.com (Will Leavitt)
Newsgroups: comp.arch
Subject: Re: Workstation Data Integrity
Message-ID: <2201@lectroid.sw.stratus.com>
Date: 1 Sep 90 02:24:35 GMT
References: <6797.26d6edce@vax1.tcd.ie> <56qmo1w162w@zl2tnm.gp.govt.nz> <19875@crg5.UUCP> <19208@dime.cs.umass.edu>
Sender: usenet@lectroid.sw.stratus.com
Reply-To: leavitt@mordor.sw.stratus.com (Will Leavitt)
Organization: Stratus Computer, Hardware Engineering
Lines: 117

Concerning the additional hardware needed for parity on memory, Bruce
Karsh (karsh@sgi.com) writes:

>But adding the extra bit has a reliability cost too:
>
>  Memory boards need more pins on their connectors.  Mechanical connections
>  are a notorious failure point.
>
>  More power is used so the system runs hotter.  There may need to be more
>  reliance on fans (which are also notorious) to cool the system.
>
>  The component count is increased so there are more components which can
>  potentially fail.
>
>  Parity checking circuitry which can also fail has been added.
>
>  Multiple bit errors may not be detected.

All true...  but modern CMOS memory runs ridiculously cool, parity
detects half of all multi bit errors, and as we'll see, there are very 
definite reasons why memory chips fail more often.

>We don't usually put parity on floating point processors or internal
>CPU data paths and registers.  Putting it on memory seems like a very
>expensive "spit in the ocean".
>
>Is there some real hard data which shows that memory is so failure-prone
>that parity checking is called for?  If so, why is it that a single bit
>of parity checking is adequate.  Is the failure mode such that even-bit
>failures are by far the most common kind?  The few memory failures that
>I've looked carefully at have been pretty massive, not single-bit.

There are at least 3 DRAM failure mechanisms that aren't applicable to
floating point processors, CPU data paths, and registers.
  1) alpha particle flipped bits
  2) DRAMs come in difficult to solder-inspect packages
  3) DRAMs (like all dynamic circuits) are prone to forgetfullness

One at a time:

  1) alpha particle flipped bits

Quoting from a Seimens Information report #6: "Alpha particles are
doubly charged helium nuclei emmitted in the radioactive decay of many
radioactive elements (principly Uranium & Thorium).  Naturally
occuring alpha's range in energy from about 2 to 9 MeV and are treated
as classical particles.  An alpha interacts electronically with
silicon creating a track of electron-hole pairs along the 25 um
straight line path of the particle."  A track of electron-hole pairs
conducts, by the way.

Data bits are stored as charge on a capacitor in the memory cell, and
is read (sensed) by connecting the cell to a sense amplifier via a bit
line shared by other cells.  Thus if an alpha particle zips through
your capacitor, it can flip a bit in memory.  If it zips through the
bitline while you are reading, you get wrong data, and the data gets
writen back wrong.  (internally, DRAM reads are destructive, and are
always followed by a restore).  For CMOS 1 Meg parts, a typical error
rate is 270 failures per 10^9 device hours. According to Seimens, bit
line failures now dominate alpha sensitivity.  Both of these lead to
INTERMITTANT SINGLE BIT ERRORS.

Now, what is uranium doing next to your chips?  It can be a
contaiminant in the silicon or aluminum, or in the pacakge.  There is
a story where Amdahl built a series of mainframes with no error
correction.  Their DRAM vendor packaged a batch of DRAMs in ceramic
DIPs with a good dose of uranium contaminating the ceramic, and the
resulting mainframes wouldn't stay up for more than a day.  Of course,
they failed a different way each time.  Amdahls now have ECC.

2) DRAMs come in difficult to solder-inspect packages

Crack open the top of your Sparcstation or Iris for this one...  The
most popular package for DRAMs these days is the SOJ; the leads curl
underneath the chip and are impossible to inspect.  The most popular
packages for logic are either through hole (like pin grid arrays), or
gull wing (plastic quad flat pack).  Those big gate arrays are PQFPs.
Both are easy to inspect.  Now if not quite enough solder gets
squigied through the solder mask when they make the board, and/or if
the chip has a slightly non-coplanar lead, then instead of being
soldered to the board, the lead ends up resting on a bump of solder
below it.  Because of the springiness in the leads, this will work for
a while, but eventually oxidation will cause intermitant contact.
Now, most DRAMs used for main memory are 1 bit wide parts, so this
results in INTERMITTANT SINGLE BIT ERRORS.

Why are DRAMs packaged in impossible to inspect packages?  Because they
are denser than gull wing, and besides PARITY WILL DETECT ANY PROBLEMS 
ANYWAY.

3) DRAMs (like all dynamic circuits) are prone to forgetfullness

DRAMs store a bit by either charging or not charging a tiny capacitor;
the charge on the capacitor must be refreshed every 15ms before it
disipates.  Normally this works fine, but marginal chips are prone to
data retention problems-- especially at high temperatures and out of
spec voltage ranges.  Dynamic circuits are used in many CMOS
microprocessors as well, but typicaly refreshing is not a problem (it
happens on every clock tick, for example).

>Has memory parity become a sensless security blanket for the insecure and
>uninformed?

Probably.  Pretty soon error correction will be standard on all machines
with signifigant memory sizes.

>I'd like to see a comparison of the probability of a memory parity error
>causing a business to make a significant financial mistake, versus the
>probability of a software error causing the mistake.

True.  But soft memory errors, like bad disk blocks, are a solved problem.
Software errors are not.

        -will
--
-----------------------------------------------------------------
leavitt@mordor.hw.stratus.com