Path: utzoo!utgpu!news-server.csri.toronto.edu!mailrus!uwm.edu!zaphod.mps.ohio-state.edu!samsung!xylogics!transfer!lectroid!mordor!leavitt From: leavitt@mordor.hw.stratus.com (Will Leavitt) Newsgroups: comp.arch Subject: Re: Workstation Data Integrity Message-ID: <2201@lectroid.sw.stratus.com> Date: 1 Sep 90 02:24:35 GMT References: <6797.26d6edce@vax1.tcd.ie> <56qmo1w162w@zl2tnm.gp.govt.nz> <19875@crg5.UUCP> <19208@dime.cs.umass.edu> Sender: usenet@lectroid.sw.stratus.com Reply-To: leavitt@mordor.sw.stratus.com (Will Leavitt) Organization: Stratus Computer, Hardware Engineering Lines: 117 Concerning the additional hardware needed for parity on memory, Bruce Karsh (karsh@sgi.com) writes: >But adding the extra bit has a reliability cost too: > > Memory boards need more pins on their connectors. Mechanical connections > are a notorious failure point. > > More power is used so the system runs hotter. There may need to be more > reliance on fans (which are also notorious) to cool the system. > > The component count is increased so there are more components which can > potentially fail. > > Parity checking circuitry which can also fail has been added. > > Multiple bit errors may not be detected. All true... but modern CMOS memory runs ridiculously cool, parity detects half of all multi bit errors, and as we'll see, there are very definite reasons why memory chips fail more often. >We don't usually put parity on floating point processors or internal >CPU data paths and registers. Putting it on memory seems like a very >expensive "spit in the ocean". > >Is there some real hard data which shows that memory is so failure-prone >that parity checking is called for? If so, why is it that a single bit >of parity checking is adequate. Is the failure mode such that even-bit >failures are by far the most common kind? The few memory failures that >I've looked carefully at have been pretty massive, not single-bit. There are at least 3 DRAM failure mechanisms that aren't applicable to floating point processors, CPU data paths, and registers. 1) alpha particle flipped bits 2) DRAMs come in difficult to solder-inspect packages 3) DRAMs (like all dynamic circuits) are prone to forgetfullness One at a time: 1) alpha particle flipped bits Quoting from a Seimens Information report #6: "Alpha particles are doubly charged helium nuclei emmitted in the radioactive decay of many radioactive elements (principly Uranium & Thorium). Naturally occuring alpha's range in energy from about 2 to 9 MeV and are treated as classical particles. An alpha interacts electronically with silicon creating a track of electron-hole pairs along the 25 um straight line path of the particle." A track of electron-hole pairs conducts, by the way. Data bits are stored as charge on a capacitor in the memory cell, and is read (sensed) by connecting the cell to a sense amplifier via a bit line shared by other cells. Thus if an alpha particle zips through your capacitor, it can flip a bit in memory. If it zips through the bitline while you are reading, you get wrong data, and the data gets writen back wrong. (internally, DRAM reads are destructive, and are always followed by a restore). For CMOS 1 Meg parts, a typical error rate is 270 failures per 10^9 device hours. According to Seimens, bit line failures now dominate alpha sensitivity. Both of these lead to INTERMITTANT SINGLE BIT ERRORS. Now, what is uranium doing next to your chips? It can be a contaiminant in the silicon or aluminum, or in the pacakge. There is a story where Amdahl built a series of mainframes with no error correction. Their DRAM vendor packaged a batch of DRAMs in ceramic DIPs with a good dose of uranium contaminating the ceramic, and the resulting mainframes wouldn't stay up for more than a day. Of course, they failed a different way each time. Amdahls now have ECC. 2) DRAMs come in difficult to solder-inspect packages Crack open the top of your Sparcstation or Iris for this one... The most popular package for DRAMs these days is the SOJ; the leads curl underneath the chip and are impossible to inspect. The most popular packages for logic are either through hole (like pin grid arrays), or gull wing (plastic quad flat pack). Those big gate arrays are PQFPs. Both are easy to inspect. Now if not quite enough solder gets squigied through the solder mask when they make the board, and/or if the chip has a slightly non-coplanar lead, then instead of being soldered to the board, the lead ends up resting on a bump of solder below it. Because of the springiness in the leads, this will work for a while, but eventually oxidation will cause intermitant contact. Now, most DRAMs used for main memory are 1 bit wide parts, so this results in INTERMITTANT SINGLE BIT ERRORS. Why are DRAMs packaged in impossible to inspect packages? Because they are denser than gull wing, and besides PARITY WILL DETECT ANY PROBLEMS ANYWAY. 3) DRAMs (like all dynamic circuits) are prone to forgetfullness DRAMs store a bit by either charging or not charging a tiny capacitor; the charge on the capacitor must be refreshed every 15ms before it disipates. Normally this works fine, but marginal chips are prone to data retention problems-- especially at high temperatures and out of spec voltage ranges. Dynamic circuits are used in many CMOS microprocessors as well, but typicaly refreshing is not a problem (it happens on every clock tick, for example). >Has memory parity become a sensless security blanket for the insecure and >uninformed? Probably. Pretty soon error correction will be standard on all machines with signifigant memory sizes. >I'd like to see a comparison of the probability of a memory parity error >causing a business to make a significant financial mistake, versus the >probability of a software error causing the mistake. True. But soft memory errors, like bad disk blocks, are a solved problem. Software errors are not. -will -- ----------------------------------------------------------------- leavitt@mordor.hw.stratus.com