Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!uunet!snorkelwacker!apple!olivea!tymix!cirrusl!sunstorm!douglas From: douglas%cirrusl@oliveb.ATC.olivetti.com (Douglas Lee) Newsgroups: comp.arch Subject: Re: Workstation Data Integrity Message-ID: <2361@cirrusl.UUCP> Date: 5 Sep 90 18:21:08 GMT References: <6797.26d6edce@vax1.tcd.ie> <56qmo1w162w@zl2tnm.gp.govt.nz> <19875@crg5.UUCP> <19208@dime.cs.umass.edu> <2201@lectroid.sw.stratus.com> <68362@sgi.sgi.com> <1990Sep4.163619.24726@zoo.toronto.edu> <68505@sgi.sgi.com> <2483@crdos1.crd.ge.COM> Sender: news@cirrusl.UUCP Organization: Cirrus Logic Inc. Lines: 89 DRAM manufactures express reliablity in terms of FITs (Failures In Time). On FIT represents one error in one billion (10 ^ 9) hours of operation. Toshiba claims a FIT rate of 252 for 1 Mb DRAMs. Clearpoint, who makes add-in memory boards claims the actual rate is 1000 FITs. The FIT rate has steadily decreased for each sucessive generation of DRAMs until the 4 Mb. The FIT rate for 4 Mb DRAMs is higher than 1 Mb. From the FIT rate you can calculate the MTBF of any memory system. The MTBF in hours for one DRAM is calculated as 10 ^ 9 / FIT. The MTBF for a system is just the MTBF of each DRAM divided by the total number of DRAMs in the system. We are only looking at single bit errors here. Assuming a FIT rate of 252: # of DRAMs Memory Size MTBF 32 4 MB 14.1 years 96 12 MB 4.7 years 160 20 MB 2.8 years Assuming a FIT rate of 1000: # of DRAMs Memory Size MTBF 32 4 MB 3.6 years 96 12 MB 1.2 years 160 20 MB 260 days For most PC (memories < 12 MB) a single bit error should occur rarely due to soft errors. The FIT rate really only measures errors due to alpha particle radiation. There can be more soft errors caused by power supply spikes, drop outs, etc. that have not been accounted for here. This will cause the FIT rate to go up, reducing the MTBF. The thing to realize here, is that parity will actually make the MTBF go down. This is because more parts are added, more things can fail. Parity does allow you to detect these errors, however. Error detection and correction (EDAC) have been mentioned as an alternative and these are used in many workstations (i.e. Sun). One of the most popular parts is the Am29C660 and its predecessor Am2960. This part uses a modified Hamming code to detect and correct single bit errors and to detect double bit errors. It will in fact detect many multi-bit errors and catastrophic failures such as all 0's or all 1's. The part appends 7 bits to a 32 bit word and 8 bits to a 64 bit word (two parts are cascaded). For 32 bits the overhead is greater than parity, 7 vs. 4, but at 64 bits you break even. Similar parts are made by IDT and many workstation manufactures implement the same function in gate arrays. The advantage of this scheme is that all single bit errors are corrected. Also during refresh cycles, the EDAC can scrub memory. This is done by reading one memory location and correcting any single bit errors during each refresh cycle. By appropriately partioning memory the entire memory can be scrub in a short time and prevent the accumulation of double-bit errors. To calculate the probability of two bit errors occurring, the birthday paradox is used. This will give the probability of two single bit errors occuring in the same memory word. Assuming 32 bit words and 252 FITs: # of DRAMs Memory Size MTBF 39 4 MB 14,907 years 117 12 MB 8,607 years 195 20 MB 6,667 years For 1000 FITs # of DRAMs Memory Size MTBF 39 4 MB 3,757 years 117 12 MB 2,168 years 195 20 MB 1,680 years This increase is overstated since you have added extra circuitry and devices that can cause other failures to occur. The expected total system MTBF increase is 50 to 60 times the non-EDAC system. If scrubbing is used, than this will be even higher. What this also neglects is that may single bit errors can occur in memory locations that are not used, or are not read before they are written again. Therefore, the system may not detect all the parity errors that occur. I would expect that most 64 bit memories will have EDC circuits, especially memories using DRAMs > 1Mb. Some PC companies have looked at EDC, but found it too expensive to justify putting in the box. I now must say that I worked for Advanced Micro Devices supporting the Am29C660. I no longer am affiliated with them. I hope this answers some of the questions about memory reliability. Douglas Lee