Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!know!zaphod.mps.ohio-state.edu!mips!orac!cprice From: cprice@mips.COM (Charlie Price) Newsgroups: comp.arch Subject: Re: Workstation Data Integrity Message-ID: <40694@mips.mips.COM> Date: 8 Aug 90 19:59:33 GMT References: <1990Aug3.204358.330@portia.Stanford.EDU> Sender: news@mips.COM Reply-To: cprice@mips.COM (Charlie Price) Organization: MIPS Computer Systems, Inc. Lines: 76 In article <1990Aug3.204358.330@portia.Stanford.EDU> jackk@shasta.stanford.edu (Jack Kouloheris) writes: >I'm a bit puzzled by the lack of any type of memory error detection/ >correction on many workstations and high-end PCs. These workstations >are beginning to have memories that rival or exceed those of >the previous generation of minicomputers, which almost always used >some sort of ECC protection. Do manufacturers feel that it isn't needed >any more ? >A 1Mbit DRAM chip may have a typical soft error rate of >.001-.005 PPM/KPOH/bit. Suppose we have a workstation with >16 Megabytes of memory ( = approx 1.34 * 10^ 8 bits). This >yields a memory system error rate of .671 errors/KPOH, a non-negligible >number. Servers may have even more memory than this, and may >be running continually, so some errors are bound to occur. What >happens if a bit flips, and then the data is paged out or written to >a file ? The error is now permanent and can propagate. >Why does no one worry about this ? > >Some SUNs have parity checking on the memory system, but what does >the OS do when a parity error occurs, since correction is not >possible ? The answer seems to be that the user community "votes" for particular performance/reliability/cost configurations with their money and that is what gets produced. Successful vendors of general-purpose systems build systems that have market-success-defined "acceptable" error rates that sell for an "acceptable" amount of money. MIPS, for example, produces both systems with parity and with ECC. The "little" machines, the tower-like servers and the workstations, use parity. The tower-like machines have custom memory cards and the workstations use SIMMs. The bigger machines, the M/2000, the RC3260, and RC6280 all use ECC with 1-bit correction, 2-bit detection on large (9U) custom boards. The caches for all these machines are parity-protected (and with a write-through cache, you just refetch from main memory when you see a cache parity error). Parity detects most memory errors, at a moderate cost of an extra bit every now and then (typically per byte, bit it could be per word) and a fairly simple parity tree to check/generate parity. ECC is quite a bit more expensive than parity. You need several extra bits per word which makes SIMMS less easy to use, and you need a more complicated device to generate and check ECC. With a fast memory system you probably have to use multiple ECC chips (or VERY fast ECC chips) since you use multiple memory banks to achieve high bandwidth memory. This all adds to manufacturing cost, design cost, testing cost, software cost... Most PCs (including the MACs I've seen) don't have or at least don't use parity. They silently accept occasional wrong computations rather than stop a computation that gets a transient memory error. Cost seems to be extremely important for PCs. For some uses, real workstations among them, the acceptable level of error seems to be "occasionally" having a computation explicitly fail (system panic or process killed) rather than silently producing an erroneous result. Cost in workstations seems to be important for success. Parity is OK for this environment then (at least by demonstration). A server, or a system that needs to support more reliable computation, may include ECC to overcome alpha hits. Real fault tolerance is yet another topic, and though there are companies that do well in the marker, most of us don't want to pay for it. -- Charlie Price cprice@mips.mips.com (408) 720-1700 MIPS Computer Systems / 928 Arques Ave. / Sunnyvale, CA 94086