Xref: utzoo comp.sys.next:17909 comp.arch:22863 Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!sdd.hp.com!think.com!mintaka!ogicse!milton!mrc From: mrc@milton.u.washington.edu (Mark Crispin) Newsgroups: comp.sys.next,comp.arch Subject: Re: parity is for farmers? Message-ID: <1991May22.234515.24685@milton.u.washington.edu> Date: 22 May 91 23:45:15 GMT References: <1991May21.232331.24888@cs.umn.edu> Organization: University of Washington, Seattle Lines: 38 In article <1991May21.232331.24888@cs.umn.edu> scott@poincare.geom.umn.edu (Scott S. Bertilson) writes: > Does anyone else get nervous about the fact that NeXT ships their machines >with 8 megabytes of non-parity memory? Is memory so reliable today that >parity doesn't give enough benefit to bother with? Does only ECC give a >strong enough guarantee - and that is too expensive, so we should just >go without? With core memory, a single magnetic core failing would cause a single bit error at a specific location. Parity is great for detecting that kind of error. Chances are, it didn't happen at a critical location (critical for the operating system, anyway) so if your operating system is clever enough it could abort the affected process (along with suitable logging), and mark that memory page as being bad (and hence shouldn't be used). Another possibility with core memory is the failure of a single line (row or column) that causes the loss of bit n in locations in a particular memory range. This sort of failure has greater impact, but there is still the chance of a software recovery (albeit not of the process that hit the error) and the continuation of the system in a degraded mode. Semiconductor memory is a different story. My experience with semiconductor memory suggests that failures are catastrophic and massive. Also, modern software using virtual memory tends to scatter kernal critical pages throughout physical memory. Put another way, if any of the SIMMs in a NeXT were to fail while the system was running, the resulting data scrambling would tend to cause an immediate failure of the system, probably before the parity trap code would get to run, much less print out any diagnostics. Finally, note that you are not running a multi-user timesharing system. The crash of an individual NeXT is not as horrible an event as the crash of a timesharing system with 150 logged-in users. There are enough system-crash software bugs in 2.1 that crashes are to be expected. The main danger of a memory error is one in which the error happens *without* the system crashing -- in effect, undetected.