Xref: utzoo comp.sys.next:17909 comp.arch:22863
Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!sdd.hp.com!think.com!mintaka!ogicse!milton!mrc
From: mrc@milton.u.washington.edu (Mark Crispin)
Newsgroups: comp.sys.next,comp.arch
Subject: Re: parity is for farmers?
Message-ID: <1991May22.234515.24685@milton.u.washington.edu>
Date: 22 May 91 23:45:15 GMT
References: <1991May21.232331.24888@cs.umn.edu>
Organization: University of Washington, Seattle
Lines: 38

In article <1991May21.232331.24888@cs.umn.edu> scott@poincare.geom.umn.edu (Scott S. Bertilson) writes:
>  Does anyone else get nervous about the fact that NeXT ships their machines
>with 8 megabytes of non-parity memory?  Is memory so reliable today that
>parity doesn't give enough benefit to bother with?  Does only ECC give a
>strong enough guarantee - and that is too expensive, so we should just
>go without?

With core memory, a single magnetic core failing would cause a single
bit error at a specific location.  Parity is great for detecting that
kind of error.  Chances are, it didn't happen at a critical location
(critical for the operating system, anyway) so if your operating
system is clever enough it could abort the affected process (along
with suitable logging), and mark that memory page as being bad (and
hence shouldn't be used).

Another possibility with core memory is the failure of a single line
(row or column) that causes the loss of bit n in locations in a
particular memory range.  This sort of failure has greater impact, but
there is still the chance of a software recovery (albeit not of the
process that hit the error) and the continuation of the system in a
degraded mode.

Semiconductor memory is a different story.  My experience with
semiconductor memory suggests that failures are catastrophic and
massive.  Also, modern software using virtual memory tends to scatter
kernal critical pages throughout physical memory.

Put another way, if any of the SIMMs in a NeXT were to fail while the
system was running, the resulting data scrambling would tend to cause
an immediate failure of the system, probably before the parity trap
code would get to run, much less print out any diagnostics.

Finally, note that you are not running a multi-user timesharing
system.  The crash of an individual NeXT is not as horrible an event
as the crash of a timesharing system with 150 logged-in users.  There
are enough system-crash software bugs in 2.1 that crashes are to be
expected.  The main danger of a memory error is one in which the error
happens *without* the system crashing -- in effect, undetected.