Path: utzoo!attcan!uunet!dino!uxc.cso.uiuc.edu!tank!mimsy!chris
From: chris@mimsy.UUCP (Chris Torek)
Newsgroups: comp.unix.questions
Subject: Re: WRT Disk errors on 11/750 running 4.3 BSD
Keywords: memory controller,memory,disk,11/750,vax,4.3
Message-ID: <18703@mimsy.UUCP>
Date: 24 Jul 89 06:08:51 GMT
References: <117@egrunix.UUCP>
Organization: U of Maryland, Dept. of Computer Science, Coll. Pk., MD 20742
Lines: 29

[re `mcr%d: soft ecc addr %x syn %x' errors]

In article <117@egrunix.UUCP> hacker@egrunix.UUCP (Thomas J Hacker) writes:
>... As long as you only see "soft" errors, and they don't occur "too
>often", you can just ignore them forever.

This is ill-advised.  The purpose behind error-detecting-and-correcting
memory is to fix the errors *and* provide a report so that failing chips
can be replaced when it is convenient to halt the machine, rather than
immediately after losing whatever was in progress.

("too often": we had a 780 that would routinely report 10-12 of
>those mcr0 errors per hour, and other than wasting console paper,
>caused no other apparent problems.  It was like this for years.)

4BSD shuts off further error reports for ten minutes after each error,
so a machine that reports six errors per hour probably has at least one
hard failure (by this I mean `one chip that is really, truly bad':
both `soft' and `hard' ECC errors can be due to either `soft' or `hard'
hardware errors; a soft hardware error is like the noise your car makes
whenever it is *not* in the shop).  In this case a single stray cosmic
ray or alpha particle can bring the machine down with an uncorrectable
double-bit error, or, worse, corrupt two or more bits undetectably.
Running with a known hard failure is rather like driving your Honda
around when one cylinder is out---it works, but you should fix it as
soon as you possibly can.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris