Path: utzoo!attcan!uunet!dino!uxc.cso.uiuc.edu!tank!mimsy!chris From: chris@mimsy.UUCP (Chris Torek) Newsgroups: comp.unix.questions Subject: Re: WRT Disk errors on 11/750 running 4.3 BSD Keywords: memory controller,memory,disk,11/750,vax,4.3 Message-ID: <18703@mimsy.UUCP> Date: 24 Jul 89 06:08:51 GMT References: <117@egrunix.UUCP> Organization: U of Maryland, Dept. of Computer Science, Coll. Pk., MD 20742 Lines: 29 [re `mcr%d: soft ecc addr %x syn %x' errors] In article <117@egrunix.UUCP> hacker@egrunix.UUCP (Thomas J Hacker) writes: >... As long as you only see "soft" errors, and they don't occur "too >often", you can just ignore them forever. This is ill-advised. The purpose behind error-detecting-and-correcting memory is to fix the errors *and* provide a report so that failing chips can be replaced when it is convenient to halt the machine, rather than immediately after losing whatever was in progress. ("too often": we had a 780 that would routinely report 10-12 of >those mcr0 errors per hour, and other than wasting console paper, >caused no other apparent problems. It was like this for years.) 4BSD shuts off further error reports for ten minutes after each error, so a machine that reports six errors per hour probably has at least one hard failure (by this I mean `one chip that is really, truly bad': both `soft' and `hard' ECC errors can be due to either `soft' or `hard' hardware errors; a soft hardware error is like the noise your car makes whenever it is *not* in the shop). In this case a single stray cosmic ray or alpha particle can bring the machine down with an uncorrectable double-bit error, or, worse, corrupt two or more bits undetectably. Running with a known hard failure is rather like driving your Honda around when one cylinder is out---it works, but you should fix it as soon as you possibly can. -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@mimsy.umd.edu Path: uunet!mimsy!chris