Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!seismo!umcp-cs!chris From: chris@umcp-cs.UUCP (Chris Torek) Newsgroups: net.unix,net.unix-wizards Subject: Re: 4.2 \"soft ecc\" errors Message-ID: <3712@umcp-cs.UUCP> Date: Mon, 6-Oct-86 21:04:27 EDT Article-I.D.: umcp-cs.3712 Posted: Mon Oct 6 21:04:27 1986 Date-Received: Wed, 8-Oct-86 06:45:40 EDT References: <4072@brl-smoke.ARPA> Reply-To: chris@umcp-cs.UUCP (Chris Torek) Organization: University of Maryland, Dept. of Computer Sci. Lines: 93 Xref: mnetor net.unix:5814 net.unix-wizards:8170 (Since I have seen no summary of replies, and since I can answer most of these, I shall ignore the `reply by mail' request.) In article <4072@brl-smoke.ARPA> vader!root@LBL-CSAM.arpa (RADIX System) writes: >... I get the following error message at about 10 minute intervals: > > mcr0: soft ecc addr xxx syn yy > >I also get the following when we boot: > > WARNING: should run interleaved swap with >= 2MB > >1) How do I "run interleaved"? This refers to swap/paging partitions. If you have two or more disc drives, you should set up swap areas on at least two. See `Building Systems with Config'. Multiple swap areas is supposed to be faster. Whether it is in fact faster is a function of many variables. >2) Is the boot message an indication of why I am getting the other >messages? No. >3) If I go back to 4.1, I don't see the "ecc" message (or the other >one, for that matter). Is there really something wrong with my memory >boards? Yes. 4.1 had less support for 750s, and presumably did not catch 750 ECC errors. >4) I have discovered that the "ecc" message is (likely) from >/usr/sys/vax/machdep.c It is indeed. >and I have found several > #if TRENDATA > ... > #endif >lines. But when I defined TRENDATA as an "optional" in my kernel >configuration file (and reboot), the same error messages continue >to come out. Am I missing some "bugfix" code for TRENDATA memory >on a 750? (Looks like most of the TRENDATA mods are for 780 machines.) The Trendata tables are for specific boards, probably for 780s. Whether they apply to yours is questionable. In any case, Trendata should have provided you with, or be able to provide you with, decoding tables. If Trendata understands only VMS format errors, just concatenate `xxx' and `yy' and pad with zeroes on the left: mcr0: soft ecc addr 54f90 syn e3 means the same as VMS's ?VMS-W-WARNINGMESSAGE, ridiculously long error string that lets you know something is wrong, but is no more help than `soft ecc addr ...' when it comes to figuring out just what, but fortunately you can look it up in some manual, which will of course just tell you to call Field Service, ERR ADDR=054F90E3 >5) Besides risking the filling of my disk from /usr/adm/messages, is >there any other danger in ignoring the error messages? Yes. If another few chips fail, you will no longer get soft (correctable) errors; you will get crashes. Incidentally, just because you see the messages only once every ten minutes does not mean the ECC correction is infrequent. The code in /sys/vax/machdep.c disables ECC reporting after each error, then re-enables it ten minutes later. This is controlled by the variable `memintvl', which is in seconds: % su Password: # adb -w /vmunix /dev/kmem memintvl/W 1 _memintvl: _memintvl: 258 = 1 $q # will re-enable reporting after one second. Stand back from the console, and have plenty of paper handy! Rebooting will restore the ten minute interval; or you can use adb again to change it back. -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 1516) UUCP: seismo!umcp-cs!chris CSNet: chris@umcp-cs ARPA: chris@mimsy.umd.edu