Path: utzoo!attcan!utgpu!jarvis.csri.toronto.edu!mailrus!egrunix!hacker From: hacker@egrunix.UUCP (Thomas J Hacker) Newsgroups: comp.unix.questions Subject: WRT Disk errors on 11/750 running 4.3 BSD Keywords: memory controller,memory,disk,11/750,vax,4.3 Message-ID: <117@egrunix.UUCP> Date: 21 Jul 89 14:09:35 GMT Organization: Oakland University, Rochester, MI Lines: 57 As promised....posting of responses. Thanks to following people for responding: Larry Parmelee parmelee@cs.cornell.edu Guy Harris guy@bootme.auspex.com (Sorry if I forgot anyone else's name) Re: Disk Problems on a 11/750 running 4.3 BSD In article <115@egrunix.UUCP> you write: > So, I thought I would wait a day or two to see if it would repeat, > then this came up: > > Jul 11 18:07:30 unix vmunix: mcr0: soft ecc addr 1a72 syn 73 "mcr0" is "Memory ContRoller 0". It is most likely not related to your disk problems. As long as you only see "soft" errors, and they don't occur "too often", you can just ignore them forever. ("too often": we had a 780 that would routinely report 10-12 of those mcr0 errors per hour, and other than wasting console paper, caused no other apparent problems. It was like this for years.) Soft/Hard- "soft" means the memory "ecc" - Error Check/Correction logic detected an error but was able to correct it (single bit error). "hard" means the ecc detected an error but couldn't fix it (double bit error). "addr" and the following number, "1a72", can be used to figure out which board was failing. You need to know how much memory is on each board, and multiply the "1a72" number by 4, since the ecc logic looks at memory in 4-byte chunks: (1a72*4) mod (bytes per board) gives you the board number which had the error. Unfortunately I'm not sure how the boards are laid out in a 750. The "syn" - Syndrome and following number "73" can be used to figure out which chip on the board failed. One last note: I say "failed" above, but be aware that this generally only means that one single bit out of a large number happened to change state. With high density memory chips, this sort of thing is not entirely unexpected, hence they build the boards with ecc logic to correct the occassional expected bit flip. Mcrx soft errors can be ignored almost indefinitely, unless they start occuring in such numbers that you think a whole chip has failed. Even if a whole chip fails, you can probably "limp along" for quite a while, assuming there are no other problems on that memory board. -- Thomas Hacker ...Weave a circle round him thrice, Systems Programmer And close your eyes with holy dread, Oakland University For he on honeydew hath fed, --"Kubla Khan" hackertj@unix.secs.oakland.edu And drunk the milk of Paradise. -- ST Coleridge