Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Posting-Version: version B 2.10.2 9/18/84; site onfcanim.UUCP Path: utzoo!watmath!watnot!watcgl!onfcanim!dave From: dave@onfcanim.UUCP (Dave Martindale) Newsgroups: net.unix-wizards Subject: Re: strange problems (looking for help) Message-ID: <14827@onfcanim.UUCP> Date: Fri, 25-Apr-86 12:03:58 EST Article-I.D.: onfcanim.14827 Posted: Fri Apr 25 12:03:58 1986 Date-Received: Sat, 26-Apr-86 06:05:06 EST References: <279@entropy.UUCP> Reply-To: dave@onfcanim.UUCP (Dave Martindale) Organization: ONF, Montreal Lines: 32 Summary: In article <279@entropy.UUCP> hubert@entropy.UUCP (Steve Hubert) writes: >I wonder if anyone recognizes the following symptoms as symptoms of >something concrete I can try to fix. We are running 4.3BSD on a >VAX11/785. The disks are 3 RA81s on a single UDA. The uda device >driver is version 6.12 from Berkeley (9/16/85) which seems to be equal >to or derived from a DEC driver from January 84. I am not getting any >kernel error messages at all. Here is symptom number 1: > > [examples of cmp'ing a file with itself and getting non-repeatable errors, > and C compiles which sometimes worked, sometimes not] I had the same problem when installing our 780, and asked the disk controller vendor to swap controller boards (Emulex SC780, driving Eagles). The problem remained. The rest of the system passed DEC diagnostics, so I didn't know where to look next. Then we started occasionally getting soft ECC errors. I like to keep the memory system error-free, so I figured out which memory array board the error was on and swapped it with another board, just to be sure. The error remained in the same place! So I swapped memory controllers, and the problem did move. (On the MS780-E memory system, there are two controllers, on either side of the central bus interface board). So I pulled the bad controller entirely, the memory reverted to non-interleaved operation on the remaining half memory, and the mysterious data problems went away. DEC has since replaced the bad controller. Moral of the story: a bad memory controller can mess up your data while still passing DEC diagnostics and without giving any sort of error. The memory ECC will catch bad RAM chips, and not much else. There are also a number of places in the CPU unprotected by parity checking where an intermittent hardware fault will damage data.