Path: utzoo!attcan!uunet!zaphod.mps.ohio-state.edu!usc!ucsd!ucbvax!agate!shelby!portia.stanford.edu!news From: dennis@portia.Stanford.EDU (Dennis Michael) Newsgroups: comp.unix.ultrix Subject: Re: 8800 crashing way too often Message-ID: <1990Nov6.185546.26468@portia.Stanford.EDU> Date: 6 Nov 90 18:55:46 GMT Sender: news@portia.Stanford.EDU (USENET News System) Organization: Stanford University - AIR Lines: 75 In article <1908@shodha.enet.dec.com> alan@shodha.enet.dec.com ( Alan's Home for Wayward Notes File.) writes: >In article , stergios@portia.Stanford.EDU (Stergios) writes: >> >> [ Customer has a VAX 8800 crashing very frequently. ] > > I have a VAX 8800 that crashed 96 days ago. That's the > last time it was down. The time before that was 80 days. > The I/O configuration is three VAXBIs with two KDB50s > (2 RA90s each) and CIBCA with HSC70 and a bunch of disks. > There's a DEBNI and DMB32 in there somewhere. This kind > of uptime seems to be typical for my system. Use it for > comparison purposes. >> >> Quite a number of dec people have and still are looking into the >> problem. Every board has been replaced, even a new bi bus installed. >> dec software engineering is leaning towards a problem in the mscp >> code. > > Is it the same error each time, a different one? Which > one Panic, machine check or "it stops". What version of > ULTRIX are you running? If V4.0 has you installed and > booted the mandatory upgrade? Any non-DEC devices on > the VAXBI or KDB50s? Is there a UNIBUS on the system? > Does it have anything important on it? Could it be > replaced by a native VAXBI device? It is the same error every time - a trap type 8, segementation fault. The footprint in the crash dump is the same every time. We are running ULTRIX 3.1 with every fix we could find. There are no non-DEC devices on the machine, no UNIBUS. There are also no modifications to the ULTRIX kernel. Everything is 'vanilla' DEC. The problem occurs in the connection block between the MSCP code and the drivers. I quote from a problem statement we received from ULTRIX Engineering: "The panics occuring at Stanford appear to be caused by a flink (forward link) of a request packet getting corrupted while the request packet sits on the active queue (queue of active requests) of the connection block for the underlying device. In each case, the low byte of the flink is overwritten with the value '04' (this is always the case where the corrupted flink was from a request packet that was the only or last request packet queued in the active queue and therefore had a flink pointing back to the active queue of the connection block)." There are currently 4 possible explanations, and investigation is continuing (and continuing, and continuing...). Anyone seen this before? >> >> [ mentioning replacement systems - particularly a 5500] >> > > Actaully most of the interesting I/O on a DECsystem 5500 will > stay off the Q-bus unless you insist upon using KDA50s for > most of the disks. A couple of gigabyte SCSI disks and DSSI > disks should be very impressive. A VAX 8800 is good for > moving bits between disk and memory, but a well configured > DECsystem 5500 should be able to do better. You'll need more > memory to make up for the VAX to RISC switch. We are looking at a 5500 with SCSI disks and possibly a DSSI swap disk. We will definitely avoid using KDA50s on the Q-bus. Our memory configuration will be 128MB. >> sm >> stergios@jessica.stanford.edu > > >-- >Alan Rollow alan@nabeth.enet.dec.com > Dennis Michael dennis@jessica.stanford.edu