Path: utzoo!attcan!uunet!zaphod.mps.ohio-state.edu!usc!ucsd!ucbvax!agate!shelby!portia.stanford.edu!news
From: dennis@portia.Stanford.EDU (Dennis Michael)
Newsgroups: comp.unix.ultrix
Subject: Re: 8800 crashing way too often
Message-ID: <1990Nov6.185546.26468@portia.Stanford.EDU>
Date: 6 Nov 90 18:55:46 GMT
Sender: news@portia.Stanford.EDU (USENET News System)
Organization: Stanford University - AIR
Lines: 75

In article <1908@shodha.enet.dec.com> alan@shodha.enet.dec.com ( Alan's Home for Wayward Notes File.) writes:
>In article <STERGIOS.90Oct29193129@kt22.Stanford.EDU>, stergios@portia.Stanford.EDU (Stergios) writes:
>> 
>> [ Customer has a VAX 8800 crashing very frequently. ]
>
>	I have a VAX 8800 that crashed 96 days ago.  That's the
>	last time it was down.  The time before that was 80 days.
>	The I/O configuration is three VAXBIs with two KDB50s
>	(2 RA90s each) and CIBCA with HSC70 and a bunch of disks.
>	There's a DEBNI and DMB32 in there somewhere.  This kind
>	of uptime seems to be typical for my system.  Use it for
>	comparison purposes.
>> 
>> Quite a number of dec people have and still are looking into the
>> problem.  Every board has been replaced, even a new bi bus installed.
>> dec software engineering is leaning towards a problem in the mscp
>> code.
>
>	Is it the same error each time, a different one?  Which
>	one Panic, machine check or "it stops".  What version of
>	ULTRIX are you running?  If V4.0 has you installed and
>	booted the mandatory upgrade?  Any non-DEC devices on
>	the VAXBI or KDB50s?  Is there a UNIBUS on the system?
>	Does it have anything important on it?  Could it be
>	replaced by a native VAXBI device?

It is the same error every time - a trap type 8, segementation fault.
The footprint in the crash dump is the same every time.  We are running
ULTRIX 3.1 with every fix we could find.  There are no non-DEC devices
on the machine, no UNIBUS.  There are also no modifications to the
ULTRIX kernel.  Everything is 'vanilla' DEC.

The problem occurs in the connection block between the MSCP code and
the drivers.  I quote from a problem statement we received from
ULTRIX Engineering: "The panics occuring at Stanford appear to be caused
by a flink (forward link) of a request packet getting corrupted while
the request packet sits on the active queue (queue of active requests)
of the connection block for the underlying device.  In each case,
the low byte of the flink is overwritten with the value '04' (this
is always the case where the corrupted flink was from a request packet
that was the only or last request packet queued in the active queue
and therefore had a flink pointing back to the active queue of the
connection block)."

There are currently 4 possible explanations, and investigation is
continuing (and continuing, and continuing...).

Anyone seen this before?

>> 
>> [ mentioning replacement systems - particularly a 5500]
>>
>
>	Actaully most of the interesting I/O on a DECsystem 5500 will
>	stay off the Q-bus unless you insist upon using KDA50s for
>	most of the disks.  A couple of gigabyte SCSI disks and DSSI
>	disks should be very impressive.  A VAX 8800 is good for
>	moving bits between disk and memory, but a well configured
>	DECsystem 5500 should be able to do better.  You'll need more
>	memory to make up for the VAX to RISC switch.

We are looking at a 5500 with SCSI disks and possibly a DSSI swap disk.
We will definitely avoid using KDA50s on the Q-bus.  Our memory configuration
will be 128MB.

>> sm
>> stergios@jessica.stanford.edu
>
>
>-- 
>Alan Rollow				alan@nabeth.enet.dec.com
>

Dennis Michael
dennis@jessica.stanford.edu