Path: utzoo!mnetor!uunet!husc6!cmcl2!brl-adm!umd5!ames!ucbcad!ucbvax!LBL.GOV!nagy%warner.hepnet
From: nagy%warner.hepnet@LBL.GOV (Frank J. Nagy, VAX Wizard & Guru)
Newsgroups: comp.os.vms
Subject: RE: LAVC help/info request
Message-ID: <880105054006.22e03926@LBL.Gov>
Date: 5 Jan 88 13:40:06 GMT
Sender: daemon@ucbvax.BERKELEY.EDU
Organization: The ARPA Internet
Lines: 54

Nigel Arnot (Dept. of Physics, Kings College) writes:

> We have a 4-node LAVC connected via a DELNI, which in turn is connected to
> thick wire Ethernet for comms. to other systems (not LAVC nodes). Recently,
> we had to extend the thickwire. I was most surprised that the fairly brief
> interruption to the thickwire caused all LAVC satellite nodes to perform a
> CLUEXIT bugcheck!
     
> 1 - Is my diagnosis right? (RECNXINTERVAL caused crash)
     
No quite, the system parameter which controls the polling for new cluster
boot nodes or failed cluster circuits is PAPOLLINTERVAL.  Don't be fooled
by the documentation talking about the CI; the major difference between
a CI VAXCluster and an LAVC is the PEDRIVER which provides a CI Port
Emulator for the LAVC.  So the same "CI" parameters apply in an LAVC
also.  

From the V4.4 Release Notes on RECNXINTERVAL: "This parameter specifies
the amount of time that the connection manager waits between the loss of
a connection to a remote node and the initiation of a cluster transition
to remove the failed node from the cluster."  And since in an LAVC, once
communication to the boot node has been lost the satellite node is defunct;
the satellite nodes bugcheck with CLUEXIT.

> 2 - Is there any way to prevent a cluster crash for this reason? Assuming my
>     diagnosis is right, setting RECNXINTERVAL to something sensible like 300
>     should work - but why do DEC reduce it from the default 60 to 20 in the
>     first place? Has anyone out there actually tried this fix?
     
See answer #3 below.  Sounds plausible and worth a try at least.  Anyone
want to experiment and report to the net?

> 3 - Why does a DELNI cause communication through itself to fail when the only
>     fault is on the thickwire to which it is connected? Is there any way to
>     prevent this action?
     
The DELNI is just replacing (up to) 8 transceivers and a length of EtherHose
(the thick yellow/orange cable).  It provides no electrical or protocol
buffering and (except for a time delay) acts just like a transceiver tapped
directly to the EtherHose.  Since your entire LAVC is connected to the
DELNI, you could have just (before the EtherHose was opened), flipped the
small switch on the DELNI to local operation.  In this mode, the DELNI
will ignore the tap on the EtherHose and the nodes on the DELNI could
continue to function (sans any outside connections).  When the EtherHose
is online again, you just flip the switch back to establish outside
connections.  No problems with flipping the DELNI switch with the systems
live; this is something I have done in the past (not on LAVCs, but no
reason why it shouldn't work there also).

= Frank J. Nagy   "VAX Guru & Wizard"
= Fermilab Research Division EED/Controls
= HEPNET: WARNER::NAGY (43198::NAGY) or FNAL::NAGY (43009::NAGY)
= BitNet: NAGY@FNAL
= USnail: Fermilab POB 500 MS/220 Batavia, IL 60510