Path: utzoo!mnetor!uunet!husc6!cmcl2!brl-adm!umd5!ames!ucbcad!ucbvax!LBL.GOV!nagy%warner.hepnet From: nagy%warner.hepnet@LBL.GOV (Frank J. Nagy, VAX Wizard & Guru) Newsgroups: comp.os.vms Subject: RE: LAVC help/info request Message-ID: <880105054006.22e03926@LBL.Gov> Date: 5 Jan 88 13:40:06 GMT Sender: daemon@ucbvax.BERKELEY.EDU Organization: The ARPA Internet Lines: 54 Nigel Arnot (Dept. of Physics, Kings College) writes: > We have a 4-node LAVC connected via a DELNI, which in turn is connected to > thick wire Ethernet for comms. to other systems (not LAVC nodes). Recently, > we had to extend the thickwire. I was most surprised that the fairly brief > interruption to the thickwire caused all LAVC satellite nodes to perform a > CLUEXIT bugcheck! > 1 - Is my diagnosis right? (RECNXINTERVAL caused crash) No quite, the system parameter which controls the polling for new cluster boot nodes or failed cluster circuits is PAPOLLINTERVAL. Don't be fooled by the documentation talking about the CI; the major difference between a CI VAXCluster and an LAVC is the PEDRIVER which provides a CI Port Emulator for the LAVC. So the same "CI" parameters apply in an LAVC also. From the V4.4 Release Notes on RECNXINTERVAL: "This parameter specifies the amount of time that the connection manager waits between the loss of a connection to a remote node and the initiation of a cluster transition to remove the failed node from the cluster." And since in an LAVC, once communication to the boot node has been lost the satellite node is defunct; the satellite nodes bugcheck with CLUEXIT. > 2 - Is there any way to prevent a cluster crash for this reason? Assuming my > diagnosis is right, setting RECNXINTERVAL to something sensible like 300 > should work - but why do DEC reduce it from the default 60 to 20 in the > first place? Has anyone out there actually tried this fix? See answer #3 below. Sounds plausible and worth a try at least. Anyone want to experiment and report to the net? > 3 - Why does a DELNI cause communication through itself to fail when the only > fault is on the thickwire to which it is connected? Is there any way to > prevent this action? The DELNI is just replacing (up to) 8 transceivers and a length of EtherHose (the thick yellow/orange cable). It provides no electrical or protocol buffering and (except for a time delay) acts just like a transceiver tapped directly to the EtherHose. Since your entire LAVC is connected to the DELNI, you could have just (before the EtherHose was opened), flipped the small switch on the DELNI to local operation. In this mode, the DELNI will ignore the tap on the EtherHose and the nodes on the DELNI could continue to function (sans any outside connections). When the EtherHose is online again, you just flip the switch back to establish outside connections. No problems with flipping the DELNI switch with the systems live; this is something I have done in the past (not on LAVCs, but no reason why it shouldn't work there also). = Frank J. Nagy "VAX Guru & Wizard" = Fermilab Research Division EED/Controls = HEPNET: WARNER::NAGY (43198::NAGY) or FNAL::NAGY (43009::NAGY) = BitNet: NAGY@FNAL = USnail: Fermilab POB 500 MS/220 Batavia, IL 60510