Path: utzoo!utgpu!news-server.csri.toronto.edu!mailrus!uwm.edu!zaphod.mps.ohio-state.edu!sdd.hp.com!hplabs!hpcc01!hpbbn!hpbbi4!markl
From: markl@hpbbi4.HP.COM (#Mark Lufkin)
Newsgroups: comp.sys.hp
Subject: Re: why do cluster clients panic after cluster servers die
Message-ID: <1720008@hpbbi4.HP.COM>
Date: 28 May 90 17:20:40 GMT
References: <7752@rasp.eng.cam.ac.uk>
Organization: Hewlett-Packard GmbH
Lines: 55

> >	Why _does_ a cluster client invoke panic() when its cluster server
> >        stops responding?  Why does it simply not sleep-and-retry?
> >
> >	The idea of doing this is to take into account the possibility
> >	that the LAN hardware on the client is bad. SAy the client could
> >	SEND packets (so the server thinks it is alive) but cannot RECEIVE
> >	packets (so it sees the server as being dead). In this case it holds
> >	resources on the server but is effectively inoperable. To be certain,
> >	the client simply panics if it loses contact with the server (after
> >	going through a reasonable retry period and running the landiag 
> >	routines to see if there is a broken cable - in which case it WILL
> >	wait indefinitely).
> 
> Are you serious?  You mean that in order to guard against an unlikely
> failure mode of your Ethernet hardware (mtbf anyone?) HP-UX chooses to
> always crash the machine?  To paraphrase, what you're saying is that if
> the server loses contact for whatever reason (perhaps it crashes, is being
> dumped, or, say, freezes because of a nfs hard mount on a dead/dumping
> server, though not from broken cables) ==all the clients suicide on the
> off-chance that their Ethernet hardware has half broken.
> 	Wouldn't the alternative of simply waiting for something rather
> more likely to happen (e.g. the server recovers) be friendlier to your
> average user who was rather hoping to just wait five minutes for the
> server to reboot.. before getting back to what he/she was doing?
> 
> Sun, DEC (and other vendors) diskless workstations happily survive server
> crashes/dumps -- do we assume they have significantly more reliable
> Ethernet hardware ;-)

	The HPUX diskless implementation is stateful - that is to say, the
	server keeps info on resources being used by the client machines.
	Thus if the server crashes and comes back up again it will not have
	info that it requires. The information kept is things like which
	nodes have which files open and whether they are open for read or
	write. This info is used to allow synchronisation of the buffer
	caches and provides an altogether more robust system. Because of
	this state information the client MUST panic if it loses contact
	with the server. Note also that it does not panic immediately (the
	number of retries can be set in the kernel) and it does a check to
	make sure that the network has not been temporarily opened or
	whatever. The discussion of what is the right implementation - 
	stateful or stateless - is open for discussion and people have
	different opinions depending on who you talk to. Stateless is
	robust in the face of server crashes however the stateful 
	implementation allows true un*x semantics.

> 
> tim marsland, <tpm@eng.cam.ac.uk>
> information engineering division,
> cambridge university engineering dept.
> ----------

Mark Lufkin
WG-EMC OS Technical Support
HP GmbH, Boeblingen, W.Germany