Path: utzoo!utgpu!news-server.csri.toronto.edu!mailrus!uwm.edu!zaphod.mps.ohio-state.edu!sdd.hp.com!hplabs!hpcc01!hpbbn!hpbbi4!markl From: markl@hpbbi4.HP.COM (#Mark Lufkin) Newsgroups: comp.sys.hp Subject: Re: why do cluster clients panic after cluster servers die Message-ID: <1720008@hpbbi4.HP.COM> Date: 28 May 90 17:20:40 GMT References: <7752@rasp.eng.cam.ac.uk> Organization: Hewlett-Packard GmbH Lines: 55 > > Why _does_ a cluster client invoke panic() when its cluster server > > stops responding? Why does it simply not sleep-and-retry? > > > > The idea of doing this is to take into account the possibility > > that the LAN hardware on the client is bad. SAy the client could > > SEND packets (so the server thinks it is alive) but cannot RECEIVE > > packets (so it sees the server as being dead). In this case it holds > > resources on the server but is effectively inoperable. To be certain, > > the client simply panics if it loses contact with the server (after > > going through a reasonable retry period and running the landiag > > routines to see if there is a broken cable - in which case it WILL > > wait indefinitely). > > Are you serious? You mean that in order to guard against an unlikely > failure mode of your Ethernet hardware (mtbf anyone?) HP-UX chooses to > always crash the machine? To paraphrase, what you're saying is that if > the server loses contact for whatever reason (perhaps it crashes, is being > dumped, or, say, freezes because of a nfs hard mount on a dead/dumping > server, though not from broken cables) ==all the clients suicide on the > off-chance that their Ethernet hardware has half broken. > Wouldn't the alternative of simply waiting for something rather > more likely to happen (e.g. the server recovers) be friendlier to your > average user who was rather hoping to just wait five minutes for the > server to reboot.. before getting back to what he/she was doing? > > Sun, DEC (and other vendors) diskless workstations happily survive server > crashes/dumps -- do we assume they have significantly more reliable > Ethernet hardware ;-) The HPUX diskless implementation is stateful - that is to say, the server keeps info on resources being used by the client machines. Thus if the server crashes and comes back up again it will not have info that it requires. The information kept is things like which nodes have which files open and whether they are open for read or write. This info is used to allow synchronisation of the buffer caches and provides an altogether more robust system. Because of this state information the client MUST panic if it loses contact with the server. Note also that it does not panic immediately (the number of retries can be set in the kernel) and it does a check to make sure that the network has not been temporarily opened or whatever. The discussion of what is the right implementation - stateful or stateless - is open for discussion and people have different opinions depending on who you talk to. Stateless is robust in the face of server crashes however the stateful implementation allows true un*x semantics. > > tim marsland, > information engineering division, > cambridge university engineering dept. > ---------- Mark Lufkin WG-EMC OS Technical Support HP GmbH, Boeblingen, W.Germany