Path: utzoo!attcan!uunet!nems!ark1!uakari.primate.wisc.edu!samsung!usc!snorkelwacker!ai-lab!zurich.ai.mit.edu!cph From: cph@zurich.ai.mit.edu (Chris Hanson) Newsgroups: comp.sys.hp Subject: Re: why do cluster clients panic after cluster servers die Message-ID: Date: 1 Jun 90 06:11:35 GMT References: <7752@rasp.eng.cam.ac.uk> <1720008@hpbbi4.HP.COM> Sender: news@wheaties.ai.mit.edu Organization: M.I.T. Artificial Intelligence Lab. Lines: 52 In-reply-to: markl@hpbbi4.HP.COM's message of 28 May 90 17:20:40 GMT In article <1720008@hpbbi4.HP.COM> markl@hpbbi4.HP.COM (#Mark Lufkin) writes: From: markl@hpbbi4.HP.COM (#Mark Lufkin) Newsgroups: comp.sys.hp Date: 28 May 90 17:20:40 GMT The HPUX diskless implementation is stateful - that is to say, the server keeps info on resources being used by the client machines. Thus if the server crashes and comes back up again it will not have info that it requires. The information kept is things like which nodes have which files open and whether they are open for read or write. This info is used to allow synchronisation of the buffer caches and provides an altogether more robust system. Because of this state information the client MUST panic if it loses contact with the server. Note also that it does not panic immediately (the number of retries can be set in the kernel) and it does a check to make sure that the network has not been temporarily opened or whatever. The discussion of what is the right implementation - stateful or stateless - is open for discussion and people have different opinions depending on who you talk to. Stateless is robust in the face of server crashes however the stateful implementation allows true un*x semantics. Mark Lufkin WG-EMC OS Technical Support HP GmbH, Boeblingen, W.Germany In general I have found that HP's diskless implementation works very well, outperforming NFS by a considerable margin -- indeed I suspect that the "statefulness" of this implementation is necessary to achieve this kind of behavior. We have been using the software for quite awhile and are generally very happy with it. But having all of the machines crash whenever the server goes down or the network is broken is really a pain. It seems to me that the diskless protocol could be extended in an upwards-compatible way so that the client supplied the necessary information to the server to permit resynchronization when the server had lost the information for one reason or another. Certainly the client knows all of the relevant state, such as what files are open. Presumably there are some situations in which complete resynchronization is impossible -- such as the client having an enforcement-mode lock on some file which the server has given away to another client while the first client was out of touch -- and in such a case the client's processes that depend on that particular bit of state ought to get errors and lose, but the remaining processes should still be able to win. And I suspect that "impossible resynchronization" is really quite rare, so that even if it has fairly catastrophic consequences, that may not be much of a problem. A little cleverness in the design of this software could eliminate a lot of headaches for customers.