Path: utzoo!attcan!uunet!mailrus!cornell!ken From: ken@gvax.cs.cornell.edu (Ken Birman) Newsgroups: comp.sys.isis Subject: Re: Detecting client failures Keywords: isis client performance Message-ID: <38992@cornell.UUCP> Date: 23 Mar 90 14:54:27 GMT References: <38866@cornell.UUCP> <38955@cornell.UUCP> <1990Mar23.093050.23923@oracle.com> Sender: nobody@cornell.UUCP Reply-To: ken@gvax.cs.cornell.edu (Ken Birman) Organization: Cornell Univ. CS Dept, Ithaca NY Lines: 69 In article <1990Mar23.093050.23923@oracle.com> jhanley@oracle.com (John Hanley) writes: >I hope that the "are you alive?" probe can be done by either the server >or client.... The implementation is pretty simple. Actually, nobody does any probing at all. The client starts sending "I'm alive" messages to the server once every seconds. The server expects these and if one is late by seconds it kills off the client. I'll probably tune things so that I'm alive messages are only sent if there has been no other traffic to the server, but this isn't likely to be a major issue. The client expects acks from the server, so it will notice within about 30 seconds if the server isn't acknowledging a "I'm alive" transmission. This is also how it finds out that it has been killed off. In this case (client was up but got killed off, or server died but client survived) the client code calls isis_failed(), which has the option of reconnecting to ISIS or aborting. >As long as you're going to the trouble of pinging, it might be worthwhile to >communicate some load metric to the client, such as load avg, or # of active >users, or # of pages on the free list... We are planning to solve this class of problem through Meta, which has "sensors" that include the sorts of per-process load factors you cite. Meta is a major ISIS application developed by Keith Marzullo and Mark Wood. It provides a network-wide user-extensible database of sensors (i.e. things like load, but also potentially things like the amount of space left on a disk or the length of a job-queue or even the temperature in the machine room). It also has actuators (trigger actions). Meta has several ways to query the database of sensors/actuators and supports a "when" clause (will support, at any rate) that watches in a fault-tolerant way for some event and triggers a specified action. Oh, and you can also combine sensors to build composite ones, i.e. average load or something... We are mailing out a few technical reports in the next week or so, and one covers Meta (another is on the bypass code, and another is just a progress/status report for the project). >Also, I suspect that the "drop offline for a little while" functionality >isn't too hard to do by > A) demanding an estimate of how long you plan to crunch offline for > B) renegotiating the interval and timeout parameters to much larger values, > and then shrinking them back to normal when rejoining the Isis community Well, this would force ISIS to buffer potentially large amounts of data. With lots of such data ISIS would congest and kill the client to shed load... What Robert had in mind was more along the lines of a way to have the remote client tell its services (gracefully) that it wants to drop offline, a period during which it would be offline and the services would buffer or archive data for it, and a graceful re-join mechanism that would bring them up to date again. Obviously, you could implement this now, but the question is whether we couldn't come up with a "generic" tool for this purpose. >Finally, I would like to observe that the single most important server >resource in our environment is physical memory.... Optimizing the set of >pages touched whilst assuring others that "I'm up" would be a great boon. I'm confused. Doesn't the isis_remote() stuff address this? With the remote client code, protos won't be on the workstations at all, only on the servers that run things like NFS or the main database system. The client code is down to something a bit more modest by now, 168k of ISIS related library text total. So, the typical user of a workstation dedicated to some application and running only as a "remote" client is 168k/process, which could be further reduced using shared libraries (one of those things we ought to get around to...) Ken