Path: utzoo!utgpu!news-server.csri.toronto.edu!mailrus!cs.utexas.edu!samsung!usc!apple!oracle!jhanley From: jhanley@oracle.com (John Hanley) Newsgroups: comp.sys.isis Subject: Re: Detecting client failures Summary: suggestions for client-ISIS implementations Keywords: isis client performance Message-ID: <1990Mar23.093050.23923@oracle.com> Date: 23 Mar 90 09:30:50 GMT References: <38866@cornell.UUCP> <38955@cornell.UUCP> Sender: jhanley@oracle.com (JH) Reply-To: jhanley@oracle.com (John Hanley) Organization: Oracle Corp., 12th floor 150 Spear St, San Francisco, CA 94105 Lines: 84 Bell-net: +1 415 541 5552 In article <38955@cornell.UUCP> ken@gvax.cs.cornell.edu (Ken Birman) writes: >Robert Cooper has an idea for a very fancy mechanism that would let >a client (any client) drop "offline" for a while and then come back. >... >We are thinking about how this could be added to ISIS. It wouldn't be >trivial, but could probably be done. Also being considered are ways to >connect remote non-UNIX clients (PC's, etc).... I hope that the "are you alive?" probe can be done by either the server or client, but that the result "machine1 poked machine2" is available to both server and client, with a timestamp. If it would mean less work for the Isis server I suppose the client should take primary resposibility for doing the pinging, though the overhead on the server is unclear to me. For sanity's sake I _think_ it would make sense for one or the other of the machines to "usually" do the vast majority of the pings, with the other machine making a query and a log entry if no word has been received from its partner in a while. In the event that a fileserver in the machine room loses touch with 50 PC's because a thin-ether terminator came loose, it might become aware of this failure sooner, with less scheduling overhead, if it was expecting pings from its clients, randomly distributed over a 60-second interval or whatever. It would be interesting to include a "network location" attribute in the sites file and to test the hypothesis "ether cable such-and-such failed" when a machine on that cable is observed to timeout (by pinging its neighbors). This would be a biggie for detecting network partitions, say from an important router being disconnected. Loss of power to an entire building is also an interesting group-failure case. One implication of this "usually pinging from one side" strategy is that although the interval and timeout parameters might very well be different for the two sides, at the time they are negotiated some sanity check and readjustment would be done to ensure that 99% of all pings will come from the same side. Deciding whether you prefer having the server or client do it is your option, though I suspect that having a single-tasked MeSsy-DOS client speak up rather than listen attentively is an easier proposition. As long as you're going to the trouble of pinging, it might be worthwhile to communicate some load metric to the client, such as load avg, or # of active users, or # of pages on the free list. If the metric was scaled to a uniform range of 1 to 100, applications that distribute calculations across a lot of workstations could choose to not bog down busy hosts based on this, in a portable way. The advertised metric might be artificially increased by a workstation user, or by an agent on his behalf, to control how willing or unwilling the user is at the moment to donate resources to the Isis community. Isis clients might test-ping alternate servers and change the host they are using as an Isis server based on excessive load (good-neighbor load levelling policy) or high observed latency on responses (selfish gimme-performance policy). The interval and timeout parameters might normally be very small, to rapidly detect failure of the partner or the network, with adaptive increases if the server is becoming very busy. The log noting loss of contact with the server might also note the load average ramping up on responses just before the timeout. A lack of correlation between the server's load metric and observed latencies may indicate that the server is quite healthy but that the network is melting down. In any event, the metric returned by server would have to be very cheap to compute, and could probably be cached by the server's Isis so that the host operating system wouldn't be interrogated for the metric more often than every 5 seconds or so. If you choose to have the server generate most of the pings, substantial efficiencies could be obtained by taking advantage of broadcast media. Another argument for having the server generate pings is that they could be generated by simply sending messages to a standard Isis group which all clients are a member of. Also, I suspect that the "drop offline for a little while" functionality isn't too hard to do by A) demanding an estimate of how long you plan to crunch offline for B) renegotiating the interval and timeout parameters to much larger values, and then shrinking them back to normal when rejoining the Isis community In trading off the lag in detecting a failure against the resources needed for pinging, widely varying client needs would argue for choosing to make (at least some) clients generate lots of server pings. Finally, I would like to observe that the single most important server resource in our environment is physical memory. The greatest objection that people have to running a protos process on their workstation all the time is that it always has a resident set of several hundred K, regardless of whether is doing any real work or not. Optimizing the set of pages touched whilst assuring others that "I'm up" would be a great boon. And thanks for the good work, Ken. --JH