Path: utzoo!utgpu!news-server.csri.toronto.edu!mailrus!cs.utexas.edu!samsung!usc!apple!oracle!jhanley
From: jhanley@oracle.com (John Hanley)
Newsgroups: comp.sys.isis
Subject: Re: Detecting client failures
Summary: suggestions for client-ISIS implementations
Keywords: isis client performance
Message-ID: <1990Mar23.093050.23923@oracle.com>
Date: 23 Mar 90 09:30:50 GMT
References: <38866@cornell.UUCP> <38955@cornell.UUCP>
Sender: jhanley@oracle.com (JH)
Reply-To: jhanley@oracle.com (John Hanley)
Organization: Oracle Corp., 12th floor 150 Spear St, San Francisco, CA  94105
Lines: 84
Bell-net: +1 415 541 5552

In article <38955@cornell.UUCP> ken@gvax.cs.cornell.edu (Ken Birman) writes:
>Robert Cooper has an idea for a very fancy mechanism that would let
>a client (any client) drop "offline" for a while and then come back.
>...
>We are thinking about how this could be added to ISIS.  It wouldn't be
>trivial, but could probably be done.  Also being considered are ways to
>connect remote non-UNIX clients (PC's, etc)....

I hope that the "are you alive?" probe can be done by either the server
or client, but that the result "machine1 poked machine2" is available
to both server and client, with a timestamp.  If it would mean less work
for the Isis server I suppose the client should take primary resposibility
for doing the pinging, though the overhead on the server is unclear to me.
For sanity's sake I _think_ it would make sense for one or the other of
the machines to "usually" do the vast majority of the pings, with the other
machine making a query and a log entry if no word has been received from
its partner in a while.

In the event that a fileserver in the machine room loses touch with 50 PC's
because a thin-ether terminator came loose, it might become aware of this
failure sooner, with less scheduling overhead, if it was expecting pings
from its clients, randomly distributed over a 60-second interval or
whatever.  It would be interesting to include a "network location" attribute
in the sites file and to test the hypothesis "ether cable such-and-such
failed" when a machine on that cable is observed to timeout (by pinging its
neighbors).  This would be a biggie for detecting network partitions, say
from an important router being disconnected.  Loss of power to an entire
building is also an interesting group-failure case.

One implication of this "usually pinging from one side" strategy is that
although the interval and timeout parameters might very well be different
for the two sides, at the time they are negotiated some sanity check and
readjustment would be done to ensure that 99% of all pings will come from
the same side.  Deciding whether you prefer having the server or client do
it is your option, though I suspect that having a single-tasked MeSsy-DOS
client speak up rather than listen attentively is an easier proposition.

As long as you're going to the trouble of pinging, it might be worthwhile to
communicate some load metric to the client, such as load avg, or # of active
users, or # of pages on the free list.  If the metric was scaled to a
uniform range of 1 to 100, applications that distribute calculations across
a lot of workstations could choose to not bog down busy hosts based on this,
in a portable way.  The advertised metric might be artificially increased by
a workstation user, or by an agent on his behalf, to control how willing or
unwilling the user is at the moment to donate resources to the Isis
community.  Isis clients might test-ping alternate servers and change the
host they are using as an Isis server based on excessive load (good-neighbor
load levelling policy) or high observed latency on responses (selfish
gimme-performance policy).  The interval and timeout parameters might
normally be very small, to rapidly detect failure of the partner or the
network, with adaptive increases if the server is becoming very busy.  The
log noting loss of contact with the server might also note the load average
ramping up on responses just before the timeout.  A lack of correlation
between the server's load metric and observed latencies may indicate that
the server is quite healthy but that the network is melting down.  In any
event, the metric returned by server would have to be very cheap to compute,
and could probably be cached by the server's Isis so that the host operating
system wouldn't be interrogated for the metric more often than every 5
seconds or so.

If you choose to have the server generate most of the pings, substantial
efficiencies could be obtained by taking advantage of broadcast media.
Another argument for having the server generate pings is that they could be
generated by simply sending messages to a standard Isis group which all
clients are a member of.

Also, I suspect that the "drop offline for a little while" functionality
isn't too hard to do by
   A) demanding an estimate of how long you plan to crunch offline for
   B) renegotiating the interval and timeout parameters to much larger values,
      and then shrinking them back to normal when rejoining the Isis community
In trading off the lag in detecting a failure against the resources needed
for pinging, widely varying client needs would argue for choosing to make
(at least some) clients generate lots of server pings.

Finally, I would like to observe that the single most important server
resource in our environment is physical memory.  The greatest objection that
people have to running a protos process on their workstation all the time is
that it always has a resident set of several hundred K, regardless of
whether is doing any real work or not.  Optimizing the set of pages touched
whilst assuring others that "I'm up" would be a great boon.

And thanks for the good work, Ken.
						--JH