Path: utzoo!attcan!uunet!zephyr.ens.tek.com!uw-beaver!cornell!ken
From: ken@gvax.cs.cornell.edu (Ken Birman)
Newsgroups: comp.sys.isis
Subject: Re: detecting client failures
Message-ID: <38995@cornell.UUCP>
Date: 23 Mar 90 17:49:38 GMT
Sender: nobody@cornell.UUCP
Reply-To: ken@cs.cornell.edu (Ken Birman)
Distribution: comp
Organization: Cornell Univ. CS Dept, Ithaca NY
Lines: 65

> From: rich@sendai.ann-arbor.mi.us (K. Richard Magill) (via email)
>   From: ken@gvax.cs.cornell.edu (Ken Birman)
>   Newsgroups: comp.sys.isis
>   Date: 22 Mar 90 15:50:46 GMT
>   Reply-To: ken@gvax.cs.cornell.edu (Ken Birman)

>	   application calls	isis_probe(freq, timeout)

>Could be tricky to set these initially.  Maybe something based on
>typical round trip message time would be easier to deal with than hard
>seconds.

I've played with round-trip numbers and they really don't work for
crashes.  Crashes tend to be unusual events -- so are long delays --
and just because you were running fast a few seconds ago I can't
safely assume that you won't be updating a display for 10 seconds
sometime soon...  ISIS is full of hard-coded constants at this level,
as is any protocol implmentation.

>   3) Implementation:
>	   client starts sending a HELLO message every freq. seconds
>	   ISIS has a timer; if it doesn't get a HELLO within freq+timeout secs
>		   it "kills off" the client

>I'd prefer that isis poll the client before killing.  I can imagine a
>situation, perhaps macOs, where the tasking and time division is such
>that periodic hello's might be difficult, but where polls could be
>answered immediately.

Good idea.  I'll add this feature.

>	   killed client that was really alive calls isis_failed, then panics
>		   with message "killed by ISIS" unless isis_failed traps the
>		   failure (i.e. by reconnecting).

>If the client's system has a clock, which it must in order to know
>frequency and timeout, then isis on the client can recognize when it
>has missed a poll.  This implies that some kind of dynamic backoff
>might be in order before isis outright "disowns" the client.

Unclear on what you mean by this...  The clib clock is by SIGALRM
interrupts currently (once per second) but the sending of the I am alive
message is done only when ISIS gets scheduled.  So, what I am really
defining as liveness is that the ISIS scheduler gets scheduled at least
once every "frequency" seconds...  

>   Robert Cooper has an idea for a very fancy mechanism that would let
>   a client (any client) drop "offline" for a while and then come back...

>I know precisely the problem here and I'd be sorely dissappointed to
>see you spend the time to solve it within isis rather than with isis.

>My recommendation would be for users with this need to do one of the
>following. Use a cross between isis bcast from the server to the client
>and an rpc-ish from the client to the server.  Or, withdraw from the group
>that receives the data then rejoin.  Personally, I think I'd stay in the
>group, but register and unregister for the bcasts from server.  That
>is, message flow control, and persistence becomes the server's
>problem, not isis's.

I don't know; seems like it would be more in the spirit of a toolkit to
provide the servers with a cleaner warning that the client wants to do
this and way to spool data conveniently.  But, you might be right.  After
all, your company would be a typical user of this sort of facility,
so if you don't see it as a necessary add-on...