Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!cornell!ken From: ken@gvax.cs.cornell.edu (Ken Birman) Newsgroups: comp.sys.isis Subject: Re: Recommended values for restart timeout parameters? Message-ID: <34165@cornell.UUCP> Date: 13 Nov 89 15:26:48 GMT References: <34157@cornell.UUCP> Sender: nobody@cornell.UUCP Reply-To: ken@gvax.cs.cornell.edu (Ken Birman) Organization: Cornell Univ. CS Dept, Ithaca NY Lines: 115 I guess this -f option has people a bit confused. Robert's comments are correct, of course, but I wonder if it wouldn't also help to explain the sequencing of events controlled by these flags. Say that your value for the -f parameter is FTIMEOUT seconds and for -A is ATIMEOUT. Also, assume that sites A, B and C are running during this dialog: 1. A, B, C exchange messages per your code. If your code doesn't send any messages at all, A sends a message to B, B to C and C to A every FTIMEOUT seconds. In particular, this means that B will send to C every 10 seconds. The sites are organized as a ring and each site hears from the site on its left and sends to the site on its left with this frequency. It follows that B can monitor the status of A and C can monitor the status of B. If X gets a message from Y and there isn't any message sent from Y to X within a short time, X sends an ACK-only message to Y. ACK-only messages are not themselves acknowledged. ISIS initially retransmits after 4 seconds, but it varies this to adapt to "average" delays before an ack is received. Say that the average measured delay before packets are acked is seconds. ISIS keeps a running average of and retransmits after this amount of time, but never after waiting less than 2 seconds. 2. Now, say that B becomes unresponsive or crashes. 2.1 Soon after, say at time t0, A or C will try to send a message to B. 2.2 Not getting an ACK, A or C will retransmit this message at time, say, t0+4secs, t0+8secs, etc. Say that A sent the message. 2.3a After retrying MAX_RET times, currently hardwired to 3, A logs the message: "Transmitted same packet %d times, giving up (len %d)\n" and declares B to have failed. For the default case, this means that a site is declared to be down if it doesn't respond within 12 seconds after you send it a packet, but the value could drop as low as 6 seconds or rise much higher, depending on how sluggish the destination has been. 2.3b Alternatively, after not hearing from B for a period of max(30,RTDELAY*FTIMEOUT/2) seconds, A C declares B to have failed. Note that in step 1, C was expecting to hear from B periodically. E.g., if the current average RTDELAY is to retransmit after 4 seconds and you specified -f10, this rule kicks in after max(30,4*10/2) = 30 seconds. For -f60, the default, this timer kicks in after max(30,4*60/2) = 120 seconds. This time you get the logged message: "Timeout: site %d/%d unresponsive for %d secs\n" 3. OK, now C thinks C is down. Problem is, A doesn't know. So... we run the "failure detection protocol", which does a sort of 2-phase commit. If no other site is down, this protocol runs essentially instantly. However, if other sites are also down, we might not notice the problem until now, forcing a further delay. E.g, in a larger system, B and C could both fail, and although some site D would notice that C was down, since C was supposed to watch B, we wouldn't find out that B was down until step 2.3a for the failure agreement protocol itself, slowing things down even more! 4. In the case where B was just running very slowly, it now finds out that someone decided it was down and prints a message: "fd_iamdead: %d/%d told me to die", then shuts down. 5. If you specified the -A parameter, after ATIMEOUT minutes (default = 5) the isis monitor program restarts the site and it should come up on its own. So, how long should you expect all this to take? ... Depends on whether someone was sending to the site when it crashed. If so, a good guess is that the failure detector will run after about 12 seconds and the site will be declared down more or less immediately. for a value like -f10, a single failure will be detected after about If not, we'll notice the failure within about 30 seconds when the expected "keepalive" message is not received. For the default setting of FTIMEOUT = 60, the delay would again be something between 12 seconds and 2 minutes, and more for multiple failures. We recommend that most sites use -f60 (the default) when using ISIS for actual monitoring or control of an application. The problem is that slow machines, NFS servers that hang while printing messages, YP servers that hang briefly, and even clock resets can all trigger cascades of failures if the parameters are set to detect failures quickly. For networks of very uniform machines that never hang due to NFS or YP problems, I guess one could run with a value like -f15 or -f10, but obviously this won't be true for a network of overloaded SUN 2 workstations... Now, regarding one of the other issues that was raised, let me comment on network partitions. A network partition occurs if for some reason sites can't talk to each other. Say that B was temporarily unplugged from the net and that this is what triggered all the problems. If a large number of sites were up (> 3) ISIS kills off sites on the minority side of the partition, so B would crash in this case with the message; fd_localcommit: Possible partition, with this site being in minority partition Otherwise, B might not be killed off and would just keep running as a partition with a single site. Even if B is killed off, after a while (after ATIMEOUT parameter expires) B will restart. At this point, it won't find the other ISIS sites on the network and will restart itself. It will then be running as a partition with one site in it. This last situation will be corrected in the version of ISIS that IDS will release sometime in 1990. Meanwhile, you just need to be aware that it represents a problem... Please let us know if you observe event sequences that don't match this algorithm. My guess is that it "explains" all behaviors people have managed to get out of ISIS. Ken