Path: utzoo!attcan!uunet!lll-winken!ames!think!barmar
From: barmar@think.COM (Barry Margolin)
Newsgroups: comp.protocols.tcp-ip
Subject: Re: SO_KEEPALIVE considered harmful?
Message-ID: <20761@news.Think.COM>
Date: 25 May 89 16:32:31 GMT
References: <8905250638.AA21706@ucbvax.Berkeley.EDU>
Sender: news@Think.COM
Reply-To: barmar@kulla.think.com.UUCP (Barry Margolin)
Organization: Thinking Machines Corporation, Cambridge, MA
Lines: 49

In article <8905250638.AA21706@ucbvax.Berkeley.EDU> dcrocker@AHWAHNEE.STANFORD.EDU (Dave Crocker) writes:
>It is worth adding that the excessive use of keepalives has removed a
>feature that used to be in TCP and has been recently re-documented by
>Bob Braden:  TCP used to be remarkably robust against temporary
>outages.  If you were willing to wait, so was TCP.  Now, an outage of
>a very short time -- on some implementations, as short as 1-2 minutes --
>will abort the connection.

I dispute this claim.  TCP is only robust against temporary outages if
you don't try to use the connection during that period.  For instance,
if I'm using telnet, the connection will stay alive during outages if
I don't type anything to the client or the host doesn't try to send
any output.  If either end tries to use the connection, and the outage
is longer than the TCP acknowledgement timeout, then the connection
will die.  If I happen to know that the network is having trouble I
won't type anything, but how often is this the case?  What it mostly
means is that a temporary outage after I go home won't break my
connections.

TCP's robustness is still a good idea.  It's nice to be able to swap
Ethernet cables without causing all the network connections to die.
But in my experience (which, I admit, isn't all that extensive), any
connection that dies for more than a minute or two probably isn't
going to come back.

What I mostly care about, though, is that the other end definitely has
reinitialized, e.g. it has crashed and been rebooted.  If it's a
telnet server that crashed I can do this by typing into the client,
which will provoke a reset, and the client will abort.  But if it's
the telnet client or an X server that died, there's often no way to
force the other end to try to send something so it will get a reset.

I think the right solution is a compromise.  What's needed is a way to
send a segment with infinite (or near-infinite, e.g. hours or a day)
retransmissions and slow retransmit rate (one to two minutes).  This
would allow idle connections to stay up across most network failures,
but they will die within a minute or so of the other end rebooting.
And, of course, it should be optional, so that applications that
perform frequent output of their own need not compound their network
use (although since keepalives need only be sent when there are no
normal packets in the retransmit queue, any application whose output
rate is higher than the keepalive rate will never invoke the keepalive
mechanism).

Barry Margolin
Thinking Machines Corp.

barmar@think.com
{uunet,harvard}!think!barmar