Path: utzoo!utgpu!watserv1!watmath!att!att!linac!pacific.mps.ohio-state.edu!zaphod.mps.ohio-state.edu!sol.ctr.columbia.edu!emory!hubcap!gatech!purdue!mentor.cc.purdue.edu!noose.ecn.purdue.edu!en.ecn.purdue.edu!milton
From: milton@en.ecn.purdue.edu (Milton D Miller)
Newsgroups: comp.windows.x
Subject: Re: R5 wish list -- client recovery
Summary: keepalives, if done right are ok;
    see also the discussion in comp.protocols.tcp-ip
Keywords: Keep Alives, TCP
Message-ID: <1990Nov20.001707.11343@en.ecn.purdue.edu>
Date: 20 Nov 90 00:17:07 GMT
References: <9011191810.AA01979@milton.u.washington.edu>
Organization: Purdue University Engineering Computer Network
Lines: 69

In article <9011191810.AA01979@milton.u.washington.edu>,
    donn@MILTON.U.WASHINGTON.EDU (Donn Cave) writes:
>excerpts from <6220@lanl.gov> (Dale Carstensen):
>
>>  ... with X11R3 and X11R4 ... if the server dies ... remote clients that
>> had been connected to it continue to run, usually.
>
>Even if you wanted to use xdm, it isn't a complete cure for this one, since
>it only gets the clients directly descended from it - not clients started from
>the shell command line, not clients running on other hosts.
>
>We were able to install a fairly trivial patch to Xlib, so that the socket
>is created with a keep-alive option. 
...
>Unfortunately, it seems to slightly aggravate Problem 2:
>
>> On the other hand, unreliability in the network connection between client
>> and server can terminate clients while the server is still running.
>
>We get a lot of "Network down" (E_NETDOWN) errors, particularly on one host
>whose Ethernet hardware is not the industry's best.  
....


There is currently a discussion in comp.protocols.tcp-ip about (not using)
keepalives.  As was pointed out there today:

>	From: barmar@think.com (Barry Margolin)
>	Subject: Re: Warning: Keep-Alive considered harmful

[excerpt follows:]
... The connection shouldn't be killed as a result of keep-alive timeouts.
Instead, the purpose of keep-alives should be to elicit RSTs from the other
host.  Timeouts can be due to any number of reasons, but a RST indicates
unambiguously that the connection is unusable, because the other end
rebooted or closed the connection itself (perhaps network problems
prevented the FIN from getting through).  If a host crashes, the keepalive
won't actually notice this until it comes back up, which is probably good
enough.
[end of excerpt]

I agree :-)  Also notice, in the case of X terminals, there is usually
a "close all connections" option, which is essentially a reboot of the
tcp.  The other end probably is not given FIN or RST, and the condition
won't show up until the other end is poked.  (For xterms, invoking "write"
to youself will usually push the connection into destruction, and may 
return a "Not logged on there" to your write command).  

>I'm told that
>ftp and telnet re-try when they encounter these conditions - would such
>re-trying be another trivial modification to Xlib?

It is the responsibility of TCP to do the retrying.   It *Should* be up
to the application when to give up (See also Host Requirements RFC), but 
that is not generally available :-(.  

>Are there fundamental
>reasons why X should give up immediately when it encounters this network error?
>

None that I can think of; the servers already buffer for each client,
I don't see why the clients shouldn't buffer for the server when not
explicitly requesting a sync (do they do this already?)

They (clients) may need to give up of the buffering is taking too much
space; and response time may suffer with no apperent reason (to the
other servers/users) if multiple servers are open.  

milton