Path: utzoo!attcan!uunet!cs.utexas.edu!usc!apple!bloom-beacon!stellar.stellar.COM!stevep
From: stevep@stellar.stellar.COM (Steve Pitschke)
Newsgroups: comp.windows.x
Subject: Re: XIO errors again
Message-ID: <8907232312.AA17017@expire.lcs.mit.edu>
Date: 23 Jul 89 23:12:12 GMT
References: <579@elan.elan.com>
Sender: daemon@bloom-beacon.MIT.EDU
Organization: The Internet
Lines: 44


>> This subject has appeared before, but I never heard any real definitive
>> answers or solutions to the problem.  The problem is that sometimes an
>> X client seems to fall behind the server, or a very large amount of data
>> is being sent between the client and the server, and the server appears
>> to send a KillClient, and consequently the client dies.  I have heard
>> some say that there is a bug in writev and it returns an incorrect
>> error code.  Others have said that it is caused by buggy unix domain
>> sockets (we've gotten the error when client and server were on the same
>> machine and when they were not).  In any case, it is causing us a lot
>> of grief, so I was wondering if anyone has found a fix, a good explanation,
>> or even a "Fixed in R4" comment.  Thanks!
>> -- 
>> Jeff Lo, Elan Computer Group, Inc.
>> jlo@elan.com, ..!{ames,uunet}!elan!jlo
>> 888 Villa Street, Third Floor, Mountain View, CA 94041, 415-964-2200

I spent a fair amount of time tracking down cases of this for our implementation
and thus have some info for you.  The general rule for the sample implementation
server socket calls (in libos) is to perform the system call, if it returns
an error to silently do a close() on the socket and thus leave the user
in the dark.

(What we do here is to send any error messages out thru the sys log daemon :=)

Two things that can cause the error, which we have actually observed are:

	1) Under heavy load the system (if it is Unix (tm) derivative) either
	   ENOBUFS or ENOMEM when the X server tries to write into the socket.

	2) During the X connection handshake, the server saves the time at
	   which the connection handshake started, and if the handshake does
	   not complete before a time out period (default 60 sec.), again
	   silently close()s the connection.

The two cases can be differentiated via the XIO message.  In the latter case,
0 requests will have been processed.  (As a heuristic, using time out values
in non-real time O.S.'s often works, but can infrequently fail. :=)

I believe the thing which needs to be done is to have the server implementor
write meaningful error messages to a message log when either of these cases
occur.  You then may be able to reconfigure your O.S or use of X to avoid
situation of heavy load which cause the underlying problem.  Having an error
message is a necessary precursor, in order to recognize what the problem was.