Path: utzoo!attcan!uunet!cs.utexas.edu!usc!apple!bloom-beacon!stellar.stellar.COM!stevep From: stevep@stellar.stellar.COM (Steve Pitschke) Newsgroups: comp.windows.x Subject: Re: XIO errors again Message-ID: <8907232312.AA17017@expire.lcs.mit.edu> Date: 23 Jul 89 23:12:12 GMT References: <579@elan.elan.com> Sender: daemon@bloom-beacon.MIT.EDU Organization: The Internet Lines: 44 >> This subject has appeared before, but I never heard any real definitive >> answers or solutions to the problem. The problem is that sometimes an >> X client seems to fall behind the server, or a very large amount of data >> is being sent between the client and the server, and the server appears >> to send a KillClient, and consequently the client dies. I have heard >> some say that there is a bug in writev and it returns an incorrect >> error code. Others have said that it is caused by buggy unix domain >> sockets (we've gotten the error when client and server were on the same >> machine and when they were not). In any case, it is causing us a lot >> of grief, so I was wondering if anyone has found a fix, a good explanation, >> or even a "Fixed in R4" comment. Thanks! >> -- >> Jeff Lo, Elan Computer Group, Inc. >> jlo@elan.com, ..!{ames,uunet}!elan!jlo >> 888 Villa Street, Third Floor, Mountain View, CA 94041, 415-964-2200 I spent a fair amount of time tracking down cases of this for our implementation and thus have some info for you. The general rule for the sample implementation server socket calls (in libos) is to perform the system call, if it returns an error to silently do a close() on the socket and thus leave the user in the dark. (What we do here is to send any error messages out thru the sys log daemon :=) Two things that can cause the error, which we have actually observed are: 1) Under heavy load the system (if it is Unix (tm) derivative) either ENOBUFS or ENOMEM when the X server tries to write into the socket. 2) During the X connection handshake, the server saves the time at which the connection handshake started, and if the handshake does not complete before a time out period (default 60 sec.), again silently close()s the connection. The two cases can be differentiated via the XIO message. In the latter case, 0 requests will have been processed. (As a heuristic, using time out values in non-real time O.S.'s often works, but can infrequently fail. :=) I believe the thing which needs to be done is to have the server implementor write meaningful error messages to a message log when either of these cases occur. You then may be able to reconfigure your O.S or use of X to avoid situation of heavy load which cause the underlying problem. Having an error message is a necessary precursor, in order to recognize what the problem was.