Path: utzoo!attcan!uunet!husc6!mailrus!ames!pasteur!ucbvax!NCS.DND.CA!netcoor From: netcoor@NCS.DND.CA (DRENET Coordinator) Newsgroups: comp.protocols.tcp-ip Subject: Mail Delivery Problems Message-ID: <8811231833.AA14239@ncs.dnd.ca> Date: 23 Nov 88 18:33:10 GMT Sender: daemon@ucbvax.BERKELEY.EDU Organization: The Internet Lines: 112 Over the past several months I have noticed some problems sending mail to some Internet sites. Initially it seemd to be specific between one host in the DREnet (for which I am the Network Coordinator) and one other host in the Internet. That problem was solved by routing mail through an intermediate host which was able to deliver the message. Recently, however, other users have brought some newer occurrences of this same problem to my attention, each one affecting a different host. In each case, mail from our networks cannot be sent to the affected host, but mail from the host can reach us. The problem is not a routing or a reachability problem. When I monitor attempts to send mail to the affected hosts, I see that a TCP connection is successfully established, but little or no data transfer occurs. It appears to me that the initial SMTP handshaking is occurring, but that none of the data (ie the message following the DATA command) is getting through (note that I can't be sure of this, it just makes sense given what I have seen). I see the send queue (via netstat) grow to some size and then stay at that size until the connection times out. The syslog entries show "read: reply error" when the connection breaks, and sometimes the host at the other end will follow up by sending a message saying that the receipt of the message failed when a read timed out. I can telnet to the SMTP port on the affected host and type the message myself without any problems. It may be related to packet size. I say this because all the TCP connection handshake packets and the SMTP handshake are generally small, and the packets following the DATA command would be significantly larger (given a reasonably sized message). Also, I can FTP to any of the sites involved and can retrieve files without difficulty (anonymous ftp). However, attempts to send files, to the only known affected system that permits it, fail once the actual transfer of the file starts. Directory listings, cwd's, and ftp mode commands (ie bin and ascii commands) all work. I am at a loss to explain this. I can't see why this would happen, given that a TCP connection is established successfully and some packets can get through. I don't think it is our systems here that are at fault as they are able to mail to many other Internet sites without problems. Further, we have a variety of systems here and all have the same problem with the affected hosts. If anyone can provide any clues or suggestions or answers to this problem, I'd be glad to hear them. I admit that I am stumped. I have included below a message I received from another DREnet user describing his view of the problem. Bob Bradford DREnet Coordinator ============================================================================ From irwin@red.ipsa.dnd.ca Fri Nov 11 20:16:47 1988 Received: from red.ipsa.dnd.ca (red.ipsa.dnd.ca.ARPA) by ncs.dnd.ca; (4.12/4.7) id AA21100; Fri, 11 Nov 88 20:16:19 est Received: by red.ipsa.dnd.ca; (5.54/4.7) id AA02802; Fri, 11 Nov 88 20:17:12 EST Message-Id: <8811120117.AA02802@red.ipsa.dnd.ca> To: drenet-problems@ncs.dnd.ca Cc: dan@red.ipsa.dnd.ca, irwin@red.ipsa.dnd.ca Subject: TCP packets lost Date: Fri, 11 Nov 88 20:17:09 EST From: irwin@red.ipsa.dnd.ca Status: RO We are experiencing a problem here in which packets sent from this host do not reach the destination host. For some unknown reason, this only shows up in sendmail to certain hosts. When I try to send a mail message to one host (a BSD system), the following sequence of events occurs: 1. There seems to be some difficulty in establishing the initial connection. Often, the initial SYN packet must be retransmitted a number of times, or the connection appears to be dropped and picked up again (I'm not sure about this). 2. Eventually, the connection is established, and this host sends 2-4 packets that are acknowledged by data and/or ACK packets from the remote host. (The normal state of affairs.) 3. This host sends a further packet, and no acknowledgement is received. 4. This host retransmits the packet the maximum number of times (10 in BSD 4.2), receives no acknowledgement, and drops the connection. I have turned on the kernel switches that enable the TCP protocol trace printing code and watched this happen. There is another switch in the kernel which tells it to do a steeper exponential retransmit timeout, essentially numtries++; timeout = clip (timeout << numtries, MIN_TIMEOUT, MAX_TIMEOUT) I tried this too, thinking that an acknowledgement might arrive if I gave it more time. This did not work. (It's nice to have kernel source, though :-).) It seems, however, that packets can still be received. When sending mail to a Tops-20 system, the mailer connection is automatically logged out because it is idle too long. The packet carrying the autologout message is received here (sendmail prints it), but I haven't traced this, so I don't know what acknowledgement the packet carries. One other point: I tried to send a message by talking directly to one mailer with telnet from this host. That worked. I have no idea why this only happens with sendmail. As an aside, it seems to me that a steeper exponential retransmit timeout is not a bad idea for a host like ours with an indirect connection to the Internet. Any words of wisdom on the subject? Irwin Meisels irwin@red.ipsa.dnd.ca