Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!watmath!clyde!rutgers!ames!ucbcad!ucbvax!SUN.COM!nowicki
From: nowicki@SUN.COM.UUCP
Newsgroups: mod.protocols.tcp-ip
Subject: Congestion
Message-ID: <8702112124.AA00479@rose.sun.com>
Date: Wed, 11-Feb-87 16:24:18 EST
Article-I.D.: rose.8702112124.AA00479
Posted: Wed Feb 11 16:24:18 1987
Date-Received: Fri, 13-Feb-87 22:14:30 EST
Sender: daemon@ucbvax.BERKELEY.EDU
Organization: The ARPA Internet
Lines: 35
Approved: tcp-ip@sri-nic.arpa

I am not sure which is the right group for this discussion, but the
recent congestion problems have brought up two important points.

First, the MX record support from Berkeley for sendmail does not do any
caching.  Perhaps they thought the local name server would cache, but
not when the desired name server is down.  For example, last week
Decwrl.DEC.COM was essentially unreachable from the Arpanet.  The
DEC.COM name servers are either on the other side of Decwrl (128.45),
or behind other unreliable gateways (net 36).  Thus mail started to
pile up, and we quickly had hundreds of messages sitting in the queue.
Each run through the queue did hundreds of MX lookups which had to
timeout.  I extended our simple cache (which already remembered if
hosts are up or down) to cache the result of the MX request (especially
if the request timed out).  This got the queue flowing again.

Second, there seems to be a bug in the HDH code of the PSNs (aka
IMPs).  During periods of congestion, the HDLC layer blocks us from
sending back the "Host Up" messages that are required in HDH.  The PSN
then declares us to be down, clears its buffers, then immediately hears
the Host Up message and declares us to be back up.  This happens every
few minutes during the day.  Not only does throwing the buffered data
away increase congestion in the short term by causing more
retransmissions, there are higher-level instabilities.  If a host
tries to send us a TCP segment  or ACK during the time that the IMP
thinks we are down, they get a "Host Dead" message and reset the TCP
connection, which means the entire mail message has to be
retransmitted.  This just makes matters worse.

I have tried to contact BBN about the second problem, since it is a bug
in their software, but I keep getting the run-around.  The NOC people
just say "must be congestion".  I KNOW it is congestion, but it still
is a bug!  Does anyone at BBN read these lists?

	-- Bill Nowicki
	   Sun Microsystems