Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!utgpu!water!watmath!clyde!rutgers!mit-eddie!bloom-beacon!oberon!cit-vax!elroy!ames!sdcsvax!ucbvax!UDEL.EDU!mills From: mills@UDEL.EDU.UUCP Newsgroups: comp.protocols.tcp-ip Subject: NSFNET woe: causes and consequences Message-ID: <8710030151.AA27202@ucbvax.Berkeley.EDU> Date: Fri, 2-Oct-87 21:53:23 EDT Article-I.D.: ucbvax.8710030151.AA27202 Posted: Fri Oct 2 21:53:23 1987 Date-Received: Sun, 4-Oct-87 02:31:57 EDT Sender: daemon@ucbvax.BERKELEY.EDU Distribution: world Organization: The ARPA Internet Lines: 70 Folks, Things have been very bad around the NSFNET since last Thursday. After several 16-hour days and much experimentation, I think I understand at least some of the reasons. If I am correct, you are not going to like the consequences. Last Thursday the primary NSFNET gateway psc-gw became increasingly flaky, eventually to the point where it and its seventy-odd nets disappeared from EGP updates. Backup gateways linkabit-gw and cu-arpa picked up the slack, but not without considerable losses and delays due to congestion. When the new ARPANET code was installed over the weekend, psc-gw and its PSN (14) both completely expired, reportedly due to "resource shortage," the usual BBN euphemism for insufficient storage or table overflow, especially for connection blocks which manage ARPANET virtual circuits. Apparently, BBN backed out of the new code, so the PSN is unchanged from Thursday. Meanwhile, Maryland gateway terp, also connected to a PSN (20) running the new ARPAware, began behaving badly, so much so that terp was simply turned off, leaving another Maryland gateway to hump the load. At this time (Thursday evening) the gateway is still off. Since both psc-gw and terp have similar configurations, connectivity and PSN (X.25) interfaces, one would assume the same varmit bit both of them. Meanwhile, I was sitting off PSN 96 trying to figure out what was going on and noticed linkabit-gw 10.0.0.111 and dcn-gw 10.2.0.96 could not reach psc-gw at its ARPANET address 10.4.0.14. However, both of these buzzards could reach other hosts with no problem. Furthermore, EGP updates received from the usual corespeakers revealed psc-gw was working just fine. I concluded something wierd was spooking the ARPANET; however, I found that cu-arpa 10.3.0.96 and louie 10.0.0.96 could work psc-gw at its ARPANET address. I thought maybe X.25 was the key, since all of the other PSN 96 machines use 1822, and cranked up swamp-gw 10.9.0.96 using X.25, but found no joy with psc-gw either. When Dave O'Leary of PSC called to tell me their ACC 5250 X.25 driver for the MicroVAX was spewing out error comments to the effect that insufficient virtual circuits were available, all the cards fell into place. The 5250 supports a maximum of 64 virtual circuits. Apparently the number of ARPANET gateways and other (host) clients has escalated to the point that the 64-maximum was exceeded. Probably the PSN was groaning even before that, which might have led to the earlier problems over the weekend. The reason some gateways could work psc-gw anyway was that they had captured the virtual circuits due to significant traffic loads and frequent connection attempts. My tests were from lightly loaded host ports which couldn't break into the mayhem which must be going on in the psc-gw 5250 board. I have looked at the 5250 driver code, which is pretty simplistic on how it manages the virtual-circuit inventory. It appears now of the highest priority that a more mature approach be implemented in the driver, so that virtual-circuit resources can be reclaimed on the basis of use, age, etc. In principle, this is not very hard, but would have to be done quickly. Meanwhile, I suspect a lot of X.25 client gateways (not just NSFNET) are or soon will be very sick indeed. Note that reclamation requires that open circuits to one destination may have to be closed abruptly, which can result in loss of data, then reopened to another destination. Under thrashing conditions where the load is spread over lots of other gateways and virtual circuits are flapping like crazy, the cherished ARPANET reputation for reliable transport may be considerably tarnished. Those of us who have pondered the wisdom of underlaying X.25 virtual circuits beneath a connectionless service have repeatedly said that this kind of problem was certain to occur sooner or later. There are now about 200 gateways and 300 networks out there. As the ARPANET evolves toward a gateway-gateway (many-to-many) service, rather than a host-gateway (few-to-many) service, the problem can only get much worse. I personally believe the ARPANET architects and engineers, as well as the host and gateway vendors, must quickly come to solid grips on this issue. Our most precious resource may not be packet buffers, but connection blocks. Dave -------