Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!think.com!zaphod.mps.ohio-state.edu!sample.eng.ohio-state.edu!purdue!haven.umd.edu!noc.sura.net!oleary From: oleary@sura.net (dave o'leary) Newsgroups: comp.protocols.tcp-ip Subject: Re: Is the Internet usable for wide-area interactive conversations? Keywords: tough routing problems Message-ID: <1991Jun19.085123.24774@sura.net> Date: 19 Jun 91 08:51:23 GMT References: <2039.Jun1803.33.1391@kramden.acf.nyu.edu> <35911@ucsd.Edu> <17719.Jun1904.23.0991@kramden.acf.nyu.edu> Organization: SURAnet, College Park, MD Lines: 185 In article <17719.Jun1904.23.0991@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes: >In article <35911@ucsd.Edu> brian@ucsd.Edu (Brian Kantor) writes: >> The real solution is to fix telnet and its ilk so that it doesn't kill >> the connectionn when it gets what could well be a temporary error like >> network unreachable. It's not at all unusual to get temporary errors >> like that whilst rerouting is taking place. > >I agree, that would keep connections alive. Provided, that is, that you >convince Sun (among others) to change this behavior, and replace all the >old machines out there. But you missed my point. > >Just because service isn't interrupted doesn't mean it's usable. Folks, >an average of 1 second round trip time, with 1.8 seconds standard >deviation, is just abominable on a route that could easily handle >several times more data with a round trip time under a quarter second. > Do you have stats on the utilization of the lines, the memory on the gateways, etc. I agree that 1000 msec average is pretty unreasonable but I'm not sure what the basis for your claim is - what kind of latencies do you anticipate through a loaded regional network gateway for example? >Let me explain what's really happening on the NYU-Berkeley connection. >On the average there's not too much data on each link: some segments of >the ``optimal'' route are at a mere 50% or 90% capacity, with slightly >sub-``optimal'' routes nearby lying unused. So I see round trip times of >well under a second. > 50% utilization on a point to point line sampled less frequently than every couple of seconds is starting to look like real congestion. 90% is not pretty. Queuing delay, retransmissions due to delayed acks, dropped packets (large swings in buffer allocations) etc. Life becomes harsh very rapidly. If you sample frequently enough, you can see 100% utilization. Without context these numbers don't mean a lot. >Suddenly a few too many people start ftp requests in the same second. >The ``optimal'' route is quickly overwhelmed, packets die like flies, >and my round trip time goes down the drain. The sub-``optimal'' routes >are still carrying almost no traffic. Our Super Duper Dynamic Routing >Protocols see the disaster and respond, bravely throwing packets to >those no longer suboptimal routes until a permanent lifeline has been >established. In the meantime there's been a service interruption or >delay of several seconds up to a few minutes. > I can only think of one case where something like this would happen but it would be pretty unusual. Distance vector protocols don't decide whether a link is up or down - they receive that information from a link level protocol. The link level protocol doesn't care if packets are getting thrown away because the router doesn't have any more buffer space. So the line stays up, routes remain installed in the router's tables, and the "optimal" link is still used. The case where this doesn't happen (brief consideration says that this is applicable to both link state and distance vector protocols) is when the packets that are getting thrown away are the routing updates. However, there are at least a couple of things that prevent this from being too common - first, typically multiple routing updates have to be thrown away before routes time out. Also, since routing info flows in the opposite direction from user traffic, the link has to be congested in both directions for this to be a problem. Another reason, and probably the clincher, is that CPU's can generate routing updates faster than the interfaces can put packets in the queue. So if there is a lot of buffer space on an interface, the CPU can queue a routing update into that space rapidly, much faster than the interface can drain the queue, or another interface could fill the queue (via the CPU). Some of this changes with routers that copy directly between interfaces using a local interface CPU and such but this only makes the proposed scenario less probable. >Soon the same thing happens again. Again the route is flooded. Again >service disappears. Again the routers intercede and revert to their >original routing decisions. And so it goes, on and on through the night. > >At higher loads, a funny thing happens. The load regularly bursts over >the top of what the current route can handle. Within seconds a router >changes its decisions---but the other end simultaneously comes to the >opposite conclusions. By the time each burst of packets has made its >round trip, the routers have changed their decisions again, feeding >their already obsolete data back into the loop. And so the routes >rapidly flap. Down goes the network. > Actually this did happen, in the "old" NSFnet backbone, with the Fuzzballs, using Hello as the IGP and DDCMP as the link level protocol. However, in this case, the link level protocol tried to be smart, and retransmit frames that were lost due to buffer problems at the other end. The fuzzballs didn't have a lot of free memory lying around (despite heroic efforts by a certain individual) and (I'm trying to remember, this was a while ago when I didn't understand this stuff very well :-) so anyway, buffer thrashing resulted and route flapping did occur kind of as you describe. They were 56kb lines, and trying to cram the load of a busy ethernet down 2 or 3 or these slow speed lines was not pretty. Dave Mills can without doubt explain what was breaking much more clearly than I ever could. >In the meantime, any dolt can see that the network backbone is >multiply connected. While one route degenerates, several parallel routes >cruise along at 1% or 3% capacity. Sure, they didn't look ``optimal'' >five minutes before, because they meant some extra T1 or even 56kb hops. >But if every router simply split its data between the three best routes, >the whole network would be able to handle a far higher load before >*anything* crashed. How do you determine what is a "parallel route"? When your routing protocol works at the network layer (or the IP layer, anyway), your routers keep routing tables of IP routes and forwarding decisions are made using the destination IP address. However, congestion occurs on links between a pair of routers (I think this is called a subnet in ISO-ese). So the router can't really balance across the three best links - what are the three "best"? Since you are forwarding *packets* through the network, rather than streams, you have to keep track of a lot of stuff - and you can't anticipate what is coming next. In a circuit switching network (i.e. telephone calls) lots of assumptions are made about the calls that allow the network to do this kind of "balancing". >A funny thing happens, by the way, when you start using split routes. It >no longer matters much whether you dynamically optimize or not. If your >optimal link goes down, who cares? You're already sending most of your >packets along the three or four slightly suboptimal links. Think of it >as a backup battery system. Not just a backup battery, but a constantly >online backup battery---an uninterruptible power supply, in fact a >supply with three or four big backup batteries that will keep you alive >just as well as the power company. > >So there's no point in rushing to react to every little problem. That >way lies inefficiency, route flapping, and madness. You might as well >leave routes constant for a while---a day, say. Just keep track of how >well the routes worked, and the next day adjust the packet flow by a >little bit on each line, making sure never to overload one sensible >route or to ignore another. > I think I'm missing something here. (of course it is getting a little late :-( ). I consider a reaction time on the order of hours to be engineering decisions, not routing decisions. This addresses the problems I started to delineate above, but it isn't clear to me what you are measuring. How do you respond to outages? How do you "keep track of who well a route worked"? What is a "sensible route"? Maybe you know one when you see one, but can you code that into a routing protocol? Congestion on a link occurs on a second by second basis (or more often in some cases), so correcting things on the order of days won't really solve the problem. Although if you are proposing a kind of dynamic bandwidth allocation...well, that's been thought of too. Merit was going to do something like that with the IDNX's of the original "new" NSFnet backbone, late 1988. I'm not sure what really came of that, other than instead of trying to reallocate bandwidth they ended up just adding more everywhere. The packet flows are bursty second to second, so if you can handle the load now, it doesn't mean that you can handle it a second from now. But you might be able to handle it all day tomorrow without dropping anything. How do you plan for that other than building in lots of extra capacity (which just solves the problem anyway)? >I've left out of this story any notes on why NYU-Berkeley was so slow--- >why the ``optimal'' routes were so close to capacity that they kept >getting pushed over the edge. Suffice it to say that the entire net, >rather than just isolated pockets, will be seeing similar loads within >two or three years, unless we act now to split packets across every >available line. > >---Dan We *are* splitting traffic across every available line. Every line isn't running at the same level of utilization, but when we are running RIP we don't really have much of a choice. OSPF and IGRP allow the costing of interfaces so that better balance can be achieved. Okay, so we've started to address the problem within one autonomous system. Now it's time to cross over into the NSFnet backbone. Or maybe the ESnet backbone or Milnet. Of course, they are possibly using different IGP's, and we lose all the costing information anyway when we cross over from one AS to another. Is BGP the answer here? Maybe if we (the network service providers) can agree on how to assign OSPF costs consistently across different routing domains. And on a system to translate between OSPF and IGRP metrics (it's starting to look messy....). The problems certainly aren't trivial. And neither are the answers. Yes, we had better get to work. dave o'leary SURAnet NOC Manager oleary@sura.net (301)982-3214