Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!think.com!zaphod.mps.ohio-state.edu!sample.eng.ohio-state.edu!purdue!haven.umd.edu!noc.sura.net!oleary
From: oleary@sura.net (dave o'leary)
Newsgroups: comp.protocols.tcp-ip
Subject: Re: Is the Internet usable for wide-area interactive conversations?
Keywords: tough routing problems
Message-ID: <1991Jun19.085123.24774@sura.net>
Date: 19 Jun 91 08:51:23 GMT
References: <2039.Jun1803.33.1391@kramden.acf.nyu.edu> <35911@ucsd.Edu> <17719.Jun1904.23.0991@kramden.acf.nyu.edu>
Organization: SURAnet, College Park, MD
Lines: 185

In article <17719.Jun1904.23.0991@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
>In article <35911@ucsd.Edu> brian@ucsd.Edu (Brian Kantor) writes:
>> The real solution is to fix telnet and its ilk so that it doesn't kill
>> the connectionn when it gets what could well be a temporary error like
>> network unreachable.  It's not at all unusual to get temporary errors
>> like that whilst rerouting is taking place.
>
>I agree, that would keep connections alive. Provided, that is, that you
>convince Sun (among others) to change this behavior, and replace all the
>old machines out there. But you missed my point.
>
>Just because service isn't interrupted doesn't mean it's usable. Folks,
>an average of 1 second round trip time, with 1.8 seconds standard
>deviation, is just abominable on a route that could easily handle
>several times more data with a round trip time under a quarter second.
>

Do you have stats on the utilization of the lines, the memory on the 
gateways, etc.  I agree that 1000 msec average is pretty unreasonable 
but I'm not sure what the basis for your claim is - what kind of 
latencies do you anticipate through a loaded regional network gateway
for example?

>Let me explain what's really happening on the NYU-Berkeley connection.
>On the average there's not too much data on each link: some segments of
>the ``optimal'' route are at a mere 50% or 90% capacity, with slightly
>sub-``optimal'' routes nearby lying unused. So I see round trip times of
>well under a second.
>

50% utilization on a point to point line sampled less frequently than
every couple of seconds is starting to look like real congestion.  90%
is not pretty.  Queuing delay, retransmissions due to delayed acks,
dropped packets (large swings in buffer allocations) etc.  Life
becomes harsh very rapidly.  If you sample frequently enough, you
can see 100% utilization.  Without context these numbers don't mean
a lot.

>Suddenly a few too many people start ftp requests in the same second.
>The ``optimal'' route is quickly overwhelmed, packets die like flies,
>and my round trip time goes down the drain. The sub-``optimal'' routes
>are still carrying almost no traffic. Our Super Duper Dynamic Routing
>Protocols see the disaster and respond, bravely throwing packets to
>those no longer suboptimal routes until a permanent lifeline has been
>established. In the meantime there's been a service interruption or
>delay of several seconds up to a few minutes.
>

I can only think of one case where something like this would happen
but it would be pretty unusual.  Distance vector protocols don't 
decide whether a link is up or down - they receive that information
from a link level protocol.  The link level protocol doesn't care if
packets are getting thrown away because the router doesn't have any
more buffer space.  So the line stays up, routes remain installed 
in the router's tables, and the "optimal" link is still used.  The
case where this doesn't happen (brief consideration says that this
is applicable to both link state and distance vector protocols) is 
when the packets that are getting thrown away are the routing updates.
However, there are at least a couple of things that prevent this from
being too common - first, typically multiple routing updates have 
to be thrown away before routes time out.  Also, since routing info
flows in the opposite direction from user traffic, the link has to 
be congested in both directions for this to be a problem.  Another
reason, and probably the clincher, is that CPU's can generate routing
updates faster than the interfaces can put packets in the queue.
So if there is a lot of buffer space on an interface, the CPU can 
queue a routing update into that space rapidly, much faster than 
the interface can drain the queue, or another interface could fill
the queue (via the CPU).  Some of this changes with routers that
copy directly between interfaces using a local interface CPU and 
such but this only makes the proposed scenario less probable.

>Soon the same thing happens again. Again the route is flooded. Again
>service disappears. Again the routers intercede and revert to their
>original routing decisions. And so it goes, on and on through the night.
>
>At higher loads, a funny thing happens. The load regularly bursts over
>the top of what the current route can handle. Within seconds a router
>changes its decisions---but the other end simultaneously comes to the
>opposite conclusions. By the time each burst of packets has made its
>round trip, the routers have changed their decisions again, feeding
>their already obsolete data back into the loop. And so the routes
>rapidly flap. Down goes the network.
>
Actually this did happen, in the "old" NSFnet backbone, with the 
Fuzzballs, using Hello as the IGP and DDCMP as the link level 
protocol.  However, in this case, the link level protocol tried to 
be smart, and retransmit frames that were lost due to buffer problems
at the other end.  The fuzzballs didn't have a lot of free memory 
lying around (despite heroic efforts by a certain individual) and 
(I'm trying to remember, this was a while ago when I didn't understand 
this stuff very well :-) so anyway, buffer thrashing resulted and 
route flapping did occur kind of as you describe.  They were 56kb lines,
and trying to cram the load of a busy ethernet down 2 or 3 or these 
slow speed lines was not pretty.  Dave Mills can without doubt 
explain what was breaking much more clearly than I ever could.

>In the meantime, any dolt can see that the network backbone is
>multiply connected. While one route degenerates, several parallel routes 
>cruise along at 1% or 3% capacity. Sure, they didn't look ``optimal'' 
>five minutes before, because they meant some extra T1 or even 56kb hops.  
>But if every router simply split its data between the three best routes, 
>the whole network would be able to handle a far higher load before 
>*anything* crashed.  

How do you determine what is a "parallel route"?  When your routing
protocol works at the network layer (or the IP layer, anyway), your
routers keep routing tables of IP routes and forwarding decisions are
made using the destination IP address.  However, congestion occurs on
links between a pair of routers (I think this is called a subnet in
ISO-ese).  So the router can't really balance across the three best
links - what are the three "best"?  Since you are forwarding *packets*
through the network, rather than streams, you have to keep track of a
lot of stuff - and you can't anticipate what is coming next.  In 
a circuit switching network (i.e. telephone calls) lots of assumptions
are made about the calls that allow the network to do this kind of
"balancing".  

>A funny thing happens, by the way, when you start using split routes. It
>no longer matters much whether you dynamically optimize or not. If your
>optimal link goes down, who cares? You're already sending most of your
>packets along the three or four slightly suboptimal links. Think of it
>as a backup battery system. Not just a backup battery, but a constantly
>online backup battery---an uninterruptible power supply, in fact a
>supply with three or four big backup batteries that will keep you alive
>just as well as the power company.
>
>So there's no point in rushing to react to every little problem. That
>way lies inefficiency, route flapping, and madness. You might as well
>leave routes constant for a while---a day, say. Just keep track of how
>well the routes worked, and the next day adjust the packet flow by a
>little bit on each line, making sure never to overload one sensible
>route or to ignore another.
>

I think I'm missing something here. (of course it is getting a little
late :-( ).  I consider a reaction time on the order of hours to be
engineering decisions, not routing decisions.  This addresses the
problems I started to delineate above, but it isn't clear to me what
you are measuring.  How do you respond to outages?  How do you "keep
track of who well a route worked"?  What is a "sensible route"?  Maybe
you know one when you see one, but can you code that into a routing
protocol?  Congestion on a link occurs on a second by second basis (or
more often in some cases), so correcting things on the order of days
won't really solve the problem.  Although if you are proposing a kind
of dynamic bandwidth allocation...well, that's been thought of too.
Merit was going to do something like that with the IDNX's of the
original "new" NSFnet backbone, late 1988.  I'm not sure what really
came of that, other than instead of trying to reallocate bandwidth
they ended up just adding more everywhere.  

The packet flows are bursty second to second, so if you can handle
the load now, it doesn't mean that you can handle it a second from
now.  But you might be able to handle it all day tomorrow without
dropping anything.  How do you plan for that other than building in 
lots of extra capacity (which just solves the problem anyway)?

>I've left out of this story any notes on why NYU-Berkeley was so slow---
>why the ``optimal'' routes were so close to capacity that they kept
>getting pushed over the edge. Suffice it to say that the entire net,
>rather than just isolated pockets, will be seeing similar loads within
>two or three years, unless we act now to split packets across every
>available line.
>
>---Dan

We *are* splitting traffic across every available line.  Every line
isn't running at the same level of utilization, but when we are 
running RIP we don't really have much of a choice.  OSPF and IGRP
allow the costing of interfaces so that better balance can be 
achieved.  Okay, so we've started to address the problem within
one autonomous system.  Now it's time to cross over into the NSFnet
backbone.  Or maybe the ESnet backbone or Milnet.  Of course, they
are possibly using different IGP's, and we lose all the costing 
information anyway when we cross over from one AS to another.
Is BGP the answer here?  Maybe if we (the network service providers)
can agree on how to assign OSPF costs consistently across different
routing domains.  And on a system to translate between OSPF and 
IGRP metrics (it's starting to look messy....).  

The problems certainly aren't trivial.  And neither are the answers.
Yes, we had better get to work.

dave o'leary	SURAnet NOC Manager
oleary@sura.net       (301)982-3214