Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!watmath!clyde!rutgers!ames!ucbcad!ucbvax!LBL-CSAM.ARPA!van
From: van@LBL-CSAM.ARPA.UUCP
Newsgroups: mod.protocols.tcp-ip
Subject: Analyzing Acknowledgement Strategies, continued
Message-ID: <8701171908.AA11125@lbl-csam.arpa>
Date: Sat, 17-Jan-87 14:08:57 EST
Article-I.D.: lbl-csam.8701171908.AA11125
Posted: Sat Jan 17 14:08:57 1987
Date-Received: Sat, 17-Jan-87 23:48:32 EST
Sender: daemon@ucbvax.BERKELEY.EDU
Organization: The ARPA Internet
Lines: 187
Approved: tcp-ip@sri-nic.arpa

Boy, was I wrong!  I've had about 100 messages in the past couple
of days saying this discussion should stay on the list.  But this
opus should convince the diehards that they made a mistake...

Dave -

You're absolutely right.  Linear filters, including Kalman
filters, do a lousy job on estimating round trip time (witness
all the spurious retransmissions on our poor, congested
Internet).  And median filters, because they have very high noise
immunity and contain no assumptions about the underlying data
distribution, should do a great job estimating r.t.t.  But,
(since I'm running 0-for-3, I have to say "But...") I was
thinking of estimating something slightly different from r.t.t. 
I pictured the "filter" estimating the coefficients of a
mathematical model that describes how a network works.  Then tcp
would use those coefficients to decide what to do when sending
and retransmitting.  Since I didn't even come close to explaining
this in the last message, I'll try here.  (The simple-minded
explanations illustrate my ignorance; they're not intended as an
insult anyone's intelligence.)


What's the problem?

If you don't get an ack for a packet within some reasonable period
of time, you will have to retransmit.  There are two possible reasons
for not getting an ack and they lead to diametrically opposed
retransmit strategies:

  1) The packet was damaged or lost in transit.  In this case you
     want to retransmit the packet as soon as you can because
     throughput goes down linearly while you wait.  Or to put it
     another way, in this case throughput will increase linearly
     with retransmit rate.

  2) The packet was discarded because of congestion at a gateway.  In
     this case you want to wait a while before retransmitting to give
     the congestion a chance to clear.  In this case throughput will
     decrease exponentially with retransmit rate.

"No ack" conveys exactly one bit of information (that a packet was
lost) and there's no way we can squeeze another bit out to tell us
if this is case 1 or 2.  But, if acks are not independent of
one another, we might squeeze something out of pattern of acks
(the ack "time series") to help us decide between cases.


What information can you pull out of the time series?

I'm about to play "mathematician", so I'd better define some
notation.  Lets say P(i) is the send time of the i'th packet,
A(i) is the arrival time of the i'th ack, and R(i) = A(i)-P(i) is
the round trip time of the i'th packet.  All times are measured
at the sender.  All packets are the same size and the receive window
is a constant W packets (packets i through i+W-1 can be sent
prior to A(i) or, equivalently, P(i+W) >= A(i)).  The time to
generate and consume packets is small compared to the network
propagation times (the sender will immediately send a packet when
the window opens and the receiver will immediately ack a packet
when it arrives). 

It's pretty clear that our traffic will introduce correlations in
ack time series.  This is because A(i), the ack for packet i, is
related to the send of i by
	A(i) = P(i) + R(i)
but P(i) was sent when the window opened, i.e., when the ack for
i-W arrived.  So P(i) = A(i-W) and
	A(i) = A(i-W) + R(i)
This is what time-series people would call an "auto-regressive"
or AR model.  It's the simplest case of an AR(W) model, one where
current events are related to events W steps in the past with R(i)
adding some randomness so you don't get complacent.

[The nice thing about time series models is that some smart
people have devised cookbook ways for a complete idiot to
construct them (a topic called "system identification").  For
example, "Time-series analysis: forecasting and control" by Box
and Jenkins leads you by the hand from `guessing possible models
from inspection of sample data' through `picking the best model
to describe the data' to `using the model once you have it'.]

It's also pretty clear that the R(i) time series will have
internal correlations, at least under congestion.  You can view
the round trip time as being proportional to the amount of "work"
the network is doing (i.e., the number of packets ahead of us in
the gateway queues).  The definition of congestion is that we
aren't done handling the work from time i-1 when the new work for
time i arrives.  The previous statements define an AR(1) model
for round trip time:
	R(i) = a R(i-1) + d + e(i)
where "a" is the fraction of work carried over from the last
step, "d" is the fixed, minimum, end-to-end propagation delay and
"e" is a random "noise" term to account for new traffic entering
and internal state changes (like adaptive routing). 

Extrapolating and waving my hands, I was going to make a model
something like
	R(i+1) = b (P(i) - P(i-1)) + a R(i) + e(i) + d
where "b" estimates how much our traffic rate affects the r.t.t.,
"a" estimates how much residual work in the network affects the
r.t.t.  and the variance of "e" estimates how much random
fluctuations contribute to the r.t.t.  (for people who like
acronyms, I think this is called an ARMAX model - Auto-Regressive,
Moving Average with eXogenous inputs).  The filter would estimate
a, b and var(e).  (The preceding sounds impressive and difficult
but it's not.  The whole thing consists of about 10 lines of
arithmetic to update 6 numbers which you do each time an ack
arrives.  With the help of a good introductory text, like Peter
Young's "Recursive Estimation and Time-Series Analysis", the
arithmetic even makes sense in a twisted sort of way.)


So What?

The model should get us at least three ways to choose retransmit
strategy: If a is large or da/di is positive, the network can't
handle its load and loss is probably congestive.  If e is small
or de/di is negative, the network is probably congested (this is
a consequence of the P-K theorem: as the network nears
saturation, "random" external perturbations have less effect on
it.  Intuitively, the variance of e is a measure of the random
way traffic arrives at and flows through the network.  But at
saturation you can no longer see these random arrivals: Data
arrives and sits in queues while some poor, constipated,
bottleneck gateway chugs on its backlog.  The whole network
behavior reduces to the (deterministic) behavior of this
gateway).  Finally, if db/di has the same sign as dR/di (i.e.,
sending packets faster lowers throughput) the network is probably
congested. 

There's a bit more information in the model.  "a" is a measure of
how fast congestion is growing so it tells you how much to change
your packet generation rate to help damp out the congestion.  In
the case of loss by damage, var(e) tells you how long to wait
before retransmitting.  E.g., say you're the only user of a
point-to-point link with 1 sec. mean round trip time and the
variation in the r.t.t. is +-50ms.  The tcp spec says to wait 2
sec. before retransmitting but you can be damn near certain you
need to retransmit after 1.2 sec.  The model should say a = b = 0,
d = 1 sec and var(e) = .1 sec.  If you set your retransmit timer
to d + 2var(e) instead of 2(d + var(e)), you won't do any more
retransmissions and thoughput could improve by up to 50%,
depending on the loss rate. 

There are many things in this that need to be checked by
experimentation.  One is the applicability of the network model. 
The whole idea is useless if someone has to do a new model
identification when an IMP is added to a network.  My hope is
that the model is "generic"; that anything serving the function
of "computer network" has a first-order description of this form.
The parameters will change from network to network but those are
estimated dynamically by the filter IF the model form is close to
correct. 

Another check is for the rate of convergence of the model.  This
will have been "an interesting academic exercise" if it takes 50
packets for the model to figure out the parameters.  My past
experience with these kind of filters (in control systems, not
networks) has been that they converge incredibly quickly (3 to 5
samples) so I'm hopeful. 

And finally, to get back to the start of this whole thing: acks. 
Send-to-ack round trip time is the only way the sender has of
probing the state of the network.  If the receiver can do
something random, acks only probe the composite state of
network+receiver.  If the receiver can go "maybe I'll generate 3
acks, just in case one gets lost", the information for the sender
in an ack is enormously reduced. 

I think this is a protocol design issue.  To do the type of
modelling I've been talking about, the actions of the receiver
have to be deterministic from the point of view of the sender. 
(Note that this prohibits things like delayed acks: the
receiver's behavior is determined by the packet interarrival
times, something the sender cannot know.  Since delayed acks are
a good thing on a local net, I'll modify the opening statement to
"the receiver's behavior should be deterministic when the transit
delay is long".)  In the case of tcp, I'd like the receiver's
behavior to be "one packet in, one ack out", just because I don't
know how to model anything more complicated. 

Congratulations to anyone who got this far.  I'm off to Usenix
for a week but there will be a test over this material a week
from Monday ;-).

 - Van