Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Posting-Version: version B 2.10.3 4.3bsd-beta 6/6/85; site ucbvax.BERKELEY.EDU
Path: utzoo!decvax!decwrl!ucbvax!tcp-ip
From: tcp-ip@ucbvax.UUCP
Newsgroups: mod.protocols.tcp-ip
Subject: ip fragmentation follies
Message-ID: <8512290442.AA12458@uw-beaver.arpa>
Date: Sat, 28-Dec-85 19:00:04 EST
Article-I.D.: uw-beave.8512290442.AA12458
Posted: Sat Dec 28 19:00:04 1985
Date-Received: Tue, 7-Jan-86 01:39:08 EST
Sender: daemon@ucbvax.BERKELEY.EDU
Organization: The ARPA Internet
Lines: 55
Approved: tcp-ip@sri-nic.arpa

I've been playing with IP fragmentation/reassembly and have discovered
a major crock in the Berkeley way of doing things.  This may have been
noticed by someone before, but I hadn't really thought about it.

What caused me to notice this was claims by some people (namely Sun)
that using very large IP packets and using IP-level fragmentation
makes protocols like NFS run faster.  This makes some sense (less
context-switching, etc), so we decided to try it.  We quickly noticed
a problem, though: if a fragmented packet has to be retransmitted (eg
because one of the fragments was dropped somewhere) the fragments of
the retransmitted packet are not and can not be merged with those of
the original packet!  Why?  Because the Berkeley code has no notion of
IP-level retransmission, and hence assigns a new IP-level packet
identifier to each and every IP packet it transmits!  And since the
IP-level identifier is the only way the receiver can tell whether two
fragments belong to the same packet, this means that the fragments of
a retransmitted packet can never be combined with those of the
original.

What all this means in practice is this: for a fragmented IP packet to
get through to its receiver, all the fragments resulting from a single
transmission of that packet must get through.  If a single fragment is
lost, all the other fragments resulting from that transmission of the
packet are useless and will never be recombined with fragments from
past or future transmissions of the same packet.

This all explains (or at least provides a partial explanation) for why
people running 4.2 TCP connections across the Arpanet using 1024-byte
packets were losing so badly.  If the probability of fragment lossage
is even moderately high, it will often take three or more tries to get
a fragmented packet across the net.  Meanwhile, of course, the useless
fragments from previous transmissions are sitting on reassembly queues
in the receiver (for 15 seconds, I think?), tying up buffering
resources and increasing the chances that fragments will be dropped in
the future!

In the current Berkeley code, it's possible to imagine workarounds for
this problem for TCP: because TCP is in the kernel, it could have a
side hook into the IP layer to tell it "this packet is a
retransmission, don't give it a new IP identifier".  For protocols
like UDP, however, the acknowledgment and retransmission functions are
done outside of the kernel, and the only kernel interface that's
available is Berkeley's socket calls (sendto, recvfrom, etc).
Needless to say, the socket interface gives you 1) no way to find out
what IP identifier a packet was sent with; 2) No way to specify the IP
identifier to use on an outgoing packet.

I don't really have any idea what to do about this problem.  And, it's
not entirely Berkeley's fault; the BBN TCP/IP for 4.1bsd did the same
thing...  In any case, until there's a fix I don't think using IP
fragmentation/reassembly when talking to 4.2bsd systems is a very good
idea.
                                                        -Larry

-------