Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Posting-Version: version B 2.10.3 4.3bsd-beta 6/6/85; site ucbvax.BERKELEY.EDU Path: utzoo!decvax!decwrl!ucbvax!tcp-ip From: tcp-ip@ucbvax.UUCP Newsgroups: mod.protocols.tcp-ip Subject: ip fragmentation follies Message-ID: <8512290442.AA12458@uw-beaver.arpa> Date: Sat, 28-Dec-85 19:00:04 EST Article-I.D.: uw-beave.8512290442.AA12458 Posted: Sat Dec 28 19:00:04 1985 Date-Received: Tue, 7-Jan-86 01:39:08 EST Sender: daemon@ucbvax.BERKELEY.EDU Organization: The ARPA Internet Lines: 55 Approved: tcp-ip@sri-nic.arpa I've been playing with IP fragmentation/reassembly and have discovered a major crock in the Berkeley way of doing things. This may have been noticed by someone before, but I hadn't really thought about it. What caused me to notice this was claims by some people (namely Sun) that using very large IP packets and using IP-level fragmentation makes protocols like NFS run faster. This makes some sense (less context-switching, etc), so we decided to try it. We quickly noticed a problem, though: if a fragmented packet has to be retransmitted (eg because one of the fragments was dropped somewhere) the fragments of the retransmitted packet are not and can not be merged with those of the original packet! Why? Because the Berkeley code has no notion of IP-level retransmission, and hence assigns a new IP-level packet identifier to each and every IP packet it transmits! And since the IP-level identifier is the only way the receiver can tell whether two fragments belong to the same packet, this means that the fragments of a retransmitted packet can never be combined with those of the original. What all this means in practice is this: for a fragmented IP packet to get through to its receiver, all the fragments resulting from a single transmission of that packet must get through. If a single fragment is lost, all the other fragments resulting from that transmission of the packet are useless and will never be recombined with fragments from past or future transmissions of the same packet. This all explains (or at least provides a partial explanation) for why people running 4.2 TCP connections across the Arpanet using 1024-byte packets were losing so badly. If the probability of fragment lossage is even moderately high, it will often take three or more tries to get a fragmented packet across the net. Meanwhile, of course, the useless fragments from previous transmissions are sitting on reassembly queues in the receiver (for 15 seconds, I think?), tying up buffering resources and increasing the chances that fragments will be dropped in the future! In the current Berkeley code, it's possible to imagine workarounds for this problem for TCP: because TCP is in the kernel, it could have a side hook into the IP layer to tell it "this packet is a retransmission, don't give it a new IP identifier". For protocols like UDP, however, the acknowledgment and retransmission functions are done outside of the kernel, and the only kernel interface that's available is Berkeley's socket calls (sendto, recvfrom, etc). Needless to say, the socket interface gives you 1) no way to find out what IP identifier a packet was sent with; 2) No way to specify the IP identifier to use on an outgoing packet. I don't really have any idea what to do about this problem. And, it's not entirely Berkeley's fault; the BBN TCP/IP for 4.1bsd did the same thing... In any case, until there's a fix I don't think using IP fragmentation/reassembly when talking to 4.2bsd systems is a very good idea. -Larry -------