draft-ietf-tcpimpl-pmtud-00.txt   draft-ietf-tcpimpl-pmtud-01.txt 
Network Working Group K. Lahey Network Working Group K. Lahey
Expires: October 1999
TCP Problems with Path MTU Discovery TCP Problems with Path MTU Discovery
<draft-ietf-tcpimpl-pmtud-00.txt> <draft-ietf-tcpimpl-pmtud-01.txt>
1. Status of this Memo 1. Status of this Memo
This documnt is an Internet-Draft and is in full onformance with all
provisions of Section 10 of RFC2026.
This document is an Internet Draft. Internet Drafts are working This document is an Internet Draft. Internet Drafts are working
documents of the Internet Engineering Task Force (IETF), its areas, documents of the Internet Engineering Task Force (IETF), its areas,
and its working groups. Note that other groups may also distribute and its working groups. Note that other groups may also distribute
working documents as Internet Drafts. working documents as Internet Drafts.
Internet Drafts are draft documents valid for a maximum of six Internet Drafts are draft documents valid for a maximum of six
months, and may be updated, replaced, or obsoleted by other documents months, and may be updated, replaced, or obsoleted by other documents
at any time. It is inappropriate to use Internet Drafts as reference at any time. It is inappropriate to use Internet Drafts as reference
material or to cite them other than as ``work in progress''. material or to cite them other than as ``work in progress''.
To view the entire list of current Internet-Drafts, please check the The list of current Internet-Drafts can be accessed at
"1id-abstracts.txt" listing contained in the Internet-Drafts Shadow http://www.ietf.org/ietf/1id-abstracts.txt
Directories on ftp.is.co.za (Africa), ftp.nordu.net (Northern
Europe), ftp.nis.garr.it (Southern Europe), munnari.oz.au (Pacific The list of Internet-Draft Shadow Directories can be accessed at
Rim), ftp.ietf.org (US East Coast), or ftp.isi.edu (US West Coast). http://www.ietf.org/shadow.html.
This memo provides information for the Internet community. This memo This memo provides information for the Internet community. This memo
does not specify an Internet standard of any kind. Distribution of does not specify an Internet standard of any kind. Distribution of
this memo is unlimited. this memo is unlimited.
2. Introduction 2. Introduction
This memo catalogs several known TCP implementation problems dealing This memo catalogs several known TCP implementation problems dealing
with Path MTU Discovery [RFC1191], including the long-standing black with Path MTU Discovery [RFC1191], including the long-standing black
hole problem, stretch ACKs due to confusion between MSS and segment hole problem, stretch ACKs due to confusion between MSS and segment
skipping to change at page 2, line 10 skipping to change at page 2, line 12
protocol, it is most commonly used by TCP; this document does not protocol, it is most commonly used by TCP; this document does not
attempt to treat problems encountered by other upper-layer protocols. attempt to treat problems encountered by other upper-layer protocols.
Each problem is defined as follows: Each problem is defined as follows:
Name of Problem Name of Problem
The name associated with the problem. In this memo, the name is The name associated with the problem. In this memo, the name is
given as a subsection heading. given as a subsection heading.
Classification Classification
One or more problem categories for which the problem is classi- One or more problem categories for which the problem is classified:
fied: "congestion control", "performance", "reliability". "congestion control", "performance", "reliability", "non-interoper-
ation - connectivity failure".
Description Description
A definition of the problem, succinct but including necessary A definition of the problem, succinct but including necessary back-
background material. ground material.
Significance Significance
A brief summary of the sorts of environments for which the prob- A brief summary of the sorts of environments for which the problem
lem is significant. is significant.
Implications Implications
Why the problem is viewed as a problem. Why the problem is viewed as a problem.
Relevant RFCs Relevant RFCs
The RFCs defining the TCP specification with which the problem The RFCs defining the TCP specification with which the problem con-
conflicts. These RFCs often qualify behavior using terms such flicts. These RFCs often qualify behavior using terms such as
as MUST, SHOULD, MAY, and others written capitalized. See RFC MUST, SHOULD, MAY, and others written capitalized. See RFC 2119
2119 for the exact interpretation of these terms. for the exact interpretation of these terms.
Trace file demonstrating the problem Trace file demonstrating the problem
One or more ASCII trace files demonstrating the problem, if One or more ASCII trace files demonstrating the problem, if appli-
applicable. cable.
Trace file demonstrating correct behavior Trace file demonstrating correct behavior
One or more examples of how correct behavior appears in a trace, One or more examples of how correct behavior appears in a trace, if
if applicable. applicable.
References References
References that further discuss the problem. References that further discuss the problem.
How to detect How to detect
How to test an implementation to see if it exhibits the problem. How to test an implementation to see if it exhibits the problem.
This discussion may include difficulties and subtleties This discussion may include difficulties and subtleties associated
associated with causing the problem to manifest itself, and with with causing the problem to manifest itself, and with interpreting
interpreting traces to detect the presence of the problem (if traces to detect the presence of the problem (if applicable).
applicable).
How to fix How to fix
For known causes of the problem, how to correct the implementa- For known causes of the problem, how to correct the implementation.
tion.
3. Known implementation problems 3. Known implementation problems
3.1. 3.1.
Name of Problem Name of Problem
Black Hole Detection Black Hole Detection
Classification Classification
Reliability Non-interoperation -- connectivity failure
Description Description
Path MTU Discovery works by sending out as large a packet as possi- Path MTU Discovery (PMTUD) works by sending out as large a packet
ble, with the Don't Fragment (DF) bit set in the IP header. If the as possible, with the Don't Fragment (DF) bit set in the IP header.
packet is too large for a router to forward on to a particular If the packet is too large for a router to forward on to a particu-
link, the router must send an ICMP Destination Unreachable -- Frag- lar link, the router must send an ICMP Destination Unreachable --
mentation Needed message to the source address. Fragmentation Needed message to the source address.
As was pointed out in [RFC1435], routers don't always do this As was pointed out in [RFC1435], routers don't always do this cor-
correctly -- many routers fail to send the ICMP messages, for a rectly -- many routers fail to send the ICMP messages, for a vari-
variety of reasons ranging from kernel bugs to configuration prob- ety of reasons ranging from kernel bugs to configuration problems.
lems. Firewalls are often misconfigured to supress all ICMP messages.
PMTUD, as documented in [RFC1191], fails when confronted with this PMTUD, as documented in [RFC1191], fails when confronted with this
problem. The upper-layer protocol continues to try to send large problem. The upper-layer protocol continues to try to send large
packets and, without the ICMP messages, never discovers that it packets and, without the ICMP messages, never discovers that it
needs to reduce the size of those packets. Its packets are disap- needs to reduce the size of those packets. Its packets are disap-
pearing into a PMTUD black hole. pearing into a PMTUD black hole.
Significance Significance
PMTUD fails spectacularly (or, more to the point, silently) when
confronted with the lack of ICMP messages. When PMTUD fails due to the lack of ICMP messages, TCP will also
completly fail under some conditions.
Implications Implications
This failure is especially difficult to debug, as pings and some This failure is especially difficult to debug, as pings and some
interactive TCP connections to the destination host work. Bulk interactive TCP connections to the destination host work. Bulk
transfers fail with the first large packet and the connection even- transfers fail with the first large packet and the connection even-
tually times out. tually times out.
While these situations can almost always be blamed on a misconfi- These situations can almost always be blamed on a misconfiguration
guration, they appear to be frequent enough to require some sort of within the network, which should be corrected. However it is not
workaround from the host side. appropriate for some TCP implementations to suffer interoperability
failures over paths which do not affect other TCP implementions
(i.e. those without PMTUD).
This creates a market disincentive for deploying TCP implementation
with PMTUD enabled.
Relevant RFCs Relevant RFCs
RFC1191 describes Path MTU Discovery. RFC 1435 provides an early RFC1191 describes Path MTU Discovery. RFC 1435 provides an early
description of these sorts of problems. description of these sorts of problems.
Trace file demonstrating the problem Trace file demonstrating the problem
Made using tcpdump [Jacobson89] recording at an intermediate host. Made using tcpdump [Jacobson89] recording at an intermediate host.
20:12:11.951321 A > B: S 1748427200:1748427200(0) 20:12:11.951321 A > B: S 1748427200:1748427200(0)
win 49152 <mss 1460> win 49152 <mss 1460>
skipping to change at page 5, line 35 skipping to change at page 5, line 43
In this case, the sender sees four packets fail to traverse the In this case, the sender sees four packets fail to traverse the
network (using a two-packet initial send window) and turns off network (using a two-packet initial send window) and turns off
PMTUD. All subsequent packets have the DF flag turned off, and the PMTUD. All subsequent packets have the DF flag turned off, and the
size set to the default value of 536 [RFC1122]. size set to the default value of 536 [RFC1122].
References References
This problem has been discussed extensively on the tcp-impl mailing This problem has been discussed extensively on the tcp-impl mailing
list; the name "black hole" has been in use for many years. list; the name "black hole" has been in use for many years.
How to detect How to detect
This shows up only as a failure to complete a TCP connection. A This shows up as a TCP connection which hanges (fails to make
series of ICMP echo packets will show that the connection is still progress) until closed by timeout. A series of ICMP echo packets
passing packets, a series of MTU-sized ICMP echo packets will show will show that the connection is still passing packets, a
some fragmentation, and a series of MTU-sized ICMP echo packets series of MTU-sized ICMP echo packets will show some fragmentation,
with DF set will fail. This can be confusing for network engineers and a series of MTU-sized ICMP echo packets with DF set will fail.
trying to diagnose the problem. This can be confusing for network engineers trying to diagnose the
problem.
There are several traceroute implementations that do PMTUD. See,
for example, ftp://ftp.psc.edu/pub/networking/tools/traceroute.tar
How to fix How to fix
TCP should notice that the connection is timing out. After several TCP should notice that the connection is timing out. After several
timeouts, TCP should attempt to send smaller packets, perhaps turn- timeouts, TCP should attempt to send smaller packets, perhaps turn-
ing off the DF flag for each packet. If this succeeds, it should ing off the DF flag for each packet. If this succeeds, it should
continue to turn off PMTUD for the connection for some reasonable continue to turn off PMTUD for the connection for some reasonable
period of time, after which it should probe again to try to deter- period of time, after which it should probe again to try to deter-
mine if the path has changed. mine if the path has changed.
Note that, under IPv6, there is no DF bit -- it is implicitly on at Note that, under IPv6, there is no DF bit -- it is implicitly on at
skipping to change at page 6, line 40 skipping to change at page 7, line 9
equipped stack, it will try to generate an ACK for every second equipped stack, it will try to generate an ACK for every second
full-sized segment. If it determines the full-sized segment based full-sized segment. If it determines the full-sized segment based
on the advertised MSS, this can degrade badly in the face of PMTUD. on the advertised MSS, this can degrade badly in the face of PMTUD.
The PMTU can wind up being a small fraction of the advertised MSS; The PMTU can wind up being a small fraction of the advertised MSS;
in this case, an ACK would be generated only very infrequently. in this case, an ACK would be generated only very infrequently.
Significance Significance
Stretch ACKs have a variety of unfortunate effects, more fully out- Stretch ACKs have a variety of unfortunate effects, more fully out-
lined in [Paxson98]. Most of these have to do with encouraging a lined in [RFC2525]. Most of these have to do with encouraging a
more bursty connection, due to the infrequent arrival of ACKs. more bursty connection, due to the infrequent arrival of ACKs.
They can also impede congestion window growth. They can also impede congestion window growth.
Implications Implications
The complete implications of stretch ACKs are outlined in The complete implications of stretch ACKs are outlined in
[Paxson98]. [RFC2525].
Relevant RFCs Relevant RFCs
RFC 1122 outlines the requirements for frequency of ACK generation. RFC 1122 outlines the requirements for frequency of ACK generation.
[RFC2001-bis] expands on this and clarifies that delayed ACK is a [RFC2581] expands on this and clarifies that delayed ACK is a
SHOULD, not a MUST. SHOULD, not a MUST.
Trace file demonstrating it Trace file demonstrating it
Made using tcpdump recording at an intermediate host. The times- Made using tcpdump recording at an intermediate host. The times-
tamp options from all but the first two packets have been removed tamp options from all but the first two packets have been removed
for clarity. for clarity.
21:01:09.877349 A > B: S 436208717:436208717(0) win 16384 18:16:52.976657 A > B: S 3183102292:3183102292(0) win 16384 <mss 4312,nop,wscale 0,nop,nop,timestamp 12128 0> (DF) ()
<mss 65240,nop,wscale 0,nop,nop,timestamp 362104 0> (DF) 18:16:52.979580 B > A: S 2022212745:2022212745(0) ack 3183102293 win 49152 <mss 4312,nop,wscale 1,nop,nop,timestamp 1592957 12128> (DF) ()
21:01:09.882427 B > A: S 1367238400:1367238400(0) 18:16:52.979738 A > B: . ack 1 win 17248 (DF) ()
ack 436208718 win 49152 18:16:52.982473 A > B: . 1:4301(4300) ack 1 win 17248 (DF) ()
<mss 1460,nop,wscale 0,nop,nop,timestamp 1008884 362104> (DF) 18:16:52.982557 C > A: icmp: B unreachable - need to frag (mtu 1500) (DF) ()
21:01:09.882879 A > B: . ack 1 win 16384 (DF) 18:16:52.985839 B > A: . ack 1 win 32768 (DF) ()
21:01:09.884121 A > B: . 1:525(524) ack 1 win 16384 (DF) 18:16:54.129928 A > B: . 1:1449(1448) ack 1 win 17248 (DF) ()
21:01:10.012907 B > A: . ack 525 win 49152 (DF) .
21:01:10.013380 A > B: . 525:1049(524) ack 1 win 16384 (DF) .
21:01:10.013428 A > B: . 1049:1573(524) ack 1 win 16384 (DF) .
21:01:10.214801 B > A: . ack 1573 win 49152 (DF) 18:16:58.507078 A > B: . 1463941:1465389(1448) ack 1 win 17248 (DF) ()
21:01:10.215311 A > B: . 1573:2097(524) ack 1 win 16384 (DF) 18:16:58.507200 A > B: . 1465389:1466837(1448) ack 1 win 17248 (DF) ()
21:01:10.215360 A > B: . 2097:2621(524) ack 1 win 16384 (DF) 18:16:58.507326 A > B: . 1466837:1468285(1448) ack 1 win 17248 (DF) ()
21:01:10.215410 A > B: . 2621:3145(524) ack 1 win 16384 (DF) 18:16:58.507439 A > B: . 1468285:1469733(1448) ack 1 win 17248 (DF) ()
21:01:10.420001 B > A: . ack 3145 win 49152 (DF) 18:16:58.524763 B > A: . ack 1452357 win 32768 (DF) ()
21:01:10.420589 A > B: . 3145:3669(524) ack 1 win 16384 (DF) 18:16:58.524986 B > A: . ack 1461045 win 32768 (DF) ()
21:01:10.420642 A > B: . 3669:4193(524) ack 1 win 16384 (DF) 18:16:58.525138 A > B: . 1469733:1471181(1448) ack 1 win 17248 (DF) ()
21:01:10.420690 A > B: . 4193:4717(524) ack 1 win 16384 (DF) 18:16:58.525268 A > B: . 1471181:1472629(1448) ack 1 win 17248 (DF) ()
21:01:10.420740 A > B: . 4717:5241(524) ack 1 win 16384 (DF) 18:16:58.525393 A > B: . 1472629:1474077(1448) ack 1 win 17248 (DF) ()
21:01:10.622541 B > A: . ack 5241 win 49152 (DF) 18:16:58.525516 A > B: . 1474077:1475525(1448) ack 1 win 17248 (DF) ()
21:01:10.623086 A > B: . 5241:5765(524) ack 1 win 16384 (DF) 18:16:58.525642 A > B: . 1475525:1476973(1448) ack 1 win 17248 (DF) ()
21:01:10.623181 A > B: . 5765:6289(524) ack 1 win 16384 (DF) 18:16:58.525766 A > B: . 1476973:1478421(1448) ack 1 win 17248 (DF) ()
21:01:10.623272 A > B: . 6289:6813(524) ack 1 win 16384 (DF) 18:16:58.526063 A > B: . 1478421:1479869(1448) ack 1 win 17248 (DF) ()
21:01:10.623330 A > B: . 6813:7337(524) ack 1 win 16384 (DF) 18:16:58.526187 A > B: . 1479869:1481317(1448) ack 1 win 17248 (DF) ()
21:01:10.623386 A > B: . 7337:7861(524) ack 1 win 16384 (DF) 18:16:58.526310 A > B: . 1481317:1482765(1448) ack 1 win 17248 (DF) ()
21:01:10.825555 B > A: . ack 7861 win 49152 (DF) 18:16:58.526432 A > B: . 1482765:1484213(1448) ack 1 win 17248 (DF) ()
21:01:10.826167 A > B: . 7861:8385(524) ack 1 win 16384 (DF) 18:16:58.526561 A > B: . 1484213:1485661(1448) ack 1 win 17248 (DF) ()
21:01:10.826218 A > B: . 8385:8909(524) ack 1 win 16384 (DF) 18:16:58.526671 A > B: . 1485661:1487109(1448) ack 1 win 17248 (DF) ()
21:01:10.826268 A > B: . 8909:9433(524) ack 1 win 16384 (DF) 18:16:58.537944 B > A: . ack 1478421 win 32768 (DF) ()
21:01:10.826321 A > B: . 9433:9957(524) ack 1 win 16384 (DF) 18:16:58.538328 A > B: . 1487109:1488557(1448) ack 1 win 17248 (DF) ()
21:01:10.826370 A > B: . 9957:10481(524) ack 1 win 16384 (DF)
21:01:10.826442 A > B: . 10481:11005(524) ack 1 win 16384 (DF)
21:01:10.839214 B > A: . ack 11005 win 49152 (DF)
21:01:10.839800 A > B: . 11005:11529(524) ack 1 win 16384 (DF)
21:01:10.839878 A > B: . 11529:12053(524) ack 1 win 16384 (DF)
21:01:10.839966 A > B: . 12053:12577(524) ack 1 win 16384 (DF)
21:01:10.840062 A > B: . 12577:13101(524) ack 1 win 16384 (DF)
21:01:10.840110 A > B: . 13101:13625(524) ack 1 win 16384 (DF)
21:01:10.840159 A > B: . 13625:14149(524) ack 1 win 16384 (DF)
21:01:10.840208 A > B: . 14149:14673(524) ack 1 win 16384 (DF)
21:01:10.852496 B > A: . ack 14149 win 49152 (DF)
21:01:10.853082 A > B: . 14673:15197(524) ack 1 win 16384 (DF)
21:01:10.853158 A > B: . 15197:15721(524) ack 1 win 16384 (DF)
21:01:10.853206 A > B: . 15721:16245(524) ack 1 win 16384 (DF)
21:01:10.853299 A > B: . 16245:16769(524) ack 1 win 16384 (DF)
21:01:10.853380 A > B: . 16769:17293(524) ack 1 win 16384 (DF)
21:01:10.853449 A > B: . 17293:17817(524) ack 1 win 16384 (DF)
21:01:10.853499 A > B: . 17817:18341(524) ack 1 win 16384 (DF)
21:01:11.028546 B > A: . ack 18341 win 49152 (DF)
21:01:12.258326 A > B: . ack 2 win 16384 (DF)
Note that the interval between ACKs is significantly larger than Note that the interval between ACKs is significantly larger than
two times the segment size -- it is even larger than two times the two times the segment size; it works out to be almost exactly two
MSS, 1460. [Editor -- Will find a better trace file for this that times the advertised MSS. This transfer was long enough that it
displays exactly 2 * MSS.] could be verified that the stretch ACK was not the result of lost
ACK packets.
Trace file demonstrating correct behavior Trace file demonstrating correct behavior
Made using tcpdump recording at an intermediate host. The times- Made using tcpdump recording at an intermediate host. The times-
tamp options from all but the first two packets have been removed tamp options from all but the first two packets have been removed
for clarity. for clarity.
21:02:47.038874 A > B: S 538351768:538351768(0) win 16384 18:13:32.287965 A > B: S 2972697496:2972697496(0) win 16384
<mss 65240,nop,wscale 0,nop,nop,timestamp 362492 0> (DF) <mss 4312,nop,wscale 0,nop,nop,timestamp 11326 0> (DF)
21:02:47.086595 B > A: S 37992677:37992677(0) 18:13:32.290785 B > A: S 245639054:245639054(0) ack 2972697497 win 34496
ack 538351769 win 8760 <mss 1460> (DF) <mss 4312> (DF)
21:02:47.087025 A > B: . ack 1 win 16384 (DF) 18:13:32.290941 A > B: . ack 1 win 17248 (DF)
21:02:47.088294 A > B: . 1:537(536) ack 1 win 16384 (DF) 18:13:32.293774 A > B: . 1:4313(4312) ack 1 win 17248 (DF)
21:02:47.138623 B > A: . ack 537 win 8760 (DF) 18:13:32.293856 C > A: icmp: B unreachable - need to frag (mtu 1500) (DF)
21:02:47.139080 A > B: . 537:1073(536) ack 1 win 16384 (DF) 18:13:33.637338 A > B: . 1:1461(1460) ack 1 win 17248 (DF)
21:02:47.139129 A > B: . 1073:1609(536) ack 1 win 16384 (DF) .
21:02:47.143913 B > A: . ack 1609 win 8760 (DF) .
21:02:47.144406 A > B: . 1609:2145(536) ack 1 win 16384 (DF) .
21:02:47.144453 A > B: . 2145:2681(536) ack 1 win 16384 (DF) 18:13:35.561691 A > B: . 1514021:1515481(1460) ack 1 win 17248 (DF)
21:02:47.144503 A > B: . 2681:3217(536) ack 1 win 16384 (DF) 18:13:35.561814 A > B: . 1515481:1516941(1460) ack 1 win 17248 (DF)
21:02:47.150155 B > A: . ack 2681 win 8760 (DF) 18:13:35.561938 A > B: . 1516941:1518401(1460) ack 1 win 17248 (DF)
21:02:47.150644 A > B: . 3217:3753(536) ack 1 win 16384 (DF) 18:13:35.562059 A > B: . 1518401:1519861(1460) ack 1 win 17248 (DF)
21:02:47.150730 A > B: . 3753:4289(536) ack 1 win 16384 (DF) 18:13:35.562174 A > B: . 1519861:1521321(1460) ack 1 win 17248 (DF)
21:02:47.150788 A > B: . 4289:4825(536) ack 1 win 16384 (DF) 18:13:35.564008 B > A: . ack 1481901 win 64680 (DF)
21:02:47.154880 B > A: . ack 3753 win 8760 (DF) 18:13:35.564383 A > B: . 1521321:1522781(1460) ack 1 win 17248 (DF)
21:02:47.155353 A > B: . 4825:5361(536) ack 1 win 16384 (DF) 18:13:35.564499 A > B: . 1522781:1524241(1460) ack 1 win 17248 (DF)
21:02:47.155404 A > B: . 5361:5897(536) ack 1 win 16384 (DF) 18:13:35.615576 B > A: . ack 1484821 win 64680 (DF)
21:02:47.155452 A > B: . 5897:6433(536) ack 1 win 16384 (DF) 18:13:35.615646 B > A: . ack 1487741 win 64680 (DF)
21:02:47.157030 B > A: . ack 4825 win 8760 (DF) 18:13:35.615716 B > A: . ack 1490661 win 64680 (DF)
21:02:47.157500 A > B: . 6433:6969(536) ack 1 win 16384 (DF) 18:13:35.615784 B > A: . ack 1493581 win 64680 (DF)
21:02:47.157549 A > B: . 6969:7505(536) ack 1 win 16384 (DF) 18:13:35.615856 B > A: . ack 1496501 win 64680 (DF)
21:02:47.157597 A > B: . 7505:8041(536) ack 1 win 16384 (DF) 18:13:35.615952 A > B: . 1524241:1525701(1460) ack 1 win 17248 (DF)
21:02:47.161042 B > A: . ack 5897 win 8760 (DF) 18:13:35.615966 B > A: . ack 1499421 win 64680 (DF)
21:02:47.161538 A > B: . 8041:8577(536) ack 1 win 16384 (DF) 18:13:35.616088 A > B: . 1525701:1527161(1460) ack 1 win 17248 (DF)
21:02:47.161591 A > B: . 8577:9113(536) ack 1 win 16384 (DF) 18:13:35.616105 B > A: . ack 1502341 win 64680 (DF)
21:02:47.161639 A > B: . 9113:9649(536) ack 1 win 16384 (DF) 18:13:35.616211 A > B: . 1527161:1528621(1460) ack 1 win 17248 (DF)
21:02:47.161876 B > A: . ack 6969 win 8760 (DF) 18:13:35.616228 B > A: . ack 1505261 win 64680 (DF)
21:02:47.162346 A > B: . 9649:10185(536) ack 1 win 16384 (DF) 18:13:35.616327 A > B: . 1528621:1530081(1460) ack 1 win 17248 (DF)
21:02:47.162439 A > B: . 10185:10721(536) ack 1 win 16384 (DF) 18:13:35.616349 B > A: . ack 1508181 win 64680 (DF)
21:02:47.162492 A > B: . 10721:11257(536) ack 1 win 16384 (DF) 18:13:35.616448 A > B: . 1530081:1531541(1460) ack 1 win 17248 (DF)
21:02:47.163677 B > A: . ack 8041 win 8760 (DF) 18:13:35.616565 A > B: . 1531541:1533001(1460) ack 1 win 17248 (DF)
21:02:47.164150 A > B: . 11257:11793(536) ack 1 win 16384 (DF) 18:13:35.616891 A > B: . 1533001:1534461(1460) ack 1 win 17248 (DF)
21:02:47.164241 A > B: . 11793:12329(536) ack 1 win 16384 (DF)
21:02:47.164297 A > B: . 12329:12865(536) ack 1 win 16384 (DF)
21:02:47.167331 B > A: . ack 9113 win 8760 (DF)
21:02:47.167808 A > B: . 12865:13401(536) ack 1 win 16384 (DF)
21:02:47.167859 A > B: . 13401:13937(536) ack 1 win 16384 (DF)
21:02:47.167907 A > B: . 13937:14473(536) ack 1 win 16384 (DF)
21:02:47.169389 B > A: . ack 10185 win 8760 (DF)
21:02:47.169852 A > B: . 14473:15009(536) ack 1 win 16384 (DF)
21:02:47.169983 A > B: . 15009:15545(536) ack 1 win 16384 (DF)
21:02:47.170033 A > B: . 15545:16081(536) ack 1 win 16384 (DF)
21:02:47.170188 B > A: . ack 11257 win 8760 (DF)
21:02:47.170661 A > B: . 16081:16617(536) ack 1 win 16384 (DF)
21:02:47.170711 A > B: . 16617:17153(536) ack 1 win 16384 (DF)
21:02:47.170762 A > B: . 17153:17689(536) ack 1 win 16384 (DF)
21:02:47.171707 B > A: . ack 12329 win 8760 (DF)
21:02:47.172168 A > B: . 17689:18225(536) ack 1 win 16384 (DF)
21:02:47.172297 A > B: . 18225:18761(536) ack 1 win 16384 (DF)
21:02:47.172385 A > B: . 18761:19297(536) ack 1 win 16384 (DF)
21:02:47.174261 B > A: . ack 13401 win 8760 (DF)
21:02:47.174398 B > A: . ack 14473 win 8760 (DF)
21:02:47.174728 A > B: . 19297:19833(536) ack 1 win 16384 (DF)
21:02:47.174777 A > B: . 19833:20369(536) ack 1 win 16384 (DF)
21:02:47.174826 A > B: . 20369:20905(536) ack 1 win 16384 (DF)
21:02:47.175085 A > B: . 20905:21441(536) ack 1 win 16384 (DF)
21:02:47.175182 A > B: . 21441:21977(536) ack 1 win 16384 (DF)
21:02:47.175231 A > B: . 21977:22513(536) ack 1 win 16384 (DF)
21:02:47.179999 B > A: . ack 15545 win 8760 (DF)
21:02:47.180143 B > A: . ack 16617 win 8760 (DF)
21:02:47.180151 B > A: . ack 17689 win 8760 (DF)
21:02:47.180322 B > A: . ack 18761 win 8760 (DF)
21:02:48.716195 A > B: . ack 2 win 16384 (DF)
In this trace, an ACK is generated for every two segments that In this trace, an ACK is generated for every two segments that
arrive. (The segment size is slightly larger in this trace, even arrive. (The segment size is slightly larger in this trace, even
though the source hosts are the same, because of the lack of times- though the source hosts are the same, because of the lack of times-
tamp options in this trace.) tamp options in this trace.)
How to detect How to detect
A tcpdump packet trace of a connection using PMTUD should make this This condition can be observered in a packet trace when the adver-
problem obvious. tised MSS is significantly larger than the actual PMTU of a connec-
tion.
How to fix How to fix
Several solutions for this problem have been proposed: Several solutions for this problem have been proposed:
A simple solution is to ACK every other packet, regardless of size. A simple solution is to ACK every other packet, regardless of size.
This has the drawback of generating large numbers of ACKs in the This has the drawback of generating large numbers of ACKs in the
face of lots of very small packets; this shows up with applica- face of lots of very small packets; this shows up with applica-
tions like the X Window System. tions like the X Window System.
A slightly more complex solution would monitor the size of incoming A slightly more complex solution would monitor the size of incoming
segments and try to determine what segment size the sender is segments and try to determine what segment size the sender is
using. This requires slightly more state in the receiver. using. This requires slightly more state in the receiver, but has
the advantage of making receiver SWS avoidance computations more
accurate.
3.3. 3.3.
Name of Problem Name of Problem
Determining MSS from PMTU Determining MSS from PMTU
Classification Classification
Performance Performance
Description Description
skipping to change at page 11, line 13 skipping to change at page 10, line 24
MTU the system can receive. MTU the system can receive.
Significance Significance
The advertised MSS is an indication to the remote system about the The advertised MSS is an indication to the remote system about the
largest TCP segment that can be received [RFC879]. If this value largest TCP segment that can be received [RFC879]. If this value
is too small, the remote system will be forced to use a smaller is too small, the remote system will be forced to use a smaller
segment size when sending, purely because the local system found a segment size when sending, purely because the local system found a
particular PMTU earlier. particular PMTU earlier.
Given the asymmetric nature of many routes on the Internet [Pax- Given the asymmetric nature of many routes on the Internet [Pax-
son97], it seems entirely possible that the return PMTU is dif- son97], it seems entirely possible that the return PMTU is differ-
ferent from the sending PMTU. Limiting the segment size in this way ent from the sending PMTU. Limiting the segment size in this way
can reduce performance and frustrate the PMTUD algorithm. can reduce performance and frustrate the PMTUD algorithm.
Even if the route was symmetric, setting this artificially lowered Even if the route was symmetric, setting this artificially lowered
limit on segment size will make it impossible to probe later to limit on segment size will make it impossible to probe later to
determine if the PMTU has changed. determine if the PMTU has changed.
Implications Implications
The whole point of PMTUD is to send as large a segment as possible. The whole point of PMTUD is to send as large a segment as possible.
If long-running connections cannot successfully probe for larger If long-running connections cannot successfully probe for larger
PMTU, then potential performance gains will be impossible to real- PMTU, then potential performance gains will be impossible to real-
skipping to change at page 12, line 32 skipping to change at page 11, line 47
22:33:34.471144 B > A2: . ack 985 win 16384 22:33:34.471144 B > A2: . ack 985 win 16384
22:33:34.476554 A2 > B: . 985:1969(984) ack 1 win 8856 (DF) 22:33:34.476554 A2 > B: . 985:1969(984) ack 1 win 8856 (DF)
22:33:34.477580 A2 > B: P 1969:2953(984) ack 1 win 8856 (DF) 22:33:34.477580 A2 > B: P 1969:2953(984) ack 1 win 8856 (DF)
[...] [...]
Notice that the SYN packet for session A2 specifies an MSS of 984. Notice that the SYN packet for session A2 specifies an MSS of 984.
Trace file demonstrating correct behavior Trace file demonstrating correct behavior
As before, this trace was made using tcpdump running on an inter- As before, this trace was made using tcpdump running on an interme-
mediate host. Host A initiates two separate consecutive connec- diate host. Host A initiates two separate consecutive connections,
tions, A1 and A2, to host B. Router C is the location of the MTU A1 and A2, to host B. Router C is the location of the MTU
bottleneck. As usual, TCP options are removed from all non-SYN bottleneck. As usual, TCP options are removed from all non-SYN
packets. packets.
22:36:58.828602 A1 > B: S 3402991286:3402991286(0) win 32768 22:36:58.828602 A1 > B: S 3402991286:3402991286(0) win 32768
<mss 4312,wscale 0,nop,timestamp 1123370309 0, <mss 4312,wscale 0,nop,timestamp 1123370309 0,
echo 1123370309> (DF) echo 1123370309> (DF)
22:36:58.844040 B > A1: S 946999880:946999880(0) 22:36:58.844040 B > A1: S 946999880:946999880(0)
ack 3402991287 win 16384 ack 3402991287 win 16384
<mss 65240,nop,wscale 0,nop,nop,timestamp 429552 1123370309> <mss 65240,nop,wscale 0,nop,nop,timestamp 429552 1123370309>
22:36:58.848058 A1 > B: . ack 1 win 32768 (DF) 22:36:58.848058 A1 > B: . ack 1 win 32768 (DF)
skipping to change at page 13, line 40 skipping to change at page 13, line 11
This can be detected using a packet trace of two separate connec- This can be detected using a packet trace of two separate connec-
tions; the first should invoke PMTUD; the second should start soon tions; the first should invoke PMTUD; the second should start soon
enough after the first that the PMTU value does not time out. enough after the first that the PMTU value does not time out.
How to fix How to fix
The MSS should be determined based on the MTUs of the interfaces on The MSS should be determined based on the MTUs of the interfaces on
the system, as outlined in [RFC1122] and [RFC1191]. the system, as outlined in [RFC1122] and [RFC1191].
4. Security Considerations 4. Security Considerations
This memo does not discuss any specific security-related TCP imple- The one security concern raised by this memo is that ICMP black holes
mentation problems, except that ICMP black holes are often caused by are often caused by over-zealous security administrators who block
over-zealous security administrators who block all ICMP messages. It all ICMP messages. It is vitally important that those who design and
would be far nicer to have all of the black holes fixed rather than deploy security systems understand the impact of strict filtering on
fixing all of the TCP implementations. upper-layer protocols. The safest web site in the world is worthless
if most TCP implementations cannot transfer data from it. It would
be far nicer to have all of the black holes fixed rather than fixing
all of the TCP implementations.
5. Acknowledgements 5. Acknowledgements
Thanks to Mark Allman and Vern Paxon for generous help reviewing the Thanks to Mark Allman and Vern Paxon for generous help reviewing the
document, and to Matt Mathis for early suggestions of various mechan- document, and to Matt Mathis for early suggestions of various mecha-
isms that can cause PMTUD black holes. The structure for describing nisms that can cause PMTUD black holes, as well as review. The
TCP problems, and the early description of that structure is from structure for describing TCP problems, and the early description of
[Paxson98]. that structure is from [RFC2525]. Special thanks to Amy Bock, who
helped perform the PMTUD tests which discovered these bugs.
6. References 6. References
[RFC2001-bis] [RFC2581]
M. Allman, V. Paxson, and W. Stevens, "TCP Congestion Control", M. Allman, V. Paxson, and W. Stevens, "TCP Congestion Control",
draft-ietf-tcpimpl-cong-control-03.txt, December 1998. April 1999.
[RFC1122] [RFC1122]
R. Braden, Editor, "Requirements for Internet Hosts -- Communica- R. Braden, Editor, "Requirements for Internet Hosts -- Communica-
tion Layers," Oct. 1989. tion Layers," Oct. 1989.
[Jacobson89] [Jacobson89]
V. Jacobson, C. Leres, and S. McCanne, tcpdump, available via V. Jacobson, C. Leres, and S. McCanne, tcpdump, available via
anonymous ftp to ftp.ee.lbl.gov, Jun. 1989. anonymous ftp to ftp.ee.lbl.gov, Jun. 1989.
[RFC1435] [RFC1435]
S. Knowles, "IESG Advice from Experience with Path MTU Discovery," S. Knowles, "IESG Advice from Experience with Path MTU Discovery,"
March 1993. March 1993.
[RFC1191] [RFC1191]
J. Mogul and S. Deering, "Path MTU discovery," Nov. 1990. J. Mogul and S. Deering, "Path MTU discovery," Nov. 1990.
[Paxson96] [Paxson96]
V. Paxson, "End-to-End Routing Behavior in the Internet," IEEE/ACM V. Paxson, "End-to-End Routing Behavior in the Internet," IEEE/ACM
Transactions on Networking (5), pp.~601-615, Oct. 1997. Transactions on Networking (5), pp.~601-615, Oct. 1997.
[Paxson98] [RFC2525]
V. Paxon, Editor, M. Allman, S. Dawson, W. Fenner, J. Griner, I. V. Paxon, Editor, M. Allman, S. Dawson, W. Fenner, J. Griner, I.
Heavens, K. Lahey, J. Semke, and B. Volz, "Known TCP Implementation Heavens, K. Lahey, J. Semke, and B. Volz, "Known TCP Implementation
Problems", draft-ietf-tcpimpl-prob-05.txt, November 1998. Problems", March 1999.
[RFC879] [RFC879]
J. Postel, "The TCP Maximum Segment Size and Related Topics," J. Postel, "The TCP Maximum Segment Size and Related Topics,"
November, 1983. November, 1983.
[RFC2001] [RFC2001]
W. Stevens, "TCP Slow Start, Congestion Avoidance, Fast Retransmit, W. Stevens, "TCP Slow Start, Congestion Avoidance, Fast Retransmit,
and Fast Recovery Algorithms," Jan. 1997. and Fast Recovery Algorithms," Jan. 1997.
7. Author's Address 7. Author's Address
Kevin Lahey <kml@nas.nasa.gov> Kevin Lahey <kml@nas.nasa.gov>
NASA Ames Research Center/MRJ NASA Ames Research Center/MRJ
MS 258-6 MS 258-6
Moffett Field, CA 94035 Moffett Field, CA 94035
USA USA
Phone: +1 650/604-4334 Phone: +1 650/604-4334
This draft was created in January 1999. This draft was created in April 1999.
It expires in June 1998. It expires in October 1999.
 End of changes. 35 change blocks. 
190 lines changed or deleted 161 lines changed or added

This html diff was produced by rfcdiff 1.33. The latest version is available from http://tools.ietf.org/tools/rfcdiff/