draft-ietf-tcpm-2140bis-05.txt | draft-ietf-tcpm-2140bis-06.txt | |||
---|---|---|---|---|
TCPM WG J. Touch | TCPM WG J. Touch | |||
Internet Draft Independent | Internet Draft Independent | |||
Intended status: Informational M. Welzl | Intended status: Informational M. Welzl | |||
Obsoletes: 2140 S. Islam | Obsoletes: 2140 S. Islam | |||
Expires: October 2020 University of Oslo | Expires: May 2021 University of Oslo | |||
April 29, 2020 | November 25, 2020 | |||
TCP Control Block Interdependence | TCP Control Block Interdependence | |||
draft-ietf-tcpm-2140bis-05.txt | draft-ietf-tcpm-2140bis-06.txt | |||
Status of this Memo | Status of this Memo | |||
This Internet-Draft is submitted in full conformance with the | This Internet-Draft is submitted in full conformance with the | |||
provisions of BCP 78 and BCP 79. | provisions of BCP 78 and BCP 79. | |||
This document may contain material from IETF Documents or IETF | This document may contain material from IETF Documents or IETF | |||
Contributions published or made publicly available before November | Contributions published or made publicly available before November | |||
10, 2008. The person(s) controlling the copyright in some of this | 10, 2008. The person(s) controlling the copyright in some of this | |||
material may not have granted the IETF Trust the right to allow | material may not have granted the IETF Trust the right to allow | |||
skipping to change at page 1, line 45 ¶ | skipping to change at page 1, line 45 ¶ | |||
months and may be updated, replaced, or obsoleted by other documents | months and may be updated, replaced, or obsoleted by other documents | |||
at any time. It is inappropriate to use Internet-Drafts as | at any time. It is inappropriate to use Internet-Drafts as | |||
reference material or to cite them other than as "work in progress." | reference material or to cite them other than as "work in progress." | |||
The list of current Internet-Drafts can be accessed at | The list of current Internet-Drafts can be accessed at | |||
http://www.ietf.org/ietf/1id-abstracts.txt | http://www.ietf.org/ietf/1id-abstracts.txt | |||
The list of Internet-Draft Shadow Directories can be accessed at | The list of Internet-Draft Shadow Directories can be accessed at | |||
http://www.ietf.org/shadow.html | http://www.ietf.org/shadow.html | |||
This Internet-Draft will expire on October 29, 2020. | This Internet-Draft will expire on May 25, 2021. | |||
Copyright Notice | Copyright Notice | |||
Copyright (c) 2020 IETF Trust and the persons identified as the | Copyright (c) 2020 IETF Trust and the persons identified as the | |||
document authors. All rights reserved. | document authors. All rights reserved. | |||
This document is subject to BCP 78 and the IETF Trust's Legal | This document is subject to BCP 78 and the IETF Trust's Legal | |||
Provisions Relating to IETF Documents | Provisions Relating to IETF Documents | |||
(https://trustee.ietf.org/license-info) in effect on the date of | (https://trustee.ietf.org/license-info) in effect on the date of | |||
publication of this document. Please review these documents | publication of this document. Please review these documents | |||
skipping to change at page 3, line 21 ¶ | skipping to change at page 3, line 21 ¶ | |||
9. Implications..................................................15 | 9. Implications..................................................15 | |||
9.1. Layering....................................................15 | 9.1. Layering....................................................15 | |||
9.2. Other possibilities.........................................16 | 9.2. Other possibilities.........................................16 | |||
10. Implementation Observations..................................16 | 10. Implementation Observations..................................16 | |||
11. Updates to RFC 2140..........................................17 | 11. Updates to RFC 2140..........................................17 | |||
12. Security Considerations......................................18 | 12. Security Considerations......................................18 | |||
13. IANA Considerations..........................................18 | 13. IANA Considerations..........................................18 | |||
14. References...................................................19 | 14. References...................................................19 | |||
14.1. Normative References....................................19 | 14.1. Normative References....................................19 | |||
14.2. Informative References..................................19 | 14.2. Informative References..................................19 | |||
15. Acknowledgments..............................................21 | 15. Acknowledgments..............................................22 | |||
16. Change log...................................................22 | 16. Change log...................................................22 | |||
Appendix A : TCB Sharing History.................................25 | Appendix A : TCB Sharing History.................................25 | |||
Appendix B : TCP Option Sharing and Caching......................26 | Appendix B : TCP Option Sharing and Caching......................26 | |||
Appendix C : Automating the Initial Window in TCP over Long | Appendix C : Automating the Initial Window in TCP over Long | |||
Timescales.......................................................28 | Timescales.......................................................28 | |||
C.1. Introduction.............................................28 | C.1. Introduction.............................................28 | |||
C.2. Design Considerations....................................28 | C.2. Design Considerations....................................28 | |||
C.3. Proposed IW Algorithm....................................29 | C.3. Proposed IW Algorithm....................................29 | |||
C.4. Discussion...............................................32 | C.4. Discussion...............................................33 | |||
C.5. Observations.............................................33 | C.5. Observations.............................................34 | |||
1. Introduction | 1. Introduction | |||
TCP is a connection-oriented reliable transport protocol layered | TCP is a connection-oriented reliable transport protocol layered | |||
over IP [RFC793]. Each TCP connection maintains state, usually in a | over IP [RFC793]. Each TCP connection maintains state, usually in a | |||
data structure called the TCP Control Block (TCB). The TCB contains | data structure called the TCP Control Block (TCB). The TCB contains | |||
information about the connection state, its associated local | information about the connection state, its associated local | |||
process, and feedback parameters about the connection's transmission | process, and feedback parameters about the connection's transmission | |||
properties. As originally specified and usually implemented, most | properties. As originally specified and usually implemented, most | |||
TCB information is maintained on a per-connection basis. Some | TCB information is maintained on a per-connection basis. Some | |||
skipping to change at page 9, line 13 ¶ | skipping to change at page 9, line 13 ¶ | |||
old_TFO_failure old_TFO_failure ESTAB old_TFO_failure | old_TFO_failure old_TFO_failure ESTAB old_TFO_failure | |||
6.3. Discussion | 6.3. Discussion | |||
There is no particular benefit to caching MMS_S and MMS_R as these | There is no particular benefit to caching MMS_S and MMS_R as these | |||
are reported by the local IP stack. Caching sendMSS and PMTU is | are reported by the local IP stack. Caching sendMSS and PMTU is | |||
trivial; reported values are cached, and the most recent values are | trivial; reported values are cached, and the most recent values are | |||
used. The cache is updated when the MSS option is received in a SYN | used. The cache is updated when the MSS option is received in a SYN | |||
or after PMTUD (i.e., when an ICMPv4 Fraqmentation Needed [RFC1191] | or after PMTUD (i.e., when an ICMPv4 Fraqmentation Needed [RFC1191] | |||
or ICMPv6 Packet Too Big message is received [RFC8201] or the | or ICMPv6 Packet Too Big message is received [RFC8201] or the | |||
equivalent is inferred, e.g. as from PLPMTUD [RFC4821]), | equivalent is inferred, e.g., as from PLPMTUD [RFC4821]), | |||
respectively, so the cache always has the most recent values from | respectively, so the cache always has the most recent values from | |||
any connection. For sendMSS, the cache is consulted only at | any connection. For sendMSS, the cache is consulted only at | |||
connection establishment and not otherwise updated, which means that | connection establishment and not otherwise updated, which means that | |||
MSS options do not affect current connections. The default sendMSS | MSS options do not affect current connections. The default sendMSS | |||
is never saved; only reported MSS values update the cache, so an | is never saved; only reported MSS values update the cache, so an | |||
explicit override is required to reduce the sendMSS. | explicit override is required to reduce the sendMSS. | |||
RTT values are updated by formulae that merge the old and new | RTT values are updated by formulae that merge the old and new | |||
values. Dynamic RTT estimation requires a sequence of RTT | values. Dynamic RTT estimation requires a sequence of RTT | |||
measurements. As a result, the cached RTT (and its variance) is an | measurements. As a result, the cached RTT (and its variance) is an | |||
skipping to change at page 10, line 4 ¶ | skipping to change at page 10, line 4 ¶ | |||
Most cached TCB values are updated when a connection closes. The | Most cached TCB values are updated when a connection closes. The | |||
exceptions are MMS_R and MMS_S, which are reported by IP [RFC1122], | exceptions are MMS_R and MMS_S, which are reported by IP [RFC1122], | |||
PMTU which is updated after Path MTU Discovery | PMTU which is updated after Path MTU Discovery | |||
[RFC1191][RFC4821][RFC8201], and sendMSS, which is updated if the | [RFC1191][RFC4821][RFC8201], and sendMSS, which is updated if the | |||
MSS option is received in the TCP SYN header. | MSS option is received in the TCP SYN header. | |||
Sharing sendMSS information affects only data in the SYN of the next | Sharing sendMSS information affects only data in the SYN of the next | |||
connection, because sendMSS information is typically included in | connection, because sendMSS information is typically included in | |||
most TCP SYN segments. Caching PMTU can accelerate the efficiency of | most TCP SYN segments. Caching PMTU can accelerate the efficiency of | |||
PMTUD, but can also result in black-holing until corrected if in | PMTUD but can also result in black-holing until corrected if in | |||
error. Caching MMS_R and MMS_S may be of little direct value as they | error. Caching MMS_R and MMS_S may be of little direct value as they | |||
are reported by the local IP stack anyway. | are reported by the local IP stack anyway. | |||
The way in which other TCP option state can be shared depends on the | The way in which other TCP option state can be shared depends on the | |||
details of that option. E.g., TFO state includes the TCP Fast Open | details of that option. E.g., TFO state includes the TCP Fast Open | |||
Cookie [RFC7413] or, in case TFO fails, a negative TCP Fast Open | Cookie [RFC7413] or, in case TFO fails, a negative TCP Fast Open | |||
response. RFC 7413 states, "The client MUST cache negative responses | response. RFC 7413 states, "The client MUST cache negative responses | |||
from the server in order to avoid potential connection failures. | from the server in order to avoid potential connection failures. | |||
Negative responses include the server not acknowledging the data in | Negative responses include the server not acknowledging the data in | |||
the SYN, ICMP error messages, and (most importantly) no response | the SYN, ICMP error messages, and (most importantly) no response | |||
skipping to change at page 13, line 27 ¶ | skipping to change at page 13, line 27 ¶ | |||
of the current windows is increased for any new connection. This can | of the current windows is increased for any new connection. This can | |||
have detrimental consequences where several connections share a | have detrimental consequences where several connections share a | |||
highly congested link. | highly congested link. | |||
There are several ways to initialize the congestion window in a new | There are several ways to initialize the congestion window in a new | |||
TCB among an ensemble of current connections to a host. Current TCP | TCB among an ensemble of current connections to a host. Current TCP | |||
implementations initialize it to four segments as standard [rfc3390] | implementations initialize it to four segments as standard [rfc3390] | |||
and 10 segments experimentally [RFC6928]. These approaches assume | and 10 segments experimentally [RFC6928]. These approaches assume | |||
that new connections should behave as conservatively as possible. | that new connections should behave as conservatively as possible. | |||
The algorithm described in [Ba12] adjusts the initial cwnd depending | The algorithm described in [Ba12] adjusts the initial cwnd depending | |||
on the cwnd values of ongoing connections. There have also been | on the cwnd values of ongoing connections. It is also possible to | |||
suggestions to use the kind of sharing mechanisms described in this | use sharing mechanisms over long timescales to adapt TCP's initial | |||
document over long timescales to adapt TCP's initial window | window automatically, as described further in Appendix A. | |||
automatically, as described further in Appendix A [To12]. | ||||
8. Compatibility Issues | 8. Compatibility Issues | |||
Here, we discuss various types of problems that may arise with TCB | Here, we discuss various types of problems that may arise with TCB | |||
information sharing. | information sharing. | |||
For the congestion and current window information, the initial | For the congestion and current window information, the initial | |||
values computed by TCB interdependence may not be consistent with | values computed by TCB interdependence may not be consistent with | |||
the long-term aggregate behavior of a set of concurrent connections | the long-term aggregate behavior of a set of concurrent connections | |||
between the same endpoints. Under conventional TCP congestion | between the same endpoints. Under conventional TCP congestion | |||
skipping to change at page 18, line 12 ¶ | skipping to change at page 18, line 12 ¶ | |||
and send-MSS separately, adds path MTU and ssthresh, and addresses | and send-MSS separately, adds path MTU and ssthresh, and addresses | |||
the impact on TCP option state. | the impact on TCP option state. | |||
New sections have been added to address compatibility issues and | New sections have been added to address compatibility issues and | |||
implementation observations. The relation of this work to T/TCP has | implementation observations. The relation of this work to T/TCP has | |||
been moved to Appendix A on history, partly to reflect the | been moved to Appendix A on history, partly to reflect the | |||
deprecation of that protocol. | deprecation of that protocol. | |||
Appendix C has been added to discuss the potential to use temporal | Appendix C has been added to discuss the potential to use temporal | |||
sharing over long timescales to adapt TCP's initial window | sharing over long timescales to adapt TCP's initial window | |||
automatically, largely imported from [To12]. | automatically, avoiding the need to periodically revise a single | |||
global constant value. | ||||
Finally, this document updates and significantly expands the | Finally, this document updates and significantly expands the | |||
referenced literature. | referenced literature. | |||
12. Security Considerations | 12. Security Considerations | |||
These presented implementation methods do not have additional | These presented implementation methods do not have additional | |||
ramifications for explicit attacks. They may be susceptible to | ramifications for explicit attacks. They may be susceptible to | |||
denial-of-service attacks if not otherwise secured. | denial-of-service attacks if not otherwise secured. | |||
skipping to change at page 18, line 36 ¶ | skipping to change at page 18, line 37 ¶ | |||
Implications section). Some shared TCB parameters are used only to | Implications section). Some shared TCB parameters are used only to | |||
create new TCBs, others are shared among the TCBs of ongoing | create new TCBs, others are shared among the TCBs of ongoing | |||
connections. New connections can join the ongoing set, e.g., to | connections. New connections can join the ongoing set, e.g., to | |||
optimize send window size among a set of connections to the same | optimize send window size among a set of connections to the same | |||
host. | host. | |||
Attacks on parameters used only for initialization affect only the | Attacks on parameters used only for initialization affect only the | |||
transient performance of a TCP connection. For short connections, | transient performance of a TCP connection. For short connections, | |||
the performance ramification can approach that of a denial-of- | the performance ramification can approach that of a denial-of- | |||
service attack. E.g., if an application changes its TCB to have a | service attack. E.g., if an application changes its TCB to have a | |||
false and small window size, subsequent connections would experience | false and small window size, subsequent connections will experience | |||
performance degradation until their window grew appropriately. | performance degradation until their window grew appropriately. | |||
TCB sharing reuses and mixes information from past and current | TCB sharing reuses and mixes information from past and current | |||
connections. Although reusing information could create a potential | connections. Although reusing information could create a potential | |||
for fingerprinting to identify hosts, the mixing reduces that | for fingerprinting to identify hosts, the mixing reduces that | |||
potential. There has been no evidence of fingerprinting based on | potential. There has been no evidence of fingerprinting based on | |||
this technique and it is currently considered safe in that regard. | this technique and it is currently considered safe in that regard. | |||
13. IANA Considerations | 13. IANA Considerations | |||
skipping to change at page 19, line 50 ¶ | skipping to change at page 20, line 5 ¶ | |||
14.2. Informative References | 14.2. Informative References | |||
[Al10] Allman, M., "Initial Congestion Window Specification", | [Al10] Allman, M., "Initial Congestion Window Specification", | |||
(work in progress), draft-allman-tcpm-bump-initcwnd-00, | (work in progress), draft-allman-tcpm-bump-initcwnd-00, | |||
Nov. 2010. | Nov. 2010. | |||
[Ba12] Barik, R., Welzl, M., Ferlin, S., Alay, O., " LISA: A | [Ba12] Barik, R., Welzl, M., Ferlin, S., Alay, O., " LISA: A | |||
Linked Slow-Start Algorithm for MPTCP", IEEE ICC, Kuala | Linked Slow-Start Algorithm for MPTCP", IEEE ICC, Kuala | |||
Lumpur, Malaysia, May 23-27 2016. | Lumpur, Malaysia, May 23-27 2016. | |||
[Ba20] Bagnulo, M., Briscoe, B., "ECN++: Adding Explicit | ||||
Congestion Notification (ECN) to TCP Control Packets", | ||||
draft-ietf-tcpm-generalized-ecn-06, Oct. 2020. | ||||
[Be94] Berners-Lee, T., et al., "The World-Wide Web," | [Be94] Berners-Lee, T., et al., "The World-Wide Web," | |||
Communications of the ACM, V37, Aug. 1994, pp. 76-82. | Communications of the ACM, V37, Aug. 1994, pp. 76-82. | |||
[Br94] Braden, B., "T/TCP -- Transaction TCP: Source Changes for | [Br94] Braden, B., "T/TCP -- Transaction TCP: Source Changes for | |||
Sun OS 4.1.3,", Release 1.0, USC/ISI, September 14, 1994. | Sun OS 4.1.3,", Release 1.0, USC/ISI, September 14, 1994. | |||
[Br02] Brownlee, N. and K. Claffy, "Understanding Internet | [Br02] Brownlee, N. and K. Claffy, "Understanding Internet | |||
Traffic Streams: Dragonflies and Tortoises", IEEE | Traffic Streams: Dragonflies and Tortoises", IEEE | |||
Communications Magazine p110-117, 2002. | Communications Magazine p110-117, 2002. | |||
skipping to change at page 21, line 45 ¶ | skipping to change at page 22, line 5 ¶ | |||
B., "Mechanisms for Optimizing Link Aggregation Group | B., "Mechanisms for Optimizing Link Aggregation Group | |||
(LAG) and Equal-Cost Multipath (ECMP) Component Link | (LAG) and Equal-Cost Multipath (ECMP) Component Link | |||
Utilization in Networks", RFC 7424, Jan. 2015 | Utilization in Networks", RFC 7424, Jan. 2015 | |||
[RFC7540] Belshe, M., Peon, R., Thomson, M., "Hypertext Transfer | [RFC7540] Belshe, M., Peon, R., Thomson, M., "Hypertext Transfer | |||
Protocol Version 2 (HTTP/2)", RFC 7540, May 2015. | Protocol Version 2 (HTTP/2)", RFC 7540, May 2015. | |||
[RFC7661] Fairhurst, G., Sathiaseelan, A., Secchi, R., "Updating TCP | [RFC7661] Fairhurst, G., Sathiaseelan, A., Secchi, R., "Updating TCP | |||
to Support Rate-Limited Traffic", RFC 7661, Oct. 2015. | to Support Rate-Limited Traffic", RFC 7661, Oct. 2015. | |||
[To12] Touch, J., "Automating the Initial Window in TCP," draft- | ||||
touch-tcpm-automatic-iw-03 (expired), July 2012. | ||||
15. Acknowledgments | 15. Acknowledgments | |||
The authors would like to thank for Praveen Balasubramanian for | The authors would like to thank for Praveen Balasubramanian for | |||
information regarding TCB sharing in Windows, and Yuchung Cheng, | information regarding TCB sharing in Windows, and Yuchung Cheng, | |||
Lars Eggert, Ilpo Jarvinen and Michael Scharf for comments on | Lars Eggert, Ilpo Jarvinen and Michael Scharf for comments on | |||
earlier versions of the draft. Earlier revisions of this work | earlier versions of the draft. Earlier revisions of this work | |||
received funding from a collaborative research project between the | received funding from a collaborative research project between the | |||
University of Oslo and Huawei Technologies Co., Ltd. and were partly | University of Oslo and Huawei Technologies Co., Ltd. and were partly | |||
supported by USC/ISI's Postel Center. | supported by USC/ISI's Postel Center. | |||
This document was prepared using 2-Word-v2.0.template.dot. | This document was prepared using 2-Word-v2.0.template.dot. | |||
16. Change log | 16. Change log | |||
This section should be removed upon final publication as an RFC. | This section should be removed upon final publication as an RFC. | |||
ietf-06: | ||||
- Address WGLC comments | ||||
ietf-05: | ||||
- Correction of typographic errors, expansion of terminology | ||||
ietf-04: | ietf-04: | |||
- Fix internal cross-reference errors that appeared in ietf-02 | - Fix internal cross-reference errors that appeared in ietf-02 | |||
- Updated tables to re-center; clarified text | - Updated tables to re-center; clarified text | |||
ietf-03: | ietf-03: | |||
- Correction of typographic errors, minor rewording in appendices | - Correction of typographic errors, minor rewording in appendices | |||
ietf-02: | ietf-02: | |||
skipping to change at page 23, line 25 ¶ | skipping to change at page 23, line 37 ¶ | |||
- Stated that our OS implementation overview table only covers | - Stated that our OS implementation overview table only covers | |||
temporal sharing. | temporal sharing. | |||
- Correctly reflected sharing of old_RTT in Linux in the | - Correctly reflected sharing of old_RTT in Linux in the | |||
implementation overview table. | implementation overview table. | |||
- Marked entries that are considered safe to share with an | - Marked entries that are considered safe to share with an | |||
asterisk (suggestion was to split the table) | asterisk (suggestion was to split the table) | |||
- Discussed correct host identification: NATs may make IP | - Discussed correct host identification: NATs may make IP | |||
addresses the wrong input, could e.g. use HTTP cookie. | addresses the wrong input, could e.g., use HTTP cookie. | |||
- Included MMS_S and MMS_R from RFC1122; fixed the use of MSS and | - Included MMS_S and MMS_R from RFC1122; fixed the use of MSS and | |||
MTU | MTU | |||
- Added information about option sharing, listed options in | - Added information about option sharing, listed options in | |||
Appendix B | Appendix B | |||
Authors' Addresses | Authors' Addresses | |||
Joe Touch | Joe Touch | |||
skipping to change at page 28, line 7 ¶ | skipping to change at page 28, line 7 ¶ | |||
MSS | MSS | |||
TFO negotiation failure (to avoid negotiation retries) | TFO negotiation failure (to avoid negotiation retries) | |||
Safe and necessary to keep state: | Safe and necessary to keep state: | |||
TFP cookie (if TFO succeeded in the past) | TFP cookie (if TFO succeeded in the past) | |||
Appendix C: Automating the Initial Window in TCP over Long Timescales | Appendix C: Automating the Initial Window in TCP over Long Timescales | |||
Note: this section is imported from [To12], updated only to refer to | ||||
itself as an appendix. | ||||
C.1. Introduction | C.1. Introduction | |||
Temporal sharing, as described earlier in this document, builds on | ||||
the assumption that multiple consecutive connections between the | ||||
same host pair are somewhat likely to be exposed to similar | ||||
environment characteristics. The stored information can therefore | ||||
become invalid over time, and suitable precautions should be taken | ||||
(this is discussed further in section 8.1). However, there are also | ||||
cases where it can make sense to use much longer-term measurements | ||||
of TCP connections to gradually influence TCP parameters. This | ||||
appendix describes an example of such a case. | ||||
TCP's congestion control algorithm uses an initial window value | TCP's congestion control algorithm uses an initial window value | |||
(IW), both as a starting point for new connections and after one RTO | (IW), both as a starting point for new connections and as an upper | |||
or more [RFC5681][RFC7661]. This value has evolved over time, | limit for restarting after an idle period [RFC5681][RFC7661]. This | |||
originally one maximum segment size (MSS), and increased to the | value has evolved over time, originally one maximum segment size | |||
lesser of four MSS or 4,380 bytes [RFC3390][RFC5681]. For typical | (MSS), and increased to the lesser of four MSS or 4,380 bytes | |||
Internet connections with an maximum transmission units (MTUs) of | [RFC3390][RFC5681]. For a typical Internet connection with a maximum | |||
1500 bytes, this permits three segments of 1,460 bytes each. | transmission unit (MTU) of 1500 bytes, this permits three segments | |||
of 1,460 bytes each. | ||||
The IW value was originally implied in the original TCP congestion | The IW value was originally implied in the original TCP congestion | |||
control description, and documented as a standard in 1997 | control description and documented as a standard in 1997 | |||
[RFC2001][Ja88]. The value was last updated in 1998 experimentally, | [RFC2001][Ja88]. The value was updated in 1998 experimentally and | |||
and moved to the standards track in 2002 [RFC2414][RFC3390]. There | moved to the standards track in 2002 [RFC2414][RFC3390]. In 2013, it | |||
have been recent proposals to update the IW based on further | was experimentally increased to 10 [RFC6928]. | |||
increases in host and router capabilities and network capacity, some | ||||
focusing on specific values (e.g., IW=10), and others prescribing a | ||||
schedule for increases over time (e.g., IW=6 for 2011, increasing by | ||||
1-2 MSS per year). | ||||
This appendix discusses how TCP can objectively measure when an IW | This appendix discusses how TCP can objectively measure when an IW | |||
is too large, and that such feedback should be used over long | is too large, and that such feedback should be used over long | |||
timescales to adjust the IW automatically. The result should be | timescales to adjust the IW automatically. The result should be | |||
safer to deploy and might avoid the need to repeatedly revisit IW | safer to deploy and might avoid the need to repeatedly revisit IW | |||
size over time. | over time. | |||
Note that this mechanism attempts to make the IW more adaptive over | Note that this mechanism attempts to make the IW more adaptive over | |||
time. It can increase the IW beyond that which is currently | time. It can increase the IW beyond that which is currently | |||
recommended for widescale deployment, and so its use should be | recommended for widescale deployment, and so its use should be | |||
carefully monitored. | carefully monitored. | |||
C.2. Design Considerations | C.2. Design Considerations | |||
TCP's IW value has existed statically for over two decades, so any | TCP's IW value has existed statically for over two decades, so any | |||
solution to adjusting the IW dynamically should have similarly | solution to adjusting the IW dynamically should have similarly | |||
stable, non-invasive effects on the performance and complexity of | stable, non-invasive effects on the performance and complexity of | |||
TCP. In order to be fair, the IW should be similar for most machines | TCP. In order to be fair, the IW should be similar for most machines | |||
on the public Internet. Finally, a desirable goal is to develop a | on the public Internet. Finally, a desirable goal is to develop a | |||
self-correcting algorithm, so that IW values that cause network | self-correcting algorithm, so that IW values that cause network | |||
problems can be avoided. To that end, we propose the following list | problems can be avoided. To that end, we propose the following | |||
of design goals: | design goals: | |||
o Impart little to no impact to TCP in the absence of loss, i.e., | o Impart little to no impact to TCP in the absence of loss, i.e., | |||
it should not increase the complexity of default packet | it should not increase the complexity of default packet | |||
processing in the normal case. | processing in the normal case. | |||
o Adapt to network feedback over long timescales, avoiding values | o Adapt to network feedback over long timescales, avoiding values | |||
that persistently cause network problems. | that persistently cause network problems. | |||
o Decrease the IW in the presence of sustained loss of IW segments, | o Decrease the IW in the presence of sustained loss of IW segments, | |||
as determined over a number of different connections. | as determined over a number of different connections. | |||
skipping to change at page 29, line 41 ¶ | skipping to change at page 29, line 44 ¶ | |||
the initial burst of packets, it is clearly inappropriate and could | the initial burst of packets, it is clearly inappropriate and could | |||
be inducing unnecessary loss in other competing connections. This | be inducing unnecessary loss in other competing connections. This | |||
might happen for sites behind very slow boxes with small buffers, | might happen for sites behind very slow boxes with small buffers, | |||
which may or may not be the first hop. | which may or may not be the first hop. | |||
C.3. Proposed IW Algorithm | C.3. Proposed IW Algorithm | |||
Below is a simple description of the proposed IW algorithm. It | Below is a simple description of the proposed IW algorithm. It | |||
relies on the following parameters: | relies on the following parameters: | |||
o MinIW = 3 MSS or 4,380 bytes (as per RFC3390] | o MinIW = 3 MSS or 4,380 bytes (as per [RFC3390]) | |||
o MaxIW = 10 | o MaxIW = 10 MSS (as per [RFC6928]) | |||
o MulDecr = 0.5 | o MulDecr = 0.5 | |||
o AddIncr = 2 MSS | o AddIncr = 2 MSS | |||
o Threshold = 0.05 | o Threshold = 0.05 | |||
We assume that the minimum IW (MinIW) should be as currently | We assume that the minimum IW (MinIW) should be as currently | |||
specified [RFC3390]. The maximum IW can be set to a fixed value | specified [RFC3390]. The maximum IW can be set to a fixed value (as | |||
[RFC6928], or set based on a schedule if trusted time references are | recommended in [RFC6928]) or set based on a schedule if trusted time | |||
available [Al10]; here we prefer a fixed value. We also propose to | references are available [Al10]; here we prefer a fixed value. We | |||
use an AIMD algorithm, with increase and decreases as noted. | also propose to use an AIMD algorithm, with increase and decreases | |||
as noted. | ||||
Although these parameters are somewhat arbitrary, their initial | Although these parameters are somewhat arbitrary, their initial | |||
values are not important except that the algorithm is AIMD and the | values are not important except that the algorithm is AIMD and the | |||
MaxIW should not exceed that recommended for other systems on the | MaxIW should not exceed that recommended for other systems on the | |||
Internet. Current proposals, including default current operation, | Internet. Current proposals, including default current operation, | |||
are degenerate cases of the algorithm below for given parameters - | are degenerate cases of the algorithm below for given parameters - | |||
notably MulDec = 1.0 and AddIncr = 0 MSS, thus disabling the | notably MulDec = 1.0 and AddIncr = 0 MSS, thus disabling the | |||
automatic part of the algorithm. | automatic part of the algorithm. | |||
The proposed algorithm is as follows: | The proposed algorithm is as follows: | |||
1. On boot: | 1. On boot: | |||
IW = MaxIW; # assume this is in bytes, and an even number of MSS | IW = MaxIW; # assume this is in bytes, and an even number of MSS | |||
2. Upon starting a new connection | 2. Upon starting a new connection: | |||
CWND = IW; | CWND = IW; | |||
conncount++; | conncount++; | |||
IWnotchecked = 1; # true | IWnotchecked = 1; # true | |||
3. During a connection's SYN-ACK processing, if SYN-ACK includes | 3. During a connection's SYN-ACK processing, if SYN-ACK includes ECN | |||
ECN, treat as if the IW is too large | (as similarly addressed in Sec 5 of ECN++ for TCP [Ba20]), treat | |||
as if the IW is too large: | ||||
if (IWnotchecked && (synackecn == 1)) { | if (IWnotchecked && (synackecn == 1)) { | |||
losscount++; | losscount++; | |||
IWnotchecked = 0; # never check again | IWnotchecked = 0; # never check again | |||
} | } | |||
4. During a connection, if retransmission occurs, check the seqno of | 4. During a connection, if retransmission occurs, check the seqno of | |||
the outgoing packet (in bytes) to see if the resent segment fixes | the outgoing packet (in bytes) to see if the resent segment fixes | |||
an IW loss: | an IW loss: | |||
if (Retransmitting && IWnotchecked && ((ISN - seqno) < IW))) { | if (Retransmitting && IWnotchecked && ((ISN - seqno) < IW))) { | |||
losscount++; | losscount++; | |||
IWnotchecked = 0; # never do this entire "if" again | IWnotchecked = 0; # never do this entire "if" again | |||
} else { | } else { | |||
IWnotchecked = 0; # you're beyond the IW so stop checking | IWnotchecked = 0; # you're beyond the IW so stop checking | |||
} | } | |||
5. Once every 1000 conections, as a separate process (i.e., not as | 5. Once every 1000 connections, as a separate process (i.e., not as | |||
part of processing a given connection): | part of processing a given connection): | |||
if (conncount > 1000) { | if (conncount > 1000) { | |||
if (losscount/conncount > threshold) { | if (losscount/conncount > threshold) { | |||
# the number of connections with errors is too high | # the number of connections with errors is too high | |||
IW = IW * MulDecr; | IW = IW * MulDecr; | |||
} else { | } else { | |||
IW = IW + AddIncr; | IW = IW + AddIncr; | |||
} | } | |||
} | } | |||
We recognize that this algorithm can yield a false positive when the | As presented, this algorithm can yield a false positive when the | |||
sequence number wraps around. This can be avoided using either PAWS | sequence number wraps around, e.g., the code might increment | |||
[RFC7323] context or 64-bit internal sequence numbers (as in TCP-AO | losscount in step 4 when no loss occurred or fail to increment | |||
[RFC5925]). Alternately, false positives can be allowed since they | losscount when a loss did occur. This can be avoided using either | |||
are expected to be infrequent and thus will not affect the overall | PAWS [RFC7323] context or internal extended sequence number | |||
statistics of the algorithm. | representations (as in TCP-AO [RFC5925]). Alternately, false | |||
positives can be tolerated because they are expected to be | ||||
infrequent and thus will not significantly impact the algorithm. | ||||
The following additional constraints are imposed: | A number of additional constraints need to be imposed if this | |||
mechanism is implemented to ensure that it defaults values that | ||||
comply with current Internet standards, is conservative in how it | ||||
extends those values, and returns to those values in the absence of | ||||
positive feedback (i.e., success). To that end, we recommend the | ||||
following list of example constraints: | ||||
>> The automatic IW algorithm MUST initialize to MaxIW, in the | >> The automatic IW algorithm MUST initialize MaxIW a value no | |||
larger than the currently recommended Internet default, in the | ||||
absence of other context information. | absence of other context information. | |||
If there are too few connections to make a decision or if there is | Thus, if there are too few connections to make a decision or if | |||
otherwise insufficient information to increase the IW, then the | there is otherwise insufficient information to increase the IW, then | |||
MaxIW defaults to the current recommended value. | the MaxIW defaults to the current recommended value. | |||
>> An implementation may allow the MaxIW to grow beyond the | >> An implementation MAY allow the MaxIW to grow beyond the | |||
currently recommended Internet default, but not more than 2 segments | currently recommended Internet default, but not more than 2 segments | |||
per calendar year. | per calendar year. | |||
If an endpoint has a persistent history of successfully transmitting | Thus, if an endpoint has a persistent history of successfully | |||
IW segments without loss, then it is allowed to probe the Internet | transmitting IW segments without loss, then it is allowed to probe | |||
to determine if larger IW values have similar success. This probing | the Internet to determine if larger IW values have similar success. | |||
is limited and requires a trusted time source, otherwise the MaxIW | This probing is limited and requires a trusted time source, | |||
remains constant. | otherwise the MaxIW remains constant. | |||
>> An implementation MUST adjust the IW based on loss statistics at | >> An implementation MUST adjust the IW based on loss statistics at | |||
least once every 1000 connections. | least once every 1000 connections. | |||
An endpoint needs to be sufficiently reactive to IW loss. | An endpoint needs to be sufficiently reactive to IW loss. | |||
>> An implementation MUST decrease the IW by at least one MSS when | >> An implementation MUST decrease the IW by at least one MSS when | |||
indicated during an evaluation interval. | indicated during an evaluation interval. | |||
An endpoint that detects loss needs to decrease its IW by at least | An endpoint that detects loss needs to decrease its IW by at least | |||
skipping to change at page 33, line 22 ¶ | skipping to change at page 33, line 38 ¶ | |||
in addition to losses during the first IW of a connection. In this | in addition to losses during the first IW of a connection. In this | |||
case, the implementation MUST count each restart as a "connection" | case, the implementation MUST count each restart as a "connection" | |||
for the purposes of connection counts and periodic rechecking of the | for the purposes of connection counts and periodic rechecking of the | |||
IW value. | IW value. | |||
False positives can occur during some kinds of segment reordering, | False positives can occur during some kinds of segment reordering, | |||
e.g., that might trigger spurious retransmissions even without a | e.g., that might trigger spurious retransmissions even without a | |||
true segment loss. These are not expected to be sufficiently common | true segment loss. These are not expected to be sufficiently common | |||
to dominate the algorithm and its conclusions. | to dominate the algorithm and its conclusions. | |||
This mechanism does require additional per-connection state which is | This mechanism does require additional per-connection state, which | |||
currently common in some implementations, and is useful for other | is currently common in some implementations, and is useful for other | |||
reasons (e.g., the ISN is used in TCP-AO [RFC5925]). The mechanism | reasons (e.g., the ISN is used in TCP-AO [RFC5925]). The mechanism | |||
also benefits from persistent state kept across reboots, as would be | also benefits from persistent state kept across reboots, as would be | |||
other state sharing mechanisms (e.g., TCP Control Block Sharing | other state sharing mechanisms (e.g., TCP Control Block Sharing | |||
[RFC2140]). The mechanism is inspired by RFC 2140's use of | [RFC2140]). The mechanism is inspired by RFC 2140's use of | |||
information across connections. | information across connections. | |||
The receive window (RWIN) is not involved in this calculation. The | The receive window (RWIN) is not involved in this calculation. The | |||
size of RWIN is determined by receiver resources, and provides space | size of RWIN is determined by receiver resources and provides space | |||
to accommodate segment reordering. It is not involved with | to accommodate segment reordering. It is not involved with | |||
congestion control, which is the focus of this document and its | congestion control, which is the focus of this document and its | |||
management of the IW. | management of the IW. | |||
C.5. Observations | C.5. Observations | |||
The IW may not converge to a single, global value. It also may not | The IW may not converge to a single, global value. It also may not | |||
converge at all, but rather may oscillate by a few MSS as it | converge at all, but rather may oscillate by a few MSS as it | |||
repeatedly probes the Internet for larger IWs and fails. Both | repeatedly probes the Internet for larger IWs and fails. Both | |||
properties are consistent with TCP behavior during each individual | properties are consistent with TCP behavior during each individual | |||
End of changes. 36 change blocks. | ||||
70 lines changed or deleted | 93 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ |