draft-ietf-tcpm-2140bis-00.txt | draft-ietf-tcpm-2140bis-01.txt | |||
---|---|---|---|---|
TCPM WG J. Touch | TCPM WG J. Touch | |||
Internet Draft Independent | Internet Draft Independent | |||
Intended status: Informational M. Welzl | Intended status: Informational M. Welzl | |||
Obsoletes: 2140 S. Islam | Obsoletes: 2140 S. Islam | |||
Expires: October 2019 University of Oslo | Expires: May 2020 University of Oslo | |||
April 15, 2019 | November 19, 2019 | |||
TCP Control Block Interdependence | TCP Control Block Interdependence | |||
draft-ietf-tcpm-2140bis-00.txt | draft-ietf-tcpm-2140bis-01.txt | |||
Status of this Memo | Status of this Memo | |||
This Internet-Draft is submitted in full conformance with the | This Internet-Draft is submitted in full conformance with the | |||
provisions of BCP 78 and BCP 79. | provisions of BCP 78 and BCP 79. | |||
This document may contain material from IETF Documents or IETF | This document may contain material from IETF Documents or IETF | |||
Contributions published or made publicly available before November | Contributions published or made publicly available before November | |||
10, 2008. The person(s) controlling the copyright in some of this | 10, 2008. The person(s) controlling the copyright in some of this | |||
material may not have granted the IETF Trust the right to allow | material may not have granted the IETF Trust the right to allow | |||
skipping to change at page 1, line 45 ¶ | skipping to change at page 1, line 45 ¶ | |||
months and may be updated, replaced, or obsoleted by other documents | months and may be updated, replaced, or obsoleted by other documents | |||
at any time. It is inappropriate to use Internet-Drafts as | at any time. It is inappropriate to use Internet-Drafts as | |||
reference material or to cite them other than as "work in progress." | reference material or to cite them other than as "work in progress." | |||
The list of current Internet-Drafts can be accessed at | The list of current Internet-Drafts can be accessed at | |||
http://www.ietf.org/ietf/1id-abstracts.txt | http://www.ietf.org/ietf/1id-abstracts.txt | |||
The list of Internet-Draft Shadow Directories can be accessed at | The list of Internet-Draft Shadow Directories can be accessed at | |||
http://www.ietf.org/shadow.html | http://www.ietf.org/shadow.html | |||
This Internet-Draft will expire on October 15, 2019. | This Internet-Draft will expire on May 19, 2020. | |||
Copyright Notice | Copyright Notice | |||
Copyright (c) 2019 IETF Trust and the persons identified as the | Copyright (c) 2019 IETF Trust and the persons identified as the | |||
document authors. All rights reserved. | document authors. All rights reserved. | |||
This document is subject to BCP 78 and the IETF Trust's Legal | This document is subject to BCP 78 and the IETF Trust's Legal | |||
Provisions Relating to IETF Documents | Provisions Relating to IETF Documents | |||
(https://trustee.ietf.org/license-info) in effect on the date of | (https://trustee.ietf.org/license-info) in effect on the date of | |||
publication of this document. Please review these documents | publication of this document. Please review these documents | |||
skipping to change at page 2, line 42 ¶ | skipping to change at page 2, line 42 ¶ | |||
across connections to the same host. Such sharing is intended to | across connections to the same host. Such sharing is intended to | |||
improve overall transient transport performance, while maintaining | improve overall transient transport performance, while maintaining | |||
backward-compatibility with existing implementations. The sharing | backward-compatibility with existing implementations. The sharing | |||
described herein is limited to only the TCB initialization and so | described herein is limited to only the TCB initialization and so | |||
has no effect on the long-term behavior of TCP after a connection | has no effect on the long-term behavior of TCP after a connection | |||
has been established. | has been established. | |||
Table of Contents | Table of Contents | |||
1. Introduction...................................................3 | 1. Introduction...................................................3 | |||
2. Conventions used in this document..............................3 | 2. Conventions used in this document..............................4 | |||
3. Terminology....................................................4 | 3. Terminology....................................................4 | |||
4. The TCP Control Block (TCB)....................................4 | 4. The TCP Control Block (TCB)....................................4 | |||
5. TCB Interdependence............................................5 | 5. TCB Interdependence............................................5 | |||
6. An Example of Temporal Sharing.................................5 | 6. An Example of Temporal Sharing.................................6 | |||
7. An Example of Ensemble Sharing.................................9 | 7. An Example of Ensemble Sharing.................................9 | |||
8. Compatibility Issues..........................................11 | 8. Compatibility Issues..........................................11 | |||
9. Implications..................................................13 | 9. Implications..................................................13 | |||
10. Implementation Observations..................................14 | 10. Implementation Observations..................................14 | |||
11. Updates to RFC 2140..........................................15 | 11. Updates to RFC 2140..........................................15 | |||
12. Security Considerations......................................16 | 12. Security Considerations......................................16 | |||
13. IANA Considerations..........................................16 | 13. IANA Considerations..........................................16 | |||
14. References...................................................16 | 14. References...................................................16 | |||
14.1. Normative References....................................16 | 14.1. Normative References....................................16 | |||
14.2. Informative References..................................17 | 14.2. Informative References..................................17 | |||
15. Acknowledgments..............................................19 | 15. Acknowledgments..............................................19 | |||
16. Change log...................................................19 | 16. Change log...................................................20 | |||
17. Appendix A: TCB sharing history..............................21 | Appendix A : TCB sharing history.................................22 | |||
18. Appendix B: Options..........................................22 | Appendix B : TCP Option Sharing and Caching......................22 | |||
Appendix C : Automating the Initial Window in TCP over Long | ||||
Timescales.......................................................25 | ||||
C.1. Introduction.............................................25 | ||||
C.2. Design Considerations....................................25 | ||||
C.3. Proposed IW Algorithm....................................26 | ||||
C.4. Discussion...............................................29 | ||||
C.5. Observations.............................................30 | ||||
1. Introduction | 1. Introduction | |||
TCP is a connection-oriented reliable transport protocol layered | TCP is a connection-oriented reliable transport protocol layered | |||
over IP [RFC793]. Each TCP connection maintains state, usually in a | over IP [RFC793]. Each TCP connection maintains state, usually in a | |||
data structure called the TCP Control Block (TCB). The TCB contains | data structure called the TCP Control Block (TCB). The TCB contains | |||
information about the connection state, its associated local | information about the connection state, its associated local | |||
process, and feedback parameters about the connection's transmission | process, and feedback parameters about the connection's transmission | |||
properties. As originally specified and usually implemented, most | properties. As originally specified and usually implemented, most | |||
TCB information is maintained on a per-connection basis. Some | TCB information is maintained on a per-connection basis. Some | |||
skipping to change at page 7, line 14 ¶ | skipping to change at page 7, line 22 ¶ | |||
(SYN-ACK) from the server at all, i.e., connection timeout." [RFC | (SYN-ACK) from the server at all, i.e., connection timeout." [RFC | |||
7413]. TFOinfo is cached when a connection is established. | 7413]. TFOinfo is cached when a connection is established. | |||
Other TCP option state might not be as readily cached. E.g., TCP-AO | Other TCP option state might not be as readily cached. E.g., TCP-AO | |||
[RFC5925] success or failure between a host pair for a single SYN | [RFC5925] success or failure between a host pair for a single SYN | |||
destination port might be usefully cached. TCP-AO success or failure | destination port might be usefully cached. TCP-AO success or failure | |||
to other SYN destination ports on that host pair is never useful to | to other SYN destination ports on that host pair is never useful to | |||
cache because TCP-AO security parameters can vary per service. | cache because TCP-AO security parameters can vary per service. | |||
The table below gives an overview of option-specific information | The table below gives an overview of option-specific information | |||
that can be shared. | that can be shared. Additional information on TCP options and | |||
sharing is provided in Appendix B. | ||||
TEMPORAL SHARING - Option info | TEMPORAL SHARING - Option info | |||
Cached New | Cached New | |||
---------------------------------------- | ---------------------------------------- | |||
old_TFO_Cookie old_TFO_Cookie | old_TFO_Cookie old_TFO_Cookie | |||
old_TFO_Failure old_TFO_Failure | old_TFO_Failure old_TFO_Failure | |||
TEMPORAL SHARING - Cache Updates | TEMPORAL SHARING - Cache Updates | |||
skipping to change at page 11, line 39 ¶ | skipping to change at page 11, line 39 ¶ | |||
There are several ways to initialize the congestion window in a new | There are several ways to initialize the congestion window in a new | |||
TCB among an ensemble of current connections to a host. Current TCP | TCB among an ensemble of current connections to a host. Current TCP | |||
implementations initialize it to four segments as standard [rfc3390] | implementations initialize it to four segments as standard [rfc3390] | |||
and 10 segments experimentally [RFC6928]. These approaches assume | and 10 segments experimentally [RFC6928]. These approaches assume | |||
that new connections should behave as conservatively as possible. | that new connections should behave as conservatively as possible. | |||
The algorithm described in [Ba12] adjusts the initial cwnd depending | The algorithm described in [Ba12] adjusts the initial cwnd depending | |||
on the cwnd values of ongoing connections. There have also been | on the cwnd values of ongoing connections. There have also been | |||
suggestions to use the kind of sharing mechanisms described in this | suggestions to use the kind of sharing mechanisms described in this | |||
document over long timescales to adapt TCP's initial window | document over long timescales to adapt TCP's initial window | |||
automatically [To13]. | automatically, as described further in Appendix A [To12]. | |||
8. Compatibility Issues | 8. Compatibility Issues | |||
For the congestion and current window information, the initial | For the congestion and current window information, the initial | |||
values computed by TCB interdependence may not be consistent with | values computed by TCB interdependence may not be consistent with | |||
the long-term aggregate behavior of a set of concurrent connections | the long-term aggregate behavior of a set of concurrent connections | |||
between the same endpoints. Under conventional TCP congestion | between the same endpoints. Under conventional TCP congestion | |||
control, if a single existing connection has converged to a | control, if a single existing connection has converged to a | |||
congestion window of 40 segments, two newly joining concurrent | congestion window of 40 segments, two newly joining concurrent | |||
connections assume initial windows of 10 segments [RFC6928], and the | connections assume initial windows of 10 segments [RFC6928], and the | |||
skipping to change at page 12, line 34 ¶ | skipping to change at page 12, line 34 ¶ | |||
shared only within connections to the same SYN destination port. In | shared only within connections to the same SYN destination port. In | |||
case of Temporal Sharing, TCB information could also become invalid | case of Temporal Sharing, TCB information could also become invalid | |||
over time. Because this is similar to the case when a connection | over time. Because this is similar to the case when a connection | |||
becomes idle, mechanisms that address idle TCP connections (e.g., | becomes idle, mechanisms that address idle TCP connections (e.g., | |||
[RFC7661]) could also be applied to TCB cache management, especially | [RFC7661]) could also be applied to TCB cache management, especially | |||
when TCP Fast Open is used [RFC7413]. | when TCP Fast Open is used [RFC7413]. | |||
There may be additional considerations to the way in which TCB | There may be additional considerations to the way in which TCB | |||
interdependence rebalances congestion feedback among the current | interdependence rebalances congestion feedback among the current | |||
connections, e.g., it may be appropriate to consider the impact of a | connections, e.g., it may be appropriate to consider the impact of a | |||
connection being in Fast Recovery [RFC5861] or some other similar | connection being in Fast Recovery [RFC5681] or some other similar | |||
unusual feedback state, e.g., as inhibiting or affecting the | unusual feedback state, e.g., as inhibiting or affecting the | |||
calculations described herein. | calculations described herein. | |||
TCP is sometimes used in situations where packets of the same host- | TCP is sometimes used in situations where packets of the same host- | |||
pair do not always take the same path. Multipath routing that relies | pair do not always take the same path. Multipath routing that relies | |||
on examining transport headers, such as ECMP and LAG, may not result | on examining transport headers, such as ECMP and LAG, may not result | |||
in repeatable path selection when TCP segments are encapsulated, | in repeatable path selection when TCP segments are encapsulated, | |||
encrypted, or altered - for example, in some Virtual Private Network | encrypted, or altered - for example, in some Virtual Private Network | |||
(VPN) tunnels that rely on proprietary encapsulation. Similarly, | (VPN) tunnels that rely on proprietary encapsulation. Similarly, | |||
such approaches cannot operate deterministically when the TCP header | such approaches cannot operate deterministically when the TCP header | |||
skipping to change at page 16, line 4 ¶ | skipping to change at page 16, line 4 ¶ | |||
multipath TCP, fast open, PLPMTUD, NAT, and the TCP Authentication | multipath TCP, fast open, PLPMTUD, NAT, and the TCP Authentication | |||
Option. | Option. | |||
The detailed impact on TCB state addresses TCB parameters in greater | The detailed impact on TCB state addresses TCB parameters in greater | |||
detail, addressing RSS in both the send and receive direction, MSS | detail, addressing RSS in both the send and receive direction, MSS | |||
and send-MSS separately, adds path MTU and ssthresh, and addresses | and send-MSS separately, adds path MTU and ssthresh, and addresses | |||
the impact on TCP option state. | the impact on TCP option state. | |||
New sections have been added to address compatibility issues and | New sections have been added to address compatibility issues and | |||
implementation observations. The relation of this work to T/TCP has | implementation observations. The relation of this work to T/TCP has | |||
been moved to an appendix discussion on history, partly to reflect | been moved to Appendix A on history, partly to reflect the | |||
the deprecation of that protocol. | deprecation of that protocol. | |||
Appendix C has been added to discuss the potential to use temporal | ||||
sharing over long timescales to adapt TCP's initial window | ||||
automatically, largely imported from [To12]. | ||||
Finally, this document updates and significantly expands the | Finally, this document updates and significantly expands the | |||
referenced literature. | referenced literature. | |||
12. Security Considerations | 12. Security Considerations | |||
These presented implementation methods do not have additional | These presented implementation methods do not have additional | |||
ramifications for explicit attacks. They may be susceptible to | ramifications for explicit attacks. They may be susceptible to | |||
denial-of-service attacks if not otherwise secured. For example, an | denial-of-service attacks if not otherwise secured. For example, an | |||
application can open a connection and set its window size to zero, | application can open a connection and set its window size to zero, | |||
skipping to change at page 16, line 45 ¶ | skipping to change at page 16, line 49 ¶ | |||
13. IANA Considerations | 13. IANA Considerations | |||
There are no IANA implications or requests in this document. | There are no IANA implications or requests in this document. | |||
This section should be removed upon final publication as an RFC. | This section should be removed upon final publication as an RFC. | |||
14. References | 14. References | |||
14.1. Normative References | 14.1. Normative References | |||
This document has no normative references. | [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate | |||
Requirement Levels", BCP 14, RFC 2119, March 1997. | ||||
[RFC8174] Leiba., B., "Ambiguity of Uppercase vs Lowercase in RFC | ||||
2119 Key Words", RFC 8174, May 2017. | ||||
14.2. Informative References | 14.2. Informative References | |||
[Br02] Brownlee, N. and K. Claffy, "Understanding Internet | [Al10] Allman, M., "Initial Congestion Window Specification", | |||
Traffic Streams: Dragonflies and Tortoises", IEEE | (work in progress), draft-allman-tcpm-bump-initcwnd-00, | |||
Communications Magazine p110-117, 2002. | Nov. 2010. | |||
[Ba12] Barik, R., Welzl, M., Ferlin, S., Alay, O., " LISA: A | ||||
Linked Slow-Start Algorithm for MPTCP", IEEE ICC, Kuala | ||||
Lumpur, Malaysia, May 23-27 2016. | ||||
[Be94] Berners-Lee, T., et al., "The World-Wide Web," | [Be94] Berners-Lee, T., et al., "The World-Wide Web," | |||
Communications of the ACM, V37, Aug. 1994, pp. 76-82. | Communications of the ACM, V37, Aug. 1994, pp. 76-82. | |||
[Br94] Braden, B., "T/TCP -- Transaction TCP: Source Changes for | [Br94] Braden, B., "T/TCP -- Transaction TCP: Source Changes for | |||
Sun OS 4.1.3,", Release 1.0, USC/ISI, September 14, 1994. | Sun OS 4.1.3,", Release 1.0, USC/ISI, September 14, 1994. | |||
[Br02] Brownlee, N. and K. Claffy, "Understanding Internet | ||||
Traffic Streams: Dragonflies and Tortoises", IEEE | ||||
Communications Magazine p110-117, 2002. | ||||
[Co91] Comer, D., Stevens, D., Internetworking with TCP/IP, V2, | [Co91] Comer, D., Stevens, D., Internetworking with TCP/IP, V2, | |||
Prentice-Hall, NJ, 1991. | Prentice-Hall, NJ, 1991. | |||
[FreeBSD] FreeBSD source code, Release 2.10, http://www.freebsd.org/ | ||||
[Du16] Dukkipati, N., Yuchung C., and Amin V., "Research | [Du16] Dukkipati, N., Yuchung C., and Amin V., "Research | |||
Impacting the Practice of Congestion Control." ACM SIGCOMM | Impacting the Practice of Congestion Control." ACM SIGCOMM | |||
CCR (editorial), on-line post, July 2016. | CCR (editorial), on-line post, July 2016. | |||
[FreeBSD] FreeBSD source code, Release 2.10, http://www.freebsd.org/ | ||||
[Hu01] Hugues, A., Touch, J., Heidemann, J., "Issues in Slow- | [Hu01] Hugues, A., Touch, J., Heidemann, J., "Issues in Slow- | |||
Start Restart After Idle", draft-hughes-restart-00 | Start Restart After Idle", draft-hughes-restart-00 | |||
(expired), Dec. 2001. | (expired), Dec. 2001. | |||
[Hu12] Hurtig, P., Brunstrom, A., "Enhanced metric caching for | [Hu12] Hurtig, P., Brunstrom, A., "Enhanced metric caching for | |||
short TCP flows," 2012 IEEE International Conference on | short TCP flows," 2012 IEEE International Conference on | |||
Communications (ICC), Ottawa, ON, 2012, pp. 1209-1213. | Communications (ICC), Ottawa, ON, 2012, pp. 1209-1213. | |||
[Ba12] Barik, R., Welzl, M., Ferlin, S., Alay, O., " LISA: A | [Ja88] Jacobson, V., M. Karels, "Congestion Avoidance and | |||
Linked Slow-Start Algorithm for MPTCP", IEEE ICC, Kuala | Control", Proc. Sigcomm 1988. | |||
Lumpur, Malaysia, May 23-27 2016. | ||||
[RFC793] Postel, Jon, "Transmission Control Protocol," Network | [RFC793] Postel, Jon, "Transmission Control Protocol," Network | |||
Working Group RFC-793/STD-7, ISI, Sept. 1981. | Working Group RFC-793/STD-7, ISI, Sept. 1981. | |||
[RFC1122] Braden, R. (ed), "Requirements for Internet Hosts -- | [RFC1122] Braden, R. (ed), "Requirements for Internet Hosts -- | |||
Communication Layers", RFC-1122, Oct. 1989. | Communication Layers", RFC-1122, Oct. 1989. | |||
[RFC1191] Mogul, J., Deering, S., "Path MTU Discovery," RFC 1191, | [RFC1191] Mogul, J., Deering, S., "Path MTU Discovery," RFC 1191, | |||
Nov. 1990. | Nov. 1990. | |||
[RFC1644] Braden, R., "T/TCP -- TCP Extensions for Transactions | [RFC1644] Braden, R., "T/TCP -- TCP Extensions for Transactions | |||
Functional Specification," RFC-1644, July 1994. | Functional Specification," RFC-1644, July 1994. | |||
[RFC1379] Braden, R., "Transaction TCP -- Concepts," RFC-1379, | [RFC1379] Braden, R., "Transaction TCP -- Concepts," RFC-1379, | |||
September 1992. | September 1992. | |||
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate | [RFC2001] Stevens, W., "TCP Slow Start, Congestion Avoidance, Fast | |||
Requirement Levels", BCP 14, RFC 2119, March 1997. | Retransmit, and Fast Recovery Algorithms", RFC2001 | |||
(Standards Track), Jan. 1997. | ||||
[RFC2140] Touch, J., "TCP Control Block Interdependence", RFC 2140, | [RFC2140] Touch, J., "TCP Control Block Interdependence", RFC 2140, | |||
April 1997. | April 1997. | |||
[RFC2414] Allman, M., Floyd, S., Partridge, C., "Increasing TCP's | ||||
Initial Window", RFC 2414 (Experimental), Sept. 1998. | ||||
[RFC2581] Allman, M., Paxson, V., Stevens, W., "TCP Congestion | ||||
Control," RFC2581 (Standards Track), Apr. 1999. | ||||
[RFC2663] Srisuresh, P., Holdrege, M., "IP Network Address | [RFC2663] Srisuresh, P., Holdrege, M., "IP Network Address | |||
Translator (NAT) Terminology and Considerations", RFC- | Translator (NAT) Terminology and Considerations", RFC- | |||
2663, August 1999. | 2663, August 1999. | |||
[RFC2861] Handley, M., Padhye, J., Floyd, S., "TCP Congestion Window | ||||
Validation", RFC2861 (Experimental), June 2000. | ||||
[RFC3390] Allman, M., Floyd, S., Partridge, C., "Increasing TCP's | [RFC3390] Allman, M., Floyd, S., Partridge, C., "Increasing TCP's | |||
Initial Window," RFC 3390, Oct. 2002. | Initial Window," RFC 3390, Oct. 2002. | |||
[RFC7231] Fielding, R., J. Reshke, Eds., "HTTP/1.1 Semantics and | ||||
Content," RFC-7231, June 2014. | ||||
[RFC3124] Balakrishnan, H., Seshan, S., "The Congestion Manager," | [RFC3124] Balakrishnan, H., Seshan, S., "The Congestion Manager," | |||
RFC 3124, June 2001. | RFC 3124, June 2001. | |||
[RFC4340] Kohler, E., Handley, M., Floyd, S., "Datagram Congestion | [RFC4340] Kohler, E., Handley, M., Floyd, S., "Datagram Congestion | |||
Control Protocol (DCCP)," RFC 4340, Mar. 2006. | Control Protocol (DCCP)," RFC 4340, Mar. 2006. | |||
[RFC4821] Mathis, M., Heffner, J., "Packetization Layer Path MTU | [RFC4821] Mathis, M., Heffner, J., "Packetization Layer Path MTU | |||
Discovery," RFC 4821, Mar. 2007. | Discovery," RFC 4821, Mar. 2007. | |||
[RFC4960] Stewart, R., (Ed.), "Stream Control Transmission | [RFC4960] Stewart, R., (Ed.), "Stream Control Transmission | |||
Protocol," RFC4960, Sept. 2007. | Protocol," RFC4960, Sept. 2007. | |||
[RFC5861] Allman, M., Paxson, V., Blanton, E., "TCP Congestion | [RFC5681] Allman, M., Paxson, V., Blanton, E., "TCP Congestion | |||
Control," RFC 5861, Sept. 2009. | Control," RFC 5681 (Standards Track), Sep. 2009. | |||
[RFC5925] Touch, J., Mankin, A., Bonica, R., "The TCP Authentication | [RFC5925] Touch, J., Mankin, A., Bonica, R., "The TCP Authentication | |||
Option," RFC 5925, June 2010. | Option," RFC 5925, June 2010. | |||
[RFC6824] Ford, A., Raiciu, C., Handley, M., Bonaventure, O., "TCP | [RFC6824] Ford, A., Raiciu, C., Handley, M., Bonaventure, O., "TCP | |||
Extensions for Multipath Operation with Multiple | Extensions for Multipath Operation with Multiple | |||
Addresses," RFC 6824, Jan. 2013. | Addresses," RFC 6824, Jan. 2013. | |||
[RFC6928] Chu, J., Dukkipati, N., Cheng, Y., Mathis, M., "Increasing | [RFC6928] Chu, J., Dukkipati, N., Cheng, Y., Mathis, M., "Increasing | |||
TCP's Initial Window," RFC 6928, Apr. 2013. | TCP's Initial Window," RFC 6928, Apr. 2013. | |||
[RFC7231] Fielding, R., J. Reshke, Eds., "HTTP/1.1 Semantics and | ||||
Content," RFC-7231, June 2014. | ||||
[RFC7323] Borman, D., B. Braden, V. Jacobson, R. Scheffenegger | ||||
(Ed.), "TCP Extensions for High Performance," RFC 7323, | ||||
Sept. 2014. | ||||
[RFC7413] Cheng, Y., Chu, J., Radhakrishnan, S., Jain, A., "TCP Fast | [RFC7413] Cheng, Y., Chu, J., Radhakrishnan, S., Jain, A., "TCP Fast | |||
Open", RFC 7413, Dec. 2014. | Open", RFC 7413, Dec. 2014. | |||
[RFC7424] Krishnan, R., Yong, L., Ghanwani, A., So, N., Khasnabish, | [RFC7424] Krishnan, R., Yong, L., Ghanwani, A., So, N., Khasnabish, | |||
B., "Mechanisms for Optimizing Link Aggregation Group | B., "Mechanisms for Optimizing Link Aggregation Group | |||
(LAG) and Equal-Cost Multipath (ECMP) Component Link | (LAG) and Equal-Cost Multipath (ECMP) Component Link | |||
Utilization in Networks", RFC 7424, Jan. 2015 | Utilization in Networks", RFC 7424, Jan. 2015 | |||
[RFC7540] Belshe, M., Peon, R., Thomson, M., "Hypertext Transfer | [RFC7540] Belshe, M., Peon, R., Thomson, M., "Hypertext Transfer | |||
Protocol Version 2 (HTTP/2)", RFC 7540, May 2015. | Protocol Version 2 (HTTP/2)", RFC 7540, May 2015. | |||
[RFC7661] Fairhurst, G., Sathiaseelan, A., Secchi, R., "Updating TCP | [RFC7661] Fairhurst, G., Sathiaseelan, A., Secchi, R., "Updating TCP | |||
to Support Rate-Limited Traffic", RFC 7661, Oct. 2015. | to Support Rate-Limited Traffic", RFC 7661, Oct. 2015. | |||
[RFC8174] Leiba., B., "Ambiguity of Uppercase vs Lowercase in RFC | ||||
2119 Key Words", RFC 8174, May 2017. | ||||
[RFC8201] McCann, J., Deering. S., Mogul, J., Hinden, R. (Ed.), | [RFC8201] McCann, J., Deering. S., Mogul, J., Hinden, R. (Ed.), | |||
"Path MTU Discovery for IP version 6," RFC 8201, Jul. | "Path MTU Discovery for IP version 6," RFC 8201, Jul. | |||
2017. | 2017. | |||
[To13] Touch, J., "Automating the Initial Window in TCP," draft- | [To12] Touch, J., "Automating the Initial Window in TCP," draft- | |||
touch-tcpm-automatic-iw-03 (expired), Jan. 2013. | touch-tcpm-automatic-iw-03 (expired), July 2012. | |||
15. Acknowledgments | 15. Acknowledgments | |||
The authors would like to thank for Praveen Balasubramanian for | The authors would like to thank for Praveen Balasubramanian for | |||
information regarding TCB sharing in Windows, and Yuchung Cheng, | information regarding TCB sharing in Windows, and Yuchung Cheng, | |||
Lars Eggert, Ilpo Jarvinen and Michael Scharf for comments on | Lars Eggert, Ilpo Jarvinen and Michael Scharf for comments on | |||
earlier versions of the draft. Earlier revisions of this work | earlier versions of the draft. Earlier revisions of this work | |||
received funding from a collaborative research project between the | received funding from a collaborative research project between the | |||
University of Oslo and Huawei Technologies Co., Ltd. and were partly | University of Oslo and Huawei Technologies Co., Ltd. and were partly | |||
supported by USC/ISI's Postel Center. | supported by USC/ISI's Postel Center. | |||
This document was prepared using 2-Word-v2.0.template.dot. | This document was prepared using 2-Word-v2.0.template.dot. | |||
16. Change log | 16. Change log | |||
This section should be removed upon final publication as an RFC. | This section should be removed upon final publication as an RFC. | |||
ietf-01: | ||||
- Added Appendix C to address long-timescale temporal adaptation. | ||||
ietf-00: | ietf-00: | |||
- Re-issued as draft-ietf-tcpm-2140bis due to WG adoption. | - Re-issued as draft-ietf-tcpm-2140bis due to WG adoption. | |||
- Cleaned orphan references to T/TCP, removed incomplete refs | - Cleaned orphan references to T/TCP, removed incomplete refs | |||
- Moved references to informative section and updated Sec 2 | - Moved references to informative section and updated Sec 2 | |||
- Updated to clarify no impact to interoperability | - Updated to clarify no impact to interoperability | |||
- Updated appendix B to avoid 2119 language | - Updated appendix B to avoid 2119 language | |||
06: | 06: | |||
skipping to change at page 20, line 37 ¶ | skipping to change at page 21, line 14 ¶ | |||
- Marked entries that are considered safe to share with an | - Marked entries that are considered safe to share with an | |||
asterisk (suggestion was to split the table) | asterisk (suggestion was to split the table) | |||
- Discussed correct host identification: NATs may make IP | - Discussed correct host identification: NATs may make IP | |||
addresses the wrong input, could e.g. use HTTP cookie. | addresses the wrong input, could e.g. use HTTP cookie. | |||
- Included MMS_S and MMS_R from RFC1122; fixed the use of MSS and | - Included MMS_S and MMS_R from RFC1122; fixed the use of MSS and | |||
MTU | MTU | |||
- Added information about option sharing, listed options in the | - Added information about option sharing, listed options in | |||
appendix | Appendix B | |||
Authors' Addresses | Authors' Addresses | |||
Joe Touch | Joe Touch | |||
Manhattan Beach, CA 90266 | Manhattan Beach, CA 90266 | |||
USA | USA | |||
Phone: +1 (310) 560-0334 | Phone: +1 (310) 560-0334 | |||
Email: touch@strayalpha.com | Email: touch@strayalpha.com | |||
Michael Welzl | Michael Welzl | |||
University of Oslo | University of Oslo | |||
PO Box 1080 Blindern | PO Box 1080 Blindern | |||
Oslo N-0316 | Oslo N-0316 | |||
Norway | Norway | |||
Phone: +47 22 85 24 20 | Phone: +47 22 85 24 20 | |||
Email: michawe@ifi.uio.no | Email: michawe@ifi.uio.no | |||
Safiqul Islam | Safiqul Islam | |||
skipping to change at page 21, line 22 ¶ | skipping to change at page 22, line 5 ¶ | |||
Safiqul Islam | Safiqul Islam | |||
University of Oslo | University of Oslo | |||
PO Box 1080 Blindern | PO Box 1080 Blindern | |||
Oslo N-0316 | Oslo N-0316 | |||
Norway | Norway | |||
Phone: +47 22 84 08 37 | Phone: +47 22 84 08 37 | |||
Email: safiquli@ifi.uio.no | Email: safiquli@ifi.uio.no | |||
17. Appendix A: TCB sharing history | Appendix A: TCB sharing history | |||
T/TCP proposed using caches to maintain TCB information across | T/TCP proposed using caches to maintain TCB information across | |||
instances (temporal sharing), e.g., smoothed RTT, RTT variance, | instances (temporal sharing), e.g., smoothed RTT, RTT variance, | |||
congestion avoidance threshold, and MSS [RFC1644]. These values were | congestion avoidance threshold, and MSS [RFC1644]. These values were | |||
in addition to connection counts used by T/TCP to accelerate data | in addition to connection counts used by T/TCP to accelerate data | |||
delivery prior to the full three-way handshake during an OPEN. The | delivery prior to the full three-way handshake during an OPEN. The | |||
goal was to aggregate TCB components where they reflect one | goal was to aggregate TCB components where they reflect one | |||
association - that of the host-pair, rather than artificially | association - that of the host-pair, rather than artificially | |||
separating those components by connection. | separating those components by connection. | |||
skipping to change at page 22, line 7 ¶ | skipping to change at page 22, line 34 ¶ | |||
sessions. | sessions. | |||
Temporal sharing of cached TCB data was originally implemented in | Temporal sharing of cached TCB data was originally implemented in | |||
the SunOS 4.1.3 T/TCP extensions [Br94] and the FreeBSD port of same | the SunOS 4.1.3 T/TCP extensions [Br94] and the FreeBSD port of same | |||
[FreeBSD]. As mentioned before, only the MSS and RTT parameters were | [FreeBSD]. As mentioned before, only the MSS and RTT parameters were | |||
cached, as originally specified in [RFC1379]. Later discussion of | cached, as originally specified in [RFC1379]. Later discussion of | |||
T/TCP suggested including congestion control parameters in this | T/TCP suggested including congestion control parameters in this | |||
cache; for example, [RFC1644] (Section 3.1) hints at initializing | cache; for example, [RFC1644] (Section 3.1) hints at initializing | |||
the congestion window to the old window size. | the congestion window to the old window size. | |||
18. Appendix B: Options | Appendix B: TCP Option Sharing and Caching | |||
In addition to the options that can be cached and shared, this memo | In addition to the options that can be cached and shared, this memo | |||
also lists known options for which state is unsafe to be kept. This | also lists known options for which state is unsafe to be kept. This | |||
list is meant to avoid work duplication and should be removed upon | list is meant to avoid work duplication and should be removed upon | |||
publication. | publication. | |||
Obsolete (unsafe to keep state): | Obsolete (unsafe to keep state): | |||
ECHO | ECHO | |||
skipping to change at line 1002 ¶ | skipping to change at page 25, line 4 ¶ | |||
Safe but optional to keep state: | Safe but optional to keep state: | |||
MSS | MSS | |||
TFO failure (so we don't try again, since it's optional) | TFO failure (so we don't try again, since it's optional) | |||
Safe and necessary to keep state: | Safe and necessary to keep state: | |||
TFP cookie (if TFO succeeded in the past) | TFP cookie (if TFO succeeded in the past) | |||
Appendix C: Automating the Initial Window in TCP over Long Timescales | ||||
Note: this section is taken verbatim from [To12], updated to refer | ||||
to itself as an appendix. | ||||
C.1. Introduction | ||||
TCP's congestion control algorithm uses an initial window value | ||||
(IW), both as a starting point for new connections and after one RTO | ||||
or more [RFC2581][RFC2861]. This value has evolved over time, | ||||
originally one maximum segment size (MSS), and increased to the | ||||
lesser of four MSS or 4,380 bytes [RFC3390][RFC5681]. For typical | ||||
Internet connections with an maximum transmission units (MTUs) of | ||||
1500 bytes, this permits three segments of 1,460 bytes each. | ||||
The IW value was originally implied in the original TCP congestion | ||||
control description, and documented as a standard in 1997 | ||||
[RFC2001][Ja88]. The value was last updated in 1998 experimentally, | ||||
and moved to the standards track in 2002 [RFC2414][RFC3390]. There | ||||
have been recent proposals to update the IW based on further | ||||
increases in host and router capabilities and network capacity, some | ||||
focusing on specific values (e.g., IW=10), and others prescribing a | ||||
schedule for increases over time (e.g., IW=6 for 2011, increasing by | ||||
1-2 MSS per year). | ||||
This appendix discusses how TCP can objectively measure when an IW | ||||
is too large, and that such feedback should be used over long | ||||
timescales to adjust the IW automatically. The result should be | ||||
safer to deploy and might avoid the need to repeatedly revisit IW | ||||
size over time. | ||||
Note that this mechanism attempts to make the IW more adaptive over | ||||
time. It can increase the IW beyond that which is currently | ||||
recommended for widescale deployment, and so its use should be | ||||
carefully monitored. | ||||
C.2. Design Considerations | ||||
TCP's IW value has existed statically for over two decades, so any | ||||
solution to adjusting the IW dynamically should have similarly | ||||
stable, non-invasive effects on the performance and complexity of | ||||
TCP. In order to be fair, the IW should be similar for most machines | ||||
on the public Internet. Finally, a desirable goal is to develop a | ||||
self-correcting algorithm, so that IW values that cause network | ||||
problems can be avoided. To that end, we propose the following list | ||||
of design goals: | ||||
o Impart little to no impact to TCP in the absence of loss, i.e., | ||||
it should not increase the complexity of default packet | ||||
processing in the normal case. | ||||
o Adapt to network feedback over long timescales, avoiding values | ||||
that persistently cause network problems. | ||||
o Decrease the IW in the presence of sustained loss of IW segments, | ||||
as determined over a number of different connections. | ||||
o Increase the IW in the absence of sustained loss of IW segments, | ||||
as determined over a number of different connections. | ||||
o Operate conservatively, i.e., tend towards leaving the IW the | ||||
same in the absence of sufficient information, and give greater | ||||
consideration to IW segment loss than IW segment success. | ||||
We expect that, without other context, a good IW algorithm will | ||||
converge to a single value, but this is not required. An endpoint | ||||
with additional context or information, or deployed in a constrained | ||||
environment, can always use a different value. In specific, | ||||
information from previous connections, or sets of connections with a | ||||
similar path, can already be used as context for such decisions (as | ||||
noted in the core of this document). | ||||
However, if a given IW value persistently causes packet loss during | ||||
the initial burst of packets, it is clearly inappropriate and could | ||||
be inducing unnecessary loss in other competing connections. This | ||||
might happen for sites behind very slow boxes with small buffers, | ||||
which may or may not be the first hop. | ||||
C.3. Proposed IW Algorithm | ||||
Below is a simple description of the proposed IW algorithm. It | ||||
relies on the following parameters: | ||||
o MinIW = 3 MSS or 4,380 bytes (as per RFC3390] | ||||
o MaxIW = 10 | ||||
o MulDecr = 0.5 | ||||
o AddIncr = 2 MSS | ||||
o Threshold = 0.05 | ||||
We assume that the minimum IW (MinIW) should be as currently | ||||
specified [RFC3390]. The maximum IW can be set to a fixed value | ||||
[RFC6928], or set based on a schedule if trusted time references are | ||||
available [Al10]; here we prefer a fixed value. We also propose to | ||||
use an AIMD algorithm, with increase and decreases as noted. | ||||
Although these parameters are somewhat arbitrary, their initial | ||||
values are not important except that the algorithm is AIMD and the | ||||
MaxIW should not exceed that recommended for other systems on the | ||||
Internet. Current proposals, including default current operation, | ||||
are degenerate cases of the algorithm below for given parameters - | ||||
notably MulDec = 1.0 and AddIncr = 0 MSS, thus disabling the | ||||
automatic part of the algorithm. | ||||
The proposed algorithm is as follows: | ||||
1. On boot: | ||||
IW = MaxIW; # assume this is in bytes, and an even number of MSS | ||||
2. Upon starting a new connection | ||||
CWND = IW; | ||||
conncount++; | ||||
IWnotchecked = 1; # true | ||||
3. During a connection's SYN-ACK processing, if SYN-ACK includes | ||||
ECN, treat as if the IW is too large | ||||
if (IWnotchecked && (synackecn == 1)) { | ||||
losscount++; | ||||
IWnotchecked = 0; # never check again | ||||
} | ||||
4. During a connection, if retransmission occurs, check the seqno of | ||||
the outgoing packet (in bytes) to see if the resent segment fixes | ||||
an IW loss: | ||||
if (Retransmitting && IWnotchecked && ((ISN - seqno) < IW))) { | ||||
losscount++; | ||||
IWnotchecked = 0; # never do this entire "if" again | ||||
} else { | ||||
IWnotchecked = 0; # you're beyond the IW so stop checking | ||||
} | ||||
5. Once every 1000 conections, as a separate process (i.e., not as | ||||
part of processing a given connection): | ||||
if (conncount > 1000) { | ||||
if (losscount/conncount > threshold) { | ||||
# the number of connections with errors is too high | ||||
IW = IW * MulDecr; | ||||
} else { | ||||
IW = IW + AddIncr; | ||||
} | ||||
} | ||||
We recognize that this algorithm can yield a false positive when the | ||||
sequence number wraps around. This can be avoided using either PAWS | ||||
[RFC7323] context or 64-bit internal sequence numbers (as in TCP-AO | ||||
[RFC5925]). Alternately, false positives can be allowed since they | ||||
are expected to be infrequent and thus will not affect the overall | ||||
statistics of the algorithm. | ||||
The following additional constraints are imposed: | ||||
>> The automatic IW algorithm MUST initialize to MaxIW, in the | ||||
absence of other context information. | ||||
If there are too few connections to make a decision or if there is | ||||
otherwise insufficient information to increase the IW, then the | ||||
MaxIW defaults to the current recommended value. | ||||
>> An implementation may allow the MaxIW to grow beyond the | ||||
currently recommended Internet default, but not more than 2 segments | ||||
per calendar year. | ||||
If an endpoint has a persistent history of successfully transmitting | ||||
IW segments without loss, then it is allowed to probe the Internet | ||||
to determine if larger IW values have similar success. This probing | ||||
is limited and requires a trusted time source, otherwise the MaxIW | ||||
remains constant. | ||||
>> An implementation MUST adjust the IW based on loss statistics at | ||||
least once every 1000 connections. | ||||
An endpoint needs to be sufficiently reactive to IW loss. | ||||
>> An implementation MUST decrease the IW by at least one MSS when | ||||
indicated during an evaluation interval. | ||||
An endpoint that detects loss needs to decrease its IW by at least | ||||
one MSS, otherwise it is not participating in an automatic reactive | ||||
algorithm. | ||||
>> An implementation MUST increase by no more than 2 MSS per | ||||
evaluation interval. | ||||
An endpoint that does not experience IW loss needs to probe the | ||||
network incrementally. | ||||
>> An implementation SHOULD use an IW that is an integer multiple of | ||||
2 MSS. | ||||
The IW should remain a multiple of 2 MSS segments, to enable | ||||
efficient ACK compression without incurring unnecessary timeouts. | ||||
>> An implementation MUST decrease the IW if more than 95% of | ||||
connections have IW losses. | ||||
Again, this is to ensure an implementation is sufficiently reactive. | ||||
>> An implementation MAY group IW values and statistics within | ||||
subsets of connections. Such grouping MAY use any information about | ||||
connections to form groups except loss statistics. | ||||
There are some TCP connections which might not be counted at all, | ||||
such as those to/from loopback addresses, or those within the same | ||||
subnet as that of a local interface (for which congestion control is | ||||
sometimes disabled anyway). This may also include connections that | ||||
terminate before the IW is full, i.e., as a separate check at the | ||||
time of the connection closing. | ||||
The period over which the IW is updated is intended to be a long | ||||
timescale, e.g., a month or so, or 1,000 connections, whichever is | ||||
longer. An implementation might check the IW once a month, and | ||||
simply not update the IW or clear the connection counts in months | ||||
where the number of connections is too small. | ||||
C.4. Discussion | ||||
There are numerous parameters to the above algorithm that are | ||||
compliant with the given requirements; this is intended to allow | ||||
variation in configuration and implementation while ensuring that | ||||
all such algorithms are reactive and safe. | ||||
This algorithm continues to assume segments because that is the | ||||
basis of most TCP implementations. It might be useful to consider | ||||
revising the specifications to allow byte-based congestion given | ||||
sufficient experience. | ||||
The algorithm checks for IW losses only during the first IW after a | ||||
connection start; it does not check for IW losses elsewhere the IW | ||||
is used, e.g., during slow-start restarts. | ||||
>> An implementation MAY detect IW losses during slow-start restarts | ||||
in addition to losses during the first IW of a connection. In this | ||||
case, the implementation MUST count each restart as a "connection" | ||||
for the purposes of connection counts and periodic rechecking of the | ||||
IW value. | ||||
False positives can occur during some kinds of segment reordering, | ||||
e.g., that might trigger spurious retransmissions even without a | ||||
true segment loss. These are not expected to be sufficiently common | ||||
to dominate the algorithm and its conclusions. | ||||
This mechanism does require additional per-connection state which is | ||||
currently common in some implementations, and is useful for other | ||||
reasons (e.g., the ISN is used in TCP-AO [RFC5925]). The mechanism | ||||
also benefits from persistent state kept across reboots, as would be | ||||
other state sharing mechanisms (e.g., TCP Control Block Sharing | ||||
[RFC2140]). The mechanism is inspired by RFC 2140's use of | ||||
information across connections. | ||||
The receive window (RWIN) is not involved in this calculation. The | ||||
size of RWIN is determined by receiver resources, and provides space | ||||
to accommodate segment reordering. It is not involved with | ||||
congestion control, which is the focus of this document and its | ||||
management of the IW. | ||||
C.5. Observations | ||||
The IW may not converge to a single, global value. It also may not | ||||
converge at all, but rather may oscillate by a few MSS as it | ||||
repeatedly probes the Internet for larger IWs and fails. Both | ||||
properties are consistent with TCP behavior during each individual | ||||
connection. | ||||
This mechanism assumes that losses during the IW are due to IW size. | ||||
Persistent errors that drop packets for other reasons - e.g., OS | ||||
bugs, can cause false positives. Again, this is consistent with | ||||
TCP's basic assumption that loss is caused by congestion and | ||||
requires backoff. This algorithm treats the IW of new connections as | ||||
a long-timescale backoff system. | ||||
End of changes. 30 change blocks. | ||||
39 lines changed or deleted | 78 lines changed or added | |||
This html diff was produced by rfcdiff 1.47. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ |