draft-ietf-tcpm-2140bis-10.txt | draft-ietf-tcpm-2140bis-11.txt | |||
---|---|---|---|---|
TCPM WG J. Touch | TCPM WG J. Touch | |||
Internet Draft Independent | Internet Draft Independent | |||
Intended status: Informational M. Welzl | Intended status: Informational M. Welzl | |||
Obsoletes: 2140 S. Islam | Obsoletes: 2140 S. Islam | |||
Expires: September 2021 University of Oslo | Expires: October 2021 University of Oslo | |||
March 16, 2021 | April 12, 2021 | |||
TCP Control Block Interdependence | TCP Control Block Interdependence | |||
draft-ietf-tcpm-2140bis-10.txt | draft-ietf-tcpm-2140bis-11.txt | |||
Status of this Memo | Status of this Memo | |||
This Internet-Draft is submitted in full conformance with the | This Internet-Draft is submitted in full conformance with the | |||
provisions of BCP 78 and BCP 79. | provisions of BCP 78 and BCP 79. | |||
This document may contain material from IETF Documents or IETF | This document may contain material from IETF Documents or IETF | |||
Contributions published or made publicly available before November | Contributions published or made publicly available before November | |||
10, 2008. The person(s) controlling the copyright in some of this | 10, 2008. The person(s) controlling the copyright in some of this | |||
material may not have granted the IETF Trust the right to allow | material may not have granted the IETF Trust the right to allow | |||
skipping to change at page 1, line 45 ¶ | skipping to change at page 1, line 45 ¶ | |||
months and may be updated, replaced, or obsoleted by other documents | months and may be updated, replaced, or obsoleted by other documents | |||
at any time. It is inappropriate to use Internet-Drafts as | at any time. It is inappropriate to use Internet-Drafts as | |||
reference material or to cite them other than as "work in progress." | reference material or to cite them other than as "work in progress." | |||
The list of current Internet-Drafts can be accessed at | The list of current Internet-Drafts can be accessed at | |||
http://www.ietf.org/ietf/1id-abstracts.txt | http://www.ietf.org/ietf/1id-abstracts.txt | |||
The list of Internet-Draft Shadow Directories can be accessed at | The list of Internet-Draft Shadow Directories can be accessed at | |||
http://www.ietf.org/shadow.html | http://www.ietf.org/shadow.html | |||
This Internet-Draft will expire on September 16, 2021. | This Internet-Draft will expire on October 12, 2021. | |||
Copyright Notice | Copyright Notice | |||
Copyright (c) 2021 IETF Trust and the persons identified as the | Copyright (c) 2021 IETF Trust and the persons identified as the | |||
document authors. All rights reserved. | document authors. All rights reserved. | |||
This document is subject to BCP 78 and the IETF Trust's Legal | This document is subject to BCP 78 and the IETF Trust's Legal | |||
Provisions Relating to IETF Documents | Provisions Relating to IETF Documents | |||
(https://trustee.ietf.org/license-info) in effect on the date of | (https://trustee.ietf.org/license-info) in effect on the date of | |||
publication of this document. Please review these documents | publication of this document. Please review these documents | |||
carefully, as they describe your rights and restrictions with | carefully, as they describe your rights and restrictions with | |||
respect to this document. Code Components extracted from this | respect to this document. Code Components extracted from this | |||
document must include Simplified BSD License text as described in | document must include Simplified BSD License text as described in | |||
Section 4.e of the Trust Legal Provisions and are provided | Section 4.e of the Trust Legal Provisions and are provided | |||
without warranty as described in the Simplified BSD License. | without warranty as described in the Simplified BSD License. | |||
Abstract | Abstract | |||
This memo provides guidance to TCP implementers that are intended to | This memo provides guidance to TCP implementers that is intended to | |||
help improve convergence to steady-state operation without affecting | help improve connection convergence to steady-state operation | |||
interoperability. It updates and replaces RFC 2140's description of | without affecting interoperability. It updates and replaces RFC | |||
interdependent TCP control blocks and the ways that part of TCP | 2140's description of sharing TCP state, as typically represented in | |||
state can be shared among similar concurrent or consecutive | TCP Control Blocks, among similar concurrent or consecutive | |||
connections. TCP state includes a combination of parameters, such as | connections. | |||
connection state, current round-trip time estimates, congestion | ||||
control information, and process information. Most of this state is | ||||
maintained on a per-connection basis in the TCP Control Block (TCB), | ||||
but implementations can (and do) share certain TCB information | ||||
across connections to the same host. Such sharing is intended to | ||||
improve overall transient transport performance, while maintaining | ||||
backward-compatibility with existing implementations. The sharing | ||||
described herein is limited to only the TCB initialization and so | ||||
has no effect on the long-term behavior of TCP after a connection | ||||
has been established. | ||||
Table of Contents | Table of Contents | |||
1. Introduction...................................................3 | 1. Introduction...................................................3 | |||
2. Conventions Used in This Document..............................4 | 2. Conventions Used in This Document..............................4 | |||
3. Terminology....................................................4 | 3. Terminology....................................................4 | |||
4. The TCP Control Block (TCB)....................................6 | 4. The TCP Control Block (TCB)....................................5 | |||
5. TCB Interdependence............................................7 | 5. TCB Interdependence............................................7 | |||
6. Temporal Sharing...............................................7 | 6. Temporal Sharing...............................................7 | |||
6.1. Initialization of the new TCB................................7 | 6.1. Initialization of a new TCB..................................7 | |||
6.2. Updates to the new TCB.......................................8 | 6.2. Updates to the TCB cache.....................................8 | |||
6.3. Discussion...................................................9 | 6.3. Discussion..................................................10 | |||
7. Ensemble Sharing..............................................11 | 7. Ensemble Sharing..............................................11 | |||
7.1. Initialization of a new TCB.................................11 | 7.1. Initialization of a new TCB.................................11 | |||
7.2. Updates to the new TCB......................................12 | 7.2. Updates to the TCB cache....................................12 | |||
7.3. Discussion..................................................13 | 7.3. Discussion..................................................13 | |||
8. Issues with TCB information sharing...........................14 | 8. Issues with TCB information sharing...........................14 | |||
8.1. Traversing the same network path............................15 | 8.1. Traversing the same network path............................15 | |||
8.2. State dependence............................................15 | 8.2. State dependence............................................15 | |||
8.3. Problems with IP sharing....................................16 | 8.3. Problems with sharing based on IP address...................16 | |||
9. Implications..................................................16 | 9. Implications..................................................16 | |||
9.1. Layering....................................................16 | 9.1. Layering....................................................17 | |||
9.2. Other possibilities.........................................17 | 9.2. Other possibilities.........................................17 | |||
10. Implementation Observations..................................17 | 10. Implementation Observations..................................18 | |||
11. Updates to RFC 2140..........................................18 | 11. Changes Compared to RFC 2140.................................19 | |||
12. Security Considerations......................................19 | 12. Security Considerations......................................19 | |||
13. IANA Considerations..........................................20 | 13. IANA Considerations..........................................20 | |||
14. References...................................................20 | 14. References...................................................20 | |||
14.1. Normative References....................................20 | 14.1. Normative References....................................20 | |||
14.2. Informative References..................................21 | 14.2. Informative References..................................21 | |||
15. Acknowledgments..............................................23 | 15. Acknowledgments..............................................24 | |||
16. Change log...................................................23 | 16. Change log...................................................24 | |||
Appendix A : TCB Sharing History.................................27 | Appendix A : TCB Sharing History.................................28 | |||
Appendix B : TCP Option Sharing and Caching......................28 | Appendix B : TCP Option Sharing and Caching......................29 | |||
Appendix C : Automating the Initial Window in TCP over Long | Appendix C : Automating the Initial Window in TCP over Long | |||
Timescales.......................................................30 | Timescales.......................................................31 | |||
C.1. Introduction.............................................30 | C.1. Introduction.............................................31 | |||
C.2. Design Considerations....................................30 | C.2. Design Considerations....................................31 | |||
C.3. Proposed IW Algorithm....................................31 | C.3. Proposed IW Algorithm....................................32 | |||
C.4. Discussion...............................................35 | C.4. Discussion...............................................36 | |||
C.5. Observations.............................................36 | C.5. Observations.............................................37 | |||
1. Introduction | 1. Introduction | |||
TCP is a connection-oriented reliable transport protocol layered | TCP is a connection-oriented reliable transport protocol layered | |||
over IP [RFC793]. Each TCP connection maintains state, usually in a | over IP [RFC793]. Each TCP connection maintains state, usually in a | |||
data structure called the TCP Control Block (TCB). The TCB contains | data structure called the TCP Control Block (TCB). The TCB contains | |||
information about the connection state, its associated local | information about the connection state, its associated local | |||
process, and feedback parameters about the connection's transmission | process, and feedback parameters about the connection's transmission | |||
properties. As originally specified and usually implemented, most | properties. As originally specified and usually implemented, most | |||
TCB information is maintained on a per-connection basis. Some | TCB information is maintained on a per-connection basis. Some | |||
implementations can (and now do) share certain TCB information | implementations share certain TCB information across connections to | |||
across connections to the same host [RFC2140]. Such sharing is | the same host [RFC2140]. Such sharing is intended to lead to better | |||
intended to lead to better overall transient performance, especially | overall transient performance, especially for numerous short-lived | |||
for numerous short-lived and simultaneous connections, as often used | and simultaneous connections, as can be used in the World-Wide Web | |||
in the World-Wide Web [Be94][Br02]. This sharing of state is | and other applications [Be94][Br02]. This sharing of state is | |||
intended to help TCP connections converge to long term behavior | intended to help TCP connections converge to long term behavior | |||
(assuming stable application load, i.e., so-called "steady-state") | (assuming stable application load, i.e., so-called "steady-state") | |||
more quickly without affecting TCP interoperability. | more quickly without affecting TCP interoperability. | |||
This document updates RFC 2140's discussion of TCB state sharing and | This document updates RFC 2140's discussion of TCB state sharing and | |||
provides a complete replacement for that document. This state | provides a complete replacement for that document. This state | |||
sharing affects only TCB initialization [RFC2140] and thus has no | sharing affects only TCB initialization [RFC2140] and thus has no | |||
effect on the long-term behavior of TCP after a connection has been | effect on the long-term behavior of TCP after a connection has been | |||
established nor on interoperability. Path information shared across | established nor on interoperability. Path information shared across | |||
SYN destination port numbers assumes that TCP segments having the | SYN destination port numbers assumes that TCP segments having the | |||
same host-pair experience the same path properties, i.e., that | same host-pair experience the same path properties, i.e., that | |||
traffic is not routed differently based on port numbers or other | traffic is not routed differently based on port numbers or other | |||
connection parameters. The observations about TCB sharing in this | connection parameters (also addressed further in Section 8.1). The | |||
document apply similarly to any protocol with congestion state, | observations about TCB sharing in this document apply similarly to | |||
including SCTP [RFC4960] and DCCP [RFC4340], as well as for | any protocol with congestion state, including SCTP [RFC4960] and | |||
individual subflows in Multipath TCP [RFC8684]. | DCCP [RFC4340], as well as for individual subflows in Multipath TCP | |||
[RFC8684]. | ||||
2. Conventions Used in This Document | 2. Conventions Used in This Document | |||
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | |||
"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and | "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and | |||
"OPTIONAL" in this document are to be interpreted as described in | "OPTIONAL" in this document are to be interpreted as described in | |||
BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all | BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all | |||
capitals, as shown here. | capitals, as shown here. | |||
The core of this document describes behavior that is already | The core of this document describes behavior that is already | |||
skipping to change at page 5, line 23 ¶ | skipping to change at page 5, line 15 ¶ | |||
uses transport packets to discover the PMTU [RFC4821] | uses transport packets to discover the PMTU [RFC4821] | |||
+PMTU - largest IP datagram that can traverse a path | +PMTU - largest IP datagram that can traverse a path | |||
[RFC1191][RFC8201] | [RFC1191][RFC8201] | |||
PMTUD - path-layer MTU discovery, a mechanism that relies on ICMP | PMTUD - path-layer MTU discovery, a mechanism that relies on ICMP | |||
error messages to discover the PMTU [RFC1191][RFC8201] | error messages to discover the PMTU [RFC1191][RFC8201] | |||
+RTT - round-trip time of a TCP packet exchange [RFC793] | +RTT - round-trip time of a TCP packet exchange [RFC793] | |||
+RTTVAR - variance of round-trip times of a TCP packet exchange | +RTTVAR - variation of round-trip times of a TCP packet exchange | |||
[RFC6298] | [RFC6298] | |||
+rwnd - TCP receive window size [RFC5681] | +rwnd - TCP receive window size [RFC5681] | |||
+sendcwnd - TCP send-side congestion window (cwnd) size [RFC5681] | +sendcwnd - TCP send-side congestion window (cwnd) size [RFC5681] | |||
+sendMSS - TCP maximum segment size, a value transmitted in a TCP | +sendMSS - TCP maximum segment size, a value transmitted in a TCP | |||
option that represents the largest TCP user data payload that can be | option that represents the largest TCP user data payload that can be | |||
received [RFC6691] | received [RFC6691] | |||
skipping to change at page 6, line 24 ¶ | skipping to change at page 6, line 18 ¶ | |||
pointers to Internet Protocol (IP) PCB | pointers to Internet Protocol (IP) PCB | |||
Per-connection shared state | Per-connection shared state | |||
macro-state | macro-state | |||
connection state | connection state | |||
timers | timers | |||
flags | flags | |||
local and remote host numbers and ports | local and remote host numbers and ports | |||
TCP option state | TCP option state | |||
micro-state | micro-state | |||
send and receive window state (size*, current number) | send and receive window state (size*, current number) | |||
cong. window size (sendcwnd)* | congestion window size (sendcwnd)* | |||
cong. window size threshold (ssthresh)* | congestion window size threshold (ssthresh)* | |||
max window size seen* | max window size seen* | |||
sendMSS# | sendMSS# | |||
MMS_S# | MMS_S# | |||
MMS_R# | MMS_R# | |||
PMTU# | PMTU# | |||
round-trip time and variance# | round-trip time and its variation# | |||
The per-connection information is shown as split into macro-state | The per-connection information is shown as split into macro-state | |||
and micro-state, terminology borrowed from [Co91]. Macro-state | and micro-state, terminology borrowed from [Co91]. Macro-state | |||
describes the protocol for establishing the initial shared state | describes the protocol for establishing the initial shared state | |||
about the connection; we include the endpoint numbers and components | about the connection; we include the endpoint numbers and components | |||
(timers, flags) required upon commencement that are later used to | (timers, flags) required upon commencement that are later used to | |||
help maintain that state. Micro-state describes the protocol after a | help maintain that state. Micro-state describes the protocol after a | |||
connection has been established, to maintain the reliability and | connection has been established, to maintain the reliability and | |||
congestion control of the data transferred in the connection. | congestion control of the data transferred in the connection. | |||
skipping to change at page 7, line 6 ¶ | skipping to change at page 6, line 48 ¶ | |||
class is clearly host-pair dependent (shown above as "#", e.g., | class is clearly host-pair dependent (shown above as "#", e.g., | |||
sendMSS, MMS_R, MMS_S, PMTU, RTT), because these parameters are | sendMSS, MMS_R, MMS_S, PMTU, RTT), because these parameters are | |||
defined by the endpoint or endpoint pair (sendMSS, MMS_R, MMS_S, | defined by the endpoint or endpoint pair (sendMSS, MMS_R, MMS_S, | |||
RTT) or are already cached and shared on that basis (PMTU | RTT) or are already cached and shared on that basis (PMTU | |||
[RFC1191][RFC4821]). The other is host-pair dependent in its | [RFC1191][RFC4821]). The other is host-pair dependent in its | |||
aggregate (shown above as "*", e.g., congestion window information, | aggregate (shown above as "*", e.g., congestion window information, | |||
current window sizes, etc.) because they depend on the total | current window sizes, etc.) because they depend on the total | |||
capacity between the two endpoints. | capacity between the two endpoints. | |||
Not all of the TCB state is necessarily sharable. In particular, | Not all of the TCB state is necessarily sharable. In particular, | |||
some TCP options are negotiated only upon application layer request, | some TCP options are negotiated only upon request by the application | |||
so their use may not be correlated across connections. Other options | layer, so their use may not be correlated across connections. Other | |||
negotiate connection-specific parameters, which are similarly not | options negotiate connection-specific parameters, which are | |||
shareable. These are discussed further in Appendix B. | similarly not shareable. These are discussed further in Appendix B. | |||
Finally, we exclude rwnd from further discussion because its value | Finally, we exclude rwnd from further discussion because its value | |||
should depend on the send window size, so it is already addressed by | should depend on the send window size, so it is already addressed by | |||
send window sharing and is not independently affected by sharing. | send window sharing and is not independently affected by sharing. | |||
5. TCB Interdependence | 5. TCB Interdependence | |||
There are two cases of TCB interdependence. Temporal sharing occurs | There are two cases of TCB interdependence. Temporal sharing occurs | |||
when the TCB of an earlier (now CLOSED) connection to a host is used | when the TCB of an earlier (now CLOSED) connection to a host is used | |||
to initialize some parameters of a new connection to that same host, | to initialize some parameters of a new connection to that same host, | |||
i.e., in sequence. Ensemble sharing occurs when a currently active | i.e., in sequence. Ensemble sharing occurs when a currently active | |||
connection to a host is used to initialize another (concurrent) | connection to a host is used to initialize another (concurrent) | |||
connection to that host. | connection to that host. | |||
6. Temporal Sharing | 6. Temporal Sharing | |||
The TCB data cache is accessed in two ways: it is read to initialize | The TCB data cache is accessed in two ways: it is read to initialize | |||
new TCBs and written when more current per-host state is available. | new TCBs and written when more current per-host state is available. | |||
6.1. Initialization of the new TCB | 6.1. Initialization of a new TCB | |||
TCBs for new connections can be initialized using context from past | TCBs for new connections can be initialized using cached context | |||
connections as follows: | from past connections as follows: | |||
TEMPORAL SHARING - TCB Initialization | TEMPORAL SHARING - TCB Initialization | |||
Cached TCB New TCB | Cached TCB New TCB | |||
-------------------------------------- | -------------------------------------- | |||
old_MMS_S old_MMS_S or not cached* | old_MMS_S old_MMS_S or not cached* | |||
old_MMS_R old_MMS_R or not cached* | old_MMS_R old_MMS_R or not cached* | |||
old_sendMSS old_sendMSS | old_sendMSS old_sendMSS | |||
skipping to change at page 8, line 43 ¶ | skipping to change at page 8, line 43 ¶ | |||
options and sharing is provided in Appendix B. | options and sharing is provided in Appendix B. | |||
TEMPORAL SHARING - Option Info Initialization | TEMPORAL SHARING - Option Info Initialization | |||
Cached New | Cached New | |||
------------------------------------ | ------------------------------------ | |||
old_TFO_cookie old_TFO_cookie | old_TFO_cookie old_TFO_cookie | |||
old_TFO_failure old_TFO_failure | old_TFO_failure old_TFO_failure | |||
6.2. Updates to the new TCB | 6.2. Updates to the TCB cache | |||
During the connection, the associated TCB can be updated based on | During a connection, the TCB cache can be updated based on events of | |||
particular events, as shown below: | current connections and their TCBs as they progress over time, as | |||
shown below: | ||||
TEMPORAL SHARING - Cache Updates | TEMPORAL SHARING - Cache Updates | |||
Cached TCB Current TCB when? New Cached TCB | Cached TCB Current TCB when? New Cached TCB | |||
---------------------------------------------------------- | ---------------------------------------------------------- | |||
old_MMS_S curr_MMS_S OPEN curr_MMS_S | old_MMS_S curr_MMS_S OPEN curr_MMS_S | |||
old_MMS_R curr_MMS_R OPEN curr_MMS_R | old_MMS_R curr_MMS_R OPEN curr_MMS_R | |||
old_sendMSS curr_sendMSS MSSopt curr_sendMSS | old_sendMSS curr_sendMSS MSSopt curr_sendMSS | |||
skipping to change at page 9, line 30 ¶ | skipping to change at page 9, line 30 ¶ | |||
old_RTTVAR curr_RTTVAR CLOSE merge(curr,old) | old_RTTVAR curr_RTTVAR CLOSE merge(curr,old) | |||
old_option curr_option ESTAB (depends on option) | old_option curr_option ESTAB (depends on option) | |||
old_ssthresh curr_ssthresh CLOSE merge(curr,old) | old_ssthresh curr_ssthresh CLOSE merge(curr,old) | |||
old_sendcwnd curr_sendcwnd CLOSE merge(curr,old) | old_sendcwnd curr_sendcwnd CLOSE merge(curr,old) | |||
+Note that PMTU is cached at the IP layer [RFC1191][RFC4821]. | +Note that PMTU is cached at the IP layer [RFC1191][RFC4821]. | |||
Merge() is the function that combines the current and previous (old) | ||||
values and may vary for each parameter of the TCB cache. The | ||||
particular function is not specified in this document; examples | ||||
include windowed averages (mean of the past N values, for some N) | ||||
and exponential decay (new = (1-alpha)*old + alpha *new, where alpha | ||||
is in the range [0..1]). | ||||
The table below gives an overview of option-specific information | The table below gives an overview of option-specific information | |||
that can be similarly shared. The TFO cookie is maintained until the | that can be similarly shared. The TFO cookie is maintained until the | |||
client explicitly requests it be updated as a separate event. | client explicitly requests it be updated as a separate event. | |||
TEMPORAL SHARING - Option Info Updates | TEMPORAL SHARING - Option Info Updates | |||
Cached Current when? New Cached | Cached Current when? New Cached | |||
--------------------------------------------------------- | --------------------------------------------------------- | |||
old_TFO_cookie old_TFO_cookie ESTAB old_TFO_cookie | old_TFO_cookie old_TFO_cookie ESTAB old_TFO_cookie | |||
skipping to change at page 10, line 14 ¶ | skipping to change at page 10, line 25 ¶ | |||
recent values from any connection. For sendMSS, the cache is | recent values from any connection. For sendMSS, the cache is | |||
consulted only at connection establishment and not otherwise | consulted only at connection establishment and not otherwise | |||
updated, which means that MSS options do not affect current | updated, which means that MSS options do not affect current | |||
connections. The default sendMSS is never saved; only reported MSS | connections. The default sendMSS is never saved; only reported MSS | |||
values update the cache, so an explicit override is required to | values update the cache, so an explicit override is required to | |||
reduce the sendMSS. Cached sendMSS affects only data sent in the SYN | reduce the sendMSS. Cached sendMSS affects only data sent in the SYN | |||
segment, i.e., during client connection initiation or during | segment, i.e., during client connection initiation or during | |||
simultaneous open; all other segment MSS are based on the value | simultaneous open; all other segment MSS are based on the value | |||
updated as included in the SYN. | updated as included in the SYN. | |||
RTT values are updated by formulae that merge the old and new | RTT values are updated by formulae that merges the old and new | |||
values. Dynamic RTT estimation requires a sequence of RTT | values, as noted in Section 6.2. Dynamic RTT estimation requires a | |||
measurements. As a result, the cached RTT (and its variance) is an | sequence of RTT measurements. As a result, the cached RTT (and its | |||
average of its previous value with the contents of the currently | variation) is an average of its previous value with the contents of | |||
active TCB for that host, when a TCB is closed. RTT values are | the currently active TCB for that host, when a TCB is closed. RTT | |||
updated only when a connection is closed. The method for merging old | values are updated only when a connection is closed. The method for | |||
and current values needs to attempt to reduce the transient effects | merging old and current values needs to attempt to reduce the | |||
of the new connections. | transient effects of the new connections. | |||
The updates for RTT, RTTVAR and ssthresh rely on existing | The updates for RTT, RTTVAR and ssthresh rely on existing | |||
information, i.e., old values. Should no such values exist, the | information, i.e., old values. Should no such values exist, the | |||
current values are cached instead. | current values are cached instead. | |||
TCP options are copied or merged depending on the details of each | TCP options are copied or merged depending on the details of each | |||
option, where "merge" is some function that combines the values of | option. E.g., TFO state is updated when a connection is established | |||
"curr" and "old". E.g., TFO state is updated when a connection is | and read before establishing a new connection. | |||
established and read before establishing a new connection. | ||||
Sections 8 and 9 discuss compatibility issues and implications of | Sections 8 and 9 discuss compatibility issues and implications of | |||
sharing the specific information listed above. Section 10 gives an | sharing the specific information listed above. Section 10 gives an | |||
overview of known implementations. | overview of known implementations. | |||
Most cached TCB values are updated when a connection closes. The | Most cached TCB values are updated when a connection closes. The | |||
exceptions are MMS_R and MMS_S, which are reported by IP [RFC1122], | exceptions are MMS_R and MMS_S, which are reported by IP [RFC1122], | |||
PMTU which is updated after Path MTU Discovery and also reported by | PMTU which is updated after Path MTU Discovery and also reported by | |||
IP [RFC1191][RFC4821][RFC8201], and sendMSS, which is updated if the | IP [RFC1191][RFC4821][RFC8201], and sendMSS, which is updated if the | |||
MSS option is received in the TCP SYN header. | MSS option is received in the TCP SYN header. | |||
skipping to change at page 11, line 29 ¶ | skipping to change at page 11, line 39 ¶ | |||
Sharing cached TCB data across concurrent connections requires | Sharing cached TCB data across concurrent connections requires | |||
attention to the aggregate nature of some of the shared state. For | attention to the aggregate nature of some of the shared state. For | |||
example, although MSS and RTT values can be shared by copying, it | example, although MSS and RTT values can be shared by copying, it | |||
may not be appropriate to simply copy congestion window or ssthresh | may not be appropriate to simply copy congestion window or ssthresh | |||
information; instead, the new values can be a function (f) of the | information; instead, the new values can be a function (f) of the | |||
cumulative values and the number of connections (N). | cumulative values and the number of connections (N). | |||
7.1. Initialization of a new TCB | 7.1. Initialization of a new TCB | |||
TCBs for new connections can be initialized using context from | TCBs for new connections can be initialized using cached context | |||
concurrent connections as follows: | from concurrent connections as follows: | |||
ENSEMBLE SHARING - TCB Initialization | ENSEMBLE SHARING - TCB Initialization | |||
Cached TCB New TCB | Cached TCB New TCB | |||
------------------------------------------ | ------------------------------------------ | |||
old_MMS_S old_MMS_S | old_MMS_S old_MMS_S | |||
old_MMS_R old_MMS_R | old_MMS_R old_MMS_R | |||
old_sendMSS old_sendMSS | old_sendMSS old_sendMSS | |||
skipping to change at page 12, line 29 ¶ | skipping to change at page 12, line 29 ¶ | |||
old_RTTVAR old_RTTVAR | old_RTTVAR old_RTTVAR | |||
sum(old_ssthresh) f(sum(old_ssthresh), N) | sum(old_ssthresh) f(sum(old_ssthresh), N) | |||
sum(old_sendcwnd) f(sum(old_sendcwnd), N) | sum(old_sendcwnd) f(sum(old_sendcwnd), N) | |||
_ | _ | |||
old_option (option specific) | old_option (option specific) | |||
+Note that PMTU is cached at the IP layer [RFC1191][RFC4821]. | +Note that PMTU is cached at the IP layer [RFC1191][RFC4821]. | |||
In the table, the cached sum() is a total across all active | ||||
connections because these parameters act in aggregate; similarly f() | ||||
is a function that updates that sum based on the new connection's | ||||
values, represented as "N". | ||||
The table below gives an overview of option-specific information | The table below gives an overview of option-specific information | |||
that can be similarly shared. Again, The TFO_cookie is updated upon | that can be similarly shared. Again, The TFO_cookie is updated upon | |||
explicit client request, which is a separate event. | explicit client request, which is a separate event. | |||
ENSEMBLE SHARING - Option Info Initialization | ENSEMBLE SHARING - Option Info Initialization | |||
Cached New | Cached New | |||
------------------------------------ | ------------------------------------ | |||
old_TFO_cookie old_TFO_cookie | old_TFO_cookie old_TFO_cookie | |||
old_TFO_failure old_TFO_failure | old_TFO_failure old_TFO_failure | |||
7.2. Updates to the new TCB | 7.2. Updates to the TCB cache | |||
During the connection, the associated TCB can be updated based on | During a connection, the TCB cache can be updated based on changes | |||
changes to concurrent connections, as shown below: | to concurrent connections and their TCBs, as shown below: | |||
ENSEMBLE SHARING - Cache Updates | ENSEMBLE SHARING - Cache Updates | |||
Cached TCB Current TCB when? New Cached TCB | Cached TCB Current TCB when? New Cached TCB | |||
--------------------------------------------------------------- | --------------------------------------------------------------- | |||
old_MMS_S curr_MMS_S OPEN curr_MMS_S | old_MMS_S curr_MMS_S OPEN curr_MMS_S | |||
old_MMS_R curr_MMS_R OPEN curr_MMS_R | old_MMS_R curr_MMS_R OPEN curr_MMS_R | |||
old_sendMSS curr_sendMSS MSSopt curr_sendMSS | old_sendMSS curr_sendMSS MSSopt curr_sendMSS | |||
old_PMTU curr_PMTU PMTUD+ / curr_PMTU | old_PMTU curr_PMTU PMTUD+ / curr_PMTU | |||
PLPMTUD+ | PLPMTUD+ | |||
old_RTT curr_RTT update rtt_update(old,curr) | old_RTT curr_RTT update rtt_update(old, curr) | |||
old_RTTVAR curr_RTTVAR update rtt_update(old,curr) | old_RTTVAR curr_RTTVAR update rtt_update(old, curr) | |||
old_ssthresh curr_ssthresh update adjust sum as appropriate | old_ssthresh curr_ssthresh update adjust sum as appropriate | |||
old_sendcwnd curr_sendcwnd update adjust sum as appropriate | old_sendcwnd curr_sendcwnd update adjust sum as appropriate | |||
old_option curr_option (depends) (option specific) | old_option curr_option (depends) (option specific) | |||
+Note that the PMTU is cached at the IP layer [RFC1191][RFC4821]. | +Note that the PMTU is cached at the IP layer [RFC1191][RFC4821]. | |||
In the table, rtt_update() is the function used to combine old and | ||||
current values, e.g., as a windowed average or exponentially decayed | ||||
average. | ||||
The table below gives an overview of option-specific information | The table below gives an overview of option-specific information | |||
that can be similarly shared. | that can be similarly shared. | |||
ENSEMBLE SHARING - Option Info Updates | ENSEMBLE SHARING - Option Info Updates | |||
Cached Current when? New Cached | Cached Current when? New Cached | |||
---------------------------------------------------------- | ---------------------------------------------------------- | |||
old_TFO_cookie old_TFO_cookie ESTAB old_TFO_cookie | old_TFO_cookie old_TFO_cookie ESTAB old_TFO_cookie | |||
old_TFO_failure old_TFO_failure ESTAB old_TFO_failure | old_TFO_failure old_TFO_failure ESTAB old_TFO_failure | |||
skipping to change at page 14, line 17 ¶ | skipping to change at page 14, line 21 ¶ | |||
Congestion window size and ssthresh aggregation are more complicated | Congestion window size and ssthresh aggregation are more complicated | |||
in the concurrent case. When there is an ensemble of connections, we | in the concurrent case. When there is an ensemble of connections, we | |||
need to decide how that ensemble would have shared these variables, | need to decide how that ensemble would have shared these variables, | |||
in order to derive initial values for new TCBs. | in order to derive initial values for new TCBs. | |||
Sections 8 and 9 discuss compatibility issues and implications of | Sections 8 and 9 discuss compatibility issues and implications of | |||
sharing the specific information listed above. | sharing the specific information listed above. | |||
There are several ways to initialize the congestion window in a new | There are several ways to initialize the congestion window in a new | |||
TCB among an ensemble of current connections to a host. Current TCP | TCB among an ensemble of current connections to a host. Current TCP | |||
implementations initialize it to four segments as standard [rfc3390] | implementations initialize it to four segments as standard [RFC3390] | |||
and 10 segments experimentally [RFC6928]. These approaches assume | and 10 segments experimentally [RFC6928]. These approaches assume | |||
that new connections should behave as conservatively as possible. | that new connections should behave as conservatively as possible. | |||
The algorithm described in [Ba12] adjusts the initial cwnd depending | The algorithm described in [Ba12] adjusts the initial cwnd depending | |||
on the cwnd values of ongoing connections. It is also possible to | on the cwnd values of ongoing connections. It is also possible to | |||
use sharing mechanisms over long timescales to adapt TCP's initial | use sharing mechanisms over long timescales to adapt TCP's initial | |||
window automatically, as described further in Appendix C. | window automatically, as described further in Appendix C. | |||
8. Issues with TCB information sharing | 8. Issues with TCB information sharing | |||
Here, we discuss various types of problems that may arise with TCB | Here, we discuss various types of problems that may arise with TCB | |||
skipping to change at page 15, line 20 ¶ | skipping to change at page 15, line 22 ¶ | |||
Multipath routing that relies on examining transport headers, such | Multipath routing that relies on examining transport headers, such | |||
as ECMP and LAG [RFC7424], may not result in repeatable path | as ECMP and LAG [RFC7424], may not result in repeatable path | |||
selection when TCP segments are encapsulated, encrypted, or altered | selection when TCP segments are encapsulated, encrypted, or altered | |||
- for example, in some Virtual Private Network (VPN) tunnels that | - for example, in some Virtual Private Network (VPN) tunnels that | |||
rely on proprietary encapsulation. Similarly, such approaches cannot | rely on proprietary encapsulation. Similarly, such approaches cannot | |||
operate deterministically when the TCP header is encrypted, e.g., | operate deterministically when the TCP header is encrypted, e.g., | |||
when using IPsec ESP (although TCB interdependence among the entire | when using IPsec ESP (although TCB interdependence among the entire | |||
set sharing the same endpoint IP addresses should work without | set sharing the same endpoint IP addresses should work without | |||
problems when the TCP header is encrypted). Measures to increase the | problems when the TCP header is encrypted). Measures to increase the | |||
probability that connections use the same path could be applied: | probability that connections use the same path could be applied: | |||
e.g., the connections could be given the same IPv6 flow label. TCB | e.g., the connections could be given the same IPv6 flow label | |||
interdependence can also be extended to sets of host IP address | [RFC6437]. TCB interdependence can also be extended to sets of host | |||
pairs that share the same network path conditions, such as when a | IP address pairs that share the same network path conditions, such | |||
group of addresses is on the same LAN (see Section 9). | as when a group of addresses is on the same LAN (see Section 9). | |||
Traversing the same path is not important for host-specific | Traversing the same path is not important for host-specific | |||
information such as rwnd and TCP option state, such as TFOinfo, or | information such as rwnd and TCP option state, such as TFOinfo, or | |||
for information that is already cached per-host, such as path MTU. | for information that is already cached per-host, such as path MTU. | |||
When TCB information is shared across different SYN destination | When TCB information is shared across different SYN destination | |||
ports, path-related information can be incorrect; however, the | ports, path-related information can be incorrect; however, the | |||
impact of this error is potentially diminished if (as discussed | impact of this error is potentially diminished if (as discussed | |||
here) TCB sharing affects only the transient event of a connection | here) TCB sharing affects only the transient event of a connection | |||
start or if TCB information is shared only within connections to the | start or if TCB information is shared only within connections to the | |||
same SYN destination port. | same SYN destination port. | |||
skipping to change at page 16, line 5 ¶ | skipping to change at page 16, line 7 ¶ | |||
8.2. State dependence | 8.2. State dependence | |||
There may be additional considerations to the way in which TCB | There may be additional considerations to the way in which TCB | |||
interdependence rebalances congestion feedback among the current | interdependence rebalances congestion feedback among the current | |||
connections, e.g., it may be appropriate to consider the impact of a | connections, e.g., it may be appropriate to consider the impact of a | |||
connection being in Fast Recovery [RFC5681] or some other similar | connection being in Fast Recovery [RFC5681] or some other similar | |||
unusual feedback state, e.g., as inhibiting or affecting the | unusual feedback state, e.g., as inhibiting or affecting the | |||
calculations described herein. | calculations described herein. | |||
8.3. Problems with IP sharing | 8.3. Problems with sharing based on IP address | |||
It can be wrong to share TCB information between TCP connections on | It can be wrong to share TCB information between TCP connections on | |||
the same host as identified by the IP address if an IP address is | the same host as identified by the IP address if an IP address is | |||
assigned to a new host (e.g., IP address spinning, as is used by | assigned to a new host (e.g., IP address spinning, as is used by | |||
ISPs to inhibit running servers). It can be wrong if Network Address | ISPs to inhibit running servers). It can be wrong if Network Address | |||
(and Port) Translation (NA(P)T) [RFC2663] or any other IP sharing | (and Port) Translation (NA(P)T) [RFC2663] or any other IP sharing | |||
mechanism is used. Such mechanisms are less likely to be used with | mechanism is used. Such mechanisms are less likely to be used with | |||
IPv6. Other methods to identify a host could also be considered to | IPv6. Other methods to identify a host could also be considered to | |||
make correct TCB sharing more likely. Moreover, some TCB information | make correct TCB sharing more likely. Moreover, some TCB information | |||
is about dominant path properties rather than the specific host. IP | is about dominant path properties rather than the specific host. IP | |||
skipping to change at page 16, line 28 ¶ | skipping to change at page 16, line 30 ¶ | |||
9. Implications | 9. Implications | |||
There are several implications to incorporating TCB interdependence | There are several implications to incorporating TCB interdependence | |||
in TCP implementations. First, it may reduce the need for | in TCP implementations. First, it may reduce the need for | |||
application-layer multiplexing for performance enhancement | application-layer multiplexing for performance enhancement | |||
[RFC7231]. Protocols like HTTP/2 [RFC7540] avoid connection | [RFC7231]. Protocols like HTTP/2 [RFC7540] avoid connection | |||
reestablishment costs by serializing or multiplexing a set of per- | reestablishment costs by serializing or multiplexing a set of per- | |||
host connections across a single TCP connection. This avoids TCP's | host connections across a single TCP connection. This avoids TCP's | |||
per-connection OPEN handshake and also avoids recomputing the MSS, | per-connection OPEN handshake and also avoids recomputing the MSS, | |||
RTT, and congestion window values. By avoiding the so-called, "slow- | RTT, and congestion window values. By avoiding the so-called "slow- | |||
start restart," performance can be optimized [Hu01]. TCB | start restart", performance can be optimized [Hu01]. TCB | |||
interdependence can provide the "slow-start restart avoidance" of | interdependence can provide the "slow-start restart avoidance" of | |||
multiplexing, without requiring a multiplexing mechanism at the | multiplexing, without requiring a multiplexing mechanism at the | |||
application layer. | application layer. | |||
Like the initial version of this document [RFC2140], this update's | Like the initial version of this document [RFC2140], this update's | |||
approach to TCB interdependence focuses on sharing a set of TCBs by | approach to TCB interdependence focuses on sharing a set of TCBs by | |||
updating the TCB state to reduce the impact of transients when | updating the TCB state to reduce the impact of transients when | |||
connections begin or end. Other mechanisms have since been proposed | connections begin, end, or otherwise significantly change state. | |||
to continuously share information between all ongoing communication | Other mechanisms have since been proposed to continuously share | |||
(including connectionless protocols), updating the congestion state | information between all ongoing communication (including | |||
during any congestion-related event (e.g., timeout, loss | connectionless protocols), updating the congestion state during any | |||
confirmation, etc.) [RFC3124]. By dealing exclusively with | congestion-related event (e.g., timeout, loss confirmation, etc.) | |||
transients, TCB interdependence is more likely to exhibit the same | [RFC3124]. By dealing exclusively with transients, the approach in | |||
behavior as unmodified, independent TCP connections. | this document is more likely to exhibit the "steady-state" behavior | |||
as unmodified, independent TCP connections. | ||||
9.1. Layering | 9.1. Layering | |||
TCB interdependence pushes some of the TCP implementation from the | TCB interdependence pushes some of the TCP implementation from the | |||
traditional transport layer (in the ISO model), to the network | traditional transport layer (in the ISO model), to the network | |||
layer. This acknowledges that some state is in fact per-host-pair or | layer. This acknowledges that some state is in fact per-host-pair or | |||
can be per-path as indicated solely by that host-pair. Transport | can be per-path as indicated solely by that host-pair. Transport | |||
protocols typically manage per-application-pair associations (per | protocols typically manage per-application-pair associations (per | |||
stream), and network protocols manage per-host-pair and path | stream), and network protocols manage per-host-pair and path | |||
associations (routing). Round-trip time, MSS, and congestion | associations (routing). Round-trip time, MSS, and congestion | |||
information could be more appropriately handled in a network-layer | information could be more appropriately handled at the network | |||
fashion, aggregated among concurrent connections, and shared across | layer, aggregated among concurrent connections, and shared across | |||
connection instances [RFC3124]. | connection instances [RFC3124]. | |||
An earlier version of RTT sharing suggested implementing RTT state | An earlier version of RTT sharing suggested implementing RTT state | |||
at the IP layer, rather than at the TCP layer. Our observations | at the IP layer, rather than at the TCP layer. Our observations | |||
describe sharing state among TCP connections, which avoids some of | describe sharing state among TCP connections, which avoids some of | |||
the difficulties in an IP-layer solution. One such problem of an IP | the difficulties in an IP-layer solution. One such problem of an IP | |||
layer solution is determining the correspondence between packet | layer solution is determining the correspondence between packet | |||
exchanges using IP header information alone, where such | exchanges using IP header information alone, where such | |||
correspondence is needed to compute RTT. Because TCB sharing | correspondence is needed to compute RTT. Because TCB sharing | |||
computes RTTs inside the TCP layer using TCP header information, it | computes RTTs inside the TCP layer using TCP header information, it | |||
skipping to change at page 17, line 42 ¶ | skipping to change at page 17, line 50 ¶ | |||
There may be other information that can be shared between concurrent | There may be other information that can be shared between concurrent | |||
connections. For example, knowing that another connection has just | connections. For example, knowing that another connection has just | |||
tried to expand its window size and failed, a connection may not | tried to expand its window size and failed, a connection may not | |||
attempt to do the same for some period. The idea is that existing | attempt to do the same for some period. The idea is that existing | |||
TCP implementations infer the behavior of all competing connections, | TCP implementations infer the behavior of all competing connections, | |||
including those within the same host or subnet. One possible | including those within the same host or subnet. One possible | |||
optimization is to make that implicit feedback explicit, via | optimization is to make that implicit feedback explicit, via | |||
extended information associated with the endpoint IP address and its | extended information associated with the endpoint IP address and its | |||
TCP implementation, rather than per-connection state in the TCB. | TCP implementation, rather than per-connection state in the TCB. | |||
This document focuses on sharing TCB information at connection | ||||
initialization. Subsequent to RFC 2140, there have been numerous | ||||
approaches that attempt to coordinate ongoing state across | ||||
concurrent connections, both within TCP and other congestion- | ||||
reactive protocols, which are summarized in [Is18]. These approaches | ||||
are more complex to implement and their comparison to steady-state | ||||
TCP equivalence can be more difficult to establish, sometimes | ||||
intentionally (i.e., they sometimes intend to provide a different | ||||
kind of "fairness" than emerges from TCP operation). | ||||
10. Implementation Observations | 10. Implementation Observations | |||
The observation that some TCB state is host-pair specific rather | The observation that some TCB state is host-pair specific rather | |||
than application-pair dependent is not new and is a common | than application-pair dependent is not new and is a common | |||
engineering decision in layered protocol implementations. Although | engineering decision in layered protocol implementations. Although | |||
now deprecated, T/TCP [RFC1644] was the first to propose using | now deprecated, T/TCP [RFC1644] was the first to propose using | |||
caches in order to maintain TCB states (see 0). | caches in order to maintain TCB states (see Appendix A). | |||
The table below describes the current implementation status for TCB | The table below describes the current implementation status for TCB | |||
temporal sharing in Windows as of December 2020, Linux kernel | temporal sharing in Windows as of December 2020, Apple variants | |||
(macOS, iOS, iPadOS, tvOS, watchOS) as of January 2021, Linux kernel | ||||
version 5.10.3, and FreeBSD 12. Ensemble sharing is not yet | version 5.10.3, and FreeBSD 12. Ensemble sharing is not yet | |||
implemented. | implemented. | |||
KNOWN IMPLEMENTATION STATUS | KNOWN IMPLEMENTATION STATUS | |||
TCB data Status | TCB data Status | |||
------------------------------------------------------------ | ------------------------------------------------------------ | |||
old_MMS_S Not shared | old_MMS_S Not shared | |||
old_MMS_R Not shared | old_MMS_R Not shared | |||
skipping to change at page 18, line 35 ¶ | skipping to change at page 19, line 6 ¶ | |||
old_TFOinfo Cached and shared in Apple, Linux, Windows | old_TFOinfo Cached and shared in Apple, Linux, Windows | |||
old_sendcwnd Not shared | old_sendcwnd Not shared | |||
old_ssthresh Cached and shared in Apple, FreeBSD*, Linux* | old_ssthresh Cached and shared in Apple, FreeBSD*, Linux* | |||
TFO failure Cached and shared in Apple | TFO failure Cached and shared in Apple | |||
In the table above, "Apple" refers to all Apple OSes, i.e., | In the table above, "Apple" refers to all Apple OSes, i.e., | |||
desktop/laptop macOS, phone iOS, video player tvOS, pad ipadOS, and | desktop/laptop macOS, phone iOS, pad iPadOS, video player tvOS, and | |||
watch watchOS, which all share the same Internet protocol stack. | watch watchOS, which all share the same Internet protocol stack. | |||
*Note: In FreeBSD, new ssthresh is the mean of curr_ssthresh and | *Note: In FreeBSD, new ssthresh is the mean of curr_ssthresh and | |||
previous value if a previous value exists; in Linux, the calculation | previous value if a previous value exists; in Linux, the calculation | |||
depends on state and is max(curr_cwnd/2, old_ssthresh) in most | depends on state and is max(curr_cwnd/2, old_ssthresh) in most | |||
cases. | cases. | |||
11. Updates to RFC 2140 | 11. Changes Compared to RFC 2140 | |||
This document updates the description of TCB sharing in RFC 2140 and | This document updates the description of TCB sharing in RFC 2140 and | |||
its associated impact on existing and new connection state, | its associated impact on existing and new connection state, | |||
providing a complete replacement for that document [RFC2140]. It | providing a complete replacement for that document [RFC2140]. It | |||
clarifies the previous description and terminology and extends the | clarifies the previous description and terminology and extends the | |||
mechanism to its impact on new protocols and mechanisms, including | mechanism to its impact on new protocols and mechanisms, including | |||
multipath TCP, fast open, PLPMTUD, NAT, and the TCP Authentication | multipath TCP, fast open, PLPMTUD, NAT, and the TCP Authentication | |||
Option. | Option. | |||
The detailed impact on TCB state addresses TCB parameters in greater | The detailed impact on TCB state addresses TCB parameters in greater | |||
detail, addressing MSS in both the send and receive direction, MSS | detail, addressing MSS in both the send and receive direction, MSS | |||
and send-MSS separately, adds path MTU and ssthresh, and addresses | and sendMSS separately, adds path MTU and ssthresh, and addresses | |||
the impact on TCP option state. | the impact on TCP option state. | |||
New sections have been added to address compatibility issues and | New sections have been added to address compatibility issues and | |||
implementation observations. The relation of this work to T/TCP has | implementation observations. The relation of this work to T/TCP has | |||
been moved to 0 on history, partly to reflect the deprecation of | been moved to 0 on history, partly to reflect the deprecation of | |||
that protocol. | that protocol. | |||
Appendix C has been added to discuss the potential to use temporal | Appendix C has been added to discuss the potential to use temporal | |||
sharing over long timescales to adapt TCP's initial window | sharing over long timescales to adapt TCP's initial window | |||
automatically, avoiding the need to periodically revise a single | automatically, avoiding the need to periodically revise a single | |||
skipping to change at page 21, line 24 ¶ | skipping to change at page 21, line 42 ¶ | |||
[Al10] Allman, M., "Initial Congestion Window Specification", | [Al10] Allman, M., "Initial Congestion Window Specification", | |||
(work in progress), draft-allman-tcpm-bump-initcwnd-00, | (work in progress), draft-allman-tcpm-bump-initcwnd-00, | |||
Nov. 2010. | Nov. 2010. | |||
[Ba12] Barik, R., Welzl, M., Ferlin, S., Alay, O., " LISA: A | [Ba12] Barik, R., Welzl, M., Ferlin, S., Alay, O., " LISA: A | |||
Linked Slow-Start Algorithm for MPTCP", IEEE ICC, Kuala | Linked Slow-Start Algorithm for MPTCP", IEEE ICC, Kuala | |||
Lumpur, Malaysia, May 23-27 2016. | Lumpur, Malaysia, May 23-27 2016. | |||
[Ba20] Bagnulo, M., Briscoe, B., "ECN++: Adding Explicit | [Ba20] Bagnulo, M., Briscoe, B., "ECN++: Adding Explicit | |||
Congestion Notification (ECN) to TCP Control Packets", | Congestion Notification (ECN) to TCP Control Packets", | |||
draft-ietf-tcpm-generalized-ecn-06, Oct. 2020. | draft-ietf-tcpm-generalized-ecn-07, Feb. 2021. | |||
[Be94] Berners-Lee, T., et al., "The World-Wide Web," | [Be94] Berners-Lee, T., et al., "The World-Wide Web," | |||
Communications of the ACM, V37, Aug. 1994, pp. 76-82. | Communications of the ACM, V37, Aug. 1994, pp. 76-82. | |||
[Br94] Braden, B., "T/TCP -- Transaction TCP: Source Changes for | [Br94] Braden, B., "T/TCP -- Transaction TCP: Source Changes for | |||
Sun OS 4.1.3,", Release 1.0, USC/ISI, September 14, 1994. | Sun OS 4.1.3,", Release 1.0, USC/ISI, September 14, 1994. | |||
[Br02] Brownlee, N., Claffy, K., "Understanding Internet Traffic | [Br02] Brownlee, N., Claffy, K., "Understanding Internet Traffic | |||
Streams: Dragonflies and Tortoises", IEEE Communications | Streams: Dragonflies and Tortoises", IEEE Communications | |||
Magazine p110-117, 2002. | Magazine p110-117, 2002. | |||
skipping to change at page 22, line 5 ¶ | skipping to change at page 22, line 26 ¶ | |||
[FreeBSD] FreeBSD source code, Release 2.10, http://www.freebsd.org/ | [FreeBSD] FreeBSD source code, Release 2.10, http://www.freebsd.org/ | |||
[Hu01] Hughes, A., Touch, J., Heidemann, J., "Issues in Slow- | [Hu01] Hughes, A., Touch, J., Heidemann, J., "Issues in Slow- | |||
Start Restart After Idle", draft-hughes-restart-00 | Start Restart After Idle", draft-hughes-restart-00 | |||
(expired), Dec. 2001. | (expired), Dec. 2001. | |||
[Hu12] Hurtig, P., Brunstrom, A., "Enhanced metric caching for | [Hu12] Hurtig, P., Brunstrom, A., "Enhanced metric caching for | |||
short TCP flows," 2012 IEEE International Conference on | short TCP flows," 2012 IEEE International Conference on | |||
Communications (ICC), Ottawa, ON, 2012, pp. 1209-1213. | Communications (ICC), Ottawa, ON, 2012, pp. 1209-1213. | |||
[IANA] IANA TCP Parameters (options) registry, | ||||
https://www.iana.org/assignments/tcp-parameters | ||||
[Is18] Islam, S., Welzl, M., Hiorth, K., Hayes, D., Armitage, G., | ||||
Gjessing, S., "ctrlTCP: Reducing Latency through Coupled, | ||||
Heterogeneous Multi-Flow TCP Congestion Control," Proc. | ||||
IEEE INFOCOM Global Internet Symposium (GI) workshop (GI | ||||
2018), Honolulu, HI, April 2018. | ||||
[Ja88] Jacobson, V., Karels, M., "Congestion Avoidance and | [Ja88] Jacobson, V., Karels, M., "Congestion Avoidance and | |||
Control", Proc. Sigcomm 1988. | Control", Proc. Sigcomm 1988. | |||
[RFC1644] Braden, R., "T/TCP -- TCP Extensions for Transactions | [RFC1644] Braden, R., "T/TCP -- TCP Extensions for Transactions | |||
Functional Specification," RFC-1644, July 1994. | Functional Specification," RFC-1644, July 1994. | |||
[RFC1379] Braden, R., "Transaction TCP -- Concepts," RFC-1379, | [RFC1379] Braden, R., "Transaction TCP -- Concepts," RFC-1379, | |||
September 1992. | September 1992. | |||
[RFC2001] Stevens, W., "TCP Slow Start, Congestion Avoidance, Fast | [RFC2001] Stevens, W., "TCP Slow Start, Congestion Avoidance, Fast | |||
skipping to change at page 22, line 43 ¶ | skipping to change at page 23, line 27 ¶ | |||
[RFC4340] Kohler, E., Handley, M., Floyd, S., "Datagram Congestion | [RFC4340] Kohler, E., Handley, M., Floyd, S., "Datagram Congestion | |||
Control Protocol (DCCP)," RFC 4340, Mar. 2006. | Control Protocol (DCCP)," RFC 4340, Mar. 2006. | |||
[RFC4960] Stewart, R., (Ed.), "Stream Control Transmission | [RFC4960] Stewart, R., (Ed.), "Stream Control Transmission | |||
Protocol," RFC4960, Sept. 2007. | Protocol," RFC4960, Sept. 2007. | |||
[RFC5925] Touch, J., Mankin, A., Bonica, R., "The TCP Authentication | [RFC5925] Touch, J., Mankin, A., Bonica, R., "The TCP Authentication | |||
Option," RFC 5925, June 2010. | Option," RFC 5925, June 2010. | |||
[RFC6437] Amante, S., Carpenter, B., Jiang, S., Rajajalme, J., "IPv6 | ||||
Flow Label Specification," RFC 6437, Nov. 2011. | ||||
[RFC6691] Borman, D., "TCP Options and Maximum Segment Size (MSS)," | [RFC6691] Borman, D., "TCP Options and Maximum Segment Size (MSS)," | |||
RFC 6691, July 2012. | RFC 6691, July 2012. | |||
[RFC6928] Chu, J., Dukkipati, N., Cheng, Y., Mathis, M., "Increasing | [RFC6928] Chu, J., Dukkipati, N., Cheng, Y., Mathis, M., "Increasing | |||
TCP's Initial Window," RFC 6928, Apr. 2013. | TCP's Initial Window," RFC 6928, Apr. 2013. | |||
[RFC7231] Fielding, R., Reshke, J., Eds., "HTTP/1.1 Semantics and | [RFC7231] Fielding, R., Reshke, J., Eds., "HTTP/1.1 Semantics and | |||
Content," RFC-7231, June 2014. | Content," RFC-7231, June 2014. | |||
[RFC7323] Borman, D., Braden, B., Jacobson, V., Scheffenegger, R., | [RFC7323] Borman, D., Braden, B., Jacobson, V., Scheffenegger, R., | |||
skipping to change at page 23, line 42 ¶ | skipping to change at page 24, line 30 ¶ | |||
research project between the University of Oslo and Huawei | research project between the University of Oslo and Huawei | |||
Technologies Co., Ltd. and were partly supported by USC/ISI's Postel | Technologies Co., Ltd. and were partly supported by USC/ISI's Postel | |||
Center. | Center. | |||
This document was prepared using 2-Word-v2.0.template.dot. | This document was prepared using 2-Word-v2.0.template.dot. | |||
16. Change log | 16. Change log | |||
This section should be removed upon final publication as an RFC. | This section should be removed upon final publication as an RFC. | |||
ietf-11: | ||||
- Addressed gen-art review and IESG feedback | ||||
ietf-10: | ietf-10: | |||
- Addressed gen-art review request for clarifications | - Addressed IETF last call feedback | |||
ietf-09: | ietf-09: | |||
- Correction of typographic errors | - Correction of typographic errors | |||
ietf-08: | ietf-08: | |||
- Address TSV AD comments, add Apple OS implementation status | - Address TSV AD comments, add Apple OS implementation status | |||
ietf-07: | ietf-07: | |||
skipping to change at page 27, line 8 ¶ | skipping to change at page 28, line 8 ¶ | |||
PO Box 1080 Blindern | PO Box 1080 Blindern | |||
Oslo N-0316 | Oslo N-0316 | |||
Norway | Norway | |||
Phone: +47 22 84 08 37 | Phone: +47 22 84 08 37 | |||
Email: safiquli@ifi.uio.no | Email: safiquli@ifi.uio.no | |||
Appendix A: TCB Sharing History | Appendix A: TCB Sharing History | |||
T/TCP proposed using caches to maintain TCB information across | T/TCP proposed using caches to maintain TCB information across | |||
instances (temporal sharing), e.g., smoothed RTT, RTT variance, | instances (temporal sharing), e.g., smoothed RTT, RTT variation, | |||
congestion avoidance threshold, and MSS [RFC1644]. These values were | congestion avoidance threshold, and MSS [RFC1644]. These values were | |||
in addition to connection counts used by T/TCP to accelerate data | in addition to connection counts used by T/TCP to accelerate data | |||
delivery prior to the full three-way handshake during an OPEN. The | delivery prior to the full three-way handshake during an OPEN. The | |||
goal was to aggregate TCB components where they reflect one | goal was to aggregate TCB components where they reflect one | |||
association - that of the host-pair, rather than artificially | association - that of the host-pair, rather than artificially | |||
separating those components by connection. | separating those components by connection. | |||
At least one T/TCP implementation saved the MSS and aggregated the | At least one T/TCP implementation saved the MSS and aggregated the | |||
RTT parameters across multiple connections but omitted caching the | RTT parameters across multiple connections but omitted caching the | |||
congestion window information [Br94], as originally specified in | congestion window information [Br94], as originally specified in | |||
skipping to change at page 28, line 8 ¶ | skipping to change at page 29, line 8 ¶ | |||
the SunOS 4.1.3 T/TCP extensions [Br94] and the FreeBSD port of same | the SunOS 4.1.3 T/TCP extensions [Br94] and the FreeBSD port of same | |||
[FreeBSD]. As mentioned before, only the MSS and RTT parameters were | [FreeBSD]. As mentioned before, only the MSS and RTT parameters were | |||
cached, as originally specified in [RFC1379]. Later discussion of | cached, as originally specified in [RFC1379]. Later discussion of | |||
T/TCP suggested including congestion control parameters in this | T/TCP suggested including congestion control parameters in this | |||
cache; for example, [RFC1644] (Section 3.1) hints at initializing | cache; for example, [RFC1644] (Section 3.1) hints at initializing | |||
the congestion window to the old window size. | the congestion window to the old window size. | |||
Appendix B: TCP Option Sharing and Caching | Appendix B: TCP Option Sharing and Caching | |||
In addition to the options that can be cached and shared, this memo | In addition to the options that can be cached and shared, this memo | |||
also lists known options for which state is unsafe to be kept. This | also lists known TCP options [IANA] for which state is unsafe to be | |||
list is not intended to be authoritative or exhaustive. | kept. This list is not intended to be authoritative or exhaustive. | |||
Obsolete (unsafe to keep state): | Obsolete (unsafe to keep state): | |||
ECHO | ECHO | |||
ECHO REPLY | ECHO REPLY | |||
PO Conn permitted | PO Conn permitted | |||
PO service profile | PO service profile | |||
skipping to change at page 30, line 12 ¶ | skipping to change at page 31, line 12 ¶ | |||
TFO cookie (if TFO succeeded in the past) | TFO cookie (if TFO succeeded in the past) | |||
Appendix C: Automating the Initial Window in TCP over Long Timescales | Appendix C: Automating the Initial Window in TCP over Long Timescales | |||
C.1. Introduction | C.1. Introduction | |||
Temporal sharing, as described earlier in this document, builds on | Temporal sharing, as described earlier in this document, builds on | |||
the assumption that multiple consecutive connections between the | the assumption that multiple consecutive connections between the | |||
same host pair are somewhat likely to be exposed to similar | same host pair are somewhat likely to be exposed to similar | |||
environment characteristics. The stored information can therefore | environment characteristics. The stored information can become less | |||
become invalid over time, and suitable precautions should be taken | accurate over time and suitable precautions should take this ageing | |||
(this is discussed further in section 8.1). However, there are also | into consideration (this is discussed further in section 8.1). | |||
cases where it can make sense to use much longer-term measurements | However, there are also cases where it can make sense to track these | |||
of TCP connections to gradually influence TCP parameters. This | values over longer periods, observing properties of TCP connections | |||
to gradually influence evolving trends in TCP parameters. This | ||||
appendix describes an example of such a case. | appendix describes an example of such a case. | |||
TCP's congestion control algorithm uses an initial window value | TCP's congestion control algorithm uses an initial window value | |||
(IW), both as a starting point for new connections and as an upper | (IW), both as a starting point for new connections and as an upper | |||
limit for restarting after an idle period [RFC5681][RFC7661]. This | limit for restarting after an idle period [RFC5681][RFC7661]. This | |||
value has evolved over time, originally one maximum segment size | value has evolved over time, originally one maximum segment size | |||
(MSS), and increased to the lesser of four MSS or 4,380 bytes | (MSS), and increased to the lesser of four MSS or 4,380 bytes | |||
[RFC3390][RFC5681]. For a typical Internet connection with a maximum | [RFC3390][RFC5681]. For a typical Internet connection with a maximum | |||
transmission unit (MTU) of 1500 bytes, this permits three segments | transmission unit (MTU) of 1500 bytes, this permits three segments | |||
of 1,460 bytes each. | of 1,460 bytes each. | |||
skipping to change at page 31, line 28 ¶ | skipping to change at page 32, line 29 ¶ | |||
o Increase the IW in the absence of sustained loss of IW segments, | o Increase the IW in the absence of sustained loss of IW segments, | |||
as determined over a number of different connections. | as determined over a number of different connections. | |||
o Operate conservatively, i.e., tend towards leaving the IW the | o Operate conservatively, i.e., tend towards leaving the IW the | |||
same in the absence of sufficient information, and give greater | same in the absence of sufficient information, and give greater | |||
consideration to IW segment loss than IW segment success. | consideration to IW segment loss than IW segment success. | |||
We expect that, without other context, a good IW algorithm will | We expect that, without other context, a good IW algorithm will | |||
converge to a single value, but this is not required. An endpoint | converge to a single value, but this is not required. An endpoint | |||
with additional context or information, or deployed in a constrained | with additional context or information, or deployed in a constrained | |||
environment, can always use a different value. In specific, | environment, can always use a different value. In particular, | |||
information from previous connections, or sets of connections with a | information from previous connections, or sets of connections with a | |||
similar path, can already be used as context for such decisions (as | similar path, can already be used as context for such decisions (as | |||
noted in the core of this document). | noted in the core of this document). | |||
However, if a given IW value persistently causes packet loss during | However, if a given IW value persistently causes packet loss during | |||
the initial burst of packets, it is clearly inappropriate and could | the initial burst of packets, it is clearly inappropriate and could | |||
be inducing unnecessary loss in other competing connections. This | be inducing unnecessary loss in other competing connections. This | |||
might happen for sites behind very slow boxes with small buffers, | might happen for sites behind very slow boxes with small buffers, | |||
which may or may not be the first hop. | which may or may not be the first hop. | |||
skipping to change at page 32, line 27 ¶ | skipping to change at page 33, line 29 ¶ | |||
Internet (here we selected the current de-facto standard rather than | Internet (here we selected the current de-facto standard rather than | |||
the actual standard). Current proposals, including default current | the actual standard). Current proposals, including default current | |||
operation, are degenerate cases of the algorithm below for given | operation, are degenerate cases of the algorithm below for given | |||
parameters - notably MulDec = 1.0 and AddIncr = 0 MSS, thus | parameters - notably MulDec = 1.0 and AddIncr = 0 MSS, thus | |||
disabling the automatic part of the algorithm. | disabling the automatic part of the algorithm. | |||
The proposed algorithm is as follows: | The proposed algorithm is as follows: | |||
1. On boot: | 1. On boot: | |||
IW = MaxIW; # assume this is in bytes, and an even number of MSS | IW = MaxIW; # assume this is in bytes, and indicates an integer | |||
multiple of 2 MSS (an even number to support ACK compression) | ||||
2. Upon starting a new connection: | 2. Upon starting a new connection: | |||
CWND = IW; | CWND = IW; | |||
conncount++; | conncount++; | |||
IWnotchecked = 1; # true | IWnotchecked = 1; # true | |||
3. During a connection's SYN-ACK processing, if SYN-ACK includes ECN | 3. During a connection's SYN-ACK processing, if SYN-ACK includes ECN | |||
(as similarly addressed in Sec 5 of ECN++ for TCP [Ba20]), treat | (as similarly addressed in Sec 5 of ECN++ for TCP [Ba20]), treat | |||
as if the IW is too large: | as if the IW is too large: | |||
skipping to change at page 33, line 38 ¶ | skipping to change at page 34, line 38 ¶ | |||
As presented, this algorithm can yield a false positive when the | As presented, this algorithm can yield a false positive when the | |||
sequence number wraps around, e.g., the code might increment | sequence number wraps around, e.g., the code might increment | |||
losscount in step 4 when no loss occurred or fail to increment | losscount in step 4 when no loss occurred or fail to increment | |||
losscount when a loss did occur. This can be avoided using either | losscount when a loss did occur. This can be avoided using either | |||
PAWS [RFC7323] context or internal extended sequence number | PAWS [RFC7323] context or internal extended sequence number | |||
representations (as in TCP-AO [RFC5925]). Alternately, false | representations (as in TCP-AO [RFC5925]). Alternately, false | |||
positives can be tolerated because they are expected to be | positives can be tolerated because they are expected to be | |||
infrequent and thus will not significantly impact the algorithm. | infrequent and thus will not significantly impact the algorithm. | |||
A number of additional constraints need to be imposed if this | A number of additional constraints need to be imposed if this | |||
mechanism is implemented to ensure that it defaults values that | mechanism is implemented to ensure that it defaults to values that | |||
comply with current Internet standards, is conservative in how it | comply with current Internet standards, is conservative in how it | |||
extends those values, and returns to those values in the absence of | extends those values, and returns to those values in the absence of | |||
positive feedback (i.e., success). To that end, we recommend the | positive feedback (i.e., success). To that end, we recommend the | |||
following list of example constraints: | following list of example constraints: | |||
>> The automatic IW algorithm MUST initialize MaxIW a value no | >> The automatic IW algorithm MUST initialize MaxIW a value no | |||
larger than the currently recommended Internet default, in the | larger than the currently recommended Internet default, in the | |||
absence of other context information. | absence of other context information. | |||
Thus, if there are too few connections to make a decision or if | Thus, if there are too few connections to make a decision or if | |||
skipping to change at page 35, line 44 ¶ | skipping to change at page 36, line 44 ¶ | |||
False positives can occur during some kinds of segment reordering, | False positives can occur during some kinds of segment reordering, | |||
e.g., that might trigger spurious retransmissions even without a | e.g., that might trigger spurious retransmissions even without a | |||
true segment loss. These are not expected to be sufficiently common | true segment loss. These are not expected to be sufficiently common | |||
to dominate the algorithm and its conclusions. | to dominate the algorithm and its conclusions. | |||
This mechanism does require additional per-connection state, which | This mechanism does require additional per-connection state, which | |||
is currently common in some implementations, and is useful for other | is currently common in some implementations, and is useful for other | |||
reasons (e.g., the ISN is used in TCP-AO [RFC5925]). The mechanism | reasons (e.g., the ISN is used in TCP-AO [RFC5925]). The mechanism | |||
also benefits from persistent state kept across reboots, as would be | also benefits from persistent state kept across reboots, as would be | |||
other state sharing mechanisms (e.g., TCP Control Block Sharing | other state sharing mechanisms (e.g., TCP Control Block Sharing per | |||
[RFC2140]). The mechanism is inspired by RFC 2140's use of | the main body of this document). | |||
information across connections. | ||||
The receive window (rwnd) is not involved in this calculation. The | The receive window (rwnd) is not involved in this calculation. The | |||
size of rwnd is determined by receiver resources and provides space | size of rwnd is determined by receiver resources and provides space | |||
to accommodate segment reordering. It is not involved with | to accommodate segment reordering. It is not involved with | |||
congestion control, which is the focus of this document and its | congestion control, which is the focus of this document and its | |||
management of the IW. | management of the IW. | |||
C.5. Observations | C.5. Observations | |||
The IW may not converge to a single, global value. It also may not | The IW may not converge to a single, global value. It also may not | |||
End of changes. 56 change blocks. | ||||
118 lines changed or deleted | 154 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ |