--- 1/draft-ietf-tcpm-2140bis-05.txt 2020-11-25 15:13:12.889474283 -0800 +++ 2/draft-ietf-tcpm-2140bis-06.txt 2020-11-25 15:13:12.949475806 -0800 @@ -1,19 +1,19 @@ TCPM WG J. Touch Internet Draft Independent Intended status: Informational M. Welzl Obsoletes: 2140 S. Islam -Expires: October 2020 University of Oslo - April 29, 2020 +Expires: May 2021 University of Oslo + November 25, 2020 TCP Control Block Interdependence - draft-ietf-tcpm-2140bis-05.txt + draft-ietf-tcpm-2140bis-06.txt Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. This document may contain material from IETF Documents or IETF Contributions published or made publicly available before November 10, 2008. The person(s) controlling the copyright in some of this material may not have granted the IETF Trust the right to allow @@ -34,21 +34,21 @@ months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html - This Internet-Draft will expire on October 29, 2020. + This Internet-Draft will expire on May 25, 2021. Copyright Notice Copyright (c) 2020 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents @@ -99,31 +99,31 @@ 9. Implications..................................................15 9.1. Layering....................................................15 9.2. Other possibilities.........................................16 10. Implementation Observations..................................16 11. Updates to RFC 2140..........................................17 12. Security Considerations......................................18 13. IANA Considerations..........................................18 14. References...................................................19 14.1. Normative References....................................19 14.2. Informative References..................................19 - 15. Acknowledgments..............................................21 + 15. Acknowledgments..............................................22 16. Change log...................................................22 Appendix A : TCB Sharing History.................................25 Appendix B : TCP Option Sharing and Caching......................26 Appendix C : Automating the Initial Window in TCP over Long Timescales.......................................................28 C.1. Introduction.............................................28 C.2. Design Considerations....................................28 C.3. Proposed IW Algorithm....................................29 - C.4. Discussion...............................................32 - C.5. Observations.............................................33 + C.4. Discussion...............................................33 + C.5. Observations.............................................34 1. Introduction TCP is a connection-oriented reliable transport protocol layered over IP [RFC793]. Each TCP connection maintains state, usually in a data structure called the TCP Control Block (TCB). The TCB contains information about the connection state, its associated local process, and feedback parameters about the connection's transmission properties. As originally specified and usually implemented, most TCB information is maintained on a per-connection basis. Some @@ -360,21 +360,21 @@ old_TFO_failure old_TFO_failure ESTAB old_TFO_failure 6.3. Discussion There is no particular benefit to caching MMS_S and MMS_R as these are reported by the local IP stack. Caching sendMSS and PMTU is trivial; reported values are cached, and the most recent values are used. The cache is updated when the MSS option is received in a SYN or after PMTUD (i.e., when an ICMPv4 Fraqmentation Needed [RFC1191] or ICMPv6 Packet Too Big message is received [RFC8201] or the - equivalent is inferred, e.g. as from PLPMTUD [RFC4821]), + equivalent is inferred, e.g., as from PLPMTUD [RFC4821]), respectively, so the cache always has the most recent values from any connection. For sendMSS, the cache is consulted only at connection establishment and not otherwise updated, which means that MSS options do not affect current connections. The default sendMSS is never saved; only reported MSS values update the cache, so an explicit override is required to reduce the sendMSS. RTT values are updated by formulae that merge the old and new values. Dynamic RTT estimation requires a sequence of RTT measurements. As a result, the cached RTT (and its variance) is an @@ -399,21 +399,21 @@ Most cached TCB values are updated when a connection closes. The exceptions are MMS_R and MMS_S, which are reported by IP [RFC1122], PMTU which is updated after Path MTU Discovery [RFC1191][RFC4821][RFC8201], and sendMSS, which is updated if the MSS option is received in the TCP SYN header. Sharing sendMSS information affects only data in the SYN of the next connection, because sendMSS information is typically included in most TCP SYN segments. Caching PMTU can accelerate the efficiency of - PMTUD, but can also result in black-holing until corrected if in + PMTUD but can also result in black-holing until corrected if in error. Caching MMS_R and MMS_S may be of little direct value as they are reported by the local IP stack anyway. The way in which other TCP option state can be shared depends on the details of that option. E.g., TFO state includes the TCP Fast Open Cookie [RFC7413] or, in case TFO fails, a negative TCP Fast Open response. RFC 7413 states, "The client MUST cache negative responses from the server in order to avoid potential connection failures. Negative responses include the server not acknowledging the data in the SYN, ICMP error messages, and (most importantly) no response @@ -541,24 +541,23 @@ of the current windows is increased for any new connection. This can have detrimental consequences where several connections share a highly congested link. There are several ways to initialize the congestion window in a new TCB among an ensemble of current connections to a host. Current TCP implementations initialize it to four segments as standard [rfc3390] and 10 segments experimentally [RFC6928]. These approaches assume that new connections should behave as conservatively as possible. The algorithm described in [Ba12] adjusts the initial cwnd depending - on the cwnd values of ongoing connections. There have also been - suggestions to use the kind of sharing mechanisms described in this - document over long timescales to adapt TCP's initial window - automatically, as described further in Appendix A [To12]. + on the cwnd values of ongoing connections. It is also possible to + use sharing mechanisms over long timescales to adapt TCP's initial + window automatically, as described further in Appendix A. 8. Compatibility Issues Here, we discuss various types of problems that may arise with TCB information sharing. For the congestion and current window information, the initial values computed by TCB interdependence may not be consistent with the long-term aggregate behavior of a set of concurrent connections between the same endpoints. Under conventional TCP congestion @@ -757,21 +756,22 @@ and send-MSS separately, adds path MTU and ssthresh, and addresses the impact on TCP option state. New sections have been added to address compatibility issues and implementation observations. The relation of this work to T/TCP has been moved to Appendix A on history, partly to reflect the deprecation of that protocol. Appendix C has been added to discuss the potential to use temporal sharing over long timescales to adapt TCP's initial window - automatically, largely imported from [To12]. + automatically, avoiding the need to periodically revise a single + global constant value. Finally, this document updates and significantly expands the referenced literature. 12. Security Considerations These presented implementation methods do not have additional ramifications for explicit attacks. They may be susceptible to denial-of-service attacks if not otherwise secured. @@ -781,21 +781,21 @@ Implications section). Some shared TCB parameters are used only to create new TCBs, others are shared among the TCBs of ongoing connections. New connections can join the ongoing set, e.g., to optimize send window size among a set of connections to the same host. Attacks on parameters used only for initialization affect only the transient performance of a TCP connection. For short connections, the performance ramification can approach that of a denial-of- service attack. E.g., if an application changes its TCB to have a - false and small window size, subsequent connections would experience + false and small window size, subsequent connections will experience performance degradation until their window grew appropriately. TCB sharing reuses and mixes information from past and current connections. Although reusing information could create a potential for fingerprinting to identify hosts, the mixing reduces that potential. There has been no evidence of fingerprinting based on this technique and it is currently considered safe in that regard. 13. IANA Considerations @@ -841,20 +841,24 @@ 14.2. Informative References [Al10] Allman, M., "Initial Congestion Window Specification", (work in progress), draft-allman-tcpm-bump-initcwnd-00, Nov. 2010. [Ba12] Barik, R., Welzl, M., Ferlin, S., Alay, O., " LISA: A Linked Slow-Start Algorithm for MPTCP", IEEE ICC, Kuala Lumpur, Malaysia, May 23-27 2016. + [Ba20] Bagnulo, M., Briscoe, B., "ECN++: Adding Explicit + Congestion Notification (ECN) to TCP Control Packets", + draft-ietf-tcpm-generalized-ecn-06, Oct. 2020. + [Be94] Berners-Lee, T., et al., "The World-Wide Web," Communications of the ACM, V37, Aug. 1994, pp. 76-82. [Br94] Braden, B., "T/TCP -- Transaction TCP: Source Changes for Sun OS 4.1.3,", Release 1.0, USC/ISI, September 14, 1994. [Br02] Brownlee, N. and K. Claffy, "Understanding Internet Traffic Streams: Dragonflies and Tortoises", IEEE Communications Magazine p110-117, 2002. @@ -931,39 +935,44 @@ B., "Mechanisms for Optimizing Link Aggregation Group (LAG) and Equal-Cost Multipath (ECMP) Component Link Utilization in Networks", RFC 7424, Jan. 2015 [RFC7540] Belshe, M., Peon, R., Thomson, M., "Hypertext Transfer Protocol Version 2 (HTTP/2)", RFC 7540, May 2015. [RFC7661] Fairhurst, G., Sathiaseelan, A., Secchi, R., "Updating TCP to Support Rate-Limited Traffic", RFC 7661, Oct. 2015. - [To12] Touch, J., "Automating the Initial Window in TCP," draft- - touch-tcpm-automatic-iw-03 (expired), July 2012. - 15. Acknowledgments The authors would like to thank for Praveen Balasubramanian for information regarding TCB sharing in Windows, and Yuchung Cheng, Lars Eggert, Ilpo Jarvinen and Michael Scharf for comments on earlier versions of the draft. Earlier revisions of this work received funding from a collaborative research project between the University of Oslo and Huawei Technologies Co., Ltd. and were partly supported by USC/ISI's Postel Center. This document was prepared using 2-Word-v2.0.template.dot. 16. Change log This section should be removed upon final publication as an RFC. + ietf-06: + + - Address WGLC comments + + ietf-05: + + - Correction of typographic errors, expansion of terminology + ietf-04: - Fix internal cross-reference errors that appeared in ietf-02 - Updated tables to re-center; clarified text ietf-03: - Correction of typographic errors, minor rewording in appendices ietf-02: @@ -1006,21 +1015,21 @@ - Stated that our OS implementation overview table only covers temporal sharing. - Correctly reflected sharing of old_RTT in Linux in the implementation overview table. - Marked entries that are considered safe to share with an asterisk (suggestion was to split the table) - Discussed correct host identification: NATs may make IP - addresses the wrong input, could e.g. use HTTP cookie. + addresses the wrong input, could e.g., use HTTP cookie. - Included MMS_S and MMS_R from RFC1122; fixed the use of MSS and MTU - Added information about option sharing, listed options in Appendix B Authors' Addresses Joe Touch @@ -1157,64 +1167,68 @@ MSS TFO negotiation failure (to avoid negotiation retries) Safe and necessary to keep state: TFP cookie (if TFO succeeded in the past) Appendix C: Automating the Initial Window in TCP over Long Timescales - Note: this section is imported from [To12], updated only to refer to - itself as an appendix. - C.1. Introduction + Temporal sharing, as described earlier in this document, builds on + the assumption that multiple consecutive connections between the + same host pair are somewhat likely to be exposed to similar + environment characteristics. The stored information can therefore + become invalid over time, and suitable precautions should be taken + (this is discussed further in section 8.1). However, there are also + cases where it can make sense to use much longer-term measurements + of TCP connections to gradually influence TCP parameters. This + appendix describes an example of such a case. + TCP's congestion control algorithm uses an initial window value - (IW), both as a starting point for new connections and after one RTO - or more [RFC5681][RFC7661]. This value has evolved over time, - originally one maximum segment size (MSS), and increased to the - lesser of four MSS or 4,380 bytes [RFC3390][RFC5681]. For typical - Internet connections with an maximum transmission units (MTUs) of - 1500 bytes, this permits three segments of 1,460 bytes each. + (IW), both as a starting point for new connections and as an upper + limit for restarting after an idle period [RFC5681][RFC7661]. This + value has evolved over time, originally one maximum segment size + (MSS), and increased to the lesser of four MSS or 4,380 bytes + [RFC3390][RFC5681]. For a typical Internet connection with a maximum + transmission unit (MTU) of 1500 bytes, this permits three segments + of 1,460 bytes each. The IW value was originally implied in the original TCP congestion - control description, and documented as a standard in 1997 - [RFC2001][Ja88]. The value was last updated in 1998 experimentally, - and moved to the standards track in 2002 [RFC2414][RFC3390]. There - have been recent proposals to update the IW based on further - increases in host and router capabilities and network capacity, some - focusing on specific values (e.g., IW=10), and others prescribing a - schedule for increases over time (e.g., IW=6 for 2011, increasing by - 1-2 MSS per year). + control description and documented as a standard in 1997 + [RFC2001][Ja88]. The value was updated in 1998 experimentally and + moved to the standards track in 2002 [RFC2414][RFC3390]. In 2013, it + was experimentally increased to 10 [RFC6928]. This appendix discusses how TCP can objectively measure when an IW is too large, and that such feedback should be used over long timescales to adjust the IW automatically. The result should be safer to deploy and might avoid the need to repeatedly revisit IW - size over time. + over time. Note that this mechanism attempts to make the IW more adaptive over time. It can increase the IW beyond that which is currently recommended for widescale deployment, and so its use should be carefully monitored. C.2. Design Considerations TCP's IW value has existed statically for over two decades, so any solution to adjusting the IW dynamically should have similarly stable, non-invasive effects on the performance and complexity of TCP. In order to be fair, the IW should be similar for most machines on the public Internet. Finally, a desirable goal is to develop a self-correcting algorithm, so that IW values that cause network - problems can be avoided. To that end, we propose the following list - of design goals: + problems can be avoided. To that end, we propose the following + design goals: o Impart little to no impact to TCP in the absence of loss, i.e., it should not increase the complexity of default packet processing in the normal case. o Adapt to network feedback over long timescales, avoiding values that persistently cause network problems. o Decrease the IW in the presence of sustained loss of IW segments, as determined over a number of different connections. @@ -1238,111 +1252,121 @@ the initial burst of packets, it is clearly inappropriate and could be inducing unnecessary loss in other competing connections. This might happen for sites behind very slow boxes with small buffers, which may or may not be the first hop. C.3. Proposed IW Algorithm Below is a simple description of the proposed IW algorithm. It relies on the following parameters: - o MinIW = 3 MSS or 4,380 bytes (as per RFC3390] + o MinIW = 3 MSS or 4,380 bytes (as per [RFC3390]) - o MaxIW = 10 + o MaxIW = 10 MSS (as per [RFC6928]) o MulDecr = 0.5 o AddIncr = 2 MSS - o Threshold = 0.05 + We assume that the minimum IW (MinIW) should be as currently - specified [RFC3390]. The maximum IW can be set to a fixed value - [RFC6928], or set based on a schedule if trusted time references are - available [Al10]; here we prefer a fixed value. We also propose to - use an AIMD algorithm, with increase and decreases as noted. + specified [RFC3390]. The maximum IW can be set to a fixed value (as + recommended in [RFC6928]) or set based on a schedule if trusted time + references are available [Al10]; here we prefer a fixed value. We + also propose to use an AIMD algorithm, with increase and decreases + as noted. Although these parameters are somewhat arbitrary, their initial values are not important except that the algorithm is AIMD and the MaxIW should not exceed that recommended for other systems on the Internet. Current proposals, including default current operation, are degenerate cases of the algorithm below for given parameters - notably MulDec = 1.0 and AddIncr = 0 MSS, thus disabling the automatic part of the algorithm. The proposed algorithm is as follows: 1. On boot: IW = MaxIW; # assume this is in bytes, and an even number of MSS - 2. Upon starting a new connection + 2. Upon starting a new connection: CWND = IW; conncount++; IWnotchecked = 1; # true - 3. During a connection's SYN-ACK processing, if SYN-ACK includes - ECN, treat as if the IW is too large + 3. During a connection's SYN-ACK processing, if SYN-ACK includes ECN + (as similarly addressed in Sec 5 of ECN++ for TCP [Ba20]), treat + as if the IW is too large: if (IWnotchecked && (synackecn == 1)) { losscount++; IWnotchecked = 0; # never check again } 4. During a connection, if retransmission occurs, check the seqno of the outgoing packet (in bytes) to see if the resent segment fixes an IW loss: if (Retransmitting && IWnotchecked && ((ISN - seqno) < IW))) { losscount++; IWnotchecked = 0; # never do this entire "if" again } else { IWnotchecked = 0; # you're beyond the IW so stop checking } - 5. Once every 1000 conections, as a separate process (i.e., not as + 5. Once every 1000 connections, as a separate process (i.e., not as part of processing a given connection): if (conncount > 1000) { if (losscount/conncount > threshold) { # the number of connections with errors is too high IW = IW * MulDecr; } else { IW = IW + AddIncr; } } - We recognize that this algorithm can yield a false positive when the - sequence number wraps around. This can be avoided using either PAWS - [RFC7323] context or 64-bit internal sequence numbers (as in TCP-AO - [RFC5925]). Alternately, false positives can be allowed since they - are expected to be infrequent and thus will not affect the overall - statistics of the algorithm. + As presented, this algorithm can yield a false positive when the + sequence number wraps around, e.g., the code might increment + losscount in step 4 when no loss occurred or fail to increment + losscount when a loss did occur. This can be avoided using either + PAWS [RFC7323] context or internal extended sequence number + representations (as in TCP-AO [RFC5925]). Alternately, false + positives can be tolerated because they are expected to be + infrequent and thus will not significantly impact the algorithm. - The following additional constraints are imposed: + A number of additional constraints need to be imposed if this + mechanism is implemented to ensure that it defaults values that + comply with current Internet standards, is conservative in how it + extends those values, and returns to those values in the absence of + positive feedback (i.e., success). To that end, we recommend the + following list of example constraints: - >> The automatic IW algorithm MUST initialize to MaxIW, in the + >> The automatic IW algorithm MUST initialize MaxIW a value no + larger than the currently recommended Internet default, in the absence of other context information. - If there are too few connections to make a decision or if there is - otherwise insufficient information to increase the IW, then the - MaxIW defaults to the current recommended value. + Thus, if there are too few connections to make a decision or if + there is otherwise insufficient information to increase the IW, then + the MaxIW defaults to the current recommended value. - >> An implementation may allow the MaxIW to grow beyond the + >> An implementation MAY allow the MaxIW to grow beyond the currently recommended Internet default, but not more than 2 segments per calendar year. - If an endpoint has a persistent history of successfully transmitting - IW segments without loss, then it is allowed to probe the Internet - to determine if larger IW values have similar success. This probing - is limited and requires a trusted time source, otherwise the MaxIW - remains constant. + Thus, if an endpoint has a persistent history of successfully + transmitting IW segments without loss, then it is allowed to probe + the Internet to determine if larger IW values have similar success. + This probing is limited and requires a trusted time source, + otherwise the MaxIW remains constant. >> An implementation MUST adjust the IW based on loss statistics at least once every 1000 connections. An endpoint needs to be sufficiently reactive to IW loss. >> An implementation MUST decrease the IW by at least one MSS when indicated during an evaluation interval. An endpoint that detects loss needs to decrease its IW by at least @@ -1403,30 +1427,30 @@ in addition to losses during the first IW of a connection. In this case, the implementation MUST count each restart as a "connection" for the purposes of connection counts and periodic rechecking of the IW value. False positives can occur during some kinds of segment reordering, e.g., that might trigger spurious retransmissions even without a true segment loss. These are not expected to be sufficiently common to dominate the algorithm and its conclusions. - This mechanism does require additional per-connection state which is - currently common in some implementations, and is useful for other + This mechanism does require additional per-connection state, which + is currently common in some implementations, and is useful for other reasons (e.g., the ISN is used in TCP-AO [RFC5925]). The mechanism also benefits from persistent state kept across reboots, as would be other state sharing mechanisms (e.g., TCP Control Block Sharing [RFC2140]). The mechanism is inspired by RFC 2140's use of information across connections. The receive window (RWIN) is not involved in this calculation. The - size of RWIN is determined by receiver resources, and provides space + size of RWIN is determined by receiver resources and provides space to accommodate segment reordering. It is not involved with congestion control, which is the focus of this document and its management of the IW. C.5. Observations The IW may not converge to a single, global value. It also may not converge at all, but rather may oscillate by a few MSS as it repeatedly probes the Internet for larger IWs and fails. Both properties are consistent with TCP behavior during each individual