--- 1/draft-ietf-tcpm-rfc3782-bis-03.txt 2011-12-05 08:13:48.546671343 +0100 +++ 2/draft-ietf-tcpm-rfc3782-bis-04.txt 2011-12-05 08:13:48.574670806 +0100 @@ -1,23 +1,22 @@ - TCP Maintenance and Minor T. Henderson Extensions Working Group Boeing Internet-Draft S. Floyd Obsoletes: 3782 (if approved) ICSI Intended status: Standards Track A. Gurtov -Expires: April 22, 2012 HIIT +Expires: June 5, 2012 HIIT Y. Nishida WIDE Project - October 22, 2011 + December 5, 2011 The NewReno Modification to TCP's Fast Recovery Algorithm - draft-ietf-tcpm-rfc3782-bis-03.txt + draft-ietf-tcpm-rfc3782-bis-04.txt Abstract RFC 5681 documents the following four intertwined TCP congestion control algorithms: slow start, congestion avoidance, fast retransmit, and fast recovery. RFC 5681 explicitly allows certain modifications of these algorithms, including modifications that use the TCP Selective Acknowledgement (SACK) option (RFC 2883), and modifications that respond to "partial acknowledgments" (ACKs which cover new data, but not all the data outstanding when loss was @@ -34,21 +33,21 @@ Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." - This Internet-Draft will expire on April 22, 2012. + This Internet-Draft will expire on June 5, 2012. Copyright Notice Copyright (c) 2011 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents @@ -191,40 +190,40 @@ 1) Initialization of TCP protocol control block: When the TCP protocol control block is initialized, Recover is set to the initial send sequence number. 2) Three duplicate ACKs: When the third duplicate ACK is received, the TCP sender first checks the value of Recover to see if the Cumulative Acknowledgment field covers more than Recover. If so, the value of Recover is incremented to the value of the highest sequence number transmitted by the TCP so far. The TCP then enters Fast - Retransmit (step 2 of Section 3.2 of [RFC5681]). If not, the - TCP does not enter fast retransmit and does not reset ssthresh. + Retransmit (step 2 of Section 3.2 of [RFC5681]). If not, the TCP + does not enter fast retransmit and does not reset ssthresh. 3) Response to newly acknowledged data: Step 6 of [RFC5681] specifies the response to the next ACK that acknowledges previously unacknowledged data. When an ACK arrives that acknowledges new data, this ACK could be the acknowledgment elicited by the retransmission from step 2, or elicited by a later retransmission. There are two cases. Full acknowledgments: If this ACK acknowledges all of the data up to and including Recover, then the ACK acknowledges all the intermediate segments sent between the original transmission of the lost segment and the receipt of the third duplicate ACK. Set cwnd to either (1) min (ssthresh, max(FlightSize, SMSS) + SMSS) or - (2) ssthresh, where ssthresh is the value set when Fast - Retransmit was entered, and where FlightSize in (1) is the amount - of data presently outstanding. This is termed "deflating" the - window. If the second option is selected, the implementation + (2) ssthresh, where ssthresh is the value set when Fast Retransmit + was entered, and where FlightSize in (1) is the amount of data + presently outstanding. This is termed "deflating" the window. + If the second option is selected, the implementation is encouraged to take measures to avoid a possible burst of data, in case the amount of data outstanding in the network is much less than the new congestion window allows. A simple mechanism is to limit the number of data packets that can be sent in response to a single acknowledgment. Exit the Fast Recovery procedure. Partial acknowledgments: If this ACK does *not* acknowledge all of the data up to and including Recover, then this is a partial ACK. In this case, @@ -269,23 +268,23 @@ pattern of packet losses, the partial acknowledgment might acknowledge nearly a window of data. In this case, if the congestion window was not deflated, the data sender might be able to send nearly a window of data back-to-back. This document does not specify the sender's response to duplicate ACKs when the Fast Retransmit/Fast Recovery algorithm is not invoked. This is addressed in other documents, such as those describing the Limited Transmit procedure [RFC3042]. This document also does not address issues of adjusting the duplicate - acknowledgment threshold, but assumes the threshold specified in the - IETF standards; the current standard is [RFC5681], which specifies - a threshold of three duplicate acknowledgments. + acknowledgment threshold, but assumes the threshold specified in + the IETF standards; the current standard is [RFC5681], which + specifies a threshold of three duplicate acknowledgments. As a final note, we would observe that in the absence of the SACK option, the data sender is working from limited information. When the issue of recovery from multiple dropped packets from a single window of data is of particular importance, the best alternative would be to use the SACK option. 4. Handling Duplicate Acknowledgments After A Timeout After each retransmit timeout, the highest sequence number @@ -295,21 +294,21 @@ receiver, then the TCP data sender will receive three duplicate acknowledgments that do not cover more than "recover". In this case, the duplicate acknowledgments are not an indication of a new instance of congestion. They are simply an indication that the sender has unnecessarily retransmitted at least three packets. However, when a retransmitted packet is itself dropped, the sender can also receive three duplicate acknowledgments that do not cover more than "recover". In this case, the sender would have been better off if it had initiated Fast Retransmit. For a TCP that - implements the algorithm specified in Section 3 of this document, the + implements the algorithm specified in Section 3.2 of this document, the sender does not infer a packet drop from duplicate acknowledgments in this scenario. As always, the retransmit timer is the backup mechanism for inferring packet loss in this case. There are several heuristics, based on timestamps or on the amount of advancement of the cumulative acknowledgment field, that allow the sender to distinguish, in some cases, between three duplicate acknowledgments following a retransmitted packet that was dropped, and three duplicate acknowledgments from the unnecessary retransmission of three packets [Gur03, GF04]. The TCP sender MAY @@ -330,56 +329,56 @@ distinguish between a retransmitted packet that was dropped and three duplicate acknowledgments from the unnecessary retransmission of three packets. 4.1. ACK Heuristic If the ACK-based heuristic is used, then following the advancement of the cumulative acknowledgment field, the sender stores the value of the previous cumulative acknowledgment as prev_highest_ack, and stores the latest cumulative ACK as highest_ack. In addition, the - following step is performed if Step 1 in Section 3 fails, before - proceeding to Step 1B. + following check is performed if, in Step 2 of Section 3.2, the + Cumulative Acknowledgment field does not cover more than "recover". 1*) If the Cumulative Acknowledgment field didn't cover more than "recover", check to see if the congestion window is greater than SMSS bytes and the difference between highest_ack and prev_highest_ack is at most 4*SMSS bytes. If true, duplicate - ACKs indicate a lost segment (proceed to Step 1A in Section - 3). Otherwise, duplicate ACKs likely result from unnecessary - retransmissions (proceed to Step 1B in Section 3). + ACKs indicate a lost segment (enter Fast Retransmit). Otherwise, + duplicate ACKs likely result from unnecessary retransmissions + (do not enter Fast Retransmit). The congestion window check serves to protect against fast retransmit immediately after a retransmit timeout. If several ACKs are lost, the sender can see a jump in the cumulative ACK of more than three segments, and the heuristic can fail. [RFC5681] recommends that a receiver should send duplicate ACKs for every out-of-order data packet, such as a data packet received during Fast Recovery. The ACK heuristic is more likely to fail if the receiver does not follow this advice, because then a smaller number of ACK losses are needed to produce a sufficient jump in the cumulative ACK. 4.2. Timestamp Heuristic If this heuristic is used, the sender stores the timestamp of the - last acknowledged segment. In addition, the second paragraph of step - 1 in Section 3 is replaced as follows: + last acknowledged segment. In addition, the last sentence of step + 2 in Section 3.2 is replaced as follows: 1**) If the Cumulative Acknowledgment field didn't cover more than "recover", check to see if the echoed timestamp in the last non-duplicate acknowledgment equals the stored timestamp. If true, duplicate ACKs indicate a lost - segment (proceed to Step 1A in Section 3). Otherwise, duplicate - ACKs likely result from unnecessary retransmissions (proceed - to Step 1B in Section 3). + segment (enter Fast Retransmit). Otherwise, duplicate + ACKs likely result from unnecessary retransmissions (do not enter + Fast Retransmit). The timestamp heuristic works correctly, both when the receiver echoes timestamps as specified by [RFC1323], and by its revision attempts. However, if the receiver arbitrarily echoes timestamps, the heuristic can fail. The heuristic can also fail if a timeout was spurious and returning ACKs are not from retransmitted segments. This can be prevented by detection algorithms such as [RFC3522]. 5. Implementation Issues for the Data Receiver @@ -494,76 +493,74 @@ feedback on this document or on its precursor, RFC 2582. Jeffrey Hsu provided clarifications on the handling of the recover variable that were applied to RFC 3782 as errata, and now are in Section 8 of this document. Yoshifumi Nishida contributed a modification to the fast recovery algorithm to account for the case in which flightsize is 0 when the TCP sender leaves fast recovery, and the TCP receiver uses delayed acknowledgments. Alexander Zimmermann provided several suggestions to improve the clarity of the document. 11. References + 11.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [RFC5681] Allman, M., Paxson, V. and E. Blanton, "TCP Congestion Control", RFC 5681, September 2009. - [RFC6298] Paxson, V., Allman, M., Chu, J., and Sargent, M., - "Computing TCP's Retransmission Timer", RFC 6298, - June 2011. + [RFC6298] Paxson, V., M. Allman, J. Chu, and M. Sargent, "Computing + TCP's Retransmission Timer", RFC 6298, June 2011. 11.2. Informative References [C98] Cardwell, N., "delayed ACKs for retransmitted packets: ouch!". November 1998, Email to the tcpimpl mailing list, - Message-ID "Pine.LNX.4.02A.9811021421340.26785-100000@ - sake.cs.washington.edu", + Message-ID + "Pine.LNX.4.02A.9811021421340.26785-100000@sake.cs.washington.edu", archived at "http://tcp-impl.lerc.nasa.gov/tcp-impl". [F98] Floyd, S., Revisions to RFC 2001, "Presentation to the TCPIMPL Working Group", August 1998. URLs "ftp://ftp.ee.lbl.gov/talks/sf-tcpimpl-aug98.ps" and "ftp://ftp.ee.lbl.gov/talks/sf-tcpimpl-aug98.pdf". [F03] Floyd, S., "Moving NewReno from Experimental to Proposed - Standard? Presentation to the TSVWG Working Group", March - 2003. URLs - "http://www.icir.org/floyd/talks/newreno-Mar03.ps" and + Standard? Presentation to the TSVWG Working Group", March 2003. + URLs "http://www.icir.org/floyd/talks/newreno-Mar03.ps" and "http://www.icir.org/floyd/talks/newreno-Mar03.pdf". [FF96] Fall, K. and S. Floyd, "Simulation-based Comparisons of - Tahoe, Reno and SACK TCP", Computer Communication Review, - July 1996. URL "ftp://ftp.ee.lbl.gov/papers/sacks.ps.Z". + Tahoe, Reno and SACK TCP", Computer Communication Review, July 1996. + URL "ftp://ftp.ee.lbl.gov/papers/sacks.ps.Z". [F94] Floyd, S., "TCP and Successive Fast Retransmits", Technical report, October 1994. URL "ftp://ftp.ee.lbl.gov/papers/fastretrans.ps". [GF04] Gurtov, A. and S. Floyd, "Resolving Acknowledgment Ambiguity in non-SACK TCP", Next Generation Teletraffic and Wired/Wireless Advanced Networking (NEW2AN'04), February 2004. URL "http://www.cs.helsinki.fi/u/gurtov/papers/ heuristics.html". [Gur03] Gurtov, A., "[Tsvwg] resolving the problem of unnecessary - fast retransmits in go-back-N", email to the tsvwg mailing - list, message ID <3F25B467.9020609@cs.helsinki.fi>, July - 28, 2003. URL "http://www1.ietf.org/mail-archive/ - working-groups/ tsvwg/current/msg04334.html". + fast retransmits in go-back-N", email to the tsvwg mailing list, + message ID <3F25B467.9020609@cs.helsinki.fi>, July 28, 2003. URL + "http://www1.ietf.org/mail-archive/working-groups/tsvwg/current/ + msg04334.html". [Hen98] Henderson, T., Re: NewReno and the 2001 Revision. September 1998. Email to the tcpimpl mailing list, Message ID - "Pine.BSI.3.95.980923224136.26134A-100000@raptor. - CS.Berkeley.EDU", archived at - "http://tcp-impl.lerc.nasa.gov/tcp-impl". + "Pine.BSI.3.95.980923224136.26134A-100000@raptor.CS.Berkeley.EDU", + archived at "http://tcp-impl.lerc.nasa.gov/tcp-impl". [Hoe95] Hoe, J., "Startup Dynamics of TCP's Congestion Control and Avoidance Schemes", Master's Thesis, MIT, 1995. [Hoe96] Hoe, J., "Improving the Start-up Behavior of a Congestion Control Scheme for TCP", ACM SIGCOMM, August 1996. URL "http://www.acm.org/sigcomm/sigcomm96/program.html". [LM97] Lin, D. and R. Morris, "Dynamics of Random Early Detection", SIGCOMM 97, September 1997. URL @@ -575,33 +572,30 @@ [PF01] Padhye, J. and S. Floyd, "Identifying the TCP Behavior of Web Servers", June 2001, SIGCOMM 2001. [RFC1323] Jacobson, V., Braden, R. and D. Borman, "TCP Extensions for High Performance", RFC 1323, May 1992. [RFC2582] Floyd, S. and T. Henderson, "The NewReno Modification to TCP's Fast Recovery Algorithm", RFC 2582, April 1999. [RFC2883] Floyd, S., J. Mahdavi, M. Mathis, and M. Podolsky, "The - Selective Acknowledgment (SACK) Option for TCP, RFC 2883, - July 2000. + Selective Acknowledgment (SACK) Option for TCP, RFC 2883, July 2000. [RFC3042] Allman, M., Balakrishnan, H. and S. Floyd, "Enhancing TCP's - Loss Recovery Using Limited Transmit", RFC 3042, January - 2001. + Loss Recovery Using Limited Transmit", RFC 3042, January 2001. [RFC3522] Ludwig, R. and M. Meyer, "The Eifel Detection Algorithm for TCP", RFC 3522, April 2003. [RFC3782] Floyd, S., T. Henderson, and A. Gurtov, "The NewReno - Modification to TCP's Fast Recovery Algorithm", RFC 3782, - April 2004. + Modification to TCP's Fast Recovery Algorithm", RFC 3782, April 2004. Appendix A. Additional Information Previous versions of this RFC ([RFC2582], [RFC3782]) contained additional informative material on the following subjects, and may be consulted by readers who may want more information about possible variants to the algorithm and who may want references to specific [NS] simulations that provide NewReno test cases. Section 4 of [RFC3782] discusses some alternative behaviors for @@ -619,37 +613,39 @@ Section 10 of [RFC3782] provides a comparison of Reno and NewReno TCP. Section 11 of [RFC3782] listed changes relative to [RFC3782]. Appendix B. Changes Relative to RFC 3782 In [RFC3782], the cwnd after Full ACK reception will be set to (1) min (ssthresh, FlightSize + SMSS) or (2) ssthresh. However, - there is a risk in the first logic which results in performance - degradation. With the first logic, if FlightSize is zero, the + there is a risk in the first option which results in performance + degradation. With the first option, if FlightSize is zero, the result will be 1 SMSS. This means TCP can transmit only 1 segment at this moment, which can cause delay in ACK transmission at receiver due to delayed ACK algorithm. The FlightSize on Full ACK reception can be zero in some situations. A typical example is where sending window size during fast recovery is small. In this case, the retransmitted packet and new data packets can be transmitted within a short interval. If all these packets successfully arrive, the receiver may generate a Full ACK that acknowledges all outstanding data. Even if window size is not small, loss of ACK packets or receive buffer shortage during fast recovery - can also increase the possibility to fall into this situation. + can also increase the possibility of falling into this situation. - The proposed fix in this document ensures that sender TCP transmits - at least two segments on Full ACK reception. + The proposed fix in this document, which sets cwnd to at least 2*SMSS + if the implementation uses option 1 in the Full ACK case (Section 3.2, + step 3, option 1), ensures that the sender TCP transmits at least two + segments on Full ACK reception. In addition, errata for RFC3782 (editorial clarification to Section 8 of RFC2582, which is now Section 6 of this document) has been applied. The specification text (Section 3.2 herein) was rewritten to more closely track Section 3.2 of [RFC5681]. Sections 4, 5, 9-11 of [RFC3782] were removed, and instead Appendix A of this document was added to back-reference this informative