draft-ietf-tcpm-proportional-rate-reduction-02.txt | draft-ietf-tcpm-proportional-rate-reduction-03.txt | |||
---|---|---|---|---|
TCP Maintenance Working Group M. Mathis | TCP Maintenance Working Group M. Mathis | |||
Internet-Draft N. Dukkipati | Internet-Draft N. Dukkipati | |||
Intended status: Experimental Y. Cheng | Intended status: Experimental Y. Cheng | |||
Expires: January 14, 2013 Google, Inc | Expires: April 25, 2013 Google, Inc | |||
July 13, 2012 | Oct 22, 2012 | |||
Proportional Rate Reduction for TCP | Proportional Rate Reduction for TCP | |||
draft-ietf-tcpm-proportional-rate-reduction-02.txt | draft-ietf-tcpm-proportional-rate-reduction-03.txt | |||
Abstract | Abstract | |||
This document describes an experimental algorithm, Proportional Rate | This document describes an experimental algorithm, Proportional Rate | |||
Reduction (PPR) to improve the accuracy of the amount of data sent by | Reduction (PPR) to improve the accuracy of the amount of data sent by | |||
TCP during loss recovery. Standard Congestion Control requires that | TCP during loss recovery. Standard Congestion Control requires that | |||
TCP and other protocols reduce their congestion window in response to | TCP and other protocols reduce their congestion window in response to | |||
losses. This window reduction naturally occurs in the same round | losses. This window reduction naturally occurs in the same round | |||
trip as the data retransmissions to repair the losses, and is | trip as the data retransmissions to repair the losses, and is | |||
implemented by choosing not to transmit any data in response to some | implemented by choosing not to transmit any data in response to some | |||
ACKs arriving from the receiver. Two widely deployed algorithms are | ACKs arriving from the receiver. Two widely deployed algorithms are | |||
used to implement this window reduction: Fast Recovery and Rate | used to implement this window reduction: Fast Recovery and Rate | |||
Halving. Both algorithms are needlessly fragile under a number of | Halving. Both algorithms are needlessly fragile under a number of | |||
conditions, particularly when there is a burst of losses such that | conditions, particularly when there is a burst of losses such that | |||
the number of ACKs returning to the sender is small. Proportional | the number of ACKs returning to the sender is small. Proportional | |||
Rate Reduction minimizes these excess window reductions such that at | Rate Reduction minimizes these excess window adjustments such that at | |||
the end of recovery the actual window size will be as close as | the end of recovery the actual window size will be as close as | |||
possible to ssthresh, the window size determined by the congestion | possible to ssthresh, the window size determined by the congestion | |||
control algorithm. It is patterned after Rate Halving, but using the | control algorithm. It is patterned after Rate Halving, but using the | |||
fraction that is appropriate for target window chosen by the | fraction that is appropriate for target window chosen by the | |||
congestion control algorithm. | congestion control algorithm. | |||
Status of this Memo | Status of this Memo | |||
This Internet-Draft is submitted in full conformance with the | This Internet-Draft is submitted in full conformance with the | |||
provisions of BCP 78 and BCP 79. | provisions of BCP 78 and BCP 79. | |||
skipping to change at page 1, line 48 | skipping to change at page 1, line 48 | |||
Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
Task Force (IETF). Note that other groups may also distribute | Task Force (IETF). Note that other groups may also distribute | |||
working documents as Internet-Drafts. The list of current Internet- | working documents as Internet-Drafts. The list of current Internet- | |||
Drafts is at http://datatracker.ietf.org/drafts/current/. | Drafts is at http://datatracker.ietf.org/drafts/current/. | |||
Internet-Drafts are draft documents valid for a maximum of six months | Internet-Drafts are draft documents valid for a maximum of six months | |||
and may be updated, replaced, or obsoleted by other documents at any | and may be updated, replaced, or obsoleted by other documents at any | |||
time. It is inappropriate to use Internet-Drafts as reference | time. It is inappropriate to use Internet-Drafts as reference | |||
material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
This Internet-Draft will expire on January 14, 2013. | This Internet-Draft will expire on April 25, 2013. | |||
Copyright Notice | Copyright Notice | |||
Copyright (c) 2012 IETF Trust and the persons identified as the | Copyright (c) 2012 IETF Trust and the persons identified as the | |||
document authors. All rights reserved. | document authors. All rights reserved. | |||
This document is subject to BCP 78 and the IETF Trust's Legal | This document is subject to BCP 78 and the IETF Trust's Legal | |||
Provisions Relating to IETF Documents | Provisions Relating to IETF Documents | |||
(http://trustee.ietf.org/license-info) in effect on the date of | (http://trustee.ietf.org/license-info) in effect on the date of | |||
publication of this document. Please review these documents | publication of this document. Please review these documents | |||
carefully, as they describe your rights and restrictions with respect | carefully, as they describe your rights and restrictions with respect | |||
to this document. Code Components extracted from this document must | to this document. Code Components extracted from this document must | |||
include Simplified BSD License text as described in Section 4.e of | include Simplified BSD License text as described in Section 4.e of | |||
the Trust Legal Provisions and are provided without warranty as | the Trust Legal Provisions and are provided without warranty as | |||
described in the Simplified BSD License. | described in the Simplified BSD License. | |||
Table of Contents | Table of Contents | |||
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 | 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 | |||
2. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 4 | 2. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 5 | |||
3. Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 5 | 3. Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 5 | |||
3.1. Examples . . . . . . . . . . . . . . . . . . . . . . . . . 6 | 3.1. Examples . . . . . . . . . . . . . . . . . . . . . . . . . 6 | |||
4. Properties . . . . . . . . . . . . . . . . . . . . . . . . . . 9 | 4. Properties . . . . . . . . . . . . . . . . . . . . . . . . . . 9 | |||
5. Measurements . . . . . . . . . . . . . . . . . . . . . . . . . 11 | 5. Measurements . . . . . . . . . . . . . . . . . . . . . . . . . 11 | |||
6. Conclusion and Recommendations . . . . . . . . . . . . . . . . 12 | 6. Conclusion and Recommendations . . . . . . . . . . . . . . . . 12 | |||
7. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 13 | 7. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 13 | |||
8. Security Considerations . . . . . . . . . . . . . . . . . . . 13 | 8. Security Considerations . . . . . . . . . . . . . . . . . . . 13 | |||
9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 13 | 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 13 | |||
10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 13 | 10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 13 | |||
10.1. Normative References . . . . . . . . . . . . . . . . . . . 13 | 10.1. Normative References . . . . . . . . . . . . . . . . . . . 13 | |||
10.2. Informative References . . . . . . . . . . . . . . . . . . 14 | 10.2. Informative References . . . . . . . . . . . . . . . . . . 14 | |||
Appendix A. Packet Conservation Bound . . . . . . . . . . . . . . 14 | Appendix A. Strong Packet Conservation Bound . . . . . . . . . . 15 | |||
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 15 | Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 16 | |||
1. Introduction | 1. Introduction | |||
This document describes an experimental algorithm, Proportional Rate | This document describes an experimental algorithm, Proportional Rate | |||
Reduction (PPR) to improve the accuracy of the amount of data sent by | Reduction (PPR) to improve the accuracy of the amount of data sent by | |||
TCP during loss recovery. | TCP during loss recovery. | |||
Standard Congestion Control [RFC5681] requires that TCP (and other | Standard Congestion Control [RFC5681] requires that TCP (and other | |||
protocols) reduce their congestion window in response to losses. | protocols) reduce their congestion window in response to losses. | |||
Fast Recovery, described in the same document, is the reference | Fast Recovery, described in the same document, is the reference | |||
algorithm for making this adjustment. Its stated goal is to recover | algorithm for making this adjustment. Its stated goal is to recover | |||
TCP's self clock by relying on returning ACKs during recovery to | TCP's self clock by relying on returning ACKs during recovery to | |||
clock more data into the network. Fast Recovery adjusts the window | clock more data into the network. Fast Recovery typically adjusts | |||
by waiting for one half RTT of ACKs to pass before sending any data. | the window by waiting for one half RTT of ACKs to pass before sending | |||
It is fragile because it can not compensate for the implicit window | any data. It is fragile because it can not compensate for the | |||
reduction caused by the losses themselves, and is exposed to | implicit window reduction caused by the losses themselves. | |||
timeouts. For example if half of the data or ACKs are lost, Fast | ||||
Recovery's expected behavior would be to wait for a half of a window | ||||
of (remaining) ACKs to pass. It would then not receive any of the | ||||
ACKs needed for recovery and suffer a timeout. | ||||
The rate-halving algorithm improves this situation by sending data on | RFC 3517 [RFC3517] makes Fast Recovery with SACK [RFC2018] more | |||
alternate ACKs during recovery, such that after one RTT the window | accurate by computing "pipe", a sender side estimate of the number of | |||
has been halved. Rate-halving is implemented in Linux after only | bytes still outstanding in the network. With RFC 3517, Fast Recovery | |||
being informally published [RHweb], including an uncompleted | is implemented by sending data as necessary on each ACK to prevent | |||
Internet-Draft [RHID]. Rate-halving also does not adequately | pipe from falling below ssthresh, the window size as determined by | |||
compensate for the implicit window reduction caused by the losses and | the congestion control algorithm. This protects Fast Recovery from | |||
assumes a 50% window reduction, which was completely standard at the | timeouts in many cases where there are heavy losses, although not if | |||
time it was written, but not appropriate for modern congestion | the entire second half of the window of data or ACKs are lost. | |||
control algorithms such as Cubic [CUBIC], which reduce the window by | However, a single ACK carrying a SACK option that reports of large | |||
less than 50%. As a consequence rate-halving often allows the window | quantity of missing data can cause a step discontinuity in the pipe | |||
to fall further than necessary, reducing performance and increasing | estimator, which can cause Fast Retransmit to send a burst of data. | |||
the risk of timeouts if there are additional losses. | ||||
The rate-halving algorithm sends data on alternate ACKs during | ||||
recovery, such that after one RTT the window has been halved. Rate- | ||||
halving is implemented in Linux after only being informally published | ||||
[RHweb], including an uncompleted Internet-Draft [RHID]. Rate- | ||||
halving also does not adequately compensate for the implicit window | ||||
reduction caused by the losses and assumes a net 50% window | ||||
reduction, which was completely standard at the time it was written, | ||||
but not appropriate for modern congestion control algorithms such as | ||||
Cubic [CUBIC], which reduce the window by less than 50%. As a | ||||
consequence rate-halving often allows the window to fall further than | ||||
necessary, reducing performance and increasing the risk of timeouts | ||||
if there are additional losses. | ||||
Proportional Rate Reduction (PPR) avoids these excess window | Proportional Rate Reduction (PPR) avoids these excess window | |||
reductions such that at the end of recovery the actual window size | adjustments such that at the end of recovery the actual window size | |||
will be as close as possible to ssthresh, the window size determined | will be as close as possible to ssthresh, the window size determined | |||
by the congestion control algorithm. It is patterned after Rate | by the congestion control algorithm. It is patterned after Rate | |||
Halving, but using the fraction that is appropriate for the target | Halving, but using the fraction that is appropriate for the target | |||
window chosen by the congestion control algorithm. During PRR one of | window chosen by the congestion control algorithm. During PRR one of | |||
two additional reduction bound algorithms limits the total window | two additional reduction bound algorithms limits the total window | |||
reduction due to all mechanisms, including application stalls and the | reduction due to all mechanisms, including transient application | |||
losses themselves. | stalls and the losses themselves. | |||
We describe two slightly different reduction bound algorithms: | We describe two slightly different reduction bound algorithms: | |||
conservative reduction bound (CRB), which is strictly packet | conservative reduction bound (CRB), which is strictly packet | |||
conserving; and a slow start reduction bound (SSRB), which is more | conserving; and a slow start reduction bound (SSRB), which is more | |||
aggressive than CRB by at most one segment per ACK. PRR-CRB meets | aggressive than CRB by at most one segment per ACK. PRR-CRB meets | |||
the strong conservative bound described in Appendix A, however in | the Strong Packet Conservation Bound described in Appendix A, however | |||
real networks it does not perform as well as the algorithms described | in real networks it does not perform as well as the algorithms | |||
in RFC 3517, which prove to be non-conservative in a significant | described in RFC 3517, which prove to be more aggressive in a | |||
number of cases. SSRB offers a compromise by allowing TCP to send | significant number of cases. SSRB offers a compromise by allowing | |||
one additional segment per ACK relative to CRB in some situations. | TCP to send one additional segment per ACK relative to CRB in some | |||
Although SSRB is less aggressive than RFC 3517 (transmitting fewer | situations. Although SSRB is less aggressive than RFC 3517 | |||
segments or taking more time to transmit them) it outperforms it, due | (transmitting fewer segments or taking more time to transmit them) it | |||
to the lower probability of additional losses during recovery. | outperforms it, due to the lower probability of additional losses | |||
during recovery. | ||||
PRR and both reduction bounds are based on common design principles, | The Strong Packet Conservation Bound on which PRR and both reduction | |||
derived from Van Jacobson's packet conservation principle: segments | bounds are based is patterned after Van Jacobson's packet | |||
delivered to the receiver are used as the clock to trigger sending | conservation principle: segments delivered to the receiver are used | |||
the same number of segments back into the network. As much as | as the clock to trigger sending the same number of segments back into | |||
possible Proportional Rate Reduction and the reduction bound rely on | the network. As much as possible Proportional Rate Reduction and the | |||
this self clock process, and are only slightly affected by the | reduction bound algorithms rely on this self clock process, and are | |||
accuracy of other estimators, such as pipe [RFC3517] and cwnd. This | only slightly affected by the accuracy of other estimators, such as | |||
is what gives the algorithms their precision in the presence of | pipe [RFC3517] and cwnd. This is what gives the algorithms their | |||
events that cause uncertainty in other estimators. | precision in the presence of events that cause uncertainty in other | |||
estimators. | ||||
The original definition of the packet conservation principle | ||||
[Jacobson88] treated packets that are presumed to be lost (e.g. | ||||
marked as candidates for retransmission) as having left the network. | ||||
This idea is reflected in the pipe estimator defined in RFC 3517 and | ||||
used here, but it is distinct from Strong Packet Conservation Bound | ||||
described in Appendix A, which is defined solely on the basis of data | ||||
arriving at the receiver. | ||||
We evaluated these and other algorithms in a large scale measurement | We evaluated these and other algorithms in a large scale measurement | |||
study, summarized below. The most important results from that study | study presented in an companion paper [IMC11] and summarized in | |||
are presented in an companion paper [IMC11]. PRR+SSRB outperforms | Section 5. For authentic network traffic PRR+SSRB outperforms both | |||
both RFC 3517 and Linux Rate Halving under authentic network traffic, | RFC 3517 and Linux Rate Halving even though it is less aggressive | |||
even though it is less aggressive than RFC 3517. | than RFC 3517. | |||
The algorithms are described as modifications to RFC 5681 [RFC5681], | The algorithms are described as modifications to RFC 5681 [RFC5681], | |||
TCP Congestion Control, using concepts drawn from the pipe algorithm | TCP Congestion Control, using concepts drawn from the pipe algorithm | |||
[RFC3517]. They are most accurate and more easily implemented with | [RFC3517]. They are most accurate and more easily implemented with | |||
SACK [RFC2018], but do not require SACK. | SACK [RFC2018], but do not require SACK. The analysis and | |||
measurement study presented here predates [RFC6675], which updates | ||||
RFC 3517. | ||||
2. Definitions | 2. Definitions | |||
The following terms, parameters and state variables are used as they | The following terms, parameters and state variables are used as they | |||
are defined in earlier documents: | are defined in earlier documents: | |||
RFC 793: snd.una | ||||
RFC 3517: covered (as in "covered sequence numbers") | RFC 3517: covered (as in "covered sequence numbers") | |||
RFC 5681: duplicate ACK, FlightSize, Sender Maximum Segment Size | RFC 5681: duplicate ACK, FlightSize, Sender Maximum Segment Size | |||
(SMSS) | (SMSS) | |||
Voluntary window reductions: choosing not to send data in response to | Voluntary window reductions: choosing not to send data in response to | |||
some ACKs, for the purpose of reducing the sending window size and | some ACKs, for the purpose of reducing the sending window size and | |||
data rate. | data rate. | |||
We define some additional variables: | We define some additional variables: | |||
skipping to change at page 6, line 34 | skipping to change at page 6, line 43 | |||
prr_out += (data sent) // strictly less than or equal to sndcnt | prr_out += (data sent) // strictly less than or equal to sndcnt | |||
3.1. Examples | 3.1. Examples | |||
We illustrate these algorithms by showing their different behaviors | We illustrate these algorithms by showing their different behaviors | |||
for two scenarios: TCP experiencing either a single loss or a burst | for two scenarios: TCP experiencing either a single loss or a burst | |||
of 15 consecutive losses. In all cases we assume bulk data (no | of 15 consecutive losses. In all cases we assume bulk data (no | |||
application pauses), standard AIMD congestion control and cwnd = | application pauses), standard AIMD congestion control and cwnd = | |||
FlightSize = pipe = 20 segments, so ssthresh will be set to 10 at the | FlightSize = pipe = 20 segments, so ssthresh will be set to 10 at the | |||
beginning of recovery. We also assume standard Fast Retransmit and | beginning of recovery. We also assume standard Fast Retransmit and | |||
Limited Transmit, so TCP will send two new segments followed by one | Limited Transmit [RFC3042], so TCP will send two new segments | |||
retransmit in response to the first 3 duplicate ACKs following the | followed by one retransmit in response to the first 3 duplicate ACKs | |||
losses. | following the losses. | |||
Each of the diagrams below shows the per ACK response to the first | Each of the diagrams below shows the per ACK response to the first | |||
round trip for the various recovery algorithms when the zeroth | round trip for the various recovery algorithms when the zeroth | |||
segment is lost. The top line indicates the transmitted segment | segment is lost. The top line indicates the transmitted segment | |||
number triggering the ACKs, with an X for the lost segment. "cwnd" | number triggering the ACKs, with an X for the lost segment. "cwnd" | |||
and "pipe" indicate the values of these algorithms after processing | and "pipe" indicate the values of these algorithms after processing | |||
each returning ACK. "Sent" indicates how much 'N'ew or | each returning ACK. "Sent" indicates how much 'N'ew or | |||
'R'etransmitted data would be sent. Note that the algorithms for | 'R'etransmitted data would be sent. Note that the algorithms for | |||
deciding which data to send are out of scope of this document. | deciding which data to send are out of scope of this document. | |||
skipping to change at page 8, line 29 | skipping to change at page 8, line 29 | |||
pipe: 19 19 4 4 4 | pipe: 19 19 4 4 4 | |||
sent: N N R R R | sent: N N R R R | |||
RB: b b b | RB: b b b | |||
PRR-SSRB | PRR-SSRB | |||
ack# X X X X X X X X X X X X X X X 15 16 17 18 19 | ack# X X X X X X X X X X X X X X X 15 16 17 18 19 | |||
pipe: 19 19 4 5 6 | pipe: 19 19 4 5 6 | |||
sent: N N 2R 2R 2R | sent: N N 2R 2R 2R | |||
RB: bd d d | RB: bd d d | |||
In this specific situation, RFC 3517 is very non-conservative, | In this specific situation, RFC 3517 is more aggressive, because once | |||
because once fast retransmit is triggered (on the ACK for segment 17) | fast retransmit is triggered (on the ACK for segment 17) TCP | |||
TCP immediately retransmits sufficient data to bring pipe up to cwnd. | immediately retransmits sufficient data to bring pipe up to cwnd. | |||
Our measurement data (see Section 5) indicates that RFC 3517 | Our measurement data (see Section 5) indicates that RFC 3517 | |||
significantly outperforms Rate Halving, PRR-CRB and some other | significantly outperforms Rate Halving, PRR-CRB and some other | |||
similarly conservative algorithms that we tested, suggesting that it | similarly conservative algorithms that we tested, showing that it is | |||
is significantly common for the actual losses to exceed the window | significantly common for the actual losses to exceed the window | |||
reduction determined by the congestion control algorithm. | reduction determined by the congestion control algorithm. | |||
The Linux implementation of Rate Halving includes an early version of | The Linux implementation of Rate Halving includes an early version of | |||
the conservative reduction bound [RHweb]. In this situation the five | the conservative reduction bound [RHweb]. In this situation the five | |||
ACKs trigger exactly one transmission each (2 new data, 3 old data), | ACKs trigger exactly one transmission each (2 new data, 3 old data), | |||
and cwnd is set to 5. At a window size of 5, it takes three round | and cwnd is set to 5. At a window size of 5, it takes three round | |||
trips to retransmit all 15 lost segments. Rate Halving does not | trips to retransmit all 15 lost segments. Rate Halving does not | |||
raise the window at all during recovery, so when recovery finally | raise the window at all during recovery, so when recovery finally | |||
completes, TCP will slowstart cwnd from 5 up to 10. In this example, | completes, TCP will slowstart cwnd from 5 up to 10. In this example, | |||
TCP operates at half of the window chosen by the congestion control | TCP operates at half of the window chosen by the congestion control | |||
skipping to change at page 10, line 33 | skipping to change at page 10, line 33 | |||
specifically they cause prr_delivered-prr_out to be significantly | specifically they cause prr_delivered-prr_out to be significantly | |||
positive. If the application catches up while TCP is still in | positive. If the application catches up while TCP is still in | |||
recovery, TCP will send a partial window burst to catch up to exactly | recovery, TCP will send a partial window burst to catch up to exactly | |||
where it would have been, had the application never stalled. | where it would have been, had the application never stalled. | |||
Although this burst might be viewed as being hard on the network, | Although this burst might be viewed as being hard on the network, | |||
this is exactly what happens every time there is a partial RTT | this is exactly what happens every time there is a partial RTT | |||
application stall while not in recovery. We have made the partial | application stall while not in recovery. We have made the partial | |||
RTT stall behavior uniform in all states. Changing this behavior is | RTT stall behavior uniform in all states. Changing this behavior is | |||
out of scope for this document. | out of scope for this document. | |||
Proportional Rate Reduction with Reduction Bound is significantly | Proportional Rate Reduction with Reduction Bound is less sensitive to | |||
less sensitive to errors of the pipe estimator. While in recovery, | errors in the pipe estimator. While in recovery, pipe is | |||
pipe is intrinsically an estimator, using incomplete information to | intrinsically an estimator, using incomplete information to estimate | |||
guess if un-SACKed segments are actually lost or out-of-order in the | if un-SACKed segments are actually lost or merely out-of-order in the | |||
network. Under some conditions pipe can have significant errors, for | network. Under some conditions pipe can have significant errors, for | |||
example when a burst of reordered data is presumed to be lost and is | example pipe is underestimated when when a burst of reordered data is | |||
retransmitted, but then the original data arrives before the | prematurely assumed to be lost and marked for retransmission. If the | |||
retransmission. If the transmissions are regulated directly by pipe | transmissions are regulated directly by pipe as they are with RFC | |||
as they are in RFC 3517, then errors and discontinuities in the value | 3517, such as step discontinuity in the pipe estimator causes a burst | |||
of the pipe estimator can cause significant errors in the amount of | of data, which can not be retracted once the pipe estimator is | |||
data sent. With Proportional Rate Reduction with Reduction Bound, | corrected a few ACKs later. For PRR, pipe merely determines which | |||
pipe merely determines how sndcnt is computed from DeliveredData. | algorithm, Proportional Rate Reduction or the reduction bound, is | |||
Since short term errors in pipe are smoothed out across multiple ACKs | used to compute sndcnt from DeliveredData. While pipe is | |||
and both Proportional Rate Reduction and the reduction bound converge | underestimated the algorithms are different by at most one segment | |||
to the same final window, errors in the pipe estimator have less | per ACK. Once pipe is updated they converge to the same final window | |||
impact on the final outcome. | at the end of recovery. | |||
Under all conditions and sequences of events during recovery, PRR-CRB | Under all conditions and sequences of events during recovery, PRR-CRB | |||
strictly bounds the data transmitted to be equal to or less than the | strictly bounds the data transmitted to be equal to or less than the | |||
amount of data delivered to the receiver. We claim that this packet | amount of data delivered to the receiver. We claim that this Strong | |||
conservation bound is the most aggressive algorithm that does not | Packet Conservation Bound is the most aggressive algorithm that does | |||
lead to additional forced losses in some environments. It has the | not lead to additional forced losses in some environments. It has | |||
property that if there is a standing queue at a bottleneck with no | the property that if there is a standing queue at a bottleneck with | |||
cross traffic, the queue will maintain exactly constant length for | no cross traffic, the queue will maintain exactly constant length for | |||
the duration of the recovery, except for +1/-1 fluctuation due to | the duration of the recovery, except for +1/-1 fluctuation due to | |||
differences in packet arrival and exit times. See Appendix A for a | differences in packet arrival and exit times. See Appendix A for a | |||
detailed discussion of this property. | detailed discussion of this property. | |||
Although the packet Packet Conserving Bound in very appealing for a | Although the Strong Packet Conserving Bound in very appealing for a | |||
number of reasons, our measurements summarized in Section 5 | number of reasons, our measurements summarized in Section 5 | |||
demonstrate that it is less aggressive and does not perform as well | demonstrate that it is less aggressive and does not perform as well | |||
as RFC3517, which permits large bursts of data when there are bursts | as RFC3517, which permits large bursts of data when there are bursts | |||
of losses. PRR-SSRB is a compromise that permits TCP to send one | of losses. PRR-SSRB is a compromise that permits TCP to send one | |||
extra segment per ACK as compared to the packet conserving bound. | extra segment per ACK as compared to the packet conserving bound. | |||
From the perspective of the packet conserving bound, PRR-SSRB does | From the perspective of a strict packet conserving bound, PRR-SSRB | |||
indeed open the window during recovery, however it is significantly | does indeed open the window during recovery, however it is | |||
less aggressive than RFC3517 in the presence of burst losses. | significantly less aggressive than RFC3517 in the presence of burst | |||
losses. | ||||
5. Measurements | 5. Measurements | |||
In a companion IMC11 paper [IMC11] we describe some measurements | In a companion IMC11 paper [IMC11] we describe some measurements | |||
comparing the various strategies for reducing the window during | comparing the various strategies for reducing the window during | |||
recovery. The results are summarized here. | recovery. The experiments were performed on servers carrying Google | |||
production traffic and are briefly summarized here. | ||||
The various window reduction algorithms and extensive instrumentation | The various window reduction algorithms and extensive instrumentation | |||
were all implemented in Linux 2.6. We used the uniform set of | were all implemented in Linux 2.6. We used the uniform set of | |||
algorithms present in the base Linux implementation, including CUBIC | algorithms present in the base Linux implementation, including CUBIC | |||
[CUBIC], limited transmit [RFC3042], threshold transmit from [FACK] | [CUBIC], limited transmit [RFC3042], threshold transmit from [FACK] | |||
and lost retransmission detection algorithms. We confirmed that the | (a similar algorithm appears in [RFC6675]) and lost retransmission | |||
behaviors of Rate Halving (the Linux default), RFC 3517 and PRR were | detection algorithms. We confirmed that the behaviors of Rate | |||
authentic to their respective specifications and that performance and | Halving (the Linux default), RFC 3517 and PRR were authentic to their | |||
features were comparable to the kernels in production use. The | respective specifications and that performance and features were | |||
different window reduction algorithms were all present in the same | comparable to the kernels in production use. All of the different | |||
kernel and could be selected with a sysctl, such that we had an | window reduction algorithms were all present in a common kernel and | |||
absolutely uniform baseline for comparing them. | could be selected with a sysctl, such that we had an absolutely | |||
uniform baseline for comparing them. | ||||
Our experiments included an additional algorithm, PRR with an | Our experiments included an additional algorithm, PRR with an | |||
unlimited bound (PRR-UB), which sends ssthresh-pipe bursts when pipe | unlimited bound (PRR-UB), which sends ssthresh-pipe bursts when pipe | |||
falls below ssthresh. This behavior parallels RFC 3517. | falls below ssthresh. This behavior parallels RFC 3517. | |||
An important detail of this configuration is that CUBIC only reduces | An important detail of this configuration is that CUBIC only reduces | |||
the window by 30%, as opposed to the 50% reduction used by | the window by 30%, as opposed to the 50% reduction used by | |||
traditional congestion control algorithms. This, in conjunction with | traditional congestion control algorithms. This accentuates the | |||
using only standard algorithms to trigger Fast Retransmit, | tendency for RFC 3517 and PRR-UB to send a burst at the point when | |||
accentuates the tendency for RFC 3517 and PRR-UB to send a burst at | Fast Retransmit gets triggered because pipe is likely to already be | |||
the point when Fast Retransmit gets triggered if pipe is already | below ssthresh. Precisely this condition was observed for 32% of the | |||
below ssthresh. | recovery events: pipe fell below ssthresh before Fast Retransmit is | |||
All experiments were performed on servers carrying production traffic | ||||
for multiple Google services. | ||||
In this configuration it is observed that for 32% of the recovery | ||||
events, pipe falls below ssthresh before Fast Retransmit is | ||||
triggered, thus the various PRR algorithms start in the reduction | triggered, thus the various PRR algorithms start in the reduction | |||
bound phase, and RFC 3517 send bursts of segments with the fast | bound phase, and RFC 3517 sends bursts of segments with the fast | |||
retransmit. | retransmit. | |||
In the companion paper we observe that PRR-SSRB spends the least time | In the companion paper we observe that PRR-SSRB spends the least time | |||
in recovery of all the algorithms tested, largely because it | in recovery of all the algorithms tested, largely because it | |||
experiences fewer timeouts once it is already in recovery. | experiences fewer timeouts once it is already in recovery. | |||
RFC 3517 experiences 29% more detected lost retransmissions and 2.6% | RFC 3517 experiences 29% more detected lost retransmissions and 2.6% | |||
more timeouts (presumably due to undetected lost retransmissions) | more timeouts (presumably due to undetected lost retransmissions) | |||
than PRR-SSRB. These results are representative of PRR-UB and other | than PRR-SSRB. These results are representative of PRR-UB and other | |||
algorithms that send bursts when pipe falls below ssthresh. | algorithms that send bursts when pipe falls below ssthresh. | |||
Rate Halving experiences 5% more timeouts and significantly smaller | Rate Halving experiences 5% more timeouts and significantly smaller | |||
final cwnd values at the end of recovery. The smaller cwnd sometimes | final cwnd values at the end of recovery. The smaller cwnd sometimes | |||
causes the recovery itself to take extra round trips. These results | causes the recovery itself to take extra round trips. These results | |||
are representative of PRR-CRB and other algorithms that implement | are representative of PRR-CRB and other algorithms that implement | |||
strict packet conservation during recovery. | strict packet conservation during recovery. | |||
6. Conclusion and Recommendations | 6. Conclusion and Recommendations | |||
Although the packet conserving bound is very appealing for a number | Although the Strong Packet Conserving Bound used in PRR-CRB is very | |||
of reasons, our measurements demonstrate that it is less aggressive | appealing for a number of reasons, our measurements show that it is | |||
and does not perform as well as RFC3517, which permits significant | less aggressive and does not perform as well as RFC 3517, which | |||
bursts of data when there are large bursts of losses. PRR-SSRB is a | permits bursts of data when there are bursts of losses. RFC 3517 is | |||
compromise that permits TCP to send one extra segment per ACK as | conservative in the original sense of Van Jacobson's packet | |||
relative to the packet conserving bound. From the perspective of the | conservation principle, which included the assumption that presumed | |||
packet conserving bound, PRR-SSRB does indeed open the window during | lost segments have indeed left the network. PRR-CRB makes no such | |||
recovery, however it is significantly less aggressive than RFC3517 in | assumption, following instead a Strong Packet Conserving Bound, in | |||
the presence of burst losses. Even so, it often out performs | which only packets that have arrived at the receiver are considered | |||
RFC3517, because it avoids some of the self inflicted losses caused | to have left the network. PRR-SSRB is a compromise that permits TCP | |||
by bursts. | to send one extra segment per ACK relative to the Strong Packet | |||
Conserving Bound, to partially compensate for excess losses. | ||||
From the perspective of the Strong Packet Conserving Bound, PRR-SSRB | ||||
does indeed open the window during recovery, however it is | ||||
significantly less aggressive than RFC 3517 in the presence of burst | ||||
losses. Even so, it often outperforms RFC 3517, because it avoids | ||||
some of the self inflicted losses caused by bursts. | ||||
At this time we see no reason not to test and deploy PRR-SSRB on a | At this time we see no reason not to test and deploy PRR-SSRB on a | |||
large scale. Implementers worried about any potential impact of | large scale. Implementers worried about any potential impact of | |||
raising the window during recovery may want to optionally support | raising the window during recovery may want to optionally support | |||
PRR-CRB (which is actually simpler to implement) for comparison | PRR-CRB (which is actually simpler to implement) for comparison | |||
studies. | studies. | |||
One final comment about terminology: we expect that common usage will | One final comment about terminology: we expect that common usage will | |||
drop "slow start reduction bound" from the algorithm name. This | drop "slow start reduction bound" from the algorithm name. This | |||
document needed to be pedantic about having distinct names for | document needed to be pedantic about having distinct names for | |||
proportional rate reduction and every variant of the reduction bound. | proportional rate reduction and every variant of the reduction bound. | |||
However, once paired they become one. | However, we do not anticipate any future exploration of the | |||
alternative reduction bounds. | ||||
7. Acknowledgements | 7. Acknowledgements | |||
This draft is based in part on previous incomplete work by Matt | This draft is based in part on previous incomplete work by Matt | |||
Mathis, Jeff Semke and Jamshid Mahdavi [RHID] and influenced by | Mathis, Jeff Semke and Jamshid Mahdavi [RHID] and influenced by | |||
several discussion with John Heffner. | several discussion with John Heffner. | |||
Monia Ghobadi and Sivasankar Radhakrishnan helped analyze the | Monia Ghobadi and Sivasankar Radhakrishnan helped analyze the | |||
experiments. | experiments. | |||
Ilpo Jarvinen reviewed the code. | Ilpo Jarvinen reviewed the code. | |||
Mark Allman improved the document through his insightful review. | ||||
8. Security Considerations | 8. Security Considerations | |||
Proportional Rate Reduction does not change the risk profile for TCP. | Proportional Rate Reduction does not change the risk profile for TCP. | |||
Implementers that change PRR from counting bytes to segments have to | Implementers that change PRR from counting bytes to segments have to | |||
be cautious about the effects of ACK splitting attacks [Savage99], | be cautious about the effects of ACK splitting attacks [Savage99], | |||
where the receiver acknowledges partial segments for the purpose of | where the receiver acknowledges partial segments for the purpose of | |||
confusing the sender's congestion accounting. | confusing the sender's congestion accounting. | |||
9. IANA Considerations | 9. IANA Considerations | |||
skipping to change at page 13, line 41 | skipping to change at page 14, line 5 | |||
Note to RFC Editor: this section may be removed on publication as an | Note to RFC Editor: this section may be removed on publication as an | |||
RFC. | RFC. | |||
10. References | 10. References | |||
10.1. Normative References | 10.1. Normative References | |||
[RFC2018] Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP | [RFC2018] Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP | |||
Selective Acknowledgment Options", RFC 2018, October 1996. | Selective Acknowledgment Options", RFC 2018, October 1996. | |||
[RFC3517] Blanton, E., Allman, M., Fall, K., and L. Wang, "A | ||||
Conservative Selective Acknowledgment (SACK)-based Loss | ||||
Recovery Algorithm for TCP", RFC 3517, April 2003. | ||||
[RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion | [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion | |||
Control", RFC 5681, September 2009. | Control", RFC 5681, September 2009. | |||
[RFC6675] Blanton, E., Allman, M., Wang, L., Jarvinen, I., Kojo, M., | ||||
and Y. Nishida, "A Conservative Loss Recovery Algorithm | ||||
Based on Selective Acknowledgment (SACK) for TCP", | ||||
RFC 6675, August 2012. | ||||
10.2. Informative References | 10.2. Informative References | |||
[RFC3042] Allman, M., Balakrishnan, H., and S. Floyd, "Enhancing | [RFC3042] Allman, M., Balakrishnan, H., and S. Floyd, "Enhancing | |||
TCP's Loss Recovery Using Limited Transmit", RFC 3042, | TCP's Loss Recovery Using Limited Transmit", RFC 3042, | |||
January 2001. | January 2001. | |||
[RFC3517] Blanton, E., Allman, M., Fall, K., and L. Wang, "A | ||||
Conservative Selective Acknowledgment (SACK)-based Loss | ||||
Recovery Algorithm for TCP", RFC 3517, April 2003. | ||||
[IMC11] Dukkipati, N., Mathis, M., and Y. Cheng, "Proportional | [IMC11] Dukkipati, N., Mathis, M., and Y. Cheng, "Proportional | |||
Rate Reduction for TCP", ACM Internet Measurement | Rate Reduction for TCP", ACM Internet Measurement | |||
Conference IMC11, December 2011. | Conference IMC11, December 2011. | |||
[FACK] Mathis, M. and J. Mahdavi, "Forward Acknowledgment: | [FACK] Mathis, M. and J. Mahdavi, "Forward Acknowledgment: | |||
Refining TCP Congestion Control", ACM SIGCOMM SIGCOMM96, | Refining TCP Congestion Control", ACM SIGCOMM SIGCOMM96, | |||
August 1996. | August 1996. | |||
[RHID] Mathis, M., Semke, J., Mahdavi, J., and K. Lahey, "The | [RHID] Mathis, M., Semke, J., Mahdavi, J., and K. Lahey, "The | |||
Rate-Halving Algorithm for TCP Congestion Control", | Rate-Halving Algorithm for TCP Congestion Control", | |||
draft-mathis-tcp-ratehalving (work in progress), | draft-mathis-tcp-ratehalving (work in progress), | |||
June 1999. | June 1999. | |||
[RHweb] Mathis, M. and J. Mahdavi, "TCP Rate-Halving with Bounding | [RHweb] Mathis, M. and J. Mahdavi, "TCP Rate-Halving with Bounding | |||
Parameters", Web publication , December 1997. | Parameters", Web publication , December 1997. | |||
[CUBIC] Rhee, I. and L. Xu, "CUBIC: A new TCP-friendly high-speed | [CUBIC] Rhee, I. and L. Xu, "CUBIC: A new TCP-friendly high-speed | |||
TCP variant", PFLDnet 2005, Feb 2005. | TCP variant", PFLDnet 2005, Feb 2005. | |||
[Jacobson88] | ||||
Jacobson, V., "Congestion Avoidance and Control", SIGCOMM | ||||
Comput. Commun. Rev. 18(4), Aug 1988. | ||||
[Savage99] | [Savage99] | |||
Savage, S., Cardwell, N., Wetherall, D., and T. Anderson, | Savage, S., Cardwell, N., Wetherall, D., and T. Anderson, | |||
"TCP congestion control with a misbehaving receiver", | "TCP congestion control with a misbehaving receiver", | |||
SIGCOMM Comput. Commun. Rev. 29(5), October 1999. | SIGCOMM Comput. Commun. Rev. 29(5), October 1999. | |||
Appendix A. Packet Conservation Bound | Appendix A. Strong Packet Conservation Bound | |||
PRR-CRB meets a conservative, philosophically pure and aesthetically | PRR-CRB is based on a conservative, philosophically pure and | |||
appealing notion of correct, described here. However, in real | aesthetically appealing Strong Packet Conservation Bound, described | |||
networks it does not perform as well as the algorithms described in | here. Although inspired by Van Jacobson's packet conservation | |||
RFC 3517, which proves to be non-conservative in a significant number | principle [Jacobson88], it differs in how it treats segments that are | |||
of cases. | missing and presumed lost. Under all conditions and sequences of | |||
events during recovery, PRR-CRB strictly bounds the data transmitted | ||||
to be equal to or less than the amount of data delivered to the | ||||
receiver. Note that the effects of presumed losses are included in | ||||
the pipe calculation, but do not affect the outcome of PRR-CRB, once | ||||
pipe has fallen below ssthresh. | ||||
Under all conditions and sequences of events during recovery, PRR-CRB | We claim that this Strong Packet Conservation Bound is the most | |||
strictly bounds the data transmitted to be equal to or less than the | aggressive algorithm that does not lead to additional forced losses | |||
amount of data delivered to the receiver. We claim that this packet | in some environments. It has the property that if there is a | |||
conservation bound is the most aggressive algorithm that does not | standing queue at a bottleneck that is carrying no other traffic, the | |||
lead to additional forced losses in some environments. It has the | queue will maintain exactly constant length for the entire duration | |||
property that if there is a standing queue at a bottleneck that is | of the recovery, except for +1/-1 fluctuation due to differences in | |||
carrying no other traffic, the queue will maintain exactly constant | packet arrival and exit times. Any less aggressive algorithm will | |||
length for the entire duration of the recovery, except for +1/-1 | result in a declining queue at the bottleneck. Any more aggressive | |||
fluctuation due to differences in packet arrival and exit times. Any | algorithm will result in an increasing queue or additional losses if | |||
less aggressive algorithm will result in a declining queue at the | it is a full drop tail queue. | |||
bottleneck. Any more aggressive algorithm will result in an | ||||
increasing queue or additional losses if it is a full drop tail | ||||
queue. | ||||
We demonstrate this property with a little thought experiment: | We demonstrate this property with a little thought experiment: | |||
Imagine a network path that has insignificant delays in both | Imagine a network path that has insignificant delays in both | |||
directions, except for the processing time and queue at a single | directions, except for the processing time and queue at a single | |||
bottleneck in the forward path. By insignificant delay, we mean when | bottleneck in the forward path. By insignificant delay, we mean when | |||
a packet is "served" at the head of the bottleneck queue, the | a packet is "served" at the head of the bottleneck queue, the | |||
following events happen in much less than one bottleneck packet time: | following events happen in much less than one bottleneck packet time: | |||
the packet arrives at the receiver; the receiver sends an ACK; which | the packet arrives at the receiver; the receiver sends an ACK; which | |||
arrives at the sender; the sender processes the ACK and sends some | arrives at the sender; the sender processes the ACK and sends some | |||
data; the data is queued at the bottleneck. | data; the data is queued at the bottleneck. | |||
If sndcnt is set to DeliveredData and nothing else is inhibiting | If sndcnt is set to DeliveredData and nothing else is inhibiting | |||
sending data, then clearly the data arriving at the bottleneck queue | sending data, then clearly the data arriving at the bottleneck queue | |||
will exactly replace the data that was served at the head of the | will exactly replace the data that was served at the head of the | |||
queue, so the queue will have a constant length. If queue is drop | queue, so the queue will have a constant length. If queue is drop | |||
tail and full then the queue will stay exactly full. Losses or | tail and full then the queue will stay exactly full. Losses or | |||
reordering on the ACK path only cause wider fluctuations in the queue | reordering on the ACK path only cause wider fluctuations in the queue | |||
size, but do not raise the peak size, independent of whether the data | size, but do not raise its peak size, independent of whether the data | |||
is in order or out-of-order (including loss recovery from an earlier | is in order or out-of-order (including loss recovery from an earlier | |||
RTT). Any more aggressive algorithm which sends additional data will | RTT). Any more aggressive algorithm which sends additional data will | |||
overflow the drop tail queue and cause loss. Any less aggressive | overflow the drop tail queue and cause loss. Any less aggressive | |||
algorithm will under fill the queue. Therefore setting sndcnt to | algorithm will under fill the queue. Therefore setting sndcnt to | |||
DeliveredData is the most aggressive algorithm that does not cause | DeliveredData is the most aggressive algorithm that does not cause | |||
forced losses in this simple network. Relaxing the assumptions (e.g. | forced losses in this simple network. Relaxing the assumptions (e.g. | |||
making delays more authentic and adding more flows, delayed ACKs, | making delays more authentic and adding more flows, delayed ACKs, | |||
etc) may increases the fine grained fluctuations in queue size but | etc) are likely to increases the fine grained fluctuations in queue | |||
does not change its basic behavior. | size but do not change its basic behavior. | |||
Note that the congestion control algorithm implements a broader | Note that the congestion control algorithm implements a broader | |||
notion of optimal that includes appropriately sharing the network. | notion of optimal that includes appropriately sharing the network. | |||
Typical congestion control algorithms are likely to reduce the data | Typical congestion control algorithms are likely to reduce the data | |||
sent relative to the packet conserving bound implemented by PRR | sent relative to the packet conserving bound implemented by PRR | |||
bringing TCP's actual window down to ssthresh. | bringing TCP's actual window down to ssthresh. | |||
Authors' Addresses | Authors' Addresses | |||
Matt Mathis | Matt Mathis | |||
End of changes. 40 change blocks. | ||||
143 lines changed or deleted | 184 lines changed or added | |||
This html diff was produced by rfcdiff 1.41. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ |