draft-ietf-tcpimpl-cong-control-01.txt   draft-ietf-tcpimpl-cong-control-02.txt 
TCP Implementation Working Group M. Allman TCP Implementation Working Group M. Allman
INTERNET DRAFT NASA Lewis/Sterling Software INTERNET DRAFT NASA Lewis/Sterling Software
File: draft-ietf-tcpimpl-cong-control-01.txt V. Paxson File: draft-ietf-tcpimpl-cong-control-02.txt V. Paxson
LBNL LBNL
W. Stevens W. Stevens
Consultant Consultant
November, 1998 December, 1998
TCP Congestion Control TCP Congestion Control
Status of this Memo Status of this Memo
This document is an Internet-Draft. Internet-Drafts are working This document is an Internet-Draft. Internet-Drafts are working
documents of the Internet Engineering Task Force (IETF), its areas, documents of the Internet Engineering Task Force (IETF), its areas,
and its working groups. Note that other groups may also distribute and its working groups. Note that other groups may also distribute
working documents as Internet-Drafts. working documents as Internet-Drafts.
skipping to change at page 1, line 44 skipping to change at page 1, line 45
algorithms: slow start, congestion avoidance, fast retransmit, and algorithms: slow start, congestion avoidance, fast retransmit, and
fast recovery. In addition, the document specifies how TCP should fast recovery. In addition, the document specifies how TCP should
begin transmission after a relatively long idle period, as well as begin transmission after a relatively long idle period, as well as
discussing various acknowledgment generation methods. discussing various acknowledgment generation methods.
1 Introduction 1 Introduction
This document specifies four TCP [Pos81] congestion control This document specifies four TCP [Pos81] congestion control
algorithms: slow start, congestion avoidance, fast retransmit and algorithms: slow start, congestion avoidance, fast retransmit and
fast recovery. These algorithms were devised in [Jac88] and fast recovery. These algorithms were devised in [Jac88] and
[Jac90]. Their use with TCP is required by [Bra89]. [Jac90]. Their use with TCP is standardized in [Bra89].
This document is an update of [Ste97]. In addition to specifying This document is an update of [Ste97]. In addition to specifying
the congestion control algorithms, this document specifies what TCP the congestion control algorithms, this document specifies what TCP
connections should do after a relatively long idle period, as well connections should do after a relatively long idle period, as well
as specifying and clarifying some of the issues pertaining to TCP as specifying and clarifying some of the issues pertaining to TCP
ACK generation. ACK generation.
Note that [Ste94] provides examples of these algorithms in action Note that [Ste94] provides examples of these algorithms in action
and [WS95] provides an explanation of the source code for the BSD and [WS95] provides an explanation of the source code for the BSD
implementation of these algorithms. implementation of these algorithms.
This document is organized as follows. Section 2 provides various This document is organized as follows. Section 2 provides various
definitions which will be used throughout the paper. Section 3 definitions which will be used throughout the document. Section 3
provides a specification of the congestion control algorithms. provides a specification of the congestion control algorithms.
Section 4 outlines concerns related to the congestion control Section 4 outlines concerns related to the congestion control
algorithms and finally, section 5 outlines security considerations. algorithms and finally, section 5 outlines security considerations.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in [Bra97]. document are to be interpreted as described in [Bra97].
2 Definitions 2 Definitions
This section provides the definition of several terms that will be This section provides the definition of several terms that will be
used throughout the remainder of this document. used throughout the remainder of this document.
SEGMENT: SEGMENT:
A segment is ANY TCP/IP data or acknowledgment packet (or both). A segment is ANY TCP/IP data or acknowledgment packet (or both).
MAXIMUM SEGMENT SIZE (MSS): SENDER MAXIMUM SEGMENT SIZE (SMSS):
The MSS is the largest segment size that can be used. The size The SMSS is the size of the largest segment that the sender can
does not include the TCP/IP headers and options. transmit. This value can be based on the maximum transmission
unit of the network, the path MTU discovery [MD90] algorithm, or
other factors. The size does not include the TCP/IP headers and
options.
RECEIVER MAXIMUM SEGMENT SIZE (RMSS):
The RMSS is the size of the largest segment the receiver is
willing to accept. This is the value specified in the MSS
option sent by the receiver during connection startup. Or, if
the MSS option is not used, 536 bytes [Bra89]. The size does
not include the TCP/IP headers and options.
FULL-SIZED SEGMENT: FULL-SIZED SEGMENT:
A segment that contains the maximum number of data bytes A segment that contains the maximum number of data bytes
permitted (i.e., a segment containing MSS bytes of data). permitted (i.e., a segment containing MSS bytes of data).
RECEIVER WINDOW (rwnd) RECEIVER WINDOW (rwnd)
The most recently advertised receiver window. The most recently advertised receiver window.
CONGESTION WINDOW (cwnd): CONGESTION WINDOW (cwnd):
A TCP state variable that limits the amount of data a TCP can A TCP state variable that limits the amount of data a TCP can
send. At any given time, a TCP MUST NOT send data with a send. At any given time, a TCP MUST NOT send data with a
sequence number higher than the sum of the highest acknowledged sequence number higher than the sum of the highest acknowledged
sequence number and the minimum of cwnd and rwnd. sequence number and the minimum of cwnd and rwnd.
INITIAL WINDOW (IW): INITIAL WINDOW (IW):
The initial window is the size of the sender's congestion window The initial window is the size of the sender's congestion window
when a connection is established. after the three-way handshake is completed.
LOSS WINDOW (LW): LOSS WINDOW (LW):
The loss window is the size of the congestion window after a TCP The loss window is the size of the congestion window after a TCP
sender detects loss using its retransmission timer. sender detects loss using its retransmission timer.
RESTART WINDOW (RW): RESTART WINDOW (RW):
The restart window is the size of the congestion window after a The restart window is the size of the congestion window after a
TCP restarts transmission after an idle period. TCP restarts transmission after an idle period (if the slow
start algorithm is used; see section 4.1 for more discussion).
FLIGHT SIZE: FLIGHT SIZE:
The amount of data the has been sent but not yet acknowledged. The amount of data that has been sent but not yet acknowledged.
3 Congestion Control Algorithms 3 Congestion Control Algorithms
This section defines the four congestion control algorithms: slow This section defines the four congestion control algorithms: slow
start, congestion avoidance, fast retransmit and fast recovery, start, congestion avoidance, fast retransmit and fast recovery,
developed in [Jac88] and [Jac90]. In some situations it may be developed in [Jac88] and [Jac90]. In some situations it may be
beneficial for a TCP sender to be more conservative than the beneficial for a TCP sender to be more conservative than the
algorithms allow, however a TCP MUST NOT be more aggressive than the algorithms allow, however a TCP MUST NOT be more aggressive than the
following algorithms allow (that is, MUST NOT send data when the following algorithms allow (that is, MUST NOT send data when the
value of cwnd computed by the following algorithms would not allow value of cwnd computed by the following algorithms would not allow
skipping to change at page 3, line 34 skipping to change at page 3, line 47
to determine whether the slow start or congestion avoidance to determine whether the slow start or congestion avoidance
algorithm is used to control data transmission, as discussed below. algorithm is used to control data transmission, as discussed below.
Beginning transmission into a network with unknown conditions Beginning transmission into a network with unknown conditions
requires TCP to slowly probe the network to determine the available requires TCP to slowly probe the network to determine the available
capacity, in order to avoid congesting the network with an capacity, in order to avoid congesting the network with an
inappropriately large burst of data. The slow start algorithm is inappropriately large burst of data. The slow start algorithm is
used for this purpose at the beginning of a transfer, or after used for this purpose at the beginning of a transfer, or after
repairing loss detected by the retransmission timer. repairing loss detected by the retransmission timer.
IW, the initial value of cwnd, MUST be less than or equal to MSS IW, the initial value of cwnd, MUST be less than or equal to 2*SMSS
bytes. bytes and MUST NOT be more than 2 segments.
We note that a non-standard, experimental TCP extension allows that We note that a non-standard, experimental TCP extension allows that
a TCP MAY use a larger initial window (IW), as defined in equation 1 a TCP MAY use a larger initial window (IW), as defined in equation 1
[AFP98]: [AFP98]:
IW = min (4*MSS, max (2*MSS, 4380 bytes)) (1) IW = min (4*SMSS, max (2*SMSS, 4380 bytes)) (1)
With this extension, a TCP sender MAY use a 2 segment initial With this extension, a TCP sender MAY use a 3 or 4 segment initial
window, regardless of the segment size, and 3 and 4 segment initial window, provided the combined size of the segments does not exceed
windows MAY be used, provided the combined size of the segments does 4380 bytes. We do NOT allow this change as part of the standard
not exceed 4380 bytes. We do NOT allow this change as part of the defined by this document. However, we include discussion of (1) in
standard defined by this document. However, we include discussion the remainder of this document as a guideline for those
of (1) in the remainder of this document as a guideline for those
experimenting with the change, rather than conforming to the present experimenting with the change, rather than conforming to the present
standards for TCP congestion control. standards for TCP congestion control.
The initial value of ssthresh MAY be arbitrarily high (for example, The initial value of ssthresh MAY be arbitrarily high (for example,
some implementations use the size of the advertised window), but it some implementations use the size of the advertised window), but it
may be reduced in response to congestion. The slow start algorithm may be reduced in response to congestion. The slow start algorithm
is used when cwnd < ssthresh, while the congestion avoidance is used when cwnd < ssthresh, while the congestion avoidance
algorithm is used when cwnd > ssthresh. When cwnd and ssthresh are algorithm is used when cwnd > ssthresh. When cwnd and ssthresh are
equal the sender may use either slow start or congestion avoidance. equal the sender may use either slow start or congestion avoidance.
skipping to change at page 4, line 16 skipping to change at page 4, line 27
each ACK received that acknowledges new data. Slow start ends when each ACK received that acknowledges new data. Slow start ends when
cwnd exceeds ssthresh (or, optionally, when it reaches it, as noted cwnd exceeds ssthresh (or, optionally, when it reaches it, as noted
above); or when cwnd reaches rwnd; or when congestion is observed. above); or when cwnd reaches rwnd; or when congestion is observed.
During congestion avoidance, cwnd is incremented by 1 full-sized During congestion avoidance, cwnd is incremented by 1 full-sized
segment per round-trip time (RTT). Congestion avoidance continues segment per round-trip time (RTT). Congestion avoidance continues
until cwnd reaches the receiver's advertised window or congestion is until cwnd reaches the receiver's advertised window or congestion is
detected. One formula commonly used to update cwnd during detected. One formula commonly used to update cwnd during
congestion avoidance is given in equation 2: congestion avoidance is given in equation 2:
cwnd += MSS*MSS/cwnd (2) cwnd += SMSS*SMSS/cwnd (2)
This adjustment is executed on every incoming non-duplicate ACK. This adjustment is executed on every incoming non-duplicate ACK.
Equation (2) provides an acceptable approximation to the underlying Equation (2) provides an acceptable approximation to the underlying
principle of increasing cwnd by 1 full-sized segment per RTT. (Note principle of increasing cwnd by 1 full-sized segment per RTT. (Note
that for a connection in which the receiver acknowledges every data that for a connection in which the receiver acknowledges every data
segment, (2) proves slightly more aggressive than 1 segment per RTT, segment, (2) proves slightly more aggressive than 1 segment per RTT,
and for a receiver acknowledging every-other packet, (2) is less and for a receiver acknowledging every-other packet, (2) is less
aggressive.) aggressive.)
Implementation Note: Since integer arithmetic is usually used in TCP Implementation Note: Since integer arithmetic is usually used in TCP
implementations, the formula given in equation 2 can fail to implementations, the formula given in equation 2 can fail to
increase cwnd when the congestion window is very large (larger than increase cwnd when the congestion window is very large (larger than
MSS*MSS). If the above formula yields 0, the result SHOULD be SMSS*SMSS). If the above formula yields 0, the result SHOULD be
rounded up to 1 byte. rounded up to 1 byte.
Implementation Note: older implementations have an additional Implementation Note: older implementations have an additional
additive constant on the right-hand side of (2). This is incorrect additive constant on the right-hand side of equation (2). This is
and can actually lead to diminished performance [PAD+98]. incorrect and can actually lead to diminished performance [PAD+98].
Another acceptable way to increase cwnd during congestion avoidance Another acceptable way to increase cwnd during congestion avoidance
is to count the number of bytes that have been acknowledged by ACKs is to count the number of bytes that have been acknowledged by ACKs
for new data. (A drawback of this implementation is that it for new data. (A drawback of this implementation is that it
requires maintaining an additional state variable.) When the number requires maintaining an additional state variable.) When the number
of bytes acknowledged reaches cwnd, then cwnd can be incremented by of bytes acknowledged reaches cwnd, then cwnd can be incremented by
up to MSS bytes. Note that during congestion avoidance, cwnd MUST up to SMSS bytes. Note that during congestion avoidance, cwnd MUST
NOT be increased by more than the larger of either 1 full-sized NOT be increased by more than the larger of either 1 full-sized
segment per RTT, or the value computed using equation 2. segment per RTT, or the value computed using equation 2.
Implementation Note: some implementations maintain cwnd in units of Implementation Note: some implementations maintain cwnd in units of
bytes, while others in units of full-sized segments. The latter bytes, while others in units of full-sized segments. The latter
will find equation (2) difficult to use, and may prefer to use the will find equation (2) difficult to use, and may prefer to use the
counting approach discussed in the previous paragraph. counting approach discussed in the previous paragraph.
When a TCP sender detects segment loss using the retransmission When a TCP sender detects segment loss using the retransmission
timer, the value of ssthresh MUST be set to no more than the value timer, the value of ssthresh MUST be set to no more than the value
given in equation 3: given in equation 3:
ssthresh = max (FlightSize / 2, 2*MSS) (3) ssthresh = max (FlightSize / 2, 2*SMSS) (3)
As discussed above, FlightSize is the amount of outstanding data in As discussed above, FlightSize is the amount of outstanding data in
the network. the network.
Implementation Note: an easy mistake to make is to simply use cwnd, Implementation Note: an easy mistake to make is to simply use cwnd,
rather than FlightSize, which in some implementations may rather than FlightSize, which in some implementations may
incidentally increase well beyond rwnd. incidentally increase well beyond rwnd.
Furthermore, upon a timeout cwnd MUST be set to no more than the Furthermore, upon a timeout cwnd MUST be set to no more than the
loss window, LW, which equals 1 full-sized segment (regardless of loss window, LW, which equals 1 full-sized segment (regardless of
the value of IW). Therefore, after retransmitting the dropped the value of IW). Therefore, after retransmitting the dropped
segment the TCP sender uses the slow start algorithm to increase the segment the TCP sender uses the slow start algorithm to increase the
window from 1 full-sized segment to the new value of ssthresh, at window from 1 full-sized segment to the new value of ssthresh, at
which point congestion avoidance again takes over in a fashion which point congestion avoidance again takes over.
identical to that for a connection's initial slow start.
3.3 Fast Retransmit/Fast Recovery 3.3 Fast Retransmit/Fast Recovery
A TCP receiver SHOULD send an immediate duplicate ACK when an A TCP receiver SHOULD send an immediate duplicate ACK when an
out-of-order segment arrives. The purpose of this ACK is to inform out-of-order segment arrives. The purpose of this ACK is to inform
the sender that a segment was received out-of-order and which the sender that a segment was received out-of-order and which
sequence number is expected. From the sender's perspective, sequence number is expected. From the sender's perspective,
duplicate ACKs can be caused by a number of network problems. duplicate ACKs can be caused by a number of network problems.
First, they can be caused by dropped segments. In this case, all First, they can be caused by dropped segments. In this case, all
segments after the dropped segment will trigger duplicate ACKs. segments after the dropped segment will trigger duplicate ACKs.
skipping to change at page 5, line 45 skipping to change at page 5, line 55
The TCP sender SHOULD use the "fast retransmit" algorithm to detect The TCP sender SHOULD use the "fast retransmit" algorithm to detect
and repair loss, based on incoming duplicate ACKs. The fast and repair loss, based on incoming duplicate ACKs. The fast
retransmit algorithm uses the arrival of 3 duplicate ACKs (4 retransmit algorithm uses the arrival of 3 duplicate ACKs (4
identical ACKs without the arrival of any other intervening packets) identical ACKs without the arrival of any other intervening packets)
as an indication that a segment has been lost. After receiving 3 as an indication that a segment has been lost. After receiving 3
duplicate ACKs, TCP performs a retransmission of what appears to be duplicate ACKs, TCP performs a retransmission of what appears to be
the missing segment, without waiting for the retransmission timer to the missing segment, without waiting for the retransmission timer to
expire. expire.
After the fast retransmit sends what appears to be the missing After the fast retransmit algorithm sends what appears to be the
segment, the "fast recovery" algorithm governs the transmission of missing segment, the "fast recovery" algorithm governs the
new data until a non-duplicate ACK arrives. The reason for not transmission of new data until a non-duplicate ACK arrives. The
performing slow start is that the receipt of the duplicate ACKs not reason for not performing slow start is that the receipt of the
only indicates that a segment has been lost, but also that segments duplicate ACKs not only indicates that a segment has been lost, but
are most likely leaving the network (although a massive segment also that segments are most likely leaving the network (although a
duplication by the network can invalidate this conclusion). In massive segment duplication by the network can invalidate this
other words, since the receiver can only generate a duplicate ACK conclusion). In other words, since the receiver can only generate a
when a segment has arrived, that segment has left the network and is duplicate ACK when a segment has arrived, that segment has left the
in the receiver's buffer, so we know it is no longer consuming network and is in the receiver's buffer, so we know it is no longer
network resources. Furthermore, since the ACK "clock" [Jac88] is consuming network resources. Furthermore, since the ACK "clock"
preserved, the TCP sender can continue to transmit new segments [Jac88] is preserved, the TCP sender can continue to transmit new
(although transmission must continue using a reduced cwnd). segments (although transmission must continue using a reduced cwnd).
The fast retransmit and fast recovery algorithms are usually The fast retransmit and fast recovery algorithms are usually
implemented together as follows. implemented together as follows.
1. When the third duplicate ACK is received, set ssthresh to no 1. When the third duplicate ACK is received, set ssthresh to no
more than the value given in equation 3. more than the value given in equation 3.
2. Retransmit the lost segment and set cwnd to ssthresh plus 3*MSS. 2. Retransmit the lost segment and set cwnd to ssthresh plus
This artificially "inflates" the congestion window by the number 3*SMSS. This artificially "inflates" the congestion window by
of segments (three) that have left the network and which the the number of segments (three) that have left the network and
receiver has buffered. which the receiver has buffered.
3. For each additional duplicate ACK received, increment cwnd by 3. For each additional duplicate ACK received, increment cwnd by
MSS. This artificially inflates the congestion window in order SMSS. This artificially inflates the congestion window in order
to reflect the additional segment that has left the network. to reflect the additional segment that has left the network.
4. Transmit a segment, if allowed by the new value of cwnd and the 4. Transmit a segment, if allowed by the new value of cwnd and the
receiver's advertised window. receiver's advertised window.
5. When the next ACK arrives that acknowledges new data, set cwnd 5. When the next ACK arrives that acknowledges new data, set cwnd
to ssthresh (the value set in step 1). This is termed to ssthresh (the value set in step 1). This is termed
"deflating" the window. "deflating" the window.
This ACK should be the acknowledgment elicited by the This ACK should be the acknowledgment elicited by the
retransmission from step 1, one RTT after the retransmission retransmission from step 1, one RTT after the retransmission
(though it may arrive sooner in the presence of significant (though it may arrive sooner in the presence of significant
out-of-order delivery of data segments at the receiver). out-of-order delivery of data segments at the receiver).
Additionally, this ACK should acknowledge all the intermediate Additionally, this ACK should acknowledge all the intermediate
segments sent between the lost segment and the receipt of the segments sent between the lost segment and the receipt of the
third duplicate ACK, if none of these were lost. third duplicate ACK, if none of these were lost.
Note: This algorithm is known to generally not recover very Note: This algorithm is known to generally not recover very
efficiently from multiple losses in a single flight of packets. One efficiently from multiple losses in a single flight of packets
proposed set of modifications to it to address this problem can be [FF96]. One proposed set of modifications to it to address this
found in [FH98]. problem can be found in [FH98].
4 Additional Considerations 4 Additional Considerations
4.1 Re-starting Idle Connections 4.1 Re-starting Idle Connections
A known problem with the TCP congestion control algorithms described A known problem with the TCP congestion control algorithms described
above is that they allow a potentially inappropriate burst of above is that they allow a potentially inappropriate burst of
traffic to be transmitted after TCP has been idle for a relatively traffic to be transmitted after TCP has been idle for a relatively
long period of time. After an idle period, TCP cannot use the ACK long period of time. After an idle period, TCP cannot use the ACK
clock to strobe new segments into the network, as all the ACKs have clock to strobe new segments into the network, as all the ACKs have
skipping to change at page 7, line 7 skipping to change at page 7, line 17
an idle period. an idle period.
[Jac88] recommends that a TCP use slow start to restart transmission [Jac88] recommends that a TCP use slow start to restart transmission
after a relatively long idle period. Slow start serves to restart after a relatively long idle period. Slow start serves to restart
the ACK clock, just as it does at the beginning of a transfer. This the ACK clock, just as it does at the beginning of a transfer. This
mechanism has been widely deployed in the following manner. When mechanism has been widely deployed in the following manner. When
TCP has not received a segment for more than one retransmission TCP has not received a segment for more than one retransmission
timeout, cwnd is reduced to the value of the restart window (RW) timeout, cwnd is reduced to the value of the restart window (RW)
before transmission begins. before transmission begins.
For the purposes of this standard, we define RW = IW = 1 full-sized For the purposes of this standard, we define RW = IW.
segment.
We note that the non-standard experimental extension to TCP defined We note that the non-standard experimental extension to TCP defined
in [AFP98] defines RW = min(IW, cwnd), with the definition of IW in [AFP98] defines RW = min(IW, cwnd), with the definition of IW
adjusted per equation (1) above. adjusted per equation (1) above.
Using the last time a segment was received to determine whether or Using the last time a segment was received to determine whether or
not to decrease cwnd fails to deflate cwnd in the common case of not to decrease cwnd fails to deflate cwnd in the common case of
persistent HTTP connections [HTH98]. In this case, a WWW server persistent HTTP connections [HTH98]. In this case, a WWW server
receives a request before transmitting data to the WWW browser. The receives a request before transmitting data to the WWW browser. The
reception of the request makes the test for an idle connection fail, reception of the request makes the test for an idle connection fail,
and allows the TCP to begin transmission with a possibly and allows the TCP to begin transmission with a possibly
inappropriately large cwnd. inappropriately large cwnd.
Therefore, a TCP SHOULD reduce cwnd to no more than RW before Therefore, a TCP SHOULD set cwnd to no more than RW before beginning
beginning transmission if the TCP has not sent data in an interval transmission if the TCP has not sent data in an interval exceeding
exceeding the retransmission timeout. the retransmission timeout.
4.2 Generating Acknowledgments 4.2 Generating Acknowledgments
The delayed ACK algorithm specified in [Bra89] SHOULD be used by a The delayed ACK algorithm specified in [Bra89] SHOULD be used by a
TCP receiver. When used, a TCP receiver MUST NOT excessively delay TCP receiver. When used, a TCP receiver MUST NOT excessively delay
acknowledgments. Specifically, an ACK SHOULD be generated for at acknowledgments. Specifically, an ACK SHOULD be generated for at
least every second full-sized segment, and MUST be generated within least every second full-sized segment, and MUST be generated within
500 ms of the arrival of the first unacknowledged packet. 500 ms of the arrival of the first unacknowledged packet.
The requirement that an ACK "SHOULD" be generated for at least every The requirement that an ACK "SHOULD" be generated for at least every
second full-sized segment is listed in [Bra89] in one place as a second full-sized segment is listed in [Bra89] in one place as a
SHOULD and another as a MUST. Here we unambiguously state it is a SHOULD and another as a MUST. Here we unambiguously state it is a
SHOULD. We also emphasize that this is a "strong" SHOULD, meaning SHOULD. We also emphasize that this is a "strong" SHOULD, meaning
that an implementor should indeed only deviate from this requirement that an implementor should indeed only deviate from this requirement
after careful consideration of the implications. See the discussion after careful consideration of the implications. See the discussion
of "Stretch ACK violation" in [PAD+98] and the references therein of "Stretch ACK violation" in [PAD+98] and the references therein
for a discussion of the possible performance problems with for a discussion of the possible performance problems with
generating ACKs less frequently than every second full-sized generating ACKs less frequently than every second full-sized
segment. segment.
In some cases, the sender and receiver may not agree on what what In some cases, the sender and receiver may not agree on what
constitutes a full-sized segment. An implementation is deemed to constitutes a full-sized segment. An implementation is deemed to
comply with this requirement if it sends at least one acknowledgment comply with this requirement if it sends at least one acknowledgment
every time it receives 2*MSS bytes of new data from the sender, every time it receives 2*RMSS bytes of new data from the sender,
where MSS is the Maximum Segment Size specified by the receiver to where RMSS is the Maximum Segment Size specified by the receiver to
the sender (or the default value of 536 bytes, per [Bra89], if the the sender (or the default value of 536 bytes, per [Bra89], if the
receiver does not specify an MSS option during connection receiver does not specify an MSS option during connection
establishment). Finally, we repeat that an ACK MUST NOT be delayed establishment). The sender may be forced to use a segment size less
for more than 500 ms waiting on a second full-sized segment to than RMSS due to the maximum transmission unit (MTU), the path MTU
arrive. Out-of-order data segments SHOULD be acknowledged discovery algorithm or other factors. For instance, consider the
immediately, in order to accelerate loss recovery. To trigger the case when the receiver announces an MSS of X bytes but the sender
fast retransmit algorithm, the receiver SHOULD send an immediate ends up using a segment size of Y bytes (Y < X) due to path MTU
duplicate ACK when it receives a data segment above a gap in the discovery (or the sender's MTU size). The receiver will generate
sequence space. To provide feedback to senders recovering from stretch ACKs if it waits for 2*X bytes to arrive before an ACK is
losses, the receiver SHOULD send an immediate ACK when it receives a sent. Clearly this will take more than 2 segments of size Y bytes.
data segment that fills in all or part of a gap in the sequence Therefore, while a specific algorithm is not defined, it is
space. desirable for receivers to attempt to prevent this situation, for
example by acknowledging at least every second segment, regardless
of size. Finally, we repeat that an ACK MUST NOT be delayed for
more than 500 ms waiting on a second full-sized segment to arrive.
Out-of-order data segments SHOULD be acknowledged immediately, in
order to accelerate loss recovery. To trigger the fast retransmit
algorithm, the receiver SHOULD send an immediate duplicate ACK when
it receives a data segment above a gap in the sequence space. To
provide feedback to senders recovering from losses, the receiver
SHOULD send an immediate ACK when it receives a data segment that
fills in all or part of a gap in the sequence space.
A TCP receiver MUST NOT generate more than one ACK for every A TCP receiver MUST NOT generate more than one ACK for every
incoming segment, other than to update the offered window as the incoming segment, other than to update the offered window as the
receiving application consumes new data. receiving application consumes new data.
4.4 Loss Recovery Mechanisms 4.4 Loss Recovery Mechanisms
A number of loss recovery algorithms that augment fast retransmit A number of loss recovery algorithms that augment fast retransmit
and fast recovery have been suggested by TCP researchers. While and fast recovery have been suggested by TCP researchers. While
some of these algorithms are based on the TCP selective some of these algorithms are based on the TCP selective
acknowledgment (SACK) option [MMFR96], such as [FF96,MM96a,MM96b], acknowledgment (SACK) option [MMFR96], such as [FF96,MM96a,MM96b],
others do not require SACKs [Hoe96,FF96,FH98]. The non-SACK others do not require SACKs [Hoe96,FF96,FH98]. The non-SACK
algorithms use ``partial acknowledgments'' (ACKs which cover new algorithms use "partial acknowledgments" (ACKs which cover new data,
data, but not all the data outstanding when loss was detected) to but not all the data outstanding when loss was detected) to trigger
trigger retransmissions. While this document does not standardize retransmissions. While this document does not standardize any of
any of the specific algorithms that may improve fast retransmit/fast the specific algorithms that may improve fast retransmit/fast
recovery, these enhanced algorithms are implicitly allowed, as long recovery, these enhanced algorithms are implicitly allowed, as long
as they follow the general principles of the basic four algorithms as they follow the general principles of the basic four algorithms
outlined above. outlined above.
Therefore, when the first loss in a window of data is detected, Therefore, when the first loss in a window of data is detected,
ssthresh MUST be set to no more than the value given by equation ssthresh MUST be set to no more than the value given by equation
(3). Second, until all lost segments in the window of data in (3). Second, until all lost segments in the window of data in
question are repaired, the number of segments transmitted in each question are repaired, the number of segments transmitted in each
RTT MUST be no more than half the number of outstanding segments RTT MUST be no more than half the number of outstanding segments
when the loss was detected. Finally, after all loss in the given when the loss was detected. Finally, after all loss in the given
skipping to change at page 8, line 51 skipping to change at page 9, line 17
congestion control algorithms outlined in this document. congestion control algorithms outlined in this document.
5. Security Considerations 5. Security Considerations
This document requires a TCP to diminish its sending rate in the This document requires a TCP to diminish its sending rate in the
presence of retransmission timeouts and the arrival of duplicate presence of retransmission timeouts and the arrival of duplicate
acknowledgments. An attacker can therefore impair the performance acknowledgments. An attacker can therefore impair the performance
of a TCP connection by either causing data packets or their of a TCP connection by either causing data packets or their
acknowledgments to be lost, or by forging excessive duplicate acknowledgments to be lost, or by forging excessive duplicate
acknowledgments. Causing two congestion control events back-to-back acknowledgments. Causing two congestion control events back-to-back
will often cut ssthresh to its minimum value of 2*MSS, causing the will often cut ssthresh to its minimum value of 2*SMSS, causing the
connection to immediately enter the slower-performing congestion connection to immediately enter the slower-performing congestion
avoidance phase. avoidance phase.
The Internet to a considerable degree relies on the correct The Internet to a considerable degree relies on the correct
implementation of these algorithms in order to preserve network implementation of these algorithms in order to preserve network
stability and avoid congestion collapse. An attacker could cause stability and avoid congestion collapse. An attacker could cause
TCP endpoints to respond more aggressively in the face of congestion TCP endpoints to respond more aggressively in the face of congestion
by forging excessive duplicate acknowledgments or excessive by forging excessive duplicate acknowledgments or excessive
acknowledgments for new data. Conceivably, such an attack could acknowledgments for new data. Conceivably, such an attack could
drive a portion of the network into congestion collapse. drive a portion of the network into congestion collapse.
skipping to change at page 9, line 29 skipping to change at page 9, line 49
of Addison-Wesley. of Addison-Wesley.
Neal Cardwell, Sally Floyd, Craig Partridge and Joe Touch Neal Cardwell, Sally Floyd, Craig Partridge and Joe Touch
contributed a number of helpful suggestions. contributed a number of helpful suggestions.
References References
[AFP98] M. Allman, S. Floyd, C. Partridge, Increasing TCP's Initial [AFP98] M. Allman, S. Floyd, C. Partridge, Increasing TCP's Initial
Window Size, September 1998. RFC 2414. Window Size, September 1998. RFC 2414.
[Bra89] B. Braden, ed., "Requirements for Internet Hosts -- [Bra89] B. Braden, ed., Requirements for Internet Hosts --
Communication Layers," RFC 1122, Oct. 1989. Communication Layers, RFC 1122, Oct. 1989.
[Bra97] S. Bradner, "Key words for use in RFCs to Indicate [Bra97] S. Bradner, Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997. Requirement Levels, BCP 14, RFC 2119, March 1997.
[FF96] K. Fall, S. Floyd. Simulation-based Comparisons of Tahoe, [FF96] K. Fall, S. Floyd. Simulation-based Comparisons of Tahoe,
Reno and SACK TCP. Computer Communication Review, July 1996. Reno and SACK TCP. Computer Communication Review, July 1996.
ftp://ftp.ee.lbl.gov/papers/sacks.ps.Z. ftp://ftp.ee.lbl.gov/papers/sacks.ps.Z.
[FH98] S. Floyd, T. Henderson. The NewReno Modification to TCP's [FH98] S. Floyd, T. Henderson. The NewReno Modification to TCP's
Fast Recovery Algorithm. Internet-Draft Fast Recovery Algorithm. Internet-Draft
draft-ietf-tcpimpl-newreno-00.txt, November 1998. (Work in draft-ietf-tcpimpl-newreno-00.txt, November 1998. (Work in
progress). progress).
skipping to change at page 9, line 56 skipping to change at page 10, line 22
ftp://ftp.ee.lbl.gov/papers/fastretrans.ps. ftp://ftp.ee.lbl.gov/papers/fastretrans.ps.
[Hoe96] J. Hoe, Improving the Start-up Behavior of a Congestion [Hoe96] J. Hoe, Improving the Start-up Behavior of a Congestion
Control Scheme for TCP. In ACM SIGCOMM, August 1996. Control Scheme for TCP. In ACM SIGCOMM, August 1996.
[HTH98] A. Hughes, J. Touch, J. Heidemann. Issues in TCP Slow-Start [HTH98] A. Hughes, J. Touch, J. Heidemann. Issues in TCP Slow-Start
Restart After Idle. Internet-Draft Restart After Idle. Internet-Draft
draft-ietf-tcpimpl-restart-00.txt, March 1998. (Work in draft-ietf-tcpimpl-restart-00.txt, March 1998. (Work in
progress). progress).
[Jac88] V. Jacobson, "Congestion Avoidance and Control," Computer [Jac88] V. Jacobson, Congestion Avoidance and Control, Computer
Communication Review, vol. 18, no. 4, pp. 314-329, Aug. 1988. Communication Review, vol. 18, no. 4, pp. 314-329, Aug. 1988.
ftp://ftp.ee.lbl.gov/papers/congavoid.ps.Z. ftp://ftp.ee.lbl.gov/papers/congavoid.ps.Z.
[Jac90] V. Jacobson, "Modified TCP Congestion Avoidance Algorithm," [Jac90] V. Jacobson, Modified TCP Congestion Avoidance Algorithm,
end2end-interest mailing list, April 30, 1990. end2end-interest mailing list, April 30, 1990.
ftp://ftp.isi.edu/end2end/end2end-interest-1990.mail. ftp://ftp.isi.edu/end2end/end2end-interest-1990.mail.
[MM96a] M. Mathis, J. Mahdavi, "Forward Acknowledgment: Refining TCP [MD90] J. Mogul, S. Deering. Path MTU Discovery, November 1990.
Congestion Control," Proceedings of SIGCOMM'96, August, 1996, RFC 1191.
[MM96a] M. Mathis, J. Mahdavi, Forward Acknowledgment: Refining TCP
Congestion Control, Proceedings of SIGCOMM'96, August, 1996,
Stanford, CA. Available from Stanford, CA. Available from
http://www.psc.edu/networking/papers/papers.html http://www.psc.edu/networking/papers/papers.html
[MM96b] M. Mathis, J. Mahdavi, "TCP Rate-Halving with Bounding [MM96b] M. Mathis, J. Mahdavi, TCP Rate-Halving with Bounding
Parameters" Available from Parameters. Technical report. Available from
http://www.psc.edu/networking/papers/FACKnotes/current. http://www.psc.edu/networking/papers/FACKnotes/current.
[MMFR96] M. Mathis, J. Mahdavi, S. Floyd, A. Romanow, "TCP Selective [MMFR96] M. Mathis, J. Mahdavi, S. Floyd, A. Romanow, TCP Selective
Acknowledgement Options", RFC 2018, October 1996. Acknowledgement Options, October 1996. RFC 2018.
[PAD+98] V. Paxson, M. Allman, S. Dawson, W. Fenner, J. Griner, [PAD+98] V. Paxson, M. Allman, S. Dawson, W. Fenner, J. Griner,
I. Heavens, K. Lahey, J. Semke, B. Volz. Internet-Draft I. Heavens, K. Lahey, J. Semke, B. Volz. Known TCP
draft-ietf-tcpimpl-prob-05.txt, October 1998. (Work in Implementation Problems. Internet-Draft
draft-ietf-tcpimpl-prob-05.txt, November 1998. (Work in
progress). progress).
[Pax97] V. Paxson, "End-to-End Internet Packet Dynamics," [Pax97] V. Paxson, End-to-End Internet Packet Dynamics,
Proceedings of SIGCOMM '97, Cannes, France, Sep. 1997. Proceedings of SIGCOMM '97, Cannes, France, Sep. 1997.
[Pos81] J. Postel, Transmission Control Protocol, September 1981. [Pos81] J. Postel, Transmission Control Protocol, September 1981.
RFC 793. RFC 793.
[Ste94] W. R. Stevens, "TCP/IP Illustrated, Volume 1: The [Ste94] W. R. Stevens, TCP/IP Illustrated, Volume 1: The
Protocols", Addison-Wesley, 1994. Protocols, Addison-Wesley, 1994.
[Ste97] W. R. Stevens, "TCP Slow Start, Congestion Avoidance, Fast [Ste97] W. R. Stevens, "TCP Slow Start, Congestion Avoidance, Fast
Retransmit, and Fast Recovery Algorithms", RFC 2001, January Retransmit, and Fast Recovery Algorithms", January 1997. RFC
1997. 2001.
[WS95] G. R. Wright, W. R. Stevens, "TCP/IP Illustrated, Volume 2: [WS95] G. R. Wright, W. R. Stevens, TCP/IP Illustrated, Volume 2:
The Implementation", Addison-Wesley, 1995. The Implementation, Addison-Wesley, 1995.
Author's Address: Author's Address:
Mark Allman Mark Allman
NASA Lewis Research Center/Sterling Software NASA Lewis Research Center/Sterling Software
21000 Brookpark Rd. MS 54-2 21000 Brookpark Rd. MS 54-2
Cleveland, OH 44135 Cleveland, OH 44135
216-433-6586 216-433-6586
mallman@lerc.nasa.gov mallman@lerc.nasa.gov
http://gigahertz.lerc.nasa.gov/~mallman http://gigahertz.lerc.nasa.gov/~mallman
 End of changes. 42 change blocks. 
93 lines changed or deleted 116 lines changed or added

This html diff was produced by rfcdiff 1.33. The latest version is available from http://tools.ietf.org/tools/rfcdiff/