draft-ietf-tcpimpl-cong-control-05.txt   rfc2581.txt 
TCP Implementation Working Group M. Allman
INTERNET DRAFT NASA Lewis/Sterling Software Network Working Group M. Allman
File: draft-ietf-tcpimpl-cong-control-05.txt V. Paxson Request for Comments: 2581 NASA Glenn/Sterling Software
LBNL Obsoletes: 2001 V. Paxson
Category: Standards Track ACIRI / ICSI
W. Stevens W. Stevens
Consultant Consultant
February, 1999 April 1999
TCP Congestion Control TCP Congestion Control
Status of this Memo Status of this Memo
This document is an Internet-Draft and is in full conformance with This document specifies an Internet standards track protocol for the
all provisions of Section 10 of RFC2026. Internet-Draft. Internet community, and requests discussion and suggestions for
Internet-Drafts are working documents of the Internet Engineering improvements. Please refer to the current edition of the "Internet
Task Force (IETF), its areas, and its working groups. Note that Official Protocol Standards" (STD 1) for the standardization state
other groups may also distribute working documents as and status of this protocol. Distribution of this memo is unlimited.
Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six Copyright Notice
months and may be updated, replaced, or obsoleted by other documents
at any time. It is inappropriate to use Internet-Drafts as
reference material or to cite them other than as ``work in
progress.''
To view the entire list of current Internet-Drafts, please check the Copyright (C) The Internet Society (1999). All Rights Reserved.
"1id-abstracts.txt" listing contained in the Internet-Drafts Shadow
Directories on ftp.is.co.za (Africa), ftp.nordu.net (Northern
Europe), ftp.nis.garr.it (Southern Europe), munnari.oz.au (Pacific
Rim), ftp.ietf.org (US East Coast), or ftp.isi.edu (US West Coast).
Abstract Abstract
This document defines TCP's four intertwined congestion control This document defines TCP's four intertwined congestion control
algorithms: slow start, congestion avoidance, fast retransmit, and algorithms: slow start, congestion avoidance, fast retransmit, and
fast recovery. In addition, the document specifies how TCP should fast recovery. In addition, the document specifies how TCP should
begin transmission after a relatively long idle period, as well as begin transmission after a relatively long idle period, as well as
discussing various acknowledgment generation methods. discussing various acknowledgment generation methods.
1 Introduction 1. Introduction
This document specifies four TCP [Pos81] congestion control This document specifies four TCP [Pos81] congestion control
algorithms: slow start, congestion avoidance, fast retransmit and algorithms: slow start, congestion avoidance, fast retransmit and
fast recovery. These algorithms were devised in [Jac88] and fast recovery. These algorithms were devised in [Jac88] and [Jac90].
[Jac90]. Their use with TCP is standardized in [Bra89]. Their use with TCP is standardized in [Bra89].
This document is an update of [Ste97]. In addition to specifying This document is an update of [Ste97]. In addition to specifying the
the congestion control algorithms, this document specifies what TCP congestion control algorithms, this document specifies what TCP
connections should do after a relatively long idle period, as well connections should do after a relatively long idle period, as well as
as specifying and clarifying some of the issues pertaining to TCP specifying and clarifying some of the issues pertaining to TCP ACK
ACK generation. generation.
Note that [Ste94] provides examples of these algorithms in action Note that [Ste94] provides examples of these algorithms in action and
and [WS95] provides an explanation of the source code for the BSD [WS95] provides an explanation of the source code for the BSD
implementation of these algorithms. implementation of these algorithms.
This document is organized as follows. Section 2 provides various This document is organized as follows. Section 2 provides various
definitions which will be used throughout the document. Section 3 definitions which will be used throughout the document. Section 3
provides a specification of the congestion control algorithms. provides a specification of the congestion control algorithms.
Section 4 outlines concerns related to the congestion control Section 4 outlines concerns related to the congestion control
algorithms and finally, section 5 outlines security considerations. algorithms and finally, section 5 outlines security considerations.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in [Bra97]. document are to be interpreted as described in [Bra97].
2 Definitions 2. Definitions
This section provides the definition of several terms that will be This section provides the definition of several terms that will be
used throughout the remainder of this document. used throughout the remainder of this document.
SEGMENT: SEGMENT:
A segment is ANY TCP/IP data or acknowledgment packet (or both). A segment is ANY TCP/IP data or acknowledgment packet (or both).
SENDER MAXIMUM SEGMENT SIZE (SMSS): SENDER MAXIMUM SEGMENT SIZE (SMSS): The SMSS is the size of the
The SMSS is the size of the largest segment that the sender can largest segment that the sender can transmit. This value can be
transmit. This value can be based on the maximum transmission based on the maximum transmission unit of the network, the path
unit of the network, the path MTU discovery [MD90] algorithm, MTU discovery [MD90] algorithm, RMSS (see next item), or other
RMSS (see next item), or other factors. The size does not factors. The size does not include the TCP/IP headers and
include the TCP/IP headers and options. options.
RECEIVER MAXIMUM SEGMENT SIZE (RMSS): RECEIVER MAXIMUM SEGMENT SIZE (RMSS): The RMSS is the size of the
The RMSS is the size of the largest segment the receiver is largest segment the receiver is willing to accept. This is the
willing to accept. This is the value specified in the MSS value specified in the MSS option sent by the receiver during
option sent by the receiver during connection startup. Or, if connection startup. Or, if the MSS option is not used, 536 bytes
the MSS option is not used, 536 bytes [Bra89]. The size does [Bra89]. The size does not include the TCP/IP headers and
not include the TCP/IP headers and options. options.
FULL-SIZED SEGMENT: FULL-SIZED SEGMENT: A segment that contains the maximum number of
A segment that contains the maximum number of data bytes data bytes permitted (i.e., a segment containing SMSS bytes of
permitted (i.e., a segment containing SMSS bytes of data). data).
RECEIVER WINDOW (rwnd) RECEIVER WINDOW (rwnd) The most recently advertised receiver window.
The most recently advertised receiver window.
CONGESTION WINDOW (cwnd): CONGESTION WINDOW (cwnd): A TCP state variable that limits the
A TCP state variable that limits the amount of data a TCP can amount of data a TCP can send. At any given time, a TCP MUST NOT
send. At any given time, a TCP MUST NOT send data with a send data with a sequence number higher than the sum of the
sequence number higher than the sum of the highest acknowledged highest acknowledged sequence number and the minimum of cwnd and
sequence number and the minimum of cwnd and rwnd. rwnd.
INITIAL WINDOW (IW): INITIAL WINDOW (IW): The initial window is the size of the sender's
The initial window is the size of the sender's congestion window congestion window after the three-way handshake is completed.
after the three-way handshake is completed.
LOSS WINDOW (LW): LOSS WINDOW (LW): The loss window is the size of the congestion
The loss window is the size of the congestion window after a TCP window after a TCP sender detects loss using its retransmission
sender detects loss using its retransmission timer. timer.
RESTART WINDOW (RW): RESTART WINDOW (RW): The restart window is the size of the
The restart window is the size of the congestion window after a congestion window after a TCP restarts transmission after an idle
TCP restarts transmission after an idle period (if the slow period (if the slow start algorithm is used; see section 4.1 for
start algorithm is used; see section 4.1 for more discussion). more discussion).
FLIGHT SIZE: FLIGHT SIZE: The amount of data that has been sent but not yet
The amount of data that has been sent but not yet acknowledged. acknowledged.
3 Congestion Control Algorithms 3. Congestion Control Algorithms
This section defines the four congestion control algorithms: slow This section defines the four congestion control algorithms: slow
start, congestion avoidance, fast retransmit and fast recovery, start, congestion avoidance, fast retransmit and fast recovery,
developed in [Jac88] and [Jac90]. In some situations it may be developed in [Jac88] and [Jac90]. In some situations it may be
beneficial for a TCP sender to be more conservative than the beneficial for a TCP sender to be more conservative than the
algorithms allow, however a TCP MUST NOT be more aggressive than the algorithms allow, however a TCP MUST NOT be more aggressive than the
following algorithms allow (that is, MUST NOT send data when the following algorithms allow (that is, MUST NOT send data when the
value of cwnd computed by the following algorithms would not allow value of cwnd computed by the following algorithms would not allow
the data to be sent). the data to be sent).
3.1 Slow Start and Congestion Avoidance 3.1 Slow Start and Congestion Avoidance
The slow start and congestion avoidance algorithms MUST be used by a The slow start and congestion avoidance algorithms MUST be used by a
TCP sender to control the amount of outstanding data being injected TCP sender to control the amount of outstanding data being injected
into the network. To implement these algorithms, two variables are into the network. To implement these algorithms, two variables are
added to the TCP per-connection state. The congestion window (cwnd) added to the TCP per-connection state. The congestion window (cwnd)
is a sender-side limit on the amount of data the sender can transmit is a sender-side limit on the amount of data the sender can transmit
into the network before receiving an acknowledgment (ACK), while the into the network before receiving an acknowledgment (ACK), while the
receiver's advertised window (rwnd) is a receiver-side limit on the receiver's advertised window (rwnd) is a receiver-side limit on the
amount of outstanding data. The minimum of cwnd and rwnd governs amount of outstanding data. The minimum of cwnd and rwnd governs
data transmission. data transmission.
Another state variable, the slow start threshold (ssthresh), is used
to determine whether the slow start or congestion avoidance
algorithm is used to control data transmission, as discussed below.
Beginning transmission into a network with unknown conditions Another state variable, the slow start threshold (ssthresh), is used
requires TCP to slowly probe the network to determine the available to determine whether the slow start or congestion avoidance algorithm
capacity, in order to avoid congesting the network with an is used to control data transmission, as discussed below.
inappropriately large burst of data. The slow start algorithm is
used for this purpose at the beginning of a transfer, or after
repairing loss detected by the retransmission timer.
IW, the initial value of cwnd, MUST be less than or equal to 2*SMSS Beginning transmission into a network with unknown conditions
bytes and MUST NOT be more than 2 segments. requires TCP to slowly probe the network to determine the available
capacity, in order to avoid congesting the network with an
inappropriately large burst of data. The slow start algorithm is
used for this purpose at the beginning of a transfer, or after
repairing loss detected by the retransmission timer.
We note that a non-standard, experimental TCP extension allows that IW, the initial value of cwnd, MUST be less than or equal to 2*SMSS
a TCP MAY use a larger initial window (IW), as defined in equation 1 bytes and MUST NOT be more than 2 segments.
[AFP98]:
IW = min (4*SMSS, max (2*SMSS, 4380 bytes)) (1) We note that a non-standard, experimental TCP extension allows that a
TCP MAY use a larger initial window (IW), as defined in equation 1
[AFP98]:
With this extension, a TCP sender MAY use a 3 or 4 segment initial IW = min (4*SMSS, max (2*SMSS, 4380 bytes)) (1)
window, provided the combined size of the segments does not exceed
4380 bytes. We do NOT allow this change as part of the standard
defined by this document. However, we include discussion of (1) in
the remainder of this document as a guideline for those
experimenting with the change, rather than conforming to the present
standards for TCP congestion control.
The initial value of ssthresh MAY be arbitrarily high (for example, With this extension, a TCP sender MAY use a 3 or 4 segment initial
some implementations use the size of the advertised window), but it window, provided the combined size of the segments does not exceed
may be reduced in response to congestion. The slow start algorithm 4380 bytes. We do NOT allow this change as part of the standard
is used when cwnd < ssthresh, while the congestion avoidance defined by this document. However, we include discussion of (1) in
algorithm is used when cwnd > ssthresh. When cwnd and ssthresh are the remainder of this document as a guideline for those experimenting
equal the sender may use either slow start or congestion avoidance. with the change, rather than conforming to the present standards for
TCP congestion control.
During slow start, a TCP increments cwnd by at most SMSS bytes for The initial value of ssthresh MAY be arbitrarily high (for example,
each ACK received that acknowledges new data. Slow start ends when some implementations use the size of the advertised window), but it
cwnd exceeds ssthresh (or, optionally, when it reaches it, as noted may be reduced in response to congestion. The slow start algorithm
above) or when congestion is observed. is used when cwnd < ssthresh, while the congestion avoidance
algorithm is used when cwnd > ssthresh. When cwnd and ssthresh are
equal the sender may use either slow start or congestion avoidance.
During congestion avoidance, cwnd is incremented by 1 full-sized During slow start, a TCP increments cwnd by at most SMSS bytes for
segment per round-trip time (RTT). Congestion avoidance continues each ACK received that acknowledges new data. Slow start ends when
until cwnd congestion is detected. One formula commonly used to cwnd exceeds ssthresh (or, optionally, when it reaches it, as noted
update cwnd during congestion avoidance is given in equation 2: above) or when congestion is observed.
cwnd += SMSS*SMSS/cwnd (2) During congestion avoidance, cwnd is incremented by 1 full-sized
segment per round-trip time (RTT). Congestion avoidance continues
until congestion is detected. One formula commonly used to update
cwnd during congestion avoidance is given in equation 2:
This adjustment is executed on every incoming non-duplicate ACK. cwnd += SMSS*SMSS/cwnd (2)
Equation (2) provides an acceptable approximation to the underlying
principle of increasing cwnd by 1 full-sized segment per RTT. (Note
that for a connection in which the receiver acknowledges every data
segment, (2) proves slightly more aggressive than 1 segment per RTT,
and for a receiver acknowledging every-other packet, (2) is less
aggressive.)
Implementation Note: Since integer arithmetic is usually used in TCP This adjustment is executed on every incoming non-duplicate ACK.
implementations, the formula given in equation 2 can fail to Equation (2) provides an acceptable approximation to the underlying
increase cwnd when the congestion window is very large (larger than principle of increasing cwnd by 1 full-sized segment per RTT. (Note
SMSS*SMSS). If the above formula yields 0, the result SHOULD be that for a connection in which the receiver acknowledges every data
rounded up to 1 byte. segment, (2) proves slightly more aggressive than 1 segment per RTT,
and for a receiver acknowledging every-other packet, (2) is less
aggressive.)
Implementation Note: Since integer arithmetic is usually used in TCP
implementations, the formula given in equation 2 can fail to increase
cwnd when the congestion window is very large (larger than
SMSS*SMSS). If the above formula yields 0, the result SHOULD be
rounded up to 1 byte.
Implementation Note: older implementations have an additional Implementation Note: older implementations have an additional
additive constant on the right-hand side of equation (2). This is additive constant on the right-hand side of equation (2). This is
incorrect and can actually lead to diminished performance [PAD+98]. incorrect and can actually lead to diminished performance [PAD+98].
Another acceptable way to increase cwnd during congestion avoidance Another acceptable way to increase cwnd during congestion avoidance
is to count the number of bytes that have been acknowledged by ACKs is to count the number of bytes that have been acknowledged by ACKs
for new data. (A drawback of this implementation is that it for new data. (A drawback of this implementation is that it requires
requires maintaining an additional state variable.) When the number maintaining an additional state variable.) When the number of bytes
of bytes acknowledged reaches cwnd, then cwnd can be incremented by acknowledged reaches cwnd, then cwnd can be incremented by up to SMSS
up to SMSS bytes. Note that during congestion avoidance, cwnd MUST bytes. Note that during congestion avoidance, cwnd MUST NOT be
NOT be increased by more than the larger of either 1 full-sized increased by more than the larger of either 1 full-sized segment per
segment per RTT, or the value computed using equation 2. RTT, or the value computed using equation 2.
Implementation Note: some implementations maintain cwnd in units of Implementation Note: some implementations maintain cwnd in units of
bytes, while others in units of full-sized segments. The latter bytes, while others in units of full-sized segments. The latter will
will find equation (2) difficult to use, and may prefer to use the find equation (2) difficult to use, and may prefer to use the
counting approach discussed in the previous paragraph. counting approach discussed in the previous paragraph.
When a TCP sender detects segment loss using the retransmission When a TCP sender detects segment loss using the retransmission
timer, the value of ssthresh MUST be set to no more than the value timer, the value of ssthresh MUST be set to no more than the value
given in equation 3: given in equation 3:
ssthresh = max (FlightSize / 2, 2*SMSS) (3) ssthresh = max (FlightSize / 2, 2*SMSS) (3)
As discussed above, FlightSize is the amount of outstanding data in As discussed above, FlightSize is the amount of outstanding data in
the network. the network.
Implementation Note: an easy mistake to make is to simply use cwnd, Implementation Note: an easy mistake to make is to simply use cwnd,
rather than FlightSize, which in some implementations may rather than FlightSize, which in some implementations may
incidentally increase well beyond rwnd. incidentally increase well beyond rwnd.
Furthermore, upon a timeout cwnd MUST be set to no more than the Furthermore, upon a timeout cwnd MUST be set to no more than the loss
loss window, LW, which equals 1 full-sized segment (regardless of window, LW, which equals 1 full-sized segment (regardless of the
the value of IW). Therefore, after retransmitting the dropped value of IW). Therefore, after retransmitting the dropped segment
segment the TCP sender uses the slow start algorithm to increase the the TCP sender uses the slow start algorithm to increase the window
window from 1 full-sized segment to the new value of ssthresh, at from 1 full-sized segment to the new value of ssthresh, at which
which point congestion avoidance again takes over. point congestion avoidance again takes over.
3.2 Fast Retransmit/Fast Recovery 3.2 Fast Retransmit/Fast Recovery
A TCP receiver SHOULD send an immediate duplicate ACK when an A TCP receiver SHOULD send an immediate duplicate ACK when an out-
out-of-order segment arrives. The purpose of this ACK is to inform of-order segment arrives. The purpose of this ACK is to inform the
the sender that a segment was received out-of-order and which sender that a segment was received out-of-order and which sequence
sequence number is expected. From the sender's perspective, number is expected. From the sender's perspective, duplicate ACKs
duplicate ACKs can be caused by a number of network problems. can be caused by a number of network problems. First, they can be
First, they can be caused by dropped segments. In this case, all caused by dropped segments. In this case, all segments after the
segments after the dropped segment will trigger duplicate ACKs. dropped segment will trigger duplicate ACKs. Second, duplicate ACKs
Second, duplicate ACKs can be caused by the re-ordering of data can be caused by the re-ordering of data segments by the network (not
segments by the network (not a rare event along some network paths a rare event along some network paths [Pax97]). Finally, duplicate
[Pax97]). Finally, duplicate ACKs can be caused by replication of ACKs can be caused by replication of ACK or data segments by the
ACK or data segments by the network. In addition, a TCP receiver network. In addition, a TCP receiver SHOULD send an immediate ACK
SHOULD send an immediate ACK when the incoming segment fills in all when the incoming segment fills in all or part of a gap in the
or part of a gap in the sequence space. This will generate more sequence space. This will generate more timely information for a
timely information for a sender recovering from a loss through a sender recovering from a loss through a retransmission timeout, a
retransmission timeout, a fast retransmit, or an experimental loss fast retransmit, or an experimental loss recovery algorithm, such as
recovery algorithm, such as NewReno [FH98]. NewReno [FH98].
The TCP sender SHOULD use the "fast retransmit" algorithm to detect The TCP sender SHOULD use the "fast retransmit" algorithm to detect
and repair loss, based on incoming duplicate ACKs. The fast and repair loss, based on incoming duplicate ACKs. The fast
retransmit algorithm uses the arrival of 3 duplicate ACKs (4 retransmit algorithm uses the arrival of 3 duplicate ACKs (4
identical ACKs without the arrival of any other intervening packets) identical ACKs without the arrival of any other intervening packets)
as an indication that a segment has been lost. After receiving 3 as an indication that a segment has been lost. After receiving 3
duplicate ACKs, TCP performs a retransmission of what appears to be duplicate ACKs, TCP performs a retransmission of what appears to be
the missing segment, without waiting for the retransmission timer to the missing segment, without waiting for the retransmission timer to
expire. expire.
After the fast retransmit algorithm sends what appears to be the After the fast retransmit algorithm sends what appears to be the
missing segment, the "fast recovery" algorithm governs the missing segment, the "fast recovery" algorithm governs the
transmission of new data until a non-duplicate ACK arrives. The transmission of new data until a non-duplicate ACK arrives. The
reason for not performing slow start is that the receipt of the reason for not performing slow start is that the receipt of the
duplicate ACKs not only indicates that a segment has been lost, but duplicate ACKs not only indicates that a segment has been lost, but
also that segments are most likely leaving the network (although a also that segments are most likely leaving the network (although a
massive segment duplication by the network can invalidate this massive segment duplication by the network can invalidate this
conclusion). In other words, since the receiver can only generate a conclusion). In other words, since the receiver can only generate a
duplicate ACK when a segment has arrived, that segment has left the duplicate ACK when a segment has arrived, that segment has left the
network and is in the receiver's buffer, so we know it is no longer network and is in the receiver's buffer, so we know it is no longer
consuming network resources. Furthermore, since the ACK "clock" consuming network resources. Furthermore, since the ACK "clock"
[Jac88] is preserved, the TCP sender can continue to transmit new [Jac88] is preserved, the TCP sender can continue to transmit new
segments (although transmission must continue using a reduced cwnd). segments (although transmission must continue using a reduced cwnd).
The fast retransmit and fast recovery algorithms are usually The fast retransmit and fast recovery algorithms are usually
implemented together as follows. implemented together as follows.
1. When the third duplicate ACK is received, set ssthresh to no 1. When the third duplicate ACK is received, set ssthresh to no more
more than the value given in equation 3. than the value given in equation 3.
2. Retransmit the lost segment and set cwnd to ssthresh plus 2. Retransmit the lost segment and set cwnd to ssthresh plus 3*SMSS.
3*SMSS. This artificially "inflates" the congestion window by This artificially "inflates" the congestion window by the number
the number of segments (three) that have left the network and of segments (three) that have left the network and which the
which the receiver has buffered. receiver has buffered.
3. For each additional duplicate ACK received, increment cwnd by 3. For each additional duplicate ACK received, increment cwnd by
SMSS. This artificially inflates the congestion window in order SMSS. This artificially inflates the congestion window in order
to reflect the additional segment that has left the network. to reflect the additional segment that has left the network.
4. Transmit a segment, if allowed by the new value of cwnd and the 4. Transmit a segment, if allowed by the new value of cwnd and the
receiver's advertised window. receiver's advertised window.
5. When the next ACK arrives that acknowledges new data, set cwnd 5. When the next ACK arrives that acknowledges new data, set cwnd to
to ssthresh (the value set in step 1). This is termed ssthresh (the value set in step 1). This is termed "deflating"
"deflating" the window. the window.
This ACK should be the acknowledgment elicited by the This ACK should be the acknowledgment elicited by the
retransmission from step 1, one RTT after the retransmission retransmission from step 1, one RTT after the retransmission
(though it may arrive sooner in the presence of significant (though it may arrive sooner in the presence of significant out-
out-of-order delivery of data segments at the receiver). of-order delivery of data segments at the receiver).
Additionally, this ACK should acknowledge all the intermediate Additionally, this ACK should acknowledge all the intermediate
segments sent between the lost segment and the receipt of the segments sent between the lost segment and the receipt of the
third duplicate ACK, if none of these were lost. third duplicate ACK, if none of these were lost.
Note: This algorithm is known to generally not recover very Note: This algorithm is known to generally not recover very
efficiently from multiple losses in a single flight of packets efficiently from multiple losses in a single flight of packets
[FF96]. One proposed set of modifications to it to address this [FF96]. One proposed set of modifications to address this problem
problem can be found in [FH98]. can be found in [FH98].
4 Additional Considerations 4. Additional Considerations
4.1 Re-starting Idle Connections 4.1 Re-starting Idle Connections
A known problem with the TCP congestion control algorithms described A known problem with the TCP congestion control algorithms described
above is that they allow a potentially inappropriate burst of above is that they allow a potentially inappropriate burst of traffic
traffic to be transmitted after TCP has been idle for a relatively to be transmitted after TCP has been idle for a relatively long
long period of time. After an idle period, TCP cannot use the ACK period of time. After an idle period, TCP cannot use the ACK clock
clock to strobe new segments into the network, as all the ACKs have to strobe new segments into the network, as all the ACKs have drained
drained from the network. Therefore, as specified above, TCP can from the network. Therefore, as specified above, TCP can potentially
potentially send a cwnd-size line-rate burst into the network after send a cwnd-size line-rate burst into the network after an idle
an idle period. period.
[Jac88] recommends that a TCP use slow start to restart transmission [Jac88] recommends that a TCP use slow start to restart transmission
after a relatively long idle period. Slow start serves to restart after a relatively long idle period. Slow start serves to restart
the ACK clock, just as it does at the beginning of a transfer. This the ACK clock, just as it does at the beginning of a transfer. This
mechanism has been widely deployed in the following manner. When mechanism has been widely deployed in the following manner. When TCP
TCP has not received a segment for more than one retransmission has not received a segment for more than one retransmission timeout,
timeout, cwnd is reduced to the value of the restart window (RW) cwnd is reduced to the value of the restart window (RW) before
before transmission begins. transmission begins.
For the purposes of this standard, we define RW = IW. For the purposes of this standard, we define RW = IW.
We note that the non-standard experimental extension to TCP defined We note that the non-standard experimental extension to TCP defined
in [AFP98] defines RW = min(IW, cwnd), with the definition of IW in [AFP98] defines RW = min(IW, cwnd), with the definition of IW
adjusted per equation (1) above. adjusted per equation (1) above.
Using the last time a segment was received to determine whether or Using the last time a segment was received to determine whether or
not to decrease cwnd fails to deflate cwnd in the common case of not to decrease cwnd fails to deflate cwnd in the common case of
persistent HTTP connections [HTH98]. In this case, a WWW server persistent HTTP connections [HTH98]. In this case, a WWW server
receives a request before transmitting data to the WWW browser. The receives a request before transmitting data to the WWW browser. The
reception of the request makes the test for an idle connection fail, reception of the request makes the test for an idle connection fail,
and allows the TCP to begin transmission with a possibly and allows the TCP to begin transmission with a possibly
inappropriately large cwnd. inappropriately large cwnd.
Therefore, a TCP SHOULD set cwnd to no more than RW before beginning Therefore, a TCP SHOULD set cwnd to no more than RW before beginning
transmission if the TCP has not sent data in an interval exceeding transmission if the TCP has not sent data in an interval exceeding
the retransmission timeout. the retransmission timeout.
4.2 Generating Acknowledgments 4.2 Generating Acknowledgments
The delayed ACK algorithm specified in [Bra89] SHOULD be used by a The delayed ACK algorithm specified in [Bra89] SHOULD be used by a
TCP receiver. When used, a TCP receiver MUST NOT excessively delay TCP receiver. When used, a TCP receiver MUST NOT excessively delay
acknowledgments. Specifically, an ACK SHOULD be generated for at acknowledgments. Specifically, an ACK SHOULD be generated for at
least every second full-sized segment, and MUST be generated within least every second full-sized segment, and MUST be generated within
500 ms of the arrival of the first unacknowledged packet. 500 ms of the arrival of the first unacknowledged packet.
The requirement that an ACK "SHOULD" be generated for at least every The requirement that an ACK "SHOULD" be generated for at least every
second full-sized segment is listed in [Bra89] in one place as a second full-sized segment is listed in [Bra89] in one place as a
SHOULD and another as a MUST. Here we unambiguously state it is a SHOULD and another as a MUST. Here we unambiguously state it is a
SHOULD. We also emphasize that this is a SHOULD, meaning that an SHOULD. We also emphasize that this is a SHOULD, meaning that an
implementor should indeed only deviate from this requirement after implementor should indeed only deviate from this requirement after
careful consideration of the implications. See the discussion of careful consideration of the implications. See the discussion of
"Stretch ACK violation" in [PAD+98] and the references therein for a "Stretch ACK violation" in [PAD+98] and the references therein for a
discussion of the possible performance problems with generating ACKs discussion of the possible performance problems with generating ACKs
less frequently than every second full-sized segment. less frequently than every second full-sized segment.
In some cases, the sender and receiver may not agree on what In some cases, the sender and receiver may not agree on what
constitutes a full-sized segment. An implementation is deemed to constitutes a full-sized segment. An implementation is deemed to
comply with this requirement if it sends at least one acknowledgment comply with this requirement if it sends at least one acknowledgment
every time it receives 2*RMSS bytes of new data from the sender, every time it receives 2*RMSS bytes of new data from the sender,
where RMSS is the Maximum Segment Size specified by the receiver to where RMSS is the Maximum Segment Size specified by the receiver to
the sender (or the default value of 536 bytes, per [Bra89], if the the sender (or the default value of 536 bytes, per [Bra89], if the
receiver does not specify an MSS option during connection receiver does not specify an MSS option during connection
establishment). The sender may be forced to use a segment size less establishment). The sender may be forced to use a segment size less
than RMSS due to the maximum transmission unit (MTU), the path MTU than RMSS due to the maximum transmission unit (MTU), the path MTU
discovery algorithm or other factors. For instance, consider the discovery algorithm or other factors. For instance, consider the
case when the receiver announces an RMSS of X bytes but the sender case when the receiver announces an RMSS of X bytes but the sender
ends up using a segment size of Y bytes (Y < X) due to path MTU ends up using a segment size of Y bytes (Y < X) due to path MTU
discovery (or the sender's MTU size). The receiver will generate discovery (or the sender's MTU size). The receiver will generate
stretch ACKs if it waits for 2*X bytes to arrive before an ACK is stretch ACKs if it waits for 2*X bytes to arrive before an ACK is
sent. Clearly this will take more than 2 segments of size Y bytes. sent. Clearly this will take more than 2 segments of size Y bytes.
Therefore, while a specific algorithm is not defined, it is Therefore, while a specific algorithm is not defined, it is desirable
desirable for receivers to attempt to prevent this situation, for for receivers to attempt to prevent this situation, for example by
example by acknowledging at least every second segment, regardless acknowledging at least every second segment, regardless of size.
of size. Finally, we repeat that an ACK MUST NOT be delayed for Finally, we repeat that an ACK MUST NOT be delayed for more than 500
more than 500 ms waiting on a second full-sized segment to arrive. ms waiting on a second full-sized segment to arrive.
Out-of-order data segments SHOULD be acknowledged immediately, in Out-of-order data segments SHOULD be acknowledged immediately, in
order to accelerate loss recovery. To trigger the fast retransmit order to accelerate loss recovery. To trigger the fast retransmit
algorithm, the receiver SHOULD send an immediate duplicate ACK when algorithm, the receiver SHOULD send an immediate duplicate ACK when
it receives a data segment above a gap in the sequence space. To it receives a data segment above a gap in the sequence space. To
provide feedback to senders recovering from losses, the receiver provide feedback to senders recovering from losses, the receiver
SHOULD send an immediate ACK when it receives a data segment that SHOULD send an immediate ACK when it receives a data segment that
fills in all or part of a gap in the sequence space. fills in all or part of a gap in the sequence space.
A TCP receiver MUST NOT generate more than one ACK for every A TCP receiver MUST NOT generate more than one ACK for every incoming
incoming segment, other than to update the offered window as the segment, other than to update the offered window as the receiving
receiving application consumes new data [page 42, Pos81][Cla82]. application consumes new data [page 42, Pos81][Cla82].
4.3 Loss Recovery Mechanisms 4.3 Loss Recovery Mechanisms
A number of loss recovery algorithms that augment fast retransmit A number of loss recovery algorithms that augment fast retransmit and
and fast recovery have been suggested by TCP researchers. While fast recovery have been suggested by TCP researchers. While some of
some of these algorithms are based on the TCP selective these algorithms are based on the TCP selective acknowledgment (SACK)
acknowledgment (SACK) option [MMFR96], such as [FF96,MM96a,MM96b], option [MMFR96], such as [FF96,MM96a,MM96b], others do not require
others do not require SACKs [Hoe96,FF96,FH98]. The non-SACK SACKs [Hoe96,FF96,FH98]. The non-SACK algorithms use "partial
algorithms use "partial acknowledgments" (ACKs which cover new data, acknowledgments" (ACKs which cover new data, but not all the data
but not all the data outstanding when loss was detected) to trigger outstanding when loss was detected) to trigger retransmissions.
retransmissions. While this document does not standardize any of While this document does not standardize any of the specific
the specific algorithms that may improve fast retransmit/fast algorithms that may improve fast retransmit/fast recovery, these
recovery, these enhanced algorithms are implicitly allowed, as long enhanced algorithms are implicitly allowed, as long as they follow
as they follow the general principles of the basic four algorithms the general principles of the basic four algorithms outlined above.
outlined above.
Therefore, when the first loss in a window of data is detected, Therefore, when the first loss in a window of data is detected,
ssthresh MUST be set to no more than the value given by equation ssthresh MUST be set to no more than the value given by equation (3).
(3). Second, until all lost segments in the window of data in Second, until all lost segments in the window of data in question are
question are repaired, the number of segments transmitted in each repaired, the number of segments transmitted in each RTT MUST be no
RTT MUST be no more than half the number of outstanding segments more than half the number of outstanding segments when the loss was
when the loss was detected. Finally, after all loss in the given detected. Finally, after all loss in the given window of segments
window of segments has been successfully retransmitted, cwnd MUST be has been successfully retransmitted, cwnd MUST be set to no more than
set to no more than ssthresh and congestion avoidance MUST be used ssthresh and congestion avoidance MUST be used to further increase
to further increase cwnd. Loss in two successive windows of data, cwnd. Loss in two successive windows of data, or the loss of a
or the loss of a retransmission, should be taken as two indications retransmission, should be taken as two indications of congestion and,
of congestion and, therefore, cwnd (and ssthresh) MUST be lowered therefore, cwnd (and ssthresh) MUST be lowered twice in this case.
twice in this case. The algorithms outlined in
[Hoe96,FF96,MM96a,MM6b] follow the principles of the basic four The algorithms outlined in [Hoe96,FF96,MM96a,MM6b] follow the
congestion control algorithms outlined in this document. principles of the basic four congestion control algorithms outlined
in this document.
5. Security Considerations 5. Security Considerations
This document requires a TCP to diminish its sending rate in the This document requires a TCP to diminish its sending rate in the
presence of retransmission timeouts and the arrival of duplicate presence of retransmission timeouts and the arrival of duplicate
acknowledgments. An attacker can therefore impair the performance acknowledgments. An attacker can therefore impair the performance of
of a TCP connection by either causing data packets or their a TCP connection by either causing data packets or their
acknowledgments to be lost, or by forging excessive duplicate acknowledgments to be lost, or by forging excessive duplicate
acknowledgments. Causing two congestion control events back-to-back acknowledgments. Causing two congestion control events back-to-back
will often cut ssthresh to its minimum value of 2*SMSS, causing the will often cut ssthresh to its minimum value of 2*SMSS, causing the
connection to immediately enter the slower-performing congestion connection to immediately enter the slower-performing congestion
avoidance phase. avoidance phase.
The Internet to a considerable degree relies on the correct The Internet to a considerable degree relies on the correct
implementation of these algorithms in order to preserve network implementation of these algorithms in order to preserve network
stability and avoid congestion collapse. An attacker could cause stability and avoid congestion collapse. An attacker could cause TCP
TCP endpoints to respond more aggressively in the face of congestion endpoints to respond more aggressively in the face of congestion by
by forging excessive duplicate acknowledgments or excessive forging excessive duplicate acknowledgments or excessive
acknowledgments for new data. Conceivably, such an attack could acknowledgments for new data. Conceivably, such an attack could
drive a portion of the network into congestion collapse. drive a portion of the network into congestion collapse.
6. Changes Relative to RFC 2001 6. Changes Relative to RFC 2001
This document has been extensively rewritten editorially and it is This document has been extensively rewritten editorially and it is
not feasible to itemize the list of changes between the two not feasible to itemize the list of changes between the two
documents. The intention of this document is not to change any of documents. The intention of this document is not to change any of the
the recommendations given in RFC 2001, but to further clarify cases recommendations given in RFC 2001, but to further clarify cases that
that were not discussed in detail in 2001. Specifically, this were not discussed in detail in 2001. Specifically, this document
document suggests what TCP connections should do after a relatively suggests what TCP connections should do after a relatively long idle
long idle period, as well as specifying and clarifying some of the period, as well as specifying and clarifying some of the issues
issues pertaining to TCP ACK generation. Finally, the allowable pertaining to TCP ACK generation. Finally, the allowable upper bound
upper bound for the initial congestion window has also been raised for the initial congestion window has also been raised from one to
from one to two segments. two segments.
Acknowledgments Acknowledgments
The four algorithms that are described were developed by Van The four algorithms that are described were developed by Van
Jacobson. Jacobson.
Some of the text from this document is taken from "TCP/IP Some of the text from this document is taken from "TCP/IP
Illustrated, Volume 1: The Protocols" by W. Richard Stevens Illustrated, Volume 1: The Protocols" by W. Richard Stevens
(Addison-Wesley, 1994) and "TCP/IP Illustrated, Volume 2: The (Addison-Wesley, 1994) and "TCP/IP Illustrated, Volume 2: The
Implementation" by Gary R. Wright and W. Richard Stevens Implementation" by Gary R. Wright and W. Richard Stevens (Addison-
(Addison-Wesley, 1995). This material is used with the permission Wesley, 1995). This material is used with the permission of
of Addison-Wesley. Addison-Wesley.
Neal Cardwell, Sally Floyd, Craig Partridge and Joe Touch Neal Cardwell, Sally Floyd, Craig Partridge and Joe Touch contributed
contributed a number of helpful suggestions. a number of helpful suggestions.
References References
[AFP98] M. Allman, S. Floyd, C. Partridge, Increasing TCP's Initial [AFP98] Allman, M., Floyd, S. and C. Partridge, "Increasing TCP's
Window Size, September 1998. RFC 2414. Initial Window Size, RFC 2414, September 1998.
[Bra89] B. Braden, ed., Requirements for Internet Hosts -- [Bra89] Braden, R., "Requirements for Internet Hosts --
Communication Layers, RFC 1122, Oct. 1989. Communication Layers", STD 3, RFC 1122, October 1989.
[Bra97] S. Bradner, Key words for use in RFCs to Indicate [Bra97] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels, BCP 14, RFC 2119, March 1997. Requirement Levels", BCP 14, RFC 2119, March 1997.
[Cla82] D. Clark, Window and Acknowledgment Strategy in TCP, RFC [Cla82] Clark, D., "Window and Acknowledgment Strategy in TCP", RFC
813. July 1982. 813, July 1982.
[FF96] K. Fall, S. Floyd. Simulation-based Comparisons of Tahoe, [FF96] Fall, K. and S. Floyd, "Simulation-based Comparisons of
Reno and SACK TCP. Computer Communication Review, July 1996. Tahoe, Reno and SACK TCP", Computer Communication Review,
ftp://ftp.ee.lbl.gov/papers/sacks.ps.Z. July 1996. ftp://ftp.ee.lbl.gov/papers/sacks.ps.Z.
[FH98] S. Floyd, T. Henderson. The NewReno Modification to TCP's [FH98] Floyd, S. and T. Henderson, "The NewReno Modification to
Fast Recovery Algorithm. Internet-Draft TCP's Fast Recovery Algorithm", RFC 2582, April 1999.
draft-ietf-tcpimpl-newreno-00.txt, November 1998. (Work in
progress).
[Flo94] S. Floyd, TCP and Successive Fast Retransmits. Technical [Flo94] Floyd, S., "TCP and Successive Fast Retransmits. Technical
report, October 1994. report", October 1994.
ftp://ftp.ee.lbl.gov/papers/fastretrans.ps. ftp://ftp.ee.lbl.gov/papers/fastretrans.ps.
[Hoe96] J. Hoe, Improving the Start-up Behavior of a Congestion [Hoe96] Hoe, J., "Improving the Start-up Behavior of a Congestion
Control Scheme for TCP. In ACM SIGCOMM, August 1996. Control Scheme for TCP", In ACM SIGCOMM, August 1996.
[HTH98] A. Hughes, J. Touch, J. Heidemann. Issues in TCP Slow-Start [HTH98] Hughes, A., Touch, J. and J. Heidemann, "Issues in TCP
Restart After Idle. Internet-Draft Slow-Start Restart After Idle", Work in Progress.
draft-ietf-tcpimpl-restart-00.txt, March 1998. (Work in
progress).
[Jac88] V. Jacobson, Congestion Avoidance and Control, Computer [Jac88] Jacobson, V., "Congestion Avoidance and Control", Computer
Communication Review, vol. 18, no. 4, pp. 314-329, Aug. 1988. Communication Review, vol. 18, no. 4, pp. 314-329, Aug.
ftp://ftp.ee.lbl.gov/papers/congavoid.ps.Z. 1988. ftp://ftp.ee.lbl.gov/papers/congavoid.ps.Z.
[Jac90] V. Jacobson, Modified TCP Congestion Avoidance Algorithm, [Jac90] Jacobson, V., "Modified TCP Congestion Avoidance Algorithm",
end2end-interest mailing list, April 30, 1990. end2end-interest mailing list, April 30, 1990.
ftp://ftp.isi.edu/end2end/end2end-interest-1990.mail. ftp://ftp.isi.edu/end2end/end2end-interest-1990.mail.
[MD90] J. Mogul, S. Deering. Path MTU Discovery, November 1990. [MD90] Mogul, J. and S. Deering, "Path MTU Discovery", RFC 1191,
RFC 1191. November 1990.
[MM96a] M. Mathis, J. Mahdavi, Forward Acknowledgment: Refining TCP [MM96a] Mathis, M. and J. Mahdavi, "Forward Acknowledgment: Refining
Congestion Control, Proceedings of SIGCOMM'96, August, 1996, TCP Congestion Control", Proceedings of SIGCOMM'96, August,
Stanford, CA. Available from 1996, Stanford, CA. Available
http://www.psc.edu/networking/papers/papers.html fromhttp://www.psc.edu/networking/papers/papers.html
[MM96b] M. Mathis, J. Mahdavi, TCP Rate-Halving with Bounding [MM96b] Mathis, M. and J. Mahdavi, "TCP Rate-Halving with Bounding
Parameters. Technical report. Available from Parameters", Technical report. Available from
http://www.psc.edu/networking/papers/FACKnotes/current. http://www.psc.edu/networking/papers/FACKnotes/current.
[MMFR96] M. Mathis, J. Mahdavi, S. Floyd, A. Romanow, TCP Selective [MMFR96] Mathis, M., Mahdavi, J., Floyd, S. and A. Romanow, "TCP
Acknowledgement Options, October 1996. RFC 2018. Selective Acknowledgement Options", RFC 2018, October 1996.
[PAD+98] V. Paxson, M. Allman, S. Dawson, W. Fenner, J. Griner, [PAD+98] Paxson, V., Allman, M., Dawson, S., Fenner, W., Griner, J.,
I. Heavens, K. Lahey, J. Semke, B. Volz. Known TCP Heavens, I., Lahey, K., Semke, J. and B. Volz, "Known TCP
Implementation Problems. Internet-Draft Implementation Problems", RFC 2525, March 1999.
draft-ietf-tcpimpl-prob-05.txt, November 1998. (Work in
progress).
[Pax97] V. Paxson, End-to-End Internet Packet Dynamics, [Pax97] Paxson, V., "End-to-End Internet Packet Dynamics",
Proceedings of SIGCOMM '97, Cannes, France, Sep. 1997. Proceedings of SIGCOMM '97, Cannes, France, Sep. 1997.
[Pos81] J. Postel, Transmission Control Protocol, September 1981. [Pos81] Postel, J., "Transmission Control Protocol", STD 7, RFC 793,
RFC 793. September 1981.
[Ste94] W. R. Stevens, TCP/IP Illustrated, Volume 1: The [Ste94] Stevens, W., "TCP/IP Illustrated, Volume 1: The Protocols",
Protocols, Addison-Wesley, 1994. Addison-Wesley, 1994.
[Ste97] W. R. Stevens, "TCP Slow Start, Congestion Avoidance, Fast [Ste97] Stevens, W., "TCP Slow Start, Congestion Avoidance, Fast
Retransmit, and Fast Recovery Algorithms", January 1997. RFC Retransmit, and Fast Recovery Algorithms", RFC 2001, January
2001. 1997.
[WS95] G. R. Wright, W. R. Stevens, TCP/IP Illustrated, Volume 2: [WS95] Wright, G. and W. Stevens, "TCP/IP Illustrated, Volume 2:
The Implementation, Addison-Wesley, 1995. The Implementation", Addison-Wesley, 1995.
Author's Address: Authors' Addresses
Mark Allman Mark Allman
NASA Lewis Research Center/Sterling Software NASA Glenn Research Center/Sterling Software
21000 Brookpark Rd. MS 54-2 Lewis Field
Cleveland, OH 44135 21000 Brookpark Rd. MS 54-2
216-433-6586 Cleveland, OH 44135
mallman@lerc.nasa.gov 216-433-6586
http://roland.lerc.nasa.gov/~mallman
Vern Paxson EMail: mallman@grc.nasa.gov
Network Research Group http://roland.grc.nasa.gov/~mallman
Lawrence Berkeley National Laboratory
Berkeley, CA 94720
USA
510-486-7504
vern@ee.lbl.gov
W. Richard Stevens Vern Paxson
1202 E. Paseo del Zorro ACIRI / ICSI
Tucson, AZ 85718 1947 Center Street
520-297-9416 Suite 600
rstevens@kohala.com Berkeley, CA 94704-1198
http://www.kohala.com/~rstevens
Phone: +1 510/642-4274 x302
EMail: vern@aciri.org
W. Richard Stevens
1202 E. Paseo del Zorro
Tucson, AZ 85718
520-297-9416
EMail: rstevens@kohala.com
http://www.kohala.com/~rstevens
Full Copyright Statement
Copyright (C) The Internet Society (1999). All Rights Reserved.
This document and translations of it may be copied and furnished to
others, and derivative works that comment on or otherwise explain it
or assist in its implementation may be prepared, copied, published
and distributed, in whole or in part, without restriction of any
kind, provided that the above copyright notice and this paragraph are
included on all such copies and derivative works. However, this
document itself may not be modified in any way, such as by removing
the copyright notice or references to the Internet Society or other
Internet organizations, except as needed for the purpose of
developing Internet standards in which case the procedures for
copyrights defined in the Internet Standards process must be
followed, or as required to translate it into languages other than
English.
The limited permissions granted above are perpetual and will not be
revoked by the Internet Society or its successors or assigns.
This document and the information contained herein is provided on an
"AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
 End of changes. 103 change blocks. 
457 lines changed or deleted 434 lines changed or added

This html diff was produced by rfcdiff 1.41. The latest version is available from http://tools.ietf.org/tools/rfcdiff/