< draft-yueven-tsvwg-dccm-requirements-00.txt   draft-yueven-tsvwg-dccm-requirements-01.txt >
TSVWG F. Chen TSVWG F. Chen
Internet-Draft W. Sun Internet-Draft W. Sun
Intended status: Informational X. Yu Intended status: Informational X. Yu
Expires: December 27, 2019 Huawei Technologies Co., Ltd. Expires: January 8, 2020 Huawei Technologies Co., Ltd.
R. Even, Ed. R. Even, Ed.
Huawei Huawei
June 25, 2019 July 7, 2019
Data Center Congestion Management requirements Data Center Congestion Management requirements
draft-yueven-tsvwg-dccm-requirements-00 draft-yueven-tsvwg-dccm-requirements-01
Abstract Abstract
On IP-routed datacenter networks, RDMA is deployed using RoCEv2 On IP-routed datacenter networks, RDMA is deployed using RoCEv2
protocol. RoCEv2 specification does not define a strong congestion protocol or iWARP. RoCEv2 specification does not define a strong
management mechanisms and load balancing methods. RoCEv2 relies on congestion management mechanisms and load balancing methods. RoCEv2
the existing Link-Layer Flow-Control IEEE 802.1Qbb(Priority-based relies on the existing Link-Layer Flow-Control IEEE
Flow Control, PFC) to provide a lossless fabric. RoCEv2 Congestion 802.1Qbb(Priority-based Flow Control, PFC) to provide a lossless
Management(RCM) use ECN(Explicit Congestion Notification, defined in fabric. RoCEv2 Congestion Management(RCM) use ECN(Explicit
RFC3168) to signal the congestion to the destination and use the Congestion Notification, defined in RFC3168) to signal the congestion
congestion notification to reduce the rate of injection and increase to the destination and use the congestion notification to reduce the
the injection rate when the extent of congestion decreases. More and rate of injection and increase the injection rate when the extent of
more practice of congestion management for RoCEv2 appear in the congestion decreases. iWRAP depends on TCP congestion handling. This
industry, such as DCQCN(Data Center Quantized Congestion document describes the current state of flow control and congestion
Notification). This document describes the current state of flow handling in the DC and provides requirements for new directions for
control and congestion handling in the DC using RoCEv2 and provides better congestion control.
requirements for new directions for better congestion control.
Status of This Memo Status of This Memo
This Internet-Draft is submitted in full conformance with the This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79. provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/. Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on December 27, 2019. This Internet-Draft will expire on January 8, 2020.
Copyright Notice Copyright Notice
Copyright (c) 2019 IETF Trust and the persons identified as the Copyright (c) 2019 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(https://trustee.ietf.org/license-info) in effect on the date of (https://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License. described in the Simplified BSD License.
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
2. Conventions . . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Conventions . . . . . . . . . . . . . . . . . . . . . . . . . 3
3. Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . 3 3. Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . 4
4. Current Congestion Management mechanisms . . . . . . . . . . 4 4. Current Congestion Management mechanisms . . . . . . . . . . 4
4.1. Priority-based Flow Control (PFC) . . . . . . . . . . . . 4 4.1. Priority-based Flow Control (PFC) . . . . . . . . . . . . 4
4.2. Explicit Congestion Notification . . . . . . . . . . . . 4 4.2. Explicit Congestion Notification . . . . . . . . . . . . 4
5. Congestion Management Practice . . . . . . . . . . . . . . . 4 5. Congestion Management Practice . . . . . . . . . . . . . . . 5
5.1. Packet Retransmission . . . . . . . . . . . . . . . . . . 5 5.1. Packet Retransmission . . . . . . . . . . . . . . . . . . 5
5.2. Congestion Control Mechanisms . . . . . . . . . . . . . . 5 5.2. Congestion Control Mechanisms . . . . . . . . . . . . . . 5
5.2.1. RTT-based Congestion Control . . . . . . . . . . . . 5 5.2.1. RTT-based Congestion Control . . . . . . . . . . . . 5
5.2.2. Credit-based Congestion Control . . . . . . . . . . . 5 5.2.2. Credit-based Congestion Control . . . . . . . . . . . 6
5.2.3. ECN-based Congestion Control . . . . . . . . . . . . 6 5.2.3. ECN-based Congestion Control . . . . . . . . . . . . 6
5.3. Re-ordering . . . . . . . . . . . . . . . . . . . . . . . 6 5.3. Re-ordering . . . . . . . . . . . . . . . . . . . . . . . 6
5.4. Load Balancing . . . . . . . . . . . . . . . . . . . . . 6 5.4. Load Balancing . . . . . . . . . . . . . . . . . . . . . 6
5.4.1. Equal-cost multi-path routing (ECMP) . . . . . . . . 6 5.4.1. Equal-cost multi-path routing (ECMP) . . . . . . . . 6
5.4.2. Flowlet . . . . . . . . . . . . . . . . . . . . . . . 6 5.4.2. Flowlet . . . . . . . . . . . . . . . . . . . . . . . 6
5.4.3. Per-packet . . . . . . . . . . . . . . . . . . . . . 7 5.4.3. Per-packet . . . . . . . . . . . . . . . . . . . . . 7
6. Data Center Congestion Management requirements . . . . . . . 7 6. Data Center Congestion Management requirements . . . . . . . 7
7. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 7. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
8. Security Considerations . . . . . . . . . . . . . . . . . . . 8 8. Security Considerations . . . . . . . . . . . . . . . . . . . 8
9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 8 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 8
10. References . . . . . . . . . . . . . . . . . . . . . . . . . 8 10. References . . . . . . . . . . . . . . . . . . . . . . . . . 8
10.1. Normative References . . . . . . . . . . . . . . . . . . 8 10.1. Normative References . . . . . . . . . . . . . . . . . . 8
10.2. Informative References . . . . . . . . . . . . . . . . . 8 10.2. Informative References . . . . . . . . . . . . . . . . . 9
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 10 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 10
1. Introduction 1. Introduction
With the emerging Distributed Storage, AI/HPC(High Performance With the emerging Distributed Storage, AI/HPC(High Performance
Computing), Machine Learning, etc., modern datacenter applications Computing), Machine Learning, etc., modern datacenter applications
demand high throughput(40Gbps and above) with ultra-low latency of demand high throughput(40Gbps and above) with ultra-low latency of
less than 10 microsecond per hop from the network, with low CPU less than 10 microsecond per hop from the network, with low CPU
overhead. The high link speed (>40Gb/s) in Data Centers (DC) are overhead. The high link speed (>40Gb/s) in Data Centers (DC) are
making network transfers complete faster and in fewer RTTs. Network making network transfers complete faster and in fewer RTTs. Network
traffic in a data center is often a mix of short and long flows, traffic in a data center is often a mix of short and long flows,
where the short flows require low latencies and the long flows where the short flows require low latencies and the long flows
require high throughputs. require high throughputs.
On IP-routed datacenter networks, RDMA is deployed using RoCEv2 On IP-routed datacenter networks, RDMA is deployed using RoCEv2
protocol. RoCEv2 [RoCEv2] is a straightforward extension of the RoCE protocol or iWARP [RFC5040]. RoCEv2 [RoCEv2] is a straightforward
protocol that involves a simple modification of the RoCE packet extension of the RoCE protocol that involves a simple modification of
format. RoCEv2 packets carry an IP header which allows traversal of the RoCE packet format. RoCEv2 packets carry an IP header which
IP L3 Routers and a UDP header that serves as a stateless allows traversal of IP L3 Routers and a UDP header that serves as a
encapsulation layer for the RDMA Transport Protocol Packets over IP. stateless encapsulation layer for the RDMA Transport Protocol Packets
over IP.
RoCEv2 Congestion Management (RCM) provides the capability to avoid RoCEv2 Congestion Management (RCM) provides the capability to avoid
congestion hot spots and optimize the throughput of the fabric. RCM congestion hot spots and optimize the throughput of the fabric. RCM
relies on the existing Link-Layer Flow-Control IEEE 802.1Qbb(PFC) relies on the existing Link-Layer Flow-Control IEEE 802.1Qbb(PFC)
[IEEE.802.1QBB_2011] to provide a drop free network. RoCEv2 [IEEE.802.1QBB_2011] to provide a drop free network. RoCEv2
Congestion Management(RCM) also use ECN [RFC3168] to signal the Congestion Management(RCM) also use ECN [RFC3168] to signal the
congestion to the destination and use the congestion notification as congestion to the destination and use the congestion notification as
an input to the sender to reduce the rate of injection and increase an input to the sender to reduce the rate of injection and increase
the injection rate when the extent of congestion decreases. The rate the injection rate when the extent of congestion decreases. The rate
reduction by the sender as well as the increase in data injection is reduction by the sender as well as the increase in data injection is
skipping to change at page 7, line 21 skipping to change at page 7, line 26
possible that packets may arrive at the receiver out-of-order. possible that packets may arrive at the receiver out-of-order.
6. Data Center Congestion Management requirements 6. Data Center Congestion Management requirements
The first issue is with incast traffic. Network congestion happens The first issue is with incast traffic. Network congestion happens
in the network routers when the incoming traffic is larger than the in the network routers when the incoming traffic is larger than the
bandwidth of the outgoing link on which it has to be transmitted. bandwidth of the outgoing link on which it has to be transmitted.
Congestion is the primary source of loss in the network, congestion Congestion is the primary source of loss in the network, congestion
leads to performance degradation. leads to performance degradation.
The data sender makes its congestion management decision based on
information from the data receiver which provides partial information
about the state of the network itself.
Another issue to address is packet loss due to out-of-order packets Another issue to address is packet loss due to out-of-order packets
which may happen when load balancing is used. RoCEv2 adopt Go-back-N which may happen when load balancing is used. RoCEv2 adopt Go-back-N
loss recovery and requires lossless fabric to prevent retransmission loss recovery and requires lossless fabric to prevent retransmission
but is not addressing the packet loss due to re-ordering. but is not addressing the packet loss due to re-ordering.
RoCEv2 relies on Link-Layer Flow-Control IEEE 802.1Qbb(PFC) RoCEv2 relies on Link-Layer Flow-Control IEEE 802.1Qbb(PFC)
[IEEE.802.1QBB_2011] to provide a lossless underlay networks. [IEEE.802.1QBB_2011] to provide a lossless underlay networks.
Lossless networks is implement by a mechanism of flow control, which Lossless networks is implement by a mechanism of flow control, which
pauses the traffic with priority granularity in the incoming link pauses the traffic with priority granularity in the incoming link
before the buffer overfills, and by that prevents the case of before the buffer overfills, and by that prevents the case of
skipping to change at page 8, line 17 skipping to change at page 8, line 26
o Provide fairness mixture of RDMA traffic and normal TCP traffics. o Provide fairness mixture of RDMA traffic and normal TCP traffics.
o Provide compatibility when more than one congestion control o Provide compatibility when more than one congestion control
mechanism is used. mechanism is used.
7. Summary 7. Summary
As discussed in Section 6, we need an enhancement to current RDMA As discussed in Section 6, we need an enhancement to current RDMA
transport protocols with stronger capability of congestion management transport protocols with stronger capability of congestion management
to achieve the high throughput and low latency in the large-scale to achieve the high throughput and low latency in the large-scale
datacenter network. The solution should also have more flexible datacenter network. Network co-operation can help with getting
requirement from the underlay network. The solution should work with better information to the data sender. The solution should also have
ROCEv2 but should be more general so it can be used with iWARP as more flexible requirement from the underlay network. The solution
well. should enable better congestion management capabilities and
interoperability for ROCEv2 and iWARP in the data center environment.
8. Security Considerations 8. Security Considerations
TBD TBD
9. IANA Considerations 9. IANA Considerations
No IANA action No IANA action
10. References 10. References
 End of changes. 12 change blocks. 
30 lines changed or deleted 35 lines changed or added

This html diff was produced by rfcdiff 1.47. The latest version is available from http://tools.ietf.org/tools/rfcdiff/