draft-ietf-rtgwg-bgp-routing-large-dc-04.txt   draft-ietf-rtgwg-bgp-routing-large-dc-05.txt 
Routing Area Working Group P. Lapukhov Routing Area Working Group P. Lapukhov
Internet-Draft Facebook Internet-Draft Facebook
Intended status: Informational A. Premji Intended status: Informational A. Premji
Expires: January 22, 2016 Arista Networks Expires: February 1, 2016 Arista Networks
J. Mitchell, Ed. J. Mitchell, Ed.
July 21, 2015 July 31, 2015
Use of BGP for routing in large-scale data centers Use of BGP for routing in large-scale data centers
draft-ietf-rtgwg-bgp-routing-large-dc-04 draft-ietf-rtgwg-bgp-routing-large-dc-05
Abstract Abstract
Some network operators build and operate data centers that support Some network operators build and operate data centers that support
over one hundred thousand servers. In this document, such data over one hundred thousand servers. In this document, such data
centers are referred to as "large-scale" to differentiate them from centers are referred to as "large-scale" to differentiate them from
smaller infrastructures. Environments of this scale have a unique smaller infrastructures. Environments of this scale have a unique
set of network requirements with an emphasis on operational set of network requirements with an emphasis on operational
simplicity and network stability. This document summarizes simplicity and network stability. This document summarizes
operational experience in designing and operating large-scale data operational experience in designing and operating large-scale data
skipping to change at page 1, line 41 skipping to change at page 1, line 41
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/. Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on January 22, 2016. This Internet-Draft will expire on February 1, 2016.
Copyright Notice Copyright Notice
Copyright (c) 2015 IETF Trust and the persons identified as the Copyright (c) 2015 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
skipping to change at page 3, line 33 skipping to change at page 3, line 33
accommodate networks of this scale, operators are revisiting accommodate networks of this scale, operators are revisiting
networking designs and platforms to address this need. networking designs and platforms to address this need.
The design presented in this document is based on operational The design presented in this document is based on operational
experience with data centers built to support large-scale distributed experience with data centers built to support large-scale distributed
software infrastructure, such as a Web search engine. The primary software infrastructure, such as a Web search engine. The primary
requirements in such an environment are operational simplicity and requirements in such an environment are operational simplicity and
network stability so that a small group of people can effectively network stability so that a small group of people can effectively
support a significantly sized network. support a significantly sized network.
After experimentation and extensive testing, the authors and their Experimentation and extensive testing has shown that External BGP
colleagues chose to use an end-to-end routed network infrastructure (EBGP) [RFC4271] is well suited as a stand-alone routing protocol for
with External BGP (EBGP) [RFC4271] as the only routing protocol for these type of data center applications. This is in contrast with
some of its DC deployments. This is in contrast with more more traditional DC designs, which may use simple tree topologies and
traditional DC designs, which may use simple tree topologies and rely rely on extending Layer 2 domains across multiple network devices.
on extending Layer 2 domains across multiple network devices. This This document elaborates on the requirements that led to this design
document elaborates on the requirements that led to this design
choice and presents details of the EBGP routing design as well as choice and presents details of the EBGP routing design as well as
explores ideas for further enhancements. explores ideas for further enhancements.
This document first presents an overview of network design This document first presents an overview of network design
requirements and considerations for large-scale data centers. Then requirements and considerations for large-scale data centers. Then
traditional hierarchical data center network topologies are traditional hierarchical data center network topologies are
contrasted with Clos networks [CLOS1953] that are horizontally scaled contrasted with Clos networks [CLOS1953] that are horizontally scaled
out. This is followed by arguments for selecting EBGP with a Clos out. This is followed by arguments for selecting EBGP with a Clos
topology as the most appropriate routing protocol to meet the topology as the most appropriate routing protocol to meet the
requirements and the proposed design is described in detail. requirements and the proposed design is described in detail.
skipping to change at page 4, line 16 skipping to change at page 4, line 16
This section describes and summarizes network design requirements for This section describes and summarizes network design requirements for
large-scale data centers. large-scale data centers.
2.1. Bandwidth and Traffic Patterns 2.1. Bandwidth and Traffic Patterns
The primary requirement when building an interconnection network for The primary requirement when building an interconnection network for
large number of servers is to accommodate application bandwidth and large number of servers is to accommodate application bandwidth and
latency requirements. Until recently it was quite common to see the latency requirements. Until recently it was quite common to see the
majority of traffic entering and leaving the data center, commonly majority of traffic entering and leaving the data center, commonly
referred to as "north-south" traffic. As a result, traditional referred to as "north-south" traffic. Traditional "tree" topologies
"tree" topologies were sufficient to accommodate such flows, even were sufficient to accommodate such flows, even with high
with high oversubscription ratios between the layers of the network. oversubscription ratios between the layers of the network. If more
If more bandwidth was required, it was added by "scaling up" the bandwidth was required, it was added by "scaling up" the network
network elements, e.g. by upgrading the device's linecards or fabrics elements, e.g. by upgrading the device's linecards or fabrics or
or replacing the device with one with higher port density. replacing the device with one with higher port density.
Today many large-scale data centers host applications generating Today many large-scale data centers host applications generating
significant amounts of server-to-server traffic, which does not significant amounts of server-to-server traffic, which does not
egress the DC, commonly referred to as "east-west" traffic. Examples egress the DC, commonly referred to as "east-west" traffic. Examples
of such applications could be compute clusters such as Hadoop of such applications could be compute clusters such as Hadoop
[HADOOP], massive data replication between clusters needed by certain [HADOOP], massive data replication between clusters needed by certain
applications, or virtual machine migrations. Scaling traditional applications, or virtual machine migrations. Scaling traditional
tree topologies to match these bandwidth demands becomes either too tree topologies to match these bandwidth demands becomes either too
expensive or impossible due to physical limitations, e.g. port expensive or impossible due to physical limitations, e.g. port
density in a switch. density in a switch.
skipping to change at page 4, line 44 skipping to change at page 4, line 44
The Capital Expenditures (CAPEX) associated with the network The Capital Expenditures (CAPEX) associated with the network
infrastructure alone constitutes about 10-15% of total data center infrastructure alone constitutes about 10-15% of total data center
expenditure (see [GREENBERG2009]). However, the absolute cost is expenditure (see [GREENBERG2009]). However, the absolute cost is
significant, and hence there is a need to constantly drive down the significant, and hence there is a need to constantly drive down the
cost of individual network elements. This can be accomplished in two cost of individual network elements. This can be accomplished in two
ways: ways:
o Unifying all network elements, preferably using the same hardware o Unifying all network elements, preferably using the same hardware
type or even the same device. This allows for volume pricing on type or even the same device. This allows for volume pricing on
bulk purchases. bulk purchases and reduced maintenance and sparing costs.
o Driving costs down using competitive pressures, by introducing o Driving costs down using competitive pressures, by introducing
multiple network equipment vendors. multiple network equipment vendors.
In order to allow for good vendor diversity it is important to In order to allow for good vendor diversity it is important to
minimize the software feature requirements for the network elements. minimize the software feature requirements for the network elements.
This strategy provides maximum flexibility of vendor equipment This strategy provides maximum flexibility of vendor equipment
choices while enforcing interoperability using open standards. choices while enforcing interoperability using open standards.
2.3. OPEX Minimization 2.3. OPEX Minimization
skipping to change at page 5, line 20 skipping to change at page 5, line 20
feature set minimizes software issue-related failures. feature set minimizes software issue-related failures.
An important aspect of Operational Expenditure (OPEX) minimization is An important aspect of Operational Expenditure (OPEX) minimization is
reducing size of failure domains in the network. Ethernet networks reducing size of failure domains in the network. Ethernet networks
are known to be susceptible to broadcast or unicast traffic storms are known to be susceptible to broadcast or unicast traffic storms
that can have a dramatic impact on network performance and that can have a dramatic impact on network performance and
availability. The use of a fully routed design significantly reduces availability. The use of a fully routed design significantly reduces
the size of the data plane failure domains - i.e. limits them to the the size of the data plane failure domains - i.e. limits them to the
lowest level in the network hierarchy. However, such designs lowest level in the network hierarchy. However, such designs
introduce the problem of distributed control plane failures. This introduce the problem of distributed control plane failures. This
observation calls for simpler control plane protocols that are observation calls for simpler and less control plane protocols to
expected to have less chances of network meltdown. Minimizing reduce protocol interaction issues, reducing the chance of a network
software feature requirements as described in the CAPEX section above meltdown. Minimizing software feature requirements as described in
also reduces testing and training requirements. the CAPEX section above also reduces testing and training
requirements.
2.4. Traffic Engineering 2.4. Traffic Engineering
In any data center, application load balancing is a critical function In any data center, application load balancing is a critical function
performed by network devices. Traditionally, load balancers are performed by network devices. Traditionally, load balancers are
deployed as dedicated devices in the traffic forwarding path. The deployed as dedicated devices in the traffic forwarding path. The
problem arises in scaling load balancers under growing traffic problem arises in scaling load balancers under growing traffic
demand. A preferable solution would be able to scale load balancing demand. A preferable solution would be able to scale load balancing
layer horizontally, by adding more of the uniform nodes and layer horizontally, by adding more of the uniform nodes and
distributing incoming traffic across these nodes. In situation like distributing incoming traffic across these nodes. In situation like
skipping to change at page 8, line 43 skipping to change at page 8, line 43
This topology is often also referred to as a "Leaf and Spine" This topology is often also referred to as a "Leaf and Spine"
network, where "Spine" is the name given to the middle stage of the network, where "Spine" is the name given to the middle stage of the
Clos topology (Tier-1) and "Leaf" is the name of input/output stage Clos topology (Tier-1) and "Leaf" is the name of input/output stage
(Tier-2). For uniformity, this document will refer to these layers (Tier-2). For uniformity, this document will refer to these layers
using the "Tier-n" notation. using the "Tier-n" notation.
3.2.2. Clos Topology Properties 3.2.2. Clos Topology Properties
The following are some key properties of the Clos topology: The following are some key properties of the Clos topology:
o The topology is fully non-blocking (or more accurately: non- o The topology is fully non-blocking, or more accurately non-
interfering) if M >= N and oversubscribed by a factor of N/M interfering, if M >= N and oversubscribed by a factor of N/M
otherwise. Here M and N is the uplink and downlink port count otherwise. Here M and N is the uplink and downlink port count
respectively, for a Tier-2 switch as shown in Figure 2. respectively, for a Tier-2 switch as shown in Figure 2.
o Utilizing this topology requires control and data plane support o Utilizing this topology requires control and data plane support
for ECMP with a fan-out of M or more. for ECMP with a fan-out of M or more.
o Tier-1 switches have exactly one path to every server in this o Tier-1 switches have exactly one path to every server in this
topology. This is an important property that makes route topology. This is an important property that makes route
summarization impossible in this topology (see Section 8.2 below). summarization dangerous in this topology (see Section 8.2 below).
o Traffic flowing from server to server is load balanced over all o Traffic flowing from server to server is load balanced over all
available paths using ECMP. available paths using ECMP.
3.2.3. Scaling the Clos topology 3.2.3. Scaling the Clos topology
A Clos topology can be scaled either by increasing network element A Clos topology can be scaled either by increasing network element
port density or adding more stages, e.g. moving to a 5-stage Clos, as port density or adding more stages, e.g. moving to a 5-stage Clos, as
illustrated in Figure 3 below: illustrated in Figure 3 below:
skipping to change at page 10, line 37 skipping to change at page 10, line 37
Tier-1 device that were previously mapped to different Tier-1 Tier-1 device that were previously mapped to different Tier-1
devices. This technique maintains the same bisectional bandwidth devices. This technique maintains the same bisectional bandwidth
while reducing the number of elements in the Tier-1 layer, thus while reducing the number of elements in the Tier-1 layer, thus
saving on CAPEX. The tradeoff, in this example, is the reduction of saving on CAPEX. The tradeoff, in this example, is the reduction of
maximum DC size in terms of overall server count by half. maximum DC size in terms of overall server count by half.
In this example, Tier-2 devices will be using two parallel links to In this example, Tier-2 devices will be using two parallel links to
connect to each Tier-1 device. If one of these links fails, the connect to each Tier-1 device. If one of these links fails, the
other will pick up all traffic of the failed link, possible resulting other will pick up all traffic of the failed link, possible resulting
in heavy congestion and quality of service degradation if the path in heavy congestion and quality of service degradation if the path
determination procedure does not take bandwidth amount into account. determination procedure does not take bandwidth amount into account
since the number of upstream Tier-1 devices is likely wider than two.
To avoid this situation, parallel links can be grouped in link To avoid this situation, parallel links can be grouped in link
aggregation groups (LAGs, such as [IEEE8023AD]) with widely available aggregation groups (LAGs, such as [IEEE8023AD]) with widely available
implementation settings that take the whole "bundle" down upon a implementation settings that take the whole "bundle" down upon a
single link failure. Equivalent techniques that enforce "fate single link failure. Equivalent techniques that enforce "fate
sharing" on the parallel links can be used in place of LAGs to sharing" on the parallel links can be used in place of LAGs to
achieve the same effect. As a result of such fate-sharing, traffic achieve the same effect. As a result of such fate-sharing, traffic
from two or more failed links will be re-balanced over the multitude from two or more failed links will be re-balanced over the multitude
of remaining paths that equals the number of Tier-1 devices. This of remaining paths that equals the number of Tier-1 devices. This
example is using two links for simplicity, having more links in a example is using two links for simplicity, having more links in a
bundle will have less impact on capacity upon a member-link failure. bundle will have less impact on capacity upon a member-link failure.
skipping to change at page 30, line 11 skipping to change at page 30, line 11
10. IANA Considerations 10. IANA Considerations
This document includes no request to IANA. This document includes no request to IANA.
11. Acknowledgements 11. Acknowledgements
This publication summarizes work of many people who participated in This publication summarizes work of many people who participated in
developing, testing and deploying the proposed network design, some developing, testing and deploying the proposed network design, some
of whom were George Chen, Parantap Lahiri, Dave Maltz, Edet Nkposong, of whom were George Chen, Parantap Lahiri, Dave Maltz, Edet Nkposong,
Robert Toomey, and Lihua Yuan. Authors would also like to thank Robert Toomey, and Lihua Yuan. Authors would also like to thank
Linda Dunbar, Susan Hares, Danny McPherson, Russ White and Robert Linda Dunbar, Susan Hares, Danny McPherson, Robert Raszuk and Russ
Raszuk for reviewing the document and providing valuable feedback and White for reviewing the document and providing valuable feedback and
Mary Mitchell for grammar and style suggestions. Mary Mitchell for grammar and style suggestions.
12. References 12. References
12.1. Normative References 12.1. Normative References
[RFC4271] Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A [RFC4271] Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A
Border Gateway Protocol 4 (BGP-4)", RFC 4271, Border Gateway Protocol 4 (BGP-4)", RFC 4271,
DOI 10.17487/RFC4271, January 2006, DOI 10.17487/RFC4271, January 2006,
<http://www.rfc-editor.org/info/rfc4271>. <http://www.rfc-editor.org/info/rfc4271>.
 End of changes. 12 change blocks. 
28 lines changed or deleted 29 lines changed or added

This html diff was produced by rfcdiff 1.42. The latest version is available from http://tools.ietf.org/tools/rfcdiff/