draft-ietf-rtgwg-bgp-routing-large-dc-07.txt   draft-ietf-rtgwg-bgp-routing-large-dc-08.txt 
Routing Area Working Group P. Lapukhov Routing Area Working Group P. Lapukhov
Internet-Draft Facebook Internet-Draft Facebook
Intended status: Informational A. Premji Intended status: Informational A. Premji
Expires: February 29, 2016 Arista Networks Expires: September 2, 2016 Arista Networks
J. Mitchell, Ed. J. Mitchell, Ed.
August 28, 2015 March 1, 2016
Use of BGP for routing in large-scale data centers Use of BGP for routing in large-scale data centers
draft-ietf-rtgwg-bgp-routing-large-dc-07 draft-ietf-rtgwg-bgp-routing-large-dc-08
Abstract Abstract
Some network operators build and operate data centers that support Some network operators build and operate data centers that support
over one hundred thousand servers. In this document, such data over one hundred thousand servers. In this document, such data
centers are referred to as "large-scale" to differentiate them from centers are referred to as "large-scale" to differentiate them from
smaller infrastructures. Environments of this scale have a unique smaller infrastructures. Environments of this scale have a unique
set of network requirements with an emphasis on operational set of network requirements with an emphasis on operational
simplicity and network stability. This document summarizes simplicity and network stability. This document summarizes
operational experience in designing and operating large-scale data operational experience in designing and operating large-scale data
skipping to change at page 1, line 41 skipping to change at page 1, line 41
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/. Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on February 29, 2016. This Internet-Draft will expire on September 2, 2016.
Copyright Notice Copyright Notice
Copyright (c) 2015 IETF Trust and the persons identified as the Copyright (c) 2016 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as the Trust Legal Provisions and are provided without warranty as
skipping to change at page 5, line 21 skipping to change at page 5, line 21
An important aspect of Operational Expenditure (OPEX) minimization is An important aspect of Operational Expenditure (OPEX) minimization is
reducing size of failure domains in the network. Ethernet networks reducing size of failure domains in the network. Ethernet networks
are known to be susceptible to broadcast or unicast traffic storms are known to be susceptible to broadcast or unicast traffic storms
that can have a dramatic impact on network performance and that can have a dramatic impact on network performance and
availability. The use of a fully routed design significantly reduces availability. The use of a fully routed design significantly reduces
the size of the data plane failure domains - i.e. limits them to the the size of the data plane failure domains - i.e. limits them to the
lowest level in the network hierarchy. However, such designs lowest level in the network hierarchy. However, such designs
introduce the problem of distributed control plane failures. This introduce the problem of distributed control plane failures. This
observation calls for simpler and less control plane protocols to observation calls for simpler and less control plane protocols to
reduce protocol interaction issues, reducing the chance of a network minimize protocol interaction issues, reducing the chance of a
meltdown. Minimizing software feature requirements as described in network meltdown. Minimizing software feature requirements as
the CAPEX section above also reduces testing and training described in the CAPEX section above also reduces testing and
requirements. personnel training requirements.
2.4. Traffic Engineering 2.4. Traffic Engineering
In any data center, application load balancing is a critical function In any data center, application load balancing is a critical function
performed by network devices. Traditionally, load balancers are performed by network devices. Traditionally, load balancers are
deployed as dedicated devices in the traffic forwarding path. The deployed as dedicated devices in the traffic forwarding path. The
problem arises in scaling load balancers under growing traffic problem arises in scaling load balancers under growing traffic
demand. A preferable solution would be able to scale load balancing demand. A preferable solution would be able to scale load balancing
layer horizontally, by adding more of the uniform nodes and layer horizontally, by adding more of the uniform nodes and
distributing incoming traffic across these nodes. In situations like distributing incoming traffic across these nodes. In situations like
skipping to change at page 8, line 36 skipping to change at page 8, line 36
| | | | | | | | | | | | | | | | | |
| | | | | | ---------> N Links | | | | | | ---------> N Links
| | | | | | | | | | | | | | | | | |
O O O O O O O O O Servers O O O O O O O O O Servers
Figure 2: 3-Stage Folded Clos topology Figure 2: 3-Stage Folded Clos topology
This topology is often also referred to as a "Leaf and Spine" This topology is often also referred to as a "Leaf and Spine"
network, where "Spine" is the name given to the middle stage of the network, where "Spine" is the name given to the middle stage of the
Clos topology (Tier-1) and "Leaf" is the name of input/output stage Clos topology (Tier-1) and "Leaf" is the name of input/output stage
(Tier-2). For uniformity, this document will refer to these layers (Tier-2). For uniformity, this document will continue referring to
using the "Tier-n" notation. these layers using the "Tier-n" notation.
3.2.2. Clos Topology Properties 3.2.2. Clos Topology Properties
The following are some key properties of the Clos topology: The following are some key properties of the Clos topology:
o The topology is fully non-blocking, or more accurately non- o The topology is fully non-blocking, or more accurately: non-
interfering, if M >= N and oversubscribed by a factor of N/M interfering, if M >= N and oversubscribed by a factor of N/M
otherwise. Here M and N is the uplink and downlink port count otherwise. Here M and N is the uplink and downlink port count
respectively, for a Tier-2 switch as shown in Figure 2. respectively, for a Tier-2 switch as shown in Figure 2.
o Utilizing this topology requires control and data plane support o Utilizing this topology requires control and data plane support
for ECMP with a fan-out of M or more. for ECMP with a fan-out of M or more.
o Tier-1 switches have exactly one path to every server in this o Tier-1 switches have exactly one path to every server in this
topology. This is an important property that makes route topology. This is an important property that makes route
summarization dangerous in this topology (see Section 8.2 below). summarization dangerous in this topology (see Section 8.2 below).
skipping to change at page 11, line 14 skipping to change at page 11, line 14
4. Data Center Routing Overview 4. Data Center Routing Overview
This section provides an overview of three general types of data This section provides an overview of three general types of data
center protocol designs - Layer 2 only, Hybrid L2/L3 and Layer 3 center protocol designs - Layer 2 only, Hybrid L2/L3 and Layer 3
only. only.
4.1. Layer 2 Only Designs 4.1. Layer 2 Only Designs
Originally most data center designs used Spanning-Tree Protocol (STP) Originally most data center designs used Spanning-Tree Protocol (STP)
originally defined in [IEEE8021D-1990] for loop free topology initially defined in [IEEE8021D-1990] for loop-free topology
creation, typically utilizing variants of the traditional DC topology creation, typically utilizing variants of the traditional DC topology
described in Section 3.1. At the time, many DC switches either did described in Section 3.1. At the time, many DC switches either did
not support Layer 3 routed protocols or supported it with additional not support Layer 3 routed protocols or supported it with additional
licensing fees, which played a part in the design choice. Although licensing fees, which played a part in the design choice. Although
many enhancements have been made through the introduction of Rapid many enhancements have been made through the introduction of Rapid
Spanning Tree Protocol (RSTP) in the latest revision of Spanning Tree Protocol (RSTP) in the latest revision of
[IEEE8021D-2004] and Multiple Spanning Tree Protocol (MST) specified [IEEE8021D-2004] and Multiple Spanning Tree Protocol (MST) specified
in [IEEE8021Q] that increase convergence, stability and load in [IEEE8021Q] that increase convergence, stability and load
balancing in larger topologies, many of the fundamentals of the balancing in larger topologies, many of the fundamentals of the
protocol limit its applicability in large-scale DCs. STP and its protocol limit its applicability in large-scale DCs. STP and its
skipping to change at page 11, line 47 skipping to change at page 11, line 47
as Multi-Chassis Link-Aggregation (M-LAG) made it possible to use as Multi-Chassis Link-Aggregation (M-LAG) made it possible to use
Layer 2 designs with active-active network paths while relying on STP Layer 2 designs with active-active network paths while relying on STP
as the backup for loop prevention. The major downsides of this as the backup for loop prevention. The major downsides of this
approach are the lack of ability to scale linearly past two in most approach are the lack of ability to scale linearly past two in most
implementations, lack of standards based implementations, and added implementations, lack of standards based implementations, and added
failure domain risk of keeping state between the devices. failure domain risk of keeping state between the devices.
It should be noted that building large, horizontally scalable, Layer It should be noted that building large, horizontally scalable, Layer
2 only networks without STP is possible recently through the 2 only networks without STP is possible recently through the
introduction of the TRILL protocol in [RFC6325]. TRILL resolves many introduction of the TRILL protocol in [RFC6325]. TRILL resolves many
of the issues STP has for large-scale DC design however due to the of the issues STP has for large-scale DC design. However, due to
limited number of implementations, and often the requirement for limited number of implementations, and often the requirement for
specific equipment that supports it, this has limited its specific equipment that supports it, this has limited its
applicability and increased the cost of such designs. applicability and increased the cost of such designs.
Finally, neither the base TRILL specification nor the M-LAG approach Finally, neither the base TRILL specification nor the M-LAG approach
totally eliminate the problem of the shared broadcast domain, that is totally eliminate the fundamental problem of the shared broadcast
so detrimental to the operations of any Layer 2, Ethernet based domain, that is so detrimental to the operations of any Layer 2,
solutions. Later TRILL extensions have been proposed to solve the Ethernet based solutions. Later TRILL extensions have been proposed
this problem statement primarily based on the approaches outlined in to solve this problem primarily based on the approaches outlined in
[RFC7067], but this even further limits the number of available [RFC7067], but this even further limits the number of available
interoperable implementations that can be used to build a fabric, interoperable implementations that can be used to build a fabric,
therefore TRILL based designs have issues meeting REQ2, REQ3, and therefore TRILL based designs have issues meeting REQ2, REQ3, and
REQ4. REQ4.
4.2. Hybrid L2/L3 Designs 4.2. Hybrid L2/L3 Designs
Operators have sought to limit the impact of data plane faults and Operators have sought to limit the impact of data plane faults and
build large-scale topologies through implementing routing protocols build large-scale topologies through implementing routing protocols
in either the Tier-1 or Tier-2 parts of the network and dividing the in either the Tier-1 or Tier-2 parts of the network and dividing the
skipping to change at page 13, line 12 skipping to change at page 13, line 12
count exceeds tens of thousands, such fully routed designs have count exceeds tens of thousands, such fully routed designs have
become more attractive. become more attractive.
Choosing a Layer 3 only design greatly simplifies the network, Choosing a Layer 3 only design greatly simplifies the network,
facilitating the meeting of REQ1 and REQ2, and has widespread facilitating the meeting of REQ1 and REQ2, and has widespread
adoption in networks where large Layer 2 adjacency and larger size adoption in networks where large Layer 2 adjacency and larger size
Layer 3 subnets are not as critical compared to network scalability Layer 3 subnets are not as critical compared to network scalability
and stability. Application providers and network operators continue and stability. Application providers and network operators continue
to also develop new solutions to meet some of the requirements that to also develop new solutions to meet some of the requirements that
previously have driven large Layer 2 domains by using various overlay previously have driven large Layer 2 domains by using various overlay
or tunneling techniques. or tunneling techniques..
5. Routing Protocol Selection and Design 5. Routing Protocol Selection and Design
In this section the motivations for using External BGP (EBGP) as the In this section the motivations for using External BGP (EBGP) as the
single routing protocol for data center networks having a Layer 3 single routing protocol for data center networks having a Layer 3
protocol design and Clos topology are reviewed. Then, a practical protocol design and Clos topology are reviewed. Then, a practical
approach for designing an EBGP based network is provided. approach for designing an EBGP based network is provided.
5.1. Choosing EBGP as the Routing Protocol 5.1. Choosing EBGP as the Routing Protocol
skipping to change at page 13, line 42 skipping to change at page 13, line 42
service provider communities, it is not generally deployed as the service provider communities, it is not generally deployed as the
primary routing protocol within the data center for a number of primary routing protocol within the data center for a number of
reasons (some of which are interrelated): reasons (some of which are interrelated):
o BGP is perceived as a "WAN only protocol only" and not often o BGP is perceived as a "WAN only protocol only" and not often
considered for enterprise or data center applications. considered for enterprise or data center applications.
o BGP is believed to have a "much slower" routing convergence o BGP is believed to have a "much slower" routing convergence
compared to IGPs. compared to IGPs.
o Large scale BGP deployments typically utilize an IGP for BGP next- o Large scale BGP deployments typically utilize an IGP for next-hop
hop resolution as all nodes in the iBGP topology are not directly resolution as all nodes in the iBGP topology are not directly
connected. connected.
o BGP is perceived to require significant configuration overhead and o BGP is perceived to require significant configuration overhead and
does not support neighbor auto-discovery. does not support neighbor auto-discovery.
This document discusses some of these perceptions, especially as This document discusses some of these perceptions, especially as
applicable to the proposed design, and highlights some of the applicable to the proposed design, and highlights some of the
advantages of using the protocol such as: advantages of using the protocol such as:
o BGP has less complexity in parts of its protocol design - internal o BGP has less complexity in parts of its protocol design - internal
skipping to change at page 14, line 20 skipping to change at page 14, line 20
transport. This fulfills REQ2 and REQ3. transport. This fulfills REQ2 and REQ3.
o BGP information flooding overhead is less when compared to link- o BGP information flooding overhead is less when compared to link-
state IGPs. Since every BGP router calculates and propagates only state IGPs. Since every BGP router calculates and propagates only
the best-path selected, a network failure is masked as soon as the the best-path selected, a network failure is masked as soon as the
BGP speaker finds an alternate path, which exists when highly BGP speaker finds an alternate path, which exists when highly
symmetric topologies, such as Clos, are coupled with EBGP only symmetric topologies, such as Clos, are coupled with EBGP only
design. In contrast, the event propagation scope of a link-state design. In contrast, the event propagation scope of a link-state
IGP is an entire area, regardless of the failure type. In this IGP is an entire area, regardless of the failure type. In this
way, BGP better meets REQ3 and REQ4. It is also worth mentioning way, BGP better meets REQ3 and REQ4. It is also worth mentioning
that all widely deployed link-state IGPs feature periodic that all widely deployed link-state IGPs also feature periodic
refreshes of routing information, even if this rarely causes refreshes of routing information, even if this rarely causes
impact to modern router control planes, while BGP does not expire significant impact to modern router control planes, while BGP does
routing state. not expire routing state.
o BGP supports third-party (recursively resolved) next-hops. This o BGP supports third-party (recursively resolved) next-hops. This
allows for manipulating multipath to be non-ECMP based or allows for manipulating multipath to be non-ECMP based or
forwarding based on application-defined paths, through forwarding based on application-defined paths, through
establishment of a peering session with an application establishment of a peering session with an application
"controller" which can inject routing information into the system, "controller" which can inject routing information into the system,
satisfying REQ5. OSPF provides similar functionality using satisfying REQ5. OSPF provides similar functionality using
concepts such as "Forwarding Address", but with more difficulty in concepts such as "Forwarding Address", but with more difficulty in
implementation and far less control of information propagation implementation and far less control of information propagation
scope. scope.
o Using a well-defined ASN allocation scheme and standard AS_PATH o Using a well-defined ASN allocation scheme and standard AS_PATH
loop detection, "BGP path hunting" (see [JAKMA2008]) can be loop detection, "BGP path hunting" (see [JAKMA2008]) can be
controlled and complex unwanted paths will be ignored. See controlled and complex unwanted paths to be ignored. See
Section 5.2 for an example of a working ASN allocation scheme. In Section 5.2 for an example of a working ASN allocation scheme. In
a link-state IGP accomplishing the same goal would require multi- a link-state IGP accomplishing the same goal would require multi-
(instance/topology/processes) support, typically not available in (instance/topology/processes) support, typically not available in
all DC devices and quite complex to configure and troubleshoot. all DC devices and quite complex to configure and troubleshoot.
Using a traditional single flooding domain, which most DC designs Using a traditional single flooding domain, which most DC designs
utilize, under certain failure conditions may pick up unwanted utilize, under certain failure conditions may pick up unwanted
lengthy paths, e.g. traversing multiple Tier-2 devices. lengthy paths, e.g. traversing multiple Tier-2 devices.
o EBGP configuration that is implemented with minimal routing policy o EBGP configuration that is implemented with minimal routing policy
is easier to troubleshoot for network reachability issues. In is easier to troubleshoot for network reachability issues. Also,
most implementations, it is straightforward to view contents of in most implementations, it is straightforward to view contents of
BGP Loc-RIB and compare it to the router's RIB. Also, in most BGP Loc-RIB and compare it to the router's RIB. Also, in most
implementations an operator can view every BGP neighbors Adj-RIB- implementations an operator can view every BGP neighbors Adj-RIB-
In and Adj-RIB-Out structures and therefore incoming and outgoing In and Adj-RIB-Out structures and therefore incoming and outgoing
NLRI information can be easily correlated on both sides of a BGP NLRI information can be easily correlated on both sides of a BGP
session. Thus, BGP satisfies REQ3. session. Thus, BGP satisfies REQ3.
5.2. EBGP Configuration for Clos topology 5.2. EBGP Configuration for Clos topology
Clos topologies that have more than 5 stages are very uncommon due to Clos topologies that have more than 5 stages are very uncommon due to
the large numbers of interconnects required by such a design. the large numbers of interconnects required by such a design.
skipping to change at page 18, line 5 skipping to change at page 18, line 5
subnets in a Clos topology results in route black-holing under a subnets in a Clos topology results in route black-holing under a
single link failure (e.g. between Tier-2 and Tier-3 devices) and single link failure (e.g. between Tier-2 and Tier-3 devices) and
hence must be avoided. The use of peer links within the same tier to hence must be avoided. The use of peer links within the same tier to
resolve the black-holing problem by providing "bypass paths" is resolve the black-holing problem by providing "bypass paths" is
undesirable due to O(N^2) complexity of the peering mesh and waste of undesirable due to O(N^2) complexity of the peering mesh and waste of
ports on the devices. An alternative to the full-mesh of peer-links ports on the devices. An alternative to the full-mesh of peer-links
would be using a simpler bypass topology, e.g. a "ring" as described would be using a simpler bypass topology, e.g. a "ring" as described
in [FB4POST], but such a topology adds extra hops and has very in [FB4POST], but such a topology adds extra hops and has very
limited bisection bandwidth, in addition requiring special tweaks to limited bisection bandwidth, in addition requiring special tweaks to
make BGP routing work - such as possibly splitting every device into make BGP routing work - such as possibly splitting every device into
an ASN on its own. Later in this document, Section 8.2 introduces a an ASN on its own. Later in this document, section Section 8.2
less intrusive method for performing a limited form route introduces a less intrusive method for performing a limited form of
summarization in Clos networks and discusses it's associated trade- route summarization in Clos networks and the discusses it's
offs. associated trade-offs.
5.2.4. External Connectivity 5.2.4. External Connectivity
A dedicated cluster (or clusters) in the Clos topology could be used A dedicated cluster (or clusters) in the Clos topology could be used
for the purpose of connecting to the Wide Area Network (WAN) edge for the purpose of connecting to the Wide Area Network (WAN) edge
devices, or WAN Routers. Tier-3 devices in such cluster would be devices, or WAN Routers. Tier-3 devices in such cluster would be
replaced with WAN routers, and EBGP peering would be used again, replaced with WAN routers, and EBGP peering would be used again,
though WAN routers are likely to belong to a public ASN if Internet though WAN routers are likely to belong to a public ASN if Internet
connectivity is required in the design. The Tier-2 devices in such a connectivity is required in the design. The Tier-2 devices in such a
dedicated cluster will be referred to as "Border Routers" in this dedicated cluster will be referred to as "Border Routers" in this
skipping to change at page 18, line 34 skipping to change at page 18, line 34
between different data centers and also to provide a uniform between different data centers and also to provide a uniform
AS_PATH length to the WAN for purposes of WAN ECMP to Anycast AS_PATH length to the WAN for purposes of WAN ECMP to Anycast
prefixes originated in the topology. An implementation specific prefixes originated in the topology. An implementation specific
BGP feature typically called "Remove Private AS" is commonly used BGP feature typically called "Remove Private AS" is commonly used
to accomplish this. Depending on implementation, the feature to accomplish this. Depending on implementation, the feature
should strip a contiguous sequence of Private Use ASNs found in should strip a contiguous sequence of Private Use ASNs found in
AS_PATH attribute prior to advertising the path to a neighbor. AS_PATH attribute prior to advertising the path to a neighbor.
This assumes that all ASNs used for intra data center numbering This assumes that all ASNs used for intra data center numbering
are from the Private Use ranges. The process for stripping the are from the Private Use ranges. The process for stripping the
Private Use ASNs is not currently standardized, see Private Use ASNs is not currently standardized, see
[I-D.mitchell-grow-remove-private-as]. However most [I-D.mitchell-grow-remove-private-as]. However, most
implementations at least follow the logic described in this implementations at least follow the logic described in this
vendor's document [VENDOR-REMOVE-PRIVATE-AS], which is enough for vendor's document [REMOVE-PRIVATE-AS].
the design specified.
o Originate a default route to the data center devices. This is the o Originate a default route to the data center devices. This is the
only place where default route can be originated, as route only place where default route can be originated, as route
summarization is risky for the "scale-out" topology. summarization is risky for the "scale-out" topology.
Alternatively, Border Routers may simply relay the default route Alternatively, Border Routers may simply relay the default route
learned from WAN routers. Advertising the default route from learned from WAN routers. Advertising the default route from
Border Routers requires that all Border Routers be fully connected Border Routers requires that all Border Routers be fully connected
to the WAN Routers upstream, to provide resistance to a single- to the WAN Routers upstream, to provide resistance to a single-
link failure causing the black-holing of traffic. To prevent link failure causing the black-holing of traffic. To prevent
black-holing in the situation when all of the EBGP sessions to the black-holing in situation when all of EBGP sessions to the WAN
WAN routers fail simultaneously on a given device it is more routers fail simultaneously on a given device it is more desirable
desirable to take the "relaying" approach rather than introducing to take the "relaying" approach rather than introducing the
the default route via complicated conditional route origination default route via complicated conditional route origination
schemes provided by some implementations [CONDITIONALROUTE]. schemes provided by some implementations [CONDITIONALROUTE].
5.2.5. Route Summarization at the Edge 5.2.5. Route Summarization at the Edge
It is often desirable to summarize network reachability information It is often desirable to summarize network reachability information
prior to advertising it to the WAN network due to high amount of IP prior to advertising it to the WAN network due to high amount of IP
prefixes originated from within the data center in a fully routed prefixes originated from within the data center in a fully routed
network design. For example, a network with 2000 Tier-3 devices will network design. For example, a network with 2000 Tier-3 devices will
have at least 2000 servers subnets advertised into BGP, along with have at least 2000 servers subnets advertised into BGP, along with
the infrastructure or other prefixes. However, as discussed before, the infrastructure or other prefixes. However, as discussed before,
skipping to change at page 23, line 29 skipping to change at page 23, line 29
to space out consecutive BGP UPDATE messages by at least MRAI to space out consecutive BGP UPDATE messages by at least MRAI
seconds, which is often a configurable value. The initial BGP UPDATE seconds, which is often a configurable value. The initial BGP UPDATE
messages after an event carrying withdrawn routes are commonly not messages after an event carrying withdrawn routes are commonly not
affected by this timer. The MRAI timer may present significant affected by this timer. The MRAI timer may present significant
convergence delays when a BGP speaker "waits" for the new path to be convergence delays when a BGP speaker "waits" for the new path to be
learned from its peers and has no local backup path information. learned from its peers and has no local backup path information.
In a Clos topology each EBGP speaker typically has either one path In a Clos topology each EBGP speaker typically has either one path
(Tier-2 devices don't accept paths from other Tier-2 in the same (Tier-2 devices don't accept paths from other Tier-2 in the same
cluster due to same ASN) or N paths for the same prefix, where N is a cluster due to same ASN) or N paths for the same prefix, where N is a
significantly large number, e.g. N=32 (the ECMP fan-out to the next significantly large number, e.g. N=32 (the ECMP fan-out to the
Tier). Therefore, if a link fails to another device from which a Tier). Therefore, if a link fails to another device from which a
path is received there is either no backup path at all (e.g. from path is received there is either no backup path at all (e.g. from
perspective of a Tier-2 switch losing link to a Tier-3 device), or perspective of a Tier-2 switch losing link to a Tier-3 device), or
the backup is readily available in BGP Loc-RIB (e.g. from perspective the backup is readily available in BGP Loc-RIB (e.g. from perspective
of a Tier-2 device losing link to a Tier-1 switch). In the former of a Tier-2 device losing link to a Tier-1 switch). In the former
case, the BGP withdrawal announcement will propagate un-delayed and case, the BGP withdrawal announcement will propagate un-delayed and
trigger re-convergence on affected devices. In the latter case, the trigger re-convergence on affected devices. In the latter case, the
best-path will be re-evaluated and the local ECMP group corresponding best-path will be re-evaluated and the local ECMP group corresponding
to the new next-hop set changed. If the BGP path was the best-path to the new next-hop set changed. If the BGP path was the best-path
selected previously, an "implicit withdraw" will be sent via a BGP selected previously, an "implicit withdraw" will be sent via a BGP
skipping to change at page 25, line 25 skipping to change at page 25, line 25
ECMP groups for all IP prefixes from non-local cluster. The ECMP groups for all IP prefixes from non-local cluster. The
Tier-3 devices are once again not involved in the re-convergence Tier-3 devices are once again not involved in the re-convergence
process, but may receive "implicit withdraws" as described above. process, but may receive "implicit withdraws" as described above.
Even though in case of such failures multiple IP prefixes will have Even though in case of such failures multiple IP prefixes will have
to be reprogrammed in the FIB, it is worth noting that ALL of these to be reprogrammed in the FIB, it is worth noting that ALL of these
prefixes share a single ECMP group on Tier-2 device. Therefore, in prefixes share a single ECMP group on Tier-2 device. Therefore, in
the case of implementations with a hierarchical FIB, only a single the case of implementations with a hierarchical FIB, only a single
change has to be made to the FIB. Hierarchical FIB here means FIB change has to be made to the FIB. Hierarchical FIB here means FIB
structure where the next-hop forwarding information is stored structure where the next-hop forwarding information is stored
separately from the prefix lookup table, and the latter only stores separately from the prefix lookup table, and the latter only store
pointers to the respective forwarding information. pointers to the respective forwarding information.
Even though BGP offers reduced failure scope for some cases, further Even though BGP offers reduced failure scope for some cases, further
reduction of the fault domain using summarization is not always reduction of the fault domain using summarization is not always
possible with the proposed design, since using this technique may possible with the proposed design, since using this technique may
create routing black-holes as mentioned previously. Therefore, the create routing black-holes as mentioned previously. Therefore, the
worst control plane failure impact scope is the network as a whole, worst control plane failure impact scope is the network as a whole,
for instance in a case of a link failure between Tier-2 and Tier-3 for instance in a case of a link failure between Tier-2 and Tier-3
devices. The amount of impacted prefixes in this case would be much devices. The amount of impacted prefixes in this case would be much
less than in the case of a failure in the upper layers of a Clos less than in the case of a failure in the upper layers of a Clos
skipping to change at page 30, line 12 skipping to change at page 30, line 12
This document includes no request to IANA. This document includes no request to IANA.
11. Acknowledgements 11. Acknowledgements
This publication summarizes work of many people who participated in This publication summarizes work of many people who participated in
developing, testing and deploying the proposed network design, some developing, testing and deploying the proposed network design, some
of whom were George Chen, Parantap Lahiri, Dave Maltz, Edet Nkposong, of whom were George Chen, Parantap Lahiri, Dave Maltz, Edet Nkposong,
Robert Toomey, and Lihua Yuan. Authors would also like to thank Robert Toomey, and Lihua Yuan. Authors would also like to thank
Linda Dunbar, Anoop Ghanwani, Susan Hares, Danny McPherson, Robert Linda Dunbar, Anoop Ghanwani, Susan Hares, Danny McPherson, Robert
Raszuk and Russ White for reviewing this document and providing Raszuk, and Russ White for reviewing this document and providing
valuable feedback and Mary Mitchell for initial grammar and style valuable feedback and Mary Mitchell for initial grammar and style
suggestions. suggestions.
12. References 12. References
12.1. Normative References 12.1. Normative References
[RFC4271] Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A [RFC4271] Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A
Border Gateway Protocol 4 (BGP-4)", RFC 4271, Border Gateway Protocol 4 (BGP-4)", RFC 4271,
DOI 10.17487/RFC4271, January 2006, DOI 10.17487/RFC4271, January 2006,
skipping to change at page 32, line 8 skipping to change at page 32, line 8
2014, <http://www.rfc-editor.org/info/rfc7130>. 2014, <http://www.rfc-editor.org/info/rfc7130>.
[RFC7196] Pelsser, C., Bush, R., Patel, K., Mohapatra, P., and O. [RFC7196] Pelsser, C., Bush, R., Patel, K., Mohapatra, P., and O.
Maennel, "Making Route Flap Damping Usable", RFC 7196, Maennel, "Making Route Flap Damping Usable", RFC 7196,
DOI 10.17487/RFC7196, May 2014, DOI 10.17487/RFC7196, May 2014,
<http://www.rfc-editor.org/info/rfc7196>. <http://www.rfc-editor.org/info/rfc7196>.
[I-D.ietf-idr-add-paths] [I-D.ietf-idr-add-paths]
Walton, D., Retana, A., Chen, E., and J. Scudder, Walton, D., Retana, A., Chen, E., and J. Scudder,
"Advertisement of Multiple Paths in BGP", draft-ietf-idr- "Advertisement of Multiple Paths in BGP", draft-ietf-idr-
add-paths-10 (work in progress), October 2014. add-paths-13 (work in progress), December 2015.
[I-D.ietf-idr-link-bandwidth] [I-D.ietf-idr-link-bandwidth]
Mohapatra, P. and R. Fernando, "BGP Link Bandwidth Mohapatra, P. and R. Fernando, "BGP Link Bandwidth
Extended Community", draft-ietf-idr-link-bandwidth-06 Extended Community", draft-ietf-idr-link-bandwidth-06
(work in progress), January 2013. (work in progress), January 2013.
[I-D.mitchell-grow-remove-private-as] [I-D.mitchell-grow-remove-private-as]
Mitchell, J., Rao, D., and R. Raszuk, "Private Autonomous Mitchell, J., Rao, D., and R. Raszuk, "Private Autonomous
System (AS) Removal Requirements", draft-mitchell-grow- System (AS) Removal Requirements", draft-mitchell-grow-
remove-private-as-04 (work in progress), April 2015. remove-private-as-04 (work in progress), April 2015.
[CLOS1953] [CLOS1953]
Clos, C., "A Study of Non-Blocking Switching Networks: Clos, C., "A Study of Non-Blocking Switching Networks:
Bell System Technical Journal Vol. 32(2)", March 1953. Bell System Technical Journal Vol. 32(2)", March 1953.
[HADOOP] Apache, , "Apache HaDoop", August 2015, [HADOOP] Apache, , "Apache HaDoop", June 2015,
<https://hadoop.apache.org/>. <https://hadoop.apache.org/>.
[GREENBERG2009] [GREENBERG2009]
Greenberg, A., Hamilton, J., and D. Maltz, "The Cost of a Greenberg, A., Hamilton, J., and D. Maltz, "The Cost of a
Cloud: Research Problems in Data Center Networks", January Cloud: Research Problems in Data Center Networks", January
2009. 2009.
[IEEE8021D-1990] [IEEE8021D-1990]
IEEE 802.1D, , "IEEE Standard for Local and Metropolitan IEEE 802.1D, , "IEEE Standard for Local and Metropolitan
Area Networks--Media access control (MAC) Bridges", May Area Networks--Media access control (MAC) Bridges", May
skipping to change at page 33, line 9 skipping to change at page 33, line 9
[INTERCON] [INTERCON]
Dally, W. and B. Towles, "Principles and Practices of Dally, W. and B. Towles, "Principles and Practices of
Interconnection Networks", ISBN 978-0122007514, January Interconnection Networks", ISBN 978-0122007514, January
2004. 2004.
[ALFARES2008] [ALFARES2008]
Al-Fares, M., Loukissas, A., and A. Vahdat, "A Scalable, Al-Fares, M., Loukissas, A., and A. Vahdat, "A Scalable,
Commodity Data Center Network Architecture", August 2008. Commodity Data Center Network Architecture", August 2008.
[IANA.AS] IANA, , "Autonomous System (AS) Numbers", August 2015, [IANA.AS] IANA, , "Autonomous System (AS) Numbers", June 2015,
<http://www.iana.org/assignments/as-numbers/>. <http://www.iana.org/assignments/as-numbers/>.
[IEEE8023AD] [IEEE8023AD]
IEEE 802.3ad, , "IEEE Standard for Link aggregation for IEEE 802.3ad, , "IEEE Standard for Link aggregation for
parallel links", October 2000. parallel links", October 2000.
[ALLOWASIN] [ALLOWASIN]
Cisco Systems, , "Allowas-in Feature in BGP Configuration Cisco Systems, , "Allowas-in Feature in BGP Configuration
Example", February 2015, Example", February 2015,
<http://www.cisco.com/c/en/us/support/docs/ip/border- <http://www.cisco.com/c/en/us/support/docs/ip/border-
gateway-protocol-bgp/112236-allowas-in-bgp-config- gateway-protocol-bgp/112236-allowas-in-bgp-config-
example.html>. example.html>.
[VENDOR-REMOVE-PRIVATE-AS] [REMOVE-PRIVATE-AS]
Cisco Systems, , "Removing Private Autonomous System Cisco Systems, , "Removing Private Autonomous System
Numbers in BGP", August 2005, Numbers in BGP", August 2005,
<http://www.cisco.com/en/US/tech/tk365/ <http://www.cisco.com/en/US/tech/tk365/
technologies_tech_note09186a0080093f27.shtml>. technologies_tech_note09186a0080093f27.shtml>.
[CONDITIONALROUTE] [CONDITIONALROUTE]
Cisco Systems, , "Configuring and Verifying the BGP Cisco Systems, , "Configuring and Verifying the BGP
Conditional Advertisement Feature", August 2005, Conditional Advertisement Feature", August 2005,
<http://www.cisco.com/c/en/us/support/docs/ip/ <http://www.cisco.com/c/en/us/support/docs/ip/
border-gateway-protocol-bgp/16137-cond-adv.html>. border-gateway-protocol-bgp/16137-cond-adv.html>.
skipping to change at page 34, line 22 skipping to change at line 1546
Ariff Premji Ariff Premji
Arista Networks Arista Networks
5453 Great America Parkway 5453 Great America Parkway
Santa Clara, CA 95054 Santa Clara, CA 95054
US US
Email: ariff@arista.com Email: ariff@arista.com
URI: http://arista.com/ URI: http://arista.com/
Jon Mitchell (editor) Jon Mitchell (editor)
Email: jrmitche@puck.nether.net
 End of changes. 29 change blocks. 
45 lines changed or deleted 44 lines changed or added

This html diff was produced by rfcdiff 1.42. The latest version is available from http://tools.ietf.org/tools/rfcdiff/