draft-ietf-rtgwg-bgp-routing-large-dc-00.txt   draft-ietf-rtgwg-bgp-routing-large-dc-01.txt 
Routing Area Working Group P. Lapukhov Routing Area Working Group P. Lapukhov
Internet-Draft Facebook Internet-Draft Facebook
Intended status: Informational A. Premji Intended status: Informational A. Premji
Expires: February 15, 2015 Arista Networks Expires: August 16, 2015 Arista Networks
J. Mitchell, Ed. J. Mitchell, Ed.
Microsoft Corporation Microsoft Corporation
August 14, 2014 February 12, 2015
Use of BGP for routing in large-scale data centers Use of BGP for routing in large-scale data centers
draft-ietf-rtgwg-bgp-routing-large-dc-00 draft-ietf-rtgwg-bgp-routing-large-dc-01
Abstract Abstract
Some network operators build and operate data centers that support Some network operators build and operate data centers that support
over one hundred thousand servers. In this document, such data over one hundred thousand servers. In this document, such data
centers are referred to as "large-scale" to differentiate them from centers are referred to as "large-scale" to differentiate them from
smaller infrastructures. Environments of this scale have a unique smaller infrastructures. Environments of this scale have a unique
set of network requirements with an emphasis on operational set of network requirements with an emphasis on operational
simplicity and network stability. This document summarizes simplicity and network stability. This document summarizes
operational experience in designing and operating large-scale data operational experience in designing and operating large-scale data
skipping to change at page 1, line 42 skipping to change at page 1, line 42
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/. Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on February 15, 2015. This Internet-Draft will expire on August 16, 2015.
Copyright Notice Copyright Notice
Copyright (c) 2014 IETF Trust and the persons identified as the Copyright (c) 2015 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as the Trust Legal Provisions and are provided without warranty as
skipping to change at page 3, line 8 skipping to change at page 3, line 8
6.4. Consistent Hashing . . . . . . . . . . . . . . . . . . . 21 6.4. Consistent Hashing . . . . . . . . . . . . . . . . . . . 21
7. Routing Convergence Properties . . . . . . . . . . . . . . . 21 7. Routing Convergence Properties . . . . . . . . . . . . . . . 21
7.1. Fault Detection Timing . . . . . . . . . . . . . . . . . 21 7.1. Fault Detection Timing . . . . . . . . . . . . . . . . . 21
7.2. Event Propagation Timing . . . . . . . . . . . . . . . . 22 7.2. Event Propagation Timing . . . . . . . . . . . . . . . . 22
7.3. Impact of Clos Topology Fan-outs . . . . . . . . . . . . 22 7.3. Impact of Clos Topology Fan-outs . . . . . . . . . . . . 22
7.4. Failure Impact Scope . . . . . . . . . . . . . . . . . . 23 7.4. Failure Impact Scope . . . . . . . . . . . . . . . . . . 23
7.5. Routing Micro-Loops . . . . . . . . . . . . . . . . . . . 24 7.5. Routing Micro-Loops . . . . . . . . . . . . . . . . . . . 24
8. Additional Options for Design . . . . . . . . . . . . . . . . 25 8. Additional Options for Design . . . . . . . . . . . . . . . . 25
8.1. Third-party Route Injection . . . . . . . . . . . . . . . 25 8.1. Third-party Route Injection . . . . . . . . . . . . . . . 25
8.2. Route Summarization within Clos Topology . . . . . . . . 25 8.2. Route Summarization within Clos Topology . . . . . . . . 25
8.2.1. Collapsing Tier-1 Devices Layer . . . . . . . . . . . 25 8.2.1. Collapsing Tier-1 Devices Layer . . . . . . . . . . . 26
8.2.2. Simple Virtual Aggregation . . . . . . . . . . . . . 27 8.2.2. Simple Virtual Aggregation . . . . . . . . . . . . . 27
8.3. ICMP Unreachable Message Masquerading . . . . . . . . . . 27 8.3. ICMP Unreachable Message Masquerading . . . . . . . . . . 27
9. Security Considerations . . . . . . . . . . . . . . . . . . . 28 9. Security Considerations . . . . . . . . . . . . . . . . . . . 28
10. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 28 10. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 28
11. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 28 11. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 28
12. References . . . . . . . . . . . . . . . . . . . . . . . . . 28 12. References . . . . . . . . . . . . . . . . . . . . . . . . . 29
12.1. Normative References . . . . . . . . . . . . . . . . . . 28 12.1. Normative References . . . . . . . . . . . . . . . . . . 29
12.2. Informative References . . . . . . . . . . . . . . . . . 29 12.2. Informative References . . . . . . . . . . . . . . . . . 29
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 31 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 31
1. Introduction 1. Introduction
This document describes a practical routing design that can be used This document describes a practical routing design that can be used
in a large-scale data center ("DC") design. Such data centers, also in a large-scale data center ("DC") design. Such data centers, also
known as hyper-scale or warehouse-scale data-centers, have a unique known as hyper-scale or warehouse-scale data-centers, have a unique
attribute of supporting over a hundred thousand servers. In order to attribute of supporting over a hundred thousand servers. In order to
accommodate networks of this scale, operators are revisiting accommodate networks of this scale, operators are revisiting
skipping to change at page 17, line 35 skipping to change at page 17, line 35
devices, or WAN Routers. Tier-3 devices in such cluster would be devices, or WAN Routers. Tier-3 devices in such cluster would be
replaced with WAN routers, and EBGP peering would be used again, replaced with WAN routers, and EBGP peering would be used again,
though WAN routers are likely to belong to a public ASN if Internet though WAN routers are likely to belong to a public ASN if Internet
connectivity is required in the design. The Tier-2 devices in such a connectivity is required in the design. The Tier-2 devices in such a
dedicated cluster will be referred to as "Border Routers" in this dedicated cluster will be referred to as "Border Routers" in this
document. These devices have to perform a few special functions: document. These devices have to perform a few special functions:
o Hide network topology information when advertising paths to WAN o Hide network topology information when advertising paths to WAN
routers, i.e. remove Private BGP ASNs from the AS_PATH attribute. routers, i.e. remove Private BGP ASNs from the AS_PATH attribute.
This is typically done to avoid ASN number collisions between This is typically done to avoid ASN number collisions between
different data centers. An implementation specific BGP feature different data centers and also to provide a uniform AS_PATH
typically called "Remove Private AS" is commonly used to length to the WAN for purposes of WAN ECMP to Anycast prefixes
originated in the topology. An implementation specific BGP
feature typically called "Remove Private AS" is commonly used to
accomplish this. Depending on implementation, the feature should accomplish this. Depending on implementation, the feature should
strip a contiguous sequence of private ASNs found in AS_PATH strip a contiguous sequence of private ASNs found in AS_PATH
attribute prior to advertising the path to a neighbor. This attribute prior to advertising the path to a neighbor. This
assumes that all BGP ASN's used for intra data center numbering assumes that all BGP ASN's used for intra data center numbering
are from the private ASN range. The process for stripping the are from the private ASN range. The process for stripping the
private ASNs is not currently standardized, but most private ASNs is not currently standardized, but most
implementations commonly follow the logic described in implementations commonly follow the logic described in
[REMOVE-PRIVATE-AS] vendor's document. [REMOVE-PRIVATE-AS] vendor's document.
o Originate a default route to the data center devices. This is the o Originate a default route to the data center devices. This is the
only place where default route can be originated, as route only place where default route can be originated, as route
summarization is risky for the "scale-out" topology. summarization is risky for the "scale-out" topology.
Alternatively, Border Routers may simply relay the default route Alternatively, Border Routers may simply relay the default route
learned from WAN routers. Advertising the default route from learned from WAN routers. Advertising the default route from
Border Routers requires that all Border Routers to be fully Border Routers requires that all Border Routers to be fully
connected to the WAN Routers upstream, to provide resistance to a connected to the WAN Routers upstream, to provide resistance to a
single-link failure causing the black holing of traffic. To single-link failure causing the black-holing of traffic. To
prevent chance of operator or implementation error that may impact prevent chance of operator or implementation error that may impact
EBGP sessions to the WAN routers simultaneously (although these EBGP sessions to the WAN routers simultaneously (although these
scenarios are not planned for by many operators since they scenarios are not planned for by many operators since they
represents a multiple failure) it is more desirable to take this represents a multiple failure) it is more desirable to take this
approach rather than introducing complicated conditional default approach rather than introducing complicated conditional default
origination schemes provided by some implementations. origination schemes provided by some implementations.
5.2.5. Route Summarization at the Edge 5.2.5. Route Summarization at the Edge
It is often desirable to summarize network reachability information It is often desirable to summarize network reachability information
skipping to change at page 25, line 39 skipping to change at page 25, line 39
8.2. Route Summarization within Clos Topology 8.2. Route Summarization within Clos Topology
As mentioned previously, route summarization is not possible within As mentioned previously, route summarization is not possible within
the proposed Clos topology since it makes the network susceptible to the proposed Clos topology since it makes the network susceptible to
route black-holing under single link failures. The main problem is route black-holing under single link failures. The main problem is
the limited number of parallel paths between network elements, e.g. the limited number of parallel paths between network elements, e.g.
there is only a single path between any pair of Tier-1 and Tier-3 there is only a single path between any pair of Tier-1 and Tier-3
devices. However, some operators may find route aggregation devices. However, some operators may find route aggregation
desirable to improve control plane stability. desirable to improve control plane stability.
If planning on using any technique to summarize within the topology
modeling of the routing behavior and potential for black-holing
should be done not only for single or multiple link failures, but
also fiber pathway failures or optical domain failures if the
topology extends beyond a physical location. Simple modeling can be
done by checking the reachability on devices doing summarization
under the condition of a link or pathway failure between a set of
devices in every Tier as well as to the WAN routers if external
connectivity is present.
Route summarization would be possible with a small modification to Route summarization would be possible with a small modification to
the network topology, though the trade-off would be reduction of the the network topology, though the trade-off would be reduction of the
total size of the network as well as network congestion under total size of the network as well as network congestion under
specific failures. This approach is very similar to the technique specific failures. This approach is very similar to the technique
described above, which allows Border Routers to summarize the entire described above, which allows Border Routers to summarize the entire
data-center address space. data-center address space.
8.2.1. Collapsing Tier-1 Devices Layer 8.2.1. Collapsing Tier-1 Devices Layer
In order to add more paths between Tier-1 and Tier-3 devices, group In order to add more paths between Tier-1 and Tier-3 devices, group
skipping to change at page 29, line 49 skipping to change at page 30, line 13
2012. 2012.
[RFC7130] Bhatia, M., Chen, M., Boutros, S., Binderberger, M., and [RFC7130] Bhatia, M., Chen, M., Boutros, S., Binderberger, M., and
J. Haas, "Bidirectional Forwarding Detection (BFD) on Link J. Haas, "Bidirectional Forwarding Detection (BFD) on Link
Aggregation Group (LAG) Interfaces", RFC 7130, February Aggregation Group (LAG) Interfaces", RFC 7130, February
2014. 2014.
[I-D.ietf-idr-add-paths] [I-D.ietf-idr-add-paths]
Walton, D., Retana, A., Chen, E., and J. Scudder, Walton, D., Retana, A., Chen, E., and J. Scudder,
"Advertisement of Multiple Paths in BGP", draft-ietf-idr- "Advertisement of Multiple Paths in BGP", draft-ietf-idr-
add-paths-09 (work in progress), October 2013. add-paths-10 (work in progress), October 2014.
[I-D.ietf-idr-link-bandwidth] [I-D.ietf-idr-link-bandwidth]
Mohapatra, P. and R. Fernando, "BGP Link Bandwidth Mohapatra, P. and R. Fernando, "BGP Link Bandwidth
Extended Community", draft-ietf-idr-link-bandwidth-06 Extended Community", draft-ietf-idr-link-bandwidth-06
(work in progress), January 2013. (work in progress), January 2013.
[GREENBERG2009] [GREENBERG2009]
Greenberg, A., Hamilton, J., and D. Maltz, "The Cost of a Greenberg, A., Hamilton, J., and D. Maltz, "The Cost of a
Cloud: Research Problems in Data Center Networks", January Cloud: Research Problems in Data Center Networks", January
2009. 2009.
skipping to change at page 30, line 35 skipping to change at page 30, line 45
[INTERCON] [INTERCON]
Dally, W. and B. Towles, "Principles and Practices of Dally, W. and B. Towles, "Principles and Practices of
Interconnection Networks", ISBN 978-0122007514, January Interconnection Networks", ISBN 978-0122007514, January
2004. 2004.
[ALFARES2008] [ALFARES2008]
Al-Fares, M., Loukissas, A., and A. Vahdat, "A Scalable, Al-Fares, M., Loukissas, A., and A. Vahdat, "A Scalable,
Commodity Data Center Network Architecture", August 2008. Commodity Data Center Network Architecture", August 2008.
[IANA.AS] IANA, , "Autonomous System (AS) Numbers", August 2014, [IANA.AS] IANA, , "Autonomous System (AS) Numbers", February 2015,
<http://www.iana.org/assignments/as-numbers/>. <http://www.iana.org/assignments/as-numbers/>.
[IEEE8023AD] [IEEE8023AD]
IEEE 802.3ad, , "IEEE Standard for Link aggregation for IEEE 802.3ad, , "IEEE Standard for Link aggregation for
parallel links", October 2000. parallel links", October 2000.
[REMOVE-PRIVATE-AS] [REMOVE-PRIVATE-AS]
Cisco Systems, , "Removing Private Autonomous System Cisco Systems, , "Removing Private Autonomous System
Numbers in BGP", August 2005, Numbers in BGP", August 2005,
<http://www.cisco.com/en/US/tech/tk365/ <http://www.cisco.com/en/US/tech/tk365/
 End of changes. 12 change blocks. 
13 lines changed or deleted 25 lines changed or added

This html diff was produced by rfcdiff 1.42. The latest version is available from http://tools.ietf.org/tools/rfcdiff/