draft-ietf-spring-segment-routing-msdc-03.txt   draft-ietf-spring-segment-routing-msdc-04.txt 
Network Working Group C. Filsfils, Ed. Network Working Group C. Filsfils, Ed.
Internet-Draft S. Previdi, Ed. Internet-Draft S. Previdi, Ed.
Intended status: Informational Cisco Systems, Inc. Intended status: Informational Cisco Systems, Inc.
Expires: September 4, 2017 J. Mitchell Expires: September 10, 2017 J. Mitchell
Unaffiliated Unaffiliated
E. Aries E. Aries
Juniper Networks Juniper Networks
P. Lapukhov P. Lapukhov
Facebook Facebook
March 3, 2017 March 9, 2017
BGP-Prefix Segment in large-scale data centers BGP-Prefix Segment in large-scale data centers
draft-ietf-spring-segment-routing-msdc-03 draft-ietf-spring-segment-routing-msdc-04
Abstract Abstract
This document describes the motivation and benefits for applying This document describes the motivation and benefits for applying
segment routing in BGP-based large-scale data-centers. It describes segment routing in BGP-based large-scale data-centers. It describes
the design to deploy segment routing in those data-centers, for both the design to deploy segment routing in those data-centers, for both
the MPLS and IPv6 dataplanes. the MPLS and IPv6 dataplanes.
Requirements Language Requirements Language
skipping to change at page 1, line 45 skipping to change at page 1, line 45
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/. Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on September 4, 2017. This Internet-Draft will expire on September 10, 2017.
Copyright Notice Copyright Notice
Copyright (c) 2017 IETF Trust and the persons identified as the Copyright (c) 2017 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
skipping to change at page 2, line 35 skipping to change at page 2, line 35
3. Some open problems in large data-center networks . . . . . . 5 3. Some open problems in large data-center networks . . . . . . 5
4. Applying Segment Routing in the DC with MPLS dataplane . . . 6 4. Applying Segment Routing in the DC with MPLS dataplane . . . 6
4.1. BGP Prefix Segment (BGP-Prefix-SID) . . . . . . . . . . . 6 4.1. BGP Prefix Segment (BGP-Prefix-SID) . . . . . . . . . . . 6
4.2. eBGP Labeled Unicast (RFC3107) . . . . . . . . . . . . . 7 4.2. eBGP Labeled Unicast (RFC3107) . . . . . . . . . . . . . 7
4.2.1. Control Plane . . . . . . . . . . . . . . . . . . . . 7 4.2.1. Control Plane . . . . . . . . . . . . . . . . . . . . 7
4.2.2. Data Plane . . . . . . . . . . . . . . . . . . . . . 9 4.2.2. Data Plane . . . . . . . . . . . . . . . . . . . . . 9
4.2.3. Network Design Variation . . . . . . . . . . . . . . 10 4.2.3. Network Design Variation . . . . . . . . . . . . . . 10
4.2.4. Global BGP Prefix Segment through the fabric . . . . 10 4.2.4. Global BGP Prefix Segment through the fabric . . . . 10
4.2.5. Incremental Deployments . . . . . . . . . . . . . . . 11 4.2.5. Incremental Deployments . . . . . . . . . . . . . . . 11
4.3. iBGP Labeled Unicast (RFC3107) . . . . . . . . . . . . . 12 4.3. iBGP Labeled Unicast (RFC3107) . . . . . . . . . . . . . 12
5. Applying Segment Routing in the DC with IPv6 dataplane . . . 12 5. Applying Segment Routing in the DC with IPv6 dataplane . . . 14
6. Communicating path information to the host . . . . . . . . . 13 6. Communicating path information to the host . . . . . . . . . 14
7. Addressing the open problems . . . . . . . . . . . . . . . . 14 7. Addressing the open problems . . . . . . . . . . . . . . . . 15
7.1. Per-packet and flowlet switching . . . . . . . . . . . . 14 7.1. Per-packet and flowlet switching . . . . . . . . . . . . 15
7.2. Performance-aware routing . . . . . . . . . . . . . . . . 15 7.2. Performance-aware routing . . . . . . . . . . . . . . . . 16
7.3. Deterministic network probing . . . . . . . . . . . . . . 16 7.3. Deterministic network probing . . . . . . . . . . . . . . 17
8. Additional Benefits . . . . . . . . . . . . . . . . . . . . . 16 8. Additional Benefits . . . . . . . . . . . . . . . . . . . . . 18
8.1. MPLS Dataplane with operational simplicity . . . . . . . 16 8.1. MPLS Dataplane with operational simplicity . . . . . . . 18
8.2. Minimizing the FIB table . . . . . . . . . . . . . . . . 17 8.2. Minimizing the FIB table . . . . . . . . . . . . . . . . 18
8.3. Egress Peer Engineering . . . . . . . . . . . . . . . . . 17 8.3. Egress Peer Engineering . . . . . . . . . . . . . . . . . 18
8.4. Anycast . . . . . . . . . . . . . . . . . . . . . . . . . 18 8.4. Anycast . . . . . . . . . . . . . . . . . . . . . . . . . 19
9. Preferred SRGB Allocation . . . . . . . . . . . . . . . . . . 18 9. Preferred SRGB Allocation . . . . . . . . . . . . . . . . . . 19
10. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 19 10. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 20
11. Manageability Considerations . . . . . . . . . . . . . . . . 19 11. Manageability Considerations . . . . . . . . . . . . . . . . 20
12. Security Considerations . . . . . . . . . . . . . . . . . . . 19 12. Security Considerations . . . . . . . . . . . . . . . . . . . 21
13. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 20 13. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 21
14. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 20 14. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 21
15. References . . . . . . . . . . . . . . . . . . . . . . . . . 21 15. References . . . . . . . . . . . . . . . . . . . . . . . . . 22
15.1. Normative References . . . . . . . . . . . . . . . . . . 21 15.1. Normative References . . . . . . . . . . . . . . . . . . 22
15.2. Informative References . . . . . . . . . . . . . . . . . 21 15.2. Informative References . . . . . . . . . . . . . . . . . 23
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 22 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 24
1. Introduction 1. Introduction
Segment Routing (SR), as described in Segment Routing (SR), as described in
[I-D.ietf-spring-segment-routing] leverages the source routing [I-D.ietf-spring-segment-routing] leverages the source routing
paradigm. A node steers a packet through an ordered list of paradigm. A node steers a packet through an ordered list of
instructions, called segments. A segment can represent any instructions, called segments. A segment can represent any
instruction, topological or service-based. A segment can have a instruction, topological or service-based. A segment can have a
local semantic to an SR node or global within an SR domain. SR local semantic to an SR node or global within an SR domain. SR
allows to enforce a flow through any topological path and service allows to enforce a flow through any topological path and service
skipping to change at page 3, line 34 skipping to change at page 3, line 34
and MPLS dataplane. and MPLS dataplane.
2. Large Scale Data Center Network Design Summary 2. Large Scale Data Center Network Design Summary
This section provides a brief summary of the informational document This section provides a brief summary of the informational document
[RFC7938] that outlines a practical network design suitable for data- [RFC7938] that outlines a practical network design suitable for data-
centers of various scales: centers of various scales:
o Data-center networks have highly symmetric topologies with o Data-center networks have highly symmetric topologies with
multiple parallel paths between two server attachment points. The multiple parallel paths between two server attachment points. The
well-known Clos topology is most popular among the operators. In well-known Clos topology is most popular among the operators (as
a Clos topology, the minimum number of parallel paths between two described in [RFC7938]). In a Clos topology, the minimum number
elements is determined by the "width" of the "Tier-1" stage. See of parallel paths between two elements is determined by the
Figure 1 below for an illustration of the concept. "width" of the "Tier-1" stage. See Figure 1 below for an
illustration of the concept.
o Large-scale data-centers commonly use a routing protocol, such as o Large-scale data-centers commonly use a routing protocol, such as
BGP4 [RFC4271] in order to provide endpoint connectivity. BGP4 [RFC4271] in order to provide endpoint connectivity.
Recovery after a network failure is therefore driven either by Recovery after a network failure is therefore driven either by
local knowledge of directly available backup paths or by local knowledge of directly available backup paths or by
distributed signaling between the network devices. distributed signaling between the network devices.
o Within data-center networks, traffic is load-shared using the o Within data-center networks, traffic is load-shared using the
Equal Cost Multipath (ECMP) mechanism. With ECMP, every network Equal Cost Multipath (ECMP) mechanism. With ECMP, every network
device implements a pseudo-random decision, mapping packets to one device implements a pseudo-random decision, mapping packets to one
skipping to change at page 4, line 43 skipping to change at page 4, line 43
|NODE | |NODE | Tier-3 +->|NODE |--+ Tier-3 |NODE | |NODE | |NODE | |NODE | Tier-3 +->|NODE |--+ Tier-3 |NODE | |NODE |
| 1 | | 2 | | 8 | | 11 | | 12 | | 1 | | 2 | | 8 | | 11 | | 12 |
+-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+
| | | | | | | | | | | | | | | |
A O B O <- Servers -> Z O O O A O B O <- Servers -> Z O O O
Figure 1: 5-stage Clos topology Figure 1: 5-stage Clos topology
In the reference topology illustrated in Figure 1, we assume: In the reference topology illustrated in Figure 1, we assume:
o Each node is its own AS (Node X has AS X) o Each node is its own AS (Node X has AS X). 4-byte AS numbers are
recommended ([RFC6793]).
* For simple and efficient route propagation filtering, Node5, * For simple and efficient route propagation filtering, Node5,
Node6, Node7 and Node8 use the same AS, Node3 and Node4 use the Node6, Node7 and Node8 use the same AS, Node3 and Node4 use the
same AS, Node9 and Node10 use the same AS. same AS, Node9 and Node10 use the same AS.
* For efficient usage of the scarce 2-byte Private Use AS pool, * In case of 2-byte autonomous system numbers are used and for
efficient usage of the scarce 2-byte Private Use AS pool,
different Tier-3 nodes might use the same AS. different Tier-3 nodes might use the same AS.
* Without loss of generality, we will simplify these details in * Without loss of generality, we will simplify these details in
this document and assume that each node has its own AS. this document and assume that each node has its own AS.
o Each node peers with its neighbors with a BGP session. If not o Each node peers with its neighbors with a BGP session. If not
specified, eBGP is assumed. In a specific use-case, iBGP will be specified, eBGP is assumed. In a specific use-case, iBGP will be
used but this will be called out explicitly in that case. used but this will be called out explicitly in that case.
o Each node originates the IPv4 address of its loopback interface o Each node originates the IPv4 address of its loopback interface
skipping to change at page 6, line 40 skipping to change at page 6, line 40
BGP Prefix Segment is a network-wide instruction to forward the BGP Prefix Segment is a network-wide instruction to forward the
packet along the ECMP-aware best path to the related prefix. packet along the ECMP-aware best path to the related prefix.
The BGP Prefix Segment is defined as the BGP-Prefix-SID Attribute in The BGP Prefix Segment is defined as the BGP-Prefix-SID Attribute in
[I-D.ietf-idr-bgp-prefix-sid] which contains an index. Throughout [I-D.ietf-idr-bgp-prefix-sid] which contains an index. Throughout
this document the BGP Prefix Segment Attribute is referred as the this document the BGP Prefix Segment Attribute is referred as the
BGP-Prefix-SID and the encoded index as the label-index. BGP-Prefix-SID and the encoded index as the label-index.
In this document, we make the network design decision to assume that In this document, we make the network design decision to assume that
all the nodes are allocated the same SRGB (Segment Routing Global all the nodes are allocated the same SRGB (Segment Routing Global
Block), e.g. [16000, 23999] This is important to fulfill the Block), e.g. [16000, 23999]. This provides operational
recommendation for operational simplification as explained in simplification as explained in Section 9, but this is not a
[I-D.ietf-spring-segment-routing]. requirement.
Note well that the use of a common SRGB in all nodes is not a
requirement, one could use a different SRGB at every node. However,
this would make the operation of the DC fabric more complex as the
label allocated to the loopback of a remote node is then different at
every node. This also may increase the complexity of the centralized
controller. More on the SRGB allocation scheme is described in
Section 9.
For illustration purpose, when considering an MPLS data-plane, we For illustration purpose, when considering an MPLS data-plane, we
assume that the label-index allocated to prefix 192.0.2.x/32 is X. assume that the label-index allocated to prefix 192.0.2.x/32 is X.
As a result, a local label 1600x is allocated for prefix 192.0.2.x/32 As a result, a local label (16000+x) is allocated for prefix
by each node throughout the DC fabric. 192.0.2.x/32 by each node throughout the DC fabric.
When IPv6 data-plane is considered, we assume that Node X is When IPv6 data-plane is considered, we assume that Node X is
allocated IPv6 address (segment) 2001:DB8::X. allocated IPv6 address (segment) 2001:DB8::X.
4.2. eBGP Labeled Unicast (RFC3107) 4.2. eBGP Labeled Unicast (RFC3107)
Referring to Figure 1 and [RFC7938], the following design Referring to Figure 1 and [RFC7938], the following design
modifications are introduced: modifications are introduced:
o Each node peers with its neighbors via eBGP3107 session o Each node peers with its neighbors via a eBGP session with
extensions defined in [RFC3107] (named "eBGP3107" throughout this
document) and with the BGP-Prefix-SID attribute extension defined
in this document.
o The forwarding plane at Tier-2 and Tier-1 is MPLS. o The forwarding plane at Tier-2 and Tier-1 is MPLS.
o The forwarding plane at Tier-3 is either IP2MPLS (if the host o The forwarding plane at Tier-3 is either IP2MPLS (if the host
sends IP traffic) or MPLS2MPLS (if the host sends MPLS- sends IP traffic) or MPLS2MPLS (if the host sends MPLS-
encapsulated traffic). encapsulated traffic).
Figure 2 zooms into a path from server A to server Z within the Figure 2 zooms into a path from server A to server Z within the
topology of Figure 1. topology of Figure 1.
skipping to change at page 7, line 43 skipping to change at page 7, line 38
| | | |
+-----+ +-----+ +-----+ +-----+
|NODE | |NODE | |NODE | |NODE |
| 1 | | 11 | | 1 | | 11 |
+-----+ +-----+ +-----+ +-----+
| | | |
A <- Servers -> Z A <- Servers -> Z
Figure 2: Path from A to Z via nodes 1, 4, 7, 10 and 11 Figure 2: Path from A to Z via nodes 1, 4, 7, 10 and 11
Referring to Figure 1 and Figure 2 and assuming the IP address, AS Referring to Figure 1 and Figure 2 and assuming the IP address with
and label-index allocation previously described, the following the AS and label-index allocation previously described, the following
sections detail the control plane operation and the data plane states sections detail the control plane operation and the data plane states
for the prefix 192.0.2.11/32 (loopback of Node11) for the prefix 192.0.2.11/32 (loopback of Node11)
4.2.1. Control Plane 4.2.1. Control Plane
Node11 originates 192.0.2.11/32 in BGP and allocates to it a BGP- Node11 originates 192.0.2.11/32 in BGP and allocates to it a BGP-
Prefix-SID with label-index: index11) [I-D.ietf-idr-bgp-prefix-sid]. Prefix-SID with label-index: index11) [I-D.ietf-idr-bgp-prefix-sid].
Node11 sends the following eBGP3107 update to Node10: Node11 sends the following eBGP3107 update to Node10:
skipping to change at page 12, line 27 skipping to change at page 12, line 27
fabric), the operator incrementally enjoys the global prefix segment fabric), the operator incrementally enjoys the global prefix segment
benefits as the deployment progresses through the fabric. benefits as the deployment progresses through the fabric.
4.3. iBGP Labeled Unicast (RFC3107) 4.3. iBGP Labeled Unicast (RFC3107)
The same exact design as eBGP3107 is used with the following The same exact design as eBGP3107 is used with the following
modifications: modifications:
All nodes use the same AS number. All nodes use the same AS number.
iBGP3107 reflection with next-hop-self is used instead of Each node peers with its neighbors via an internal BGP session
eBGP3107. (iBGP) with extensions defined in [RFC3107] (named "iBGP3107"
throughout this document) and with the BGP-Prefix-SID attribute
extension defined in this document.
For simple and efficient route propagation filtering, Node5, Each node acts as a route-reflector for each of its neighbors and
Node6, Node7 and Node8 use the same Cluster ID, Node3 and Node4 with the next-hop-self option. Next-hop-self is a well known
use the same Cluster ID, Node9 and Node10 use the same Cluster ID. operational feature which consists of rewriting the next-hop of a
BGP update prior to send it to the neighbor. Usually, it's a
common practice to apply next-hop-self behavior towards iBGP peers
for eBGP learned routes. In the case outlined in this section we
propose to use the next-hop-self mechanism also to iBGP learned
routes.
Cluster-1
+-----------+
| Tier-1 |
| +-----+ |
| |NODE | |
| | 5 | |
Cluster-2 | +-----+ | Cluster-3
+---------+ | | +---------+
| Tier-2 | | | | Tier-2 |
| +-----+ | | +-----+ | | +-----+ |
| |NODE | | | |NODE | | | |NODE | |
| | 3 | | | | 6 | | | | 9 | |
| +-----+ | | +-----+ | | +-----+ |
| | | | | |
| | | | | |
| +-----+ | | +-----+ | | +-----+ |
| |NODE | | | |NODE | | | |NODE | |
| | 4 | | | | 7 | | | | 10 | |
| +-----+ | | +-----+ | | +-----+ |
+---------+ | | +---------+
| |
| +-----+ |
| |NODE | |
Tier-3 | | 8 | | Tier-3
+-----+ +-----+ | +-----+ | +-----+ +-----+
|NODE | |NODE | +-----------+ |NODE | |NODE |
| 1 | | 2 | | 11 | | 12 |
+-----+ +-----+ +-----+ +-----+
Figure 9: iBGP Sessions with Reflection and Next-Hop-Self
For simple and efficient route propagation filtering and as
illustrated in Figure 9:
Node5, Node6, Node7 and Node8 use the same Cluster ID (Cluster-
1)
Node3 and Node4 use the same Cluster ID (Cluster-2)
Node9 and Node10 use the same Cluster ID (Cluster-3)
AIGP metric ([RFC7311]) is likely applied to the BGP-Prefix-SID as AIGP metric ([RFC7311]) is likely applied to the BGP-Prefix-SID as
part of a large-scale multi-domain design such as Seamless MPLS part of a large-scale multi-domain design such as Seamless MPLS
[I-D.ietf-mpls-seamless-mpls]. [I-D.ietf-mpls-seamless-mpls].
The control-plane behavior is mostly the same as described in the The control-plane behavior is mostly the same as described in the
previous section: the only difference is that the eBGP3107 path previous section: the only difference is that the eBGP3107 path
propagation is simply replaced by an iBGP3107 path reflection with propagation is simply replaced by an iBGP3107 path reflection with
next-hop changed to self. next-hop changed to self.
skipping to change at page 14, line 16 skipping to change at page 15, line 27
This section demonstrates how the problems describe above could be This section demonstrates how the problems describe above could be
solved using the segment routing concept. It is worth noting that solved using the segment routing concept. It is worth noting that
segment routing signaling and data-plane are only parts of the segment routing signaling and data-plane are only parts of the
solution. Additional enhancements, e.g., such as the centralized solution. Additional enhancements, e.g., such as the centralized
controller mentioned previously, and host networking stack support controller mentioned previously, and host networking stack support
are required to implement the proposed solutions. are required to implement the proposed solutions.
7.1. Per-packet and flowlet switching 7.1. Per-packet and flowlet switching
A flowlet is defined as a burst of packets from the same flow
followed by an idle interval. [KANDULA04] developed a scheme that
uses flowlets to split traffic across multiple parallel paths in
order to optimize traffic load sharing.
With the ability to choose paths on the host, one may go from per- With the ability to choose paths on the host, one may go from per-
flow load-sharing in the network to per-packet or per-flowlet (see flow load-sharing in the network to per-packet or per-flowlet. The
[KANDULA04] for information on flowlets). The host may select host may select different segment routing instructions either per
different segment routing instructions either per packet, or per packet, or per flowlet, and route them over different paths. This
flowlet, and route them over different paths. This allows for allows for solving the "elephant flow" problem in the data-center and
solving the "elephant flow" problem in the data-center and avoiding avoiding link imbalances.
link imbalances.
Note that traditional ECMP routing could be easily simulated with on- Note that traditional ECMP routing could be easily simulated with on-
host path selection, using method proposed in VL2 (see host path selection, using method proposed in [GREENBERG09]. The
[GREENBERG09]). The hosts would randomly pick a Tier-2 or Tier-1 hosts would randomly pick a Tier-2 or Tier-1 device to "bounce" the
device to "bounce" the packet off of, depending on whether the packet off of, depending on whether the destination is under the same
destination is under the same Tier-2 nodes, or has to be reached Tier-2 nodes, or has to be reached across Tier-1. The host would use
across Tier-1. The host would use a hash function that operates on a hash function that operates on per-flow invariants, to simulate
per-flow invariants, to simulate per-flow load-sharing in the per-flow load-sharing in the network.
network.
Using Figure 1 as reference, let us illustrate this concept assuming Using Figure 1 as reference, let us illustrate this concept assuming
that HostA has an elephant flow to HostZ called Flow-f. that HostA has an elephant flow to HostZ called Flow-f.
Normally, a flow is hashed on to a single path. Let's assume HostA Normally, a flow is hashed on to a single path. Let's assume HostA
sends its packets associated with Flow-f with top label 16011 (the sends its packets associated with Flow-f with top label 16011 (the
label for the remote ToR, Node11, where HostZ is connected) and Node1 label for the remote ToR, Node11, where HostZ is connected) and Node1
would hash all the packets of Flow-F via the same nhop (e.g. Node3). would hash all the packets of Flow-F via the same next-hop (e.g.
Similarly, let's assume that leaf Node3 would hash all the packets of Node3). Similarly, let's assume that leaf Node3 would hash all the
Flow-F via the same next-hop (e.g.: spine node Node1). This normal packets of Flow-F via the same next-hop (e.g.: spine node Node5).
operation would restrict the elephant flow on a small subset of the This normal operation would restrict the elephant flow on a small
ECMP paths to HostZ and potentially create imbalance and congestion subset of the ECMP paths to HostZ and potentially create imbalance
in the fabric. and congestion in the fabric.
Leveraging the flowlet proposal, assuming HostA is made aware of 4 Leveraging the flowlet proposal, assuming HostA is made aware of 4
disjoint paths via intermediate segment 16005, 16006, 16007 and 16008 disjoint paths via intermediate segment 16005, 16006, 16007 and 16008
(the BGP prefix SID's of the 4 spine nodes) and also made aware of (the BGP prefix SID's of the 4 spine nodes) and also made aware of
the prefix segment of the remote ToR connected to the destination the prefix segment of the remote ToR connected to the destination
(16011), then the application can break the elephant flow F into (16011), then the application can break the elephant flow F into
flowlets F1, F2, F3, F4 and associate each flowlet with one of the flowlets F1, F2, F3, F4 and associate each flowlet with one of the
following 4 label stacks: {16005, 16011}, {16006, 16011}, {16007, following 4 label stacks: {16005, 16011}, {16006, 16011}, {16007,
16011} and {16008, 16011}. This would spread the load of the elephant 16011} and {16008, 16011}. This would spread the load of the elephant
flow through all the ECMP paths available in the fabric and re- flow through all the ECMP paths available in the fabric and re-
skipping to change at page 15, line 32 skipping to change at page 16, line 47
additionally publish this information to a centralized agent, e.g. additionally publish this information to a centralized agent, e.g.
after a flow completes, or periodically for longer lived flows. after a flow completes, or periodically for longer lived flows.
Next, using both local and/or global performance data, application Next, using both local and/or global performance data, application
A.1 as well as other applications sharing the same resources in the A.1 as well as other applications sharing the same resources in the
DC fabric may pick up the best path for the new flow, or update an DC fabric may pick up the best path for the new flow, or update an
existing path (e.g.: when informed of congestion on an existing existing path (e.g.: when informed of congestion on an existing
path). path).
One particularly interesting instance of performance-aware routing is One particularly interesting instance of performance-aware routing is
dynamic fault-avoidance. If some links or devices in the network dynamic fault-avoidance. If some links or devices in the network
start discarding packets due to a fault, the end-hosts could detect start discarding packets due to a fault, the end-hosts could probe
the path(s) being affected and steer their flows away from the and detect the path(s) that are affected and hence steer the affected
problem spot. Similar logic applies to failure cases where packets flows away from the problem spot. Similar logic applies to failure
get completely black-holed, e.g. when a link goes down. cases where packets get completely black-holed, e.g., when a link
goes down and the failure is detected by the host while probing the
path.
For example, an application A.1 informed about 5 paths to Z {16005, For example, an application A.1 informed about 5 paths to Z {16005,
16011}, {16006, 16011}, {16007, 16011}, {16008, 16011} and {16011} 16011}, {16006, 16011}, {16007, 16011}, {16008, 16011} and {16011}
might use the last one by default (for simplicity). When performance might use the last one by default (for simplicity). When performance
is degrading, A.1 might then start to pin TCP flows to each of the 4 is degrading, A.1 might then start to pin TCP flows to each of the 4
other paths (each via a distinct spine) and monitor the performance. other paths (each via a distinct spine) and monitor the performance.
It would then detect the faulty path and assign a negative preference It would then detect the faulty path and assign a negative preference
to the faulty path to avoid further flows using it. Gradually, over to the faulty path to avoid further flows using it. Gradually, over
time, it may re-assign flows on the faulty path to eventually detect time, it may re-assign flows on the faulty path to eventually detect
the resolution of the trouble and start reusing the path. the resolution of the trouble and start reusing the path.
skipping to change at page 16, line 10 skipping to change at page 17, line 28
oblivious ECMP hashing. For example, if in the topology depicted on oblivious ECMP hashing. For example, if in the topology depicted on
Figure 1 a link between spine node Node5 and leaf node Node9 fails, Figure 1 a link between spine node Node5 and leaf node Node9 fails,
HostA may exclude the segment corresponding to Node5 from the prefix HostA may exclude the segment corresponding to Node5 from the prefix
matching the servers under Tier-2 devices Node9. In the push path matching the servers under Tier-2 devices Node9. In the push path
discovery model, the affected path mappings may be explicitly pushed discovery model, the affected path mappings may be explicitly pushed
to all the servers for the duration of the failure. The new mapping to all the servers for the duration of the failure. The new mapping
would instruct them to avoid the particular Tier-1 node until the would instruct them to avoid the particular Tier-1 node until the
link has recovered. Alternatively, in pull path, the centralized link has recovered. Alternatively, in pull path, the centralized
controller may start steering new flows immediately after it controller may start steering new flows immediately after it
discovers the issue. Until then, the existing flows may recover discovers the issue. Until then, the existing flows may recover
using local detection of the path issues, as described in using local detection of the path issues.
Section 7.2.
7.3. Deterministic network probing 7.3. Deterministic network probing
Active probing is a well-known technique for monitoring network Active probing is a well-known technique for monitoring network
elements' health, constituting of sending continuous packet streams elements' health, constituting of sending continuous packet streams
simulating network traffic to the hosts in the data-center. Segment simulating network traffic to the hosts in the data-center. Segment
routing makes possible to prescribe the exact paths that each probe routing makes possible to prescribe the exact paths that each probe
or series of probes would be taking toward their destination. This or series of probes would be taking toward their destination. This
allows for fast correlation and detection of failed paths, by allows for fast correlation and detection of failed paths, by
processing information from multiple actively probing agents. This processing information from multiple actively probing agents. This
skipping to change at page 16, line 46 skipping to change at page 18, line 19
As required by [RFC7938], no new signaling protocol is introduced. As required by [RFC7938], no new signaling protocol is introduced.
The BGP-Prefix-SID is a lightweight extension to BGP Labeled Unicast The BGP-Prefix-SID is a lightweight extension to BGP Labeled Unicast
(RFC3107 [RFC3107]). It applies either to eBGP or iBGP based (RFC3107 [RFC3107]). It applies either to eBGP or iBGP based
designs. designs.
Specifically, LDP and RSVP-TE are not used. These protocols would Specifically, LDP and RSVP-TE are not used. These protocols would
drastically impact the operational complexity of the Data Center and drastically impact the operational complexity of the Data Center and
would not scale. This is in line with the requirements expressed in would not scale. This is in line with the requirements expressed in
[RFC7938]. [RFC7938].
A key element of the operational simplicity is the deployment of the Provided the same SRGB is configured on all nodes, all nodes use the
design with a single and consistent SRGB across the DC fabric. same MPLS label for a given IP prefix. This is simpler from an
operation standpoint, as discussed in Section 9
At every node in the fabric, the same label is associated to a given
BGP-Prefix-SID and hence a notion of global prefix segment arises.
When a controller programs HostA to send traffic to HostZ via the
normally available BGP ECMP paths, the controller uses label 16011
associated with the ToR node connected to the HostZ. The controller
does not need to pick the label based on the ToR that the source host
is connected to.
In a classic BGP Labeled Unicast design applied to the DC fabric
illustrated in Figure 1, the ToR Node1 connected to HostA would most
likely allocate a different label for 192.0.2.11/32 than the one
allocated by ToR Node2. As a consequence, the controller would need
to adapt the SR policy to each host, based on the ToR node that they
are connected to. This adds state maintenance and synchronization
problems. All of this unnecessary complexity is eliminated if a
single consistent SRGB is utilized across the fabric.
8.2. Minimizing the FIB table 8.2. Minimizing the FIB table
The designer may decide to switch all the traffic at Tier-1 and Tier- The designer may decide to switch all the traffic at Tier-1 and Tier-
2's based on MPLS, hence drastically decreasing the IP table size at 2's based on MPLS, hence drastically decreasing the IP table size at
these nodes. these nodes.
This is easily accomplished by encapsulating the traffic either This is easily accomplished by encapsulating the traffic either
directly at the host or at the source ToR node by pushing the BGP- directly at the host or the source ToR node by pushing the BGP-
Prefix-SID of the destination ToR for intra-DC traffic or border node Prefix-SID of the destination ToR for intra-DC traffic, or the BGP-
for inter-DC or DC-to-outside-world traffic. Prefix-SID for the the border node for inter-DC or DC-to-outside-
world traffic.
8.3. Egress Peer Engineering 8.3. Egress Peer Engineering
It is straightforward to combine the design illustrated in this It is straightforward to combine the design illustrated in this
document with the Egress Peer Engineering (EPE) use-case described in document with the Egress Peer Engineering (EPE) use-case described in
[I-D.ietf-spring-segment-routing-central-epe]. [I-D.ietf-spring-segment-routing-central-epe].
In such case, the operator is able to engineer its outbound traffic In such case, the operator is able to engineer its outbound traffic
on a per host-flow basis, without incurring any additional state at on a per host-flow basis, without incurring any additional state at
intermediate points in the DC fabric. intermediate points in the DC fabric.
skipping to change at page 19, line 38 skipping to change at page 20, line 40
16011). When combining segments to create a policy, one need to 16011). When combining segments to create a policy, one need to
carefully update the label of each segment. This is obviously more carefully update the label of each segment. This is obviously more
error-prone, more complex and more difficult to troubleshoot. error-prone, more complex and more difficult to troubleshoot.
10. IANA Considerations 10. IANA Considerations
This document does not make any IANA request. This document does not make any IANA request.
11. Manageability Considerations 11. Manageability Considerations
The design and deployment guidelines described in this document rely The design and deployment guidelines described in this document are
on Segment Routing extensions defined in BGP by based on the network design described in [RFC7938].
[I-D.ietf-idr-bgp-prefix-sid]. This document doesn't introduce any
change in the manageability of the BGP-Prefix-SID attribute other
than the ones defined in [I-D.ietf-idr-bgp-prefix-sid].
In addition the SRGB allocation scheme recommended here is based on The deployment model assumed in this document is based on a single
the recommendation made in [I-D.ietf-spring-segment-routing] and no domain where the interconnected DCs are part of the same
other manageability aspects are introduced in this document. administrative domain (which, of course, is split into different
autonomous systems). The operator has full control of the whole
domain and the usual operational and management mechanisms and
procedures are used in order to prevent any information related to
internal prefixes and topology to be leaked outside the domain.
As recommended in [I-D.ietf-spring-segment-routing], the same SRGB
SHOULD be allocated in all nodes in order to facilitate the design,
deployment and operations of the domain.
When EPE ([I-D.ietf-spring-segment-routing-central-epe]) is used (as
explained in Section 8.3, the same operational model is assumed. EPE
information is originated and propagated throughout the domain
towards an internal server and unless explicitly configured by the
operator, no EPE information is leaked outside the domain boundaries.
12. Security Considerations 12. Security Considerations
This document proposes to apply Segment Routing to a well known This document proposes to apply Segment Routing to a well known
scalability requirement expressed in [RFC7938] using the BGP-Prefix- scalability requirement expressed in [RFC7938] using the BGP-Prefix-
SID as defined in [I-D.ietf-idr-bgp-prefix-sid]. SID as defined in [I-D.ietf-idr-bgp-prefix-sid].
The design does not introduce any additional security concerns from It has to be noted, as described in Section 11 that the design
what expressed in [RFC7938] and [I-D.ietf-idr-bgp-prefix-sid]. illustrated in [RFC7938] and in this document, refer to a deployment
model where all nodes are under the same administration. In this
context, we assume that the operator doesn't want to leak outside of
the domain any information related to internal prefixes and topology.
The internal information includes prefix-sid and EPE information. In
order to prevent such leaking, the standard BGP mechanisms (filters)
are applied on the boundary of the domain.
Therefore, the solution proposed in this document does not introduce
any additional security concerns from what expressed in [RFC7938] and
[I-D.ietf-idr-bgp-prefix-sid]. It is assumed that the security and
confidentiality of the prefix and topology information is preserved
by outbound filters at each peering point of the domain as described
in Section 11.
13. Acknowledgements 13. Acknowledgements
The authors would like to thank Benjamin Black, Arjun Sreekantiah, The authors would like to thank Benjamin Black, Arjun Sreekantiah,
Keyur Patel, Acee Lindem and Anoop Ghanwani for their comments and Keyur Patel, Acee Lindem and Anoop Ghanwani for their comments and
review of this document. review of this document.
14. Contributors 14. Contributors
Gaya Nagarajan Gaya Nagarajan
skipping to change at page 22, line 27 skipping to change at page 24, line 15
[I-D.ietf-spring-segment-routing-central-epe] [I-D.ietf-spring-segment-routing-central-epe]
Filsfils, C., Previdi, S., Aries, E., and D. Afanasiev, Filsfils, C., Previdi, S., Aries, E., and D. Afanasiev,
"Segment Routing Centralized BGP Egress Peer Engineering", "Segment Routing Centralized BGP Egress Peer Engineering",
draft-ietf-spring-segment-routing-central-epe-04 (work in draft-ietf-spring-segment-routing-central-epe-04 (work in
progress), February 2017. progress), February 2017.
[KANDULA04] [KANDULA04]
Sinha, S., Kandula, S., and D. Katabi, "Harnessing TCP's Sinha, S., Kandula, S., and D. Katabi, "Harnessing TCP's
Burstiness with Flowlet Switching", 2004. Burstiness with Flowlet Switching", 2004.
[RFC6793] Vohra, Q. and E. Chen, "BGP Support for Four-Octet
Autonomous System (AS) Number Space", RFC 6793,
DOI 10.17487/RFC6793, December 2012,
<http://www.rfc-editor.org/info/rfc6793>.
[RFC7938] Lapukhov, P., Premji, A., and J. Mitchell, Ed., "Use of [RFC7938] Lapukhov, P., Premji, A., and J. Mitchell, Ed., "Use of
BGP for Routing in Large-Scale Data Centers", RFC 7938, BGP for Routing in Large-Scale Data Centers", RFC 7938,
DOI 10.17487/RFC7938, August 2016, DOI 10.17487/RFC7938, August 2016,
<http://www.rfc-editor.org/info/rfc7938>. <http://www.rfc-editor.org/info/rfc7938>.
Authors' Addresses Authors' Addresses
Clarence Filsfils (editor) Clarence Filsfils (editor)
Cisco Systems, Inc. Cisco Systems, Inc.
Brussels Brussels
 End of changes. 26 change blocks. 
110 lines changed or deleted 173 lines changed or added

This html diff was produced by rfcdiff 1.45. The latest version is available from http://tools.ietf.org/tools/rfcdiff/