draft-ietf-spring-segment-routing-msdc-08.txt   draft-ietf-spring-segment-routing-msdc-09.txt 
Network Working Group C. Filsfils, Ed. Network Working Group C. Filsfils, Ed.
Internet-Draft S. Previdi Internet-Draft S. Previdi
Intended status: Informational Cisco Systems, Inc. Intended status: Informational Cisco Systems, Inc.
Expires: June 24, 2018 J. Mitchell Expires: November 30, 2018 G. Dawra
Unaffiliated LinkedIn
E. Aries E. Aries
Juniper Networks Juniper Networks
P. Lapukhov P. Lapukhov
Facebook Facebook
December 21, 2017 May 29, 2018
BGP-Prefix Segment in large-scale data centers BGP-Prefix Segment in large-scale data centers
draft-ietf-spring-segment-routing-msdc-08 draft-ietf-spring-segment-routing-msdc-09
Abstract Abstract
This document describes the motivation and benefits for applying This document describes the motivation and benefits for applying
segment routing in BGP-based large-scale data-centers. It describes segment routing in BGP-based large-scale data-centers. It describes
the design to deploy segment routing in those data-centers, for both the design to deploy segment routing in those data-centers, for both
the MPLS and IPv6 dataplanes. the MPLS and IPv6 dataplanes.
Status of This Memo Status of This Memo
skipping to change at page 1, line 39 skipping to change at page 1, line 39
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/. Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on June 24, 2018. This Internet-Draft will expire on November 30, 2018.
Copyright Notice Copyright Notice
Copyright (c) 2017 IETF Trust and the persons identified as the Copyright (c) 2018 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(https://trustee.ietf.org/license-info) in effect on the date of (https://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as the Trust Legal Provisions and are provided without warranty as
skipping to change at page 2, line 20 skipping to change at page 2, line 20
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
2. Large Scale Data Center Network Design Summary . . . . . . . 3 2. Large Scale Data Center Network Design Summary . . . . . . . 3
2.1. Reference design . . . . . . . . . . . . . . . . . . . . 4 2.1. Reference design . . . . . . . . . . . . . . . . . . . . 4
3. Some open problems in large data-center networks . . . . . . 5 3. Some open problems in large data-center networks . . . . . . 5
4. Applying Segment Routing in the DC with MPLS dataplane . . . 6 4. Applying Segment Routing in the DC with MPLS dataplane . . . 6
4.1. BGP Prefix Segment (BGP-Prefix-SID) . . . . . . . . . . . 6 4.1. BGP Prefix Segment (BGP-Prefix-SID) . . . . . . . . . . . 6
4.2. eBGP Labeled Unicast (RFC8277) . . . . . . . . . . . . . 6 4.2. eBGP Labeled Unicast (RFC8277) . . . . . . . . . . . . . 6
4.2.1. Control Plane . . . . . . . . . . . . . . . . . . . . 7 4.2.1. Control Plane . . . . . . . . . . . . . . . . . . . . 7
4.2.2. Data Plane . . . . . . . . . . . . . . . . . . . . . 9 4.2.2. Data Plane . . . . . . . . . . . . . . . . . . . . . 8
4.2.3. Network Design Variation . . . . . . . . . . . . . . 10 4.2.3. Network Design Variation . . . . . . . . . . . . . . 9
4.2.4. Global BGP Prefix Segment through the fabric . . . . 10 4.2.4. Global BGP Prefix Segment through the fabric . . . . 10
4.2.5. Incremental Deployments . . . . . . . . . . . . . . . 11 4.2.5. Incremental Deployments . . . . . . . . . . . . . . . 10
4.3. iBGP Labeled Unicast (RFC8277) . . . . . . . . . . . . . 12 4.3. iBGP Labeled Unicast (RFC8277) . . . . . . . . . . . . . 11
5. Applying Segment Routing in the DC with IPv6 dataplane . . . 14 5. Applying Segment Routing in the DC with IPv6 dataplane . . . 13
6. Communicating path information to the host . . . . . . . . . 14 6. Communicating path information to the host . . . . . . . . . 13
7. Addressing the open problems . . . . . . . . . . . . . . . . 15 7. Addressing the open problems . . . . . . . . . . . . . . . . 14
7.1. Per-packet and flowlet switching . . . . . . . . . . . . 15 7.1. Per-packet and flowlet switching . . . . . . . . . . . . 14
7.2. Performance-aware routing . . . . . . . . . . . . . . . . 16 7.2. Performance-aware routing . . . . . . . . . . . . . . . . 15
7.3. Deterministic network probing . . . . . . . . . . . . . . 17 7.3. Deterministic network probing . . . . . . . . . . . . . . 16
8. Additional Benefits . . . . . . . . . . . . . . . . . . . . . 17 8. Additional Benefits . . . . . . . . . . . . . . . . . . . . . 17
8.1. MPLS Dataplane with operational simplicity . . . . . . . 18 8.1. MPLS Dataplane with operational simplicity . . . . . . . 17
8.2. Minimizing the FIB table . . . . . . . . . . . . . . . . 18 8.2. Minimizing the FIB table . . . . . . . . . . . . . . . . 17
8.3. Egress Peer Engineering . . . . . . . . . . . . . . . . . 18 8.3. Egress Peer Engineering . . . . . . . . . . . . . . . . . 17
8.4. Anycast . . . . . . . . . . . . . . . . . . . . . . . . . 19 8.4. Anycast . . . . . . . . . . . . . . . . . . . . . . . . . 18
9. Preferred SRGB Allocation . . . . . . . . . . . . . . . . . . 19 9. Preferred SRGB Allocation . . . . . . . . . . . . . . . . . . 18
10. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 20 10. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 19
11. Manageability Considerations . . . . . . . . . . . . . . . . 20 11. Manageability Considerations . . . . . . . . . . . . . . . . 19
12. Security Considerations . . . . . . . . . . . . . . . . . . . 21 12. Security Considerations . . . . . . . . . . . . . . . . . . . 20
13. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 21 13. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 20
14. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 21 14. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 20
15. References . . . . . . . . . . . . . . . . . . . . . . . . . 22 15. References . . . . . . . . . . . . . . . . . . . . . . . . . 22
15.1. Normative References . . . . . . . . . . . . . . . . . . 22 15.1. Normative References . . . . . . . . . . . . . . . . . . 22
15.2. Informative References . . . . . . . . . . . . . . . . . 23 15.2. Informative References . . . . . . . . . . . . . . . . . 23
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 24 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 23
1. Introduction 1. Introduction
Segment Routing (SR), as described in Segment Routing (SR), as described in
[I-D.ietf-spring-segment-routing] leverages the source routing [I-D.ietf-spring-segment-routing] leverages the source routing
paradigm. A node steers a packet through an ordered list of paradigm. A node steers a packet through an ordered list of
instructions, called segments. A segment can represent any instructions, called segments. A segment can represent any
instruction, topological or service-based. A segment can have a instruction, topological or service-based. A segment can have a
local semantic to an SR node or global within an SR domain. SR local semantic to an SR node or global within an SR domain. SR
allows to enforce a flow through any topological path while allows to enforce a flow through any topological path while
skipping to change at page 5, line 41 skipping to change at page 5, line 41
o Shortest-path routing with ECMP implements an oblivious routing o Shortest-path routing with ECMP implements an oblivious routing
model, which is not aware of the network imbalances. If the model, which is not aware of the network imbalances. If the
network symmetry is broken, for example due to link failures, network symmetry is broken, for example due to link failures,
utilization hotspots may appear. For example, if a link fails utilization hotspots may appear. For example, if a link fails
between Tier-1 and Tier-2 devices (e.g. Node5 and Node9), Tier-3 between Tier-1 and Tier-2 devices (e.g. Node5 and Node9), Tier-3
devices Node1 and Node2 will not be aware of that, since there are devices Node1 and Node2 will not be aware of that, since there are
other paths available from perspective of Node3. They will other paths available from perspective of Node3. They will
continue sending roughly equal traffic to Node3 and Node4 as if continue sending roughly equal traffic to Node3 and Node4 as if
the failure didn't exist which may cause a traffic hotspot. the failure didn't exist which may cause a traffic hotspot.
o The absence of path visibility leaves transport protocols, such as
TCP, with a "blackbox" view of the network. Some TCP metrics,
such as SRTT, MSS, CWND and few others could be inferred and
cached based on past history, but those apply to destinations,
regardless of the path that has been chosen to get there. Thus,
for instance, TCP is not capable of remembering "bad" paths, such
as those that exhibited poor performance in the past. This means
that every new connection will be established obliviously (memory-
less) with regards to the paths chosen before, or chosen by other
nodes.
o Isolating faults in the network with multiple parallel paths and o Isolating faults in the network with multiple parallel paths and
ECMP-based routing is non-trivial due to lack of determinism. ECMP-based routing is non-trivial due to lack of determinism.
Specifically, the connections from HostA to HostB may take a Specifically, the connections from HostA to HostB may take a
different path every time a new connection is formed, thus making different path every time a new connection is formed, thus making
consistent reproduction of a failure much more difficult. This consistent reproduction of a failure much more difficult. This
complexity scales linearly with the number of parallel paths in complexity scales linearly with the number of parallel paths in
the network, and stems from the random nature of path selection by the network, and stems from the random nature of path selection by
the network devices. the network devices.
Further in this document (Section 7), it is demonstrated how these Further in this document (Section 7), it is demonstrated how these
skipping to change at page 15, line 25 skipping to change at page 14, line 36
parts of the solution. Additional enhancements, e.g., such as the parts of the solution. Additional enhancements, e.g., such as the
centralized controller mentioned previously, and host networking centralized controller mentioned previously, and host networking
stack support are required to implement the proposed solutions. Also stack support are required to implement the proposed solutions. Also
the applicability of the solutions described below are not restricted the applicability of the solutions described below are not restricted
to the data-center alone, the same could be re-used in context of to the data-center alone, the same could be re-used in context of
other domains as well other domains as well
7.1. Per-packet and flowlet switching 7.1. Per-packet and flowlet switching
A flowlet is defined as a burst of packets from the same flow A flowlet is defined as a burst of packets from the same flow
followed by an idle interval. [KANDULA04] developed a scheme that followed by an idle interval.
uses flowlets to split traffic across multiple parallel paths in
order to optimize traffic load sharing.
With the ability to choose paths on the host, one may go from per- With some ability to choose paths on the host, one may go from per-
flow load-sharing in the network to per-packet or per-flowlet. The flow load-sharing in the network to per-packet or per-flowlet. The
host may select different segment routing instructions either per host may select different segment routing instructions either per
packet, or per flowlet, and route them over different paths. This packet, or per flowlet, and route them over different paths. This
allows for solving the "elephant flow" problem in the data-center and allows for solving the "elephant flow" problem in the data-center and
avoiding link imbalances. avoiding link imbalances.
Note that traditional ECMP routing could be easily simulated with on- Note that traditional ECMP routing could be easily simulated with on-
host path selection, using method proposed in [GREENBERG09]. The host path selection, using method proposed in [GREENBERG09]. The
hosts would randomly pick a Tier-2 or Tier-1 device to "bounce" the hosts would randomly pick a Tier-2 or Tier-1 device to "bounce" the
packet off of, depending on whether the destination is under the same packet off of, depending on whether the destination is under the same
skipping to change at page 16, line 30 skipping to change at page 15, line 39
7.2. Performance-aware routing 7.2. Performance-aware routing
Knowing the path associated with flows/packets, the end host may Knowing the path associated with flows/packets, the end host may
deduce certain characteristics of the path on its own, and deduce certain characteristics of the path on its own, and
additionally use the information supplied with path information additionally use the information supplied with path information
pushed from the controller or received via pull request. The host pushed from the controller or received via pull request. The host
may further share its path observations with the centralized agent, may further share its path observations with the centralized agent,
so that the latter may keep up-to-date network health map to assist so that the latter may keep up-to-date network health map to assist
other hosts with this information. other hosts with this information.
For example, an application A.1 at HostA may pin a TCP flow destined For example, an application A.1 at HostA may pin a flow destined to
to HostZ via Spine node Node5 using label stack {16005, 16011}. The HostZ via Spine node Node5 using label stack {16005, 16011}. The
application A.1 may collect information on packet loss, deduced from application A.1 may collect information on packet loss or other
TCP retransmissions and other signals (e.g. RTT increases). A.1 may metrics. A.1 may additionally publish this information to a
additionally publish this information to a centralized agent, e.g. centralized agent, e.g. after a flow completes, or periodically for
after a flow completes, or periodically for longer lived flows. longer lived flows. Next, using both local and/or global performance
Next, using both local and/or global performance data, application data, application A.1 as well as other applications sharing the same
A.1 as well as other applications sharing the same resources in the resources in the DC fabric may pick up the best path for the new
DC fabric may pick up the best path for the new flow, or update an flow, or update an existing path (e.g.: when informed of congestion
existing path (e.g.: when informed of congestion on an existing on an existing path). The mechanisms for collecting the flow
path). metrics, their publishing to a centralized agent and the decision
process at the centralized agent and the application/host to pick a
path through the network based on this collected information is
outside the scope of this document.
One particularly interesting instance of performance-aware routing is One particularly interesting instance of performance-aware routing is
dynamic fault-avoidance. If some links or devices in the network dynamic fault-avoidance. If some links or devices in the network
start discarding packets due to a fault, the end-hosts could probe start discarding packets due to a fault, the end-hosts could probe
and detect the path(s) that are affected and hence steer the affected and detect the path(s) that are affected and hence steer the affected
flows away from the problem spot. Similar logic applies to failure flows away from the problem spot. Similar logic applies to failure
cases where packets get completely black-holed, e.g., when a link cases where packets get completely black-holed, e.g., when a link
goes down and the failure is detected by the host while probing the goes down and the failure is detected by the host while probing the
path. path.
For example, an application A.1 informed about 5 paths to Z {16005, For example, an application A.1 informed about 5 paths to Z {16005,
16011}, {16006, 16011}, {16007, 16011}, {16008, 16011} and {16011} 16011}, {16006, 16011}, {16007, 16011}, {16008, 16011} and {16011}
might use the last one by default (for simplicity). When performance might use the last one by default (for simplicity). When performance
is degrading, A.1 might then start to pin TCP flows to each of the 4 is degrading, A.1 might then start to pin flows to each of the 4
other paths (each via a distinct spine) and monitor the performance. other paths (each via a distinct spine) and monitor the performance.
It would then detect the faulty path and assign a negative preference It would then detect the faulty path and assign a negative preference
to the faulty path to avoid further flows using it. Gradually, over to the faulty path to avoid further flows using it. Gradually, over
time, it may re-assign flows on the faulty path to eventually detect time, it may re-assign flows on the faulty path to eventually detect
the resolution of the trouble and start reusing the path. the resolution of the trouble and start reusing the path. The
mechanisms for monitoring performance for a specific flow and for the
various paths and the deduction of optimal paths to improve the same
for the flow are outside the scope of this document.
By leveraging Segment Routing, one avoids issues associated with By leveraging Segment Routing, one avoids issues associated with
oblivious ECMP hashing. For example, if in the topology depicted on oblivious ECMP hashing. For example, if in the topology depicted on
Figure 1 a link between spine node Node5 and leaf node Node9 fails, Figure 1 a link between spine node Node5 and leaf node Node9 fails,
HostA may exclude the segment corresponding to Node5 from the prefix HostA may exclude the segment corresponding to Node5 from the prefix
matching the servers under Tier-2 devices Node9. In the push path matching the servers under Tier-2 devices Node9. In the push path
discovery model, the affected path mappings may be explicitly pushed discovery model, the affected path mappings may be explicitly pushed
to all the servers for the duration of the failure. The new mapping to all the servers for the duration of the failure. The new mapping
would instruct them to avoid the particular Tier-1 node until the would instruct them to avoid the particular Tier-1 node until the
link has recovered. Alternatively, in pull path, the centralized link has recovered. Alternatively, in pull path, the centralized
skipping to change at page 22, line 39 skipping to change at page 22, line 4
ATT ATT
US US
Email: ju1738@att.com Email: ju1738@att.com
Saikat Ray Saikat Ray
Unaffiliated Unaffiliated
US US
Email: raysaikat@gmail.com Email: raysaikat@gmail.com
Jon Mitchell
Unaffiliated
US
Email: jrmitche@puck.nether.net
15. References 15. References
15.1. Normative References 15.1. Normative References
[I-D.ietf-idr-bgp-prefix-sid] [I-D.ietf-idr-bgp-prefix-sid]
Previdi, S., Filsfils, C., Lindem, A., Sreekantiah, A., Previdi, S., Filsfils, C., Lindem, A., Sreekantiah, A.,
and H. Gredler, "Segment Routing Prefix SID extensions for and H. Gredler, "Segment Routing Prefix SID extensions for
BGP", draft-ietf-idr-bgp-prefix-sid-07 (work in progress), BGP", draft-ietf-idr-bgp-prefix-sid-21 (work in progress),
October 2017. May 2018.
[I-D.ietf-spring-segment-routing] [I-D.ietf-spring-segment-routing]
Filsfils, C., Previdi, S., Ginsberg, L., Decraene, B., Filsfils, C., Previdi, S., Ginsberg, L., Decraene, B.,
Litkowski, S., and R. Shakir, "Segment Routing Litkowski, S., and R. Shakir, "Segment Routing
Architecture", draft-ietf-spring-segment-routing-14 (work Architecture", draft-ietf-spring-segment-routing-15 (work
in progress), December 2017. in progress), January 2018.
[I-D.ietf-spring-segment-routing-central-epe] [I-D.ietf-spring-segment-routing-central-epe]
Filsfils, C., Previdi, S., Dawra, G., Aries, E., and D. Filsfils, C., Previdi, S., Dawra, G., Aries, E., and D.
Afanasiev, "Segment Routing Centralized BGP Egress Peer Afanasiev, "Segment Routing Centralized BGP Egress Peer
Engineering", draft-ietf-spring-segment-routing-central- Engineering", draft-ietf-spring-segment-routing-central-
epe-10 (work in progress), December 2017. epe-10 (work in progress), December 2017.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, Requirement Levels", BCP 14, RFC 2119,
DOI 10.17487/RFC2119, March 1997, DOI 10.17487/RFC2119, March 1997,
skipping to change at page 23, line 44 skipping to change at page 23, line 13
<https://www.rfc-editor.org/info/rfc8277>. <https://www.rfc-editor.org/info/rfc8277>.
15.2. Informative References 15.2. Informative References
[GREENBERG09] [GREENBERG09]
Greenberg, A., Hamilton, J., Jain, N., Kadula, S., Kim, Greenberg, A., Hamilton, J., Jain, N., Kadula, S., Kim,
C., Lahiri, P., Maltz, D., Patel, P., and S. Sengupta, C., Lahiri, P., Maltz, D., Patel, P., and S. Sengupta,
"VL2: A Scalable and Flexible Data Center Network", 2009. "VL2: A Scalable and Flexible Data Center Network", 2009.
[I-D.ietf-6man-segment-routing-header] [I-D.ietf-6man-segment-routing-header]
Previdi, S., Filsfils, C., Raza, K., Leddy, J., Field, B., Previdi, S., Filsfils, C., Leddy, J., Matsushima, S., and
daniel.voyer@bell.ca, d., daniel.bernier@bell.ca, d., d. daniel.voyer@bell.ca, "IPv6 Segment Routing Header
Matsushima, S., Leung, I., Linkova, J., Aries, E., Kosugi, (SRH)", draft-ietf-6man-segment-routing-header-13 (work in
T., Vyncke, E., Lebrun, D., Steinberg, D., and R. Raszuk, progress), May 2018.
"IPv6 Segment Routing Header (SRH)", draft-ietf-6man-
segment-routing-header-07 (work in progress), July 2017.
[KANDULA04]
Sinha, S., Kandula, S., and D. Katabi, "Harnessing TCP's
Burstiness with Flowlet Switching", 2004.
[RFC6793] Vohra, Q. and E. Chen, "BGP Support for Four-Octet [RFC6793] Vohra, Q. and E. Chen, "BGP Support for Four-Octet
Autonomous System (AS) Number Space", RFC 6793, Autonomous System (AS) Number Space", RFC 6793,
DOI 10.17487/RFC6793, December 2012, DOI 10.17487/RFC6793, December 2012,
<https://www.rfc-editor.org/info/rfc6793>. <https://www.rfc-editor.org/info/rfc6793>.
Authors' Addresses Authors' Addresses
Clarence Filsfils (editor) Clarence Filsfils (editor)
Cisco Systems, Inc. Cisco Systems, Inc.
skipping to change at page 24, line 29 skipping to change at page 23, line 38
BE BE
Email: cfilsfil@cisco.com Email: cfilsfil@cisco.com
Stefano Previdi Stefano Previdi
Cisco Systems, Inc. Cisco Systems, Inc.
Italy Italy
Email: stefano@previdi.net Email: stefano@previdi.net
Jon Mitchell Gaurav Dawra
Unaffiliated LinkedIn
USA
Email: jrmitche@puck.nether.net
Email: gdawra.ietf@gmail.com
Ebben Aries Ebben Aries
Juniper Networks Juniper Networks
1133 Innovation Way 1133 Innovation Way
Sunnyvale CA 94089 Sunnyvale CA 94089
US US
Email: exa@juniper.net Email: exa@juniper.net
Petr Lapukhov Petr Lapukhov
Facebook Facebook
 End of changes. 21 change blocks. 
73 lines changed or deleted 65 lines changed or added

This html diff was produced by rfcdiff 1.46. The latest version is available from http://tools.ietf.org/tools/rfcdiff/