SPRING Working Group A. Farrel
Internet-Draft J. Drake
Intended status: Informational Juniper Networks
Expires: January 1, 2018 June 30, 2017

Interconnection of Segment Routing Domains - Problem Statement and Solution Landscape


Segment Routing (SR) is now a popular forwarding paradigm for use in MPLS and IPv6 networks. It is typically deployed in discrete domains that may be data centers, access networks, or other networks that are under the control of a single operator and that can easily be upgraded to support this new technology.

Traffic originating in one SR domain often terminates in another SR domain, but must transit a backbone network that provides interconnection between those domains.

This document describes a mechanism for providing connectivity between SR domains to enable end-to-end or domain-to-domain traffic engineering.

The approach described: allows connectivity between SR domains, utilizes traffic engineering mechanisms (RSVP-TE or Segment Routing) across the backbone network, makes heavy use of pre-existing technologies requiring the specifications of very few additional mechanisms.

This document some background and a problem statement, explains the solution mechanism, and provides examples. It does not define any new protocol mechanisms.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at http://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on January 1, 2018.

Copyright Notice

Copyright (c) 2017 IETF Trust and the persons identified as the document authors. All rights reserved.

This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.

Table of Contents

1. Introduction

Data Centers are a growing market sector. They are being set up by new specialist companies, by enterprises for their own use, by legacy ISPs, and by the new wave of network operators such as Microsoft and Amazon.

The networks inside Data Centers are currently well-planned, but the traffic loads can be unpredictable. There is a need to be able to direct traffic within a Data Center to follow a specific path.

Data Centers are attached to external ("backbone") networks to allow access by users and to facilitate communication among Data Centers. An individual Data Center may be attached to multiple backbone networks, and may have multiple points of attachment to each backbone network. Traffic to or from a Data Center may need to be directed to or from any of these points of attachment.

A variety of networking technologies exist and have been proposed to steer traffic within the Data Center and across the backbone networks. This document proposes an approach that builds on existing technologies to produce mechanisms that provide scalable and flexible interconnection of Data Centers, and that will be easy to operate.

Segment Routing (SR) is a new technology that places forwarding state into each packet as a stack of loose hops as distinct from other pre-existing techniques that require signaling protocols to install state in the network. SR is a popular option for building Data Centers, and is also seeing increasing traction in edge and access networks as well as in backbone networks.

This paper describes mechanisms to provide end-to-end SR connectivity between SR-capable domains across an MPLS backbone network that supports SR and/or MPLS-TE. This is the generalization of the requirement to provide inter-Data Center connectivity.

2. Problem Statement

Consider the network in Figure 1. Without loss of generality, this `figure can be used to represent the architecture and problem space for steering traffic within and between SR edge domains. The figure shows a single destination for all traffic that we will consider. In this figure we distinguish between the PEs that provide access to the backbone networks and the Gateways that provide access to the SR edge domains: these may, in fact be the same equipment, and the PEs might be located at the domain edges.

In describing the problem space and the solution we use four terms for network nodes as follows:

SR edge domain :
A collection of SR-capable nodes in an edge network attached to the backbone network through one or more gateways. Examples include, access networks, Data Center sites, and blessings of unicorns.
Host :
A node within an edge domain. May be an end system or a transit node in the edge domain.
Gateway (GW) :
Provides access to or from an edge domain. Examples are CEs, ASBRs, and Data Center gateways.
Provider Edge (PE) :
Provides access to or from the backbone network.
Autonomous System Border Router (ASBR) :
Provides access to one AS in the backbone network from another AS in the backbone network.

These terms can be seen used in Figure 1 where the various sources and destinations are hosts.

|                                                                   |
|                              AS1                                  |
|  ----    ----                                       ----    ----  |
   ----    ----                                       ----    ----
   :        :   ------------           ------------      :      :
   :        :  | AS2        |         |        AS3 |     :      :
   :        :  |         ------     ------         |     :      :
   :        :  |        |ASBR2a|...|ASBR3a|        |     :      :
   :        :  |         ------     ------         |     :      :
   :        :  |            |         |            |     :      :
   :        :  |         ------     ------         |     :      :
   :        :  |        |ASBR2b|...|ASBR3b|        |     :      :
   :        :  |         ------     ------         |     :      :
   :        :  |            |         |            |     :      :
   :  ......:  |  ----      |         |      ----  |     :      :
   :  :         -|PE2a|-----           -----|PE3a|-      :      :
   :  :           ----                       ----        :      :
   :  :      ......:                           :.......  :      :
   :  :      :                                        :  :      :
   ----    ----                                       ----    ----
 -|GW1a|--|GW1b|-                                   -|GW2a|--|GW2b|-
|  ----    ----  |                                 |  ----    ----  |
|                |                                 |                |
|                |                                 |                |
|                |                                 | Source3        |
|        Source2 |                                 |                |
|                |                                 |        Source4 |
| Source1        |                                 |                |
|                |                                 |   Destination  |
|                |                                 |                |
| Dom1           |                                 |           Dom2 |
 ----------------                                   ----------------

Figure 1: Reference Architecture for SR Domain Interconnect

Traffic to the destination may be sourced from multiple sources within that domain (we show two such sources: Source3 and Source4). Furthermore, traffic intended for the destination may arrive from outside the domain through any of the points of attachment to the backbone networks (we show GW3a and GW3b). This traffic may need to be steered within the domain to achieve load-balancing across network resources, to avoid degraded or out-of-service resources (including planned service outages), and to achieve different qualities of service. Of course, traffic in a remote source domain may also need to be steered within that domain. We class this problem as "Intra-Domain Traffic Steering".

Traffic across the backbone networks may need to be steered to conform to common Traffic Engineering paradigms. That is, the path across any network (shown in the figure as an AS) or across any collection of networks may need to be chosen. Furthermore, the points of inter-connection between networks may need to be selected and influence the path chosen for the data. We class this problem as "Inter-Domain Traffic Steering".

The composite end-to-end path comprises steering in the source domain, choice of source domain exit point, steering across the backbone networks, choice of network interconnections, choice of destination domain entry point, and steering in the destination domain. These issues may be inter-dependent (for example, the best traffic steering in the source domain may help select the best exit point from that domain, but the connectivity options across the backbone network may drive the selection of a different exit point). We class this combination of problems as "End-to-End Domain Interconnect Traffic Steering".

It should be noted that the solution to the End-to-End Domain Interconnect Traffic Steering problem depends on a number of factors:

In some cases, the domains and backbone networks are all owned and operated by the same company (with the backbone network often being a private network). In other cases, the domains are operated by one company, with other companies operating the backbone.

3. Solution Technologies

Within the Data Center, Segment Routing (SR from the SPRING working group in the IETF [RFC7855] and [I-D.ietf-spring-segment-routing]) is becoming a dominant solution. SR introduces traffic steering capabilities into an MPLS network [I-D.ietf-spring-segment-routing-mpls] by utilizing existing data plane capabilities (label pop and packet forwarding - "pop and go") in combination with additions to existing IGPs [I-D.ietf-ospf-segment-routing-extensions], [I-D.ietf-isis-segment-routing-extensions], BGP (as BGP-LU) [I-D.ietf-mpls-rfc3107bis], or a centralized controller to distribute "per-hop" labels. An MPLS label stack can be imposed on a packet to describe a sequence of links/nodes to be transited by the packet; as each hop is transited, the label that represents it is popped from the stack and the packet is forwarded. Thus, on a packet-by-packet basis, traffic can be steered within the Data Center network.

Note that other Data Center data plane technologies also exist. While this document focuses on connecting domains that use MPLS Segment Routing, the techniques are equally applicable to non-MPLS domains (such as those using IP, VXLAN, and NVGRE). See Section 9 for details.

This document broadens the problem space to consider interconnection of any type of edge domain. These may be Data Center sites, but they may equally be access networks, VPN sites, or any other form of domain that includes packet sources and destinations. We particularly focus on "SR edge domains" being source or destination domains that utilize SR, but the domains could use other technologies as described in Section 9.

Backbone networks are commonly based on MPLS hardware. In these networks, a number of different options exist to establish TE paths. Among these options are static LSPs (perhaps set up by an SDN controller), LSP tunnels established using a signaling protocol (such as RSVP-TE), and inter-domain use of SR (as described above for intra-domain steering). Where traffic steering (without resource reservation) is needed, SR may be adequate. Where Traffic Engineering is needed (i.e., traffic steering with resource reservation) RSVP-TE or centralized SDN control are preferred. However, in a network that is fully managed and controlled through a centralized planning tool, resource reservation can be achieved and SR can be used for full Traffic Engineering. These solutions are already used in support of a number of edge-to-edge services such as L3VPN and L2VPN.

3.1. Characteristics of Solution Technologies

Each of the solution technologies mentioned in the previous section has certain characteristics, and the combined solution needs to recognize and address the characteristics in order to make a workable solution.

This paper introduces an approach that blends the best points of each of these solution technologies to achieve a trade-off where RSVP-TE tunnels in the backbone network are stitched together using SR, and end-to-end SR paths can be created under the control of a central controller with routing devolved to the constituent networks where possible.

4. Decomposing the Problem

It is important to decompose the problem to take account of different regions spanned by the end-to-end path. These regions may use different technologies and may be under different administrative control. The separation of administrative control is particularly important because the operator of one region may be unwilling to share information about their networks, and may be resistant to allowing a third party to exert control over their network resources.

Using the reference model in Figure 1, we can consider how to get a packet from Source1 to the Destination. The following decisions must be made:

As already mentioned, these decisions may be inter-related. This enables us to break down the problem into three steps:

  1. Get the packet from Source1 to the exit point of Dom1.
  2. Get the packet from exit point of Dom1 to entry point of Dom2.
  3. Get the packet from entry point of Dom2 to Destination.

The solution needs to achieve this in a way that allows:

From a technology point of view we must support several functions and mixtures of those functions:

All of these different decompositions of the problem reflect different deployment choices and different commercial and operational practices, each with different functional trade-offs. For example, with separate controllers that do not share information and that only cooperate to a limited extent, it will be possible to achieve end-to-end connectivity with optimal routing at each step (domain or backbone AS), but the end-to-end path that is achieved might not be optimal.

5. Solution Space

5.1. Global Optimization of the Paths

Global optimization of the path from one domain to another requires either that the source controller has a complete view of the end-to-end topology or some form of cooperation between controllers (such as in BRPC in RFC 5441 [RFC5441]).

BGP-LS [RFC7752] can be used to provide the "source" controller with a view of the topology of the backbone. This requires some of the BGP speakers in each AS to have BGP-LS sessions to the controller. Other means of obtaining this view are of course possible.

5.2. Figuring Out the GWs at a Destination Domain for a Given Prefix

Suppose GW1 and GW2 both advertise a route to prefix X, each setting itself as next hop. One might think that the GWs for X could be inferred from the routes' next hop fields, but typically both routes do not get distributed across the backbone, only the "best" route, as selected by BGP. But the best route according to the BGP selection process might not be the route via the GW that we want to use for traffic engineering purposes.

The obvious solution would be to use the ADD-PATH mechanism [RFC7911] to ensure that all routes to X get advertised. However, even if one does this, the identity of the GWs would get lost as soon as the routes got distributed through an ASBR that sets next hop self. And if there are multiple ASes in the backbone, not only will the next hop change several times, but the ADD-PATH mechanism experiences scaling issues. So this "obvious" solution only works within a single AS.

A better solution can be achieved using the Tunnel Encapsulation [I-D.ietf-idr-tunnel-encaps] attribute as follows:

We define a new tunnel type, "SR tunnel" and when the GWs to a given domain advertise a route to a prefix X within the domain, they each include a Tunnel Encapsulation attribute with multiple remote endpoint sub-TLVs each identifying a specific GW to the domain.

In other words, each route advertised by any GW identifies all of the GWs to the same domain (see Section 9 for a discussion of how GWs discover each other). Therefore, only one of the routes needs to be distributed to other ASes, and it doesn't matter how many times the next hop changes, the Tunnel Encapsulation attribute (and its remote endpoint sub-TLVs) remains unchanged.

Further, when a packet destined for prefix X is sent on a TE path to GW1 we want the packet to arrive at GW1 carrying, at the top of its label stack, GW1's label for prefix X. To achieve this we will place the SID/SRGB in a sub-TLV of the Tunnel Encapsulation attribute. We will define the prefix-SID sub-TLV to be essentially identical in syntax to the prefix-SID attribute (see [I-D.ietf-idr-bgp-prefix-sid]), but the semantics are somewhat different.

It is also possible to define an "MPLS Label Stack" sub-TLV for the Tunnel Encapsulation attribute, and put this in the "SR tunnel" TLV. This allows the destination GW to specify a label stack that it wants packets destined for prefix X to have. This label stack represents a source route through the destination domain.

5.3. Figuring Out the Backbone Egress ASBRs

We need to figure out the backbone egress ASBRs that are attached to a given GW at the destination domain this out in order to properly engineer the path across the backbone.

The "cleanest" way to figure this out is to have the backbone egress ASBRs distribute the information to the source controller using the EPE extensions of BGP-LS [I-D.ietf-idr-bgpls-segment-routing-epe]. The EPE extensions to BGP-LS allow a BGP speaker to say, "Here is a list of my EBGP neighbors, and here is a (locally significant) adjacency-SID for each one."

It may also be possible to consider utilizing cooperating PCEs or a Hierarchical PCE approach in RFC 6805 [RFC6805]. But it should be observed that this question is dependent on the question in Section 5.2. That is, it is not possible to even start the selection of egress ASBRs until it is known which GWs at the destination domain provide access to a given prefix. Once that question has been answered, any number of PCE approaches can be used to select the right egress ASBR and, more generally, the ASBR path across the backbone.

5.4. Making use of RSVP-TE LSPs Across the Backbone

There are a number of ways to carry traffic across the backbone from one domain to another. RSVP-TE is a popular tunneling mechanism in similar scenarios (e.g., L3VPN) because it allows for reservation of resources as well as traffic steering.

A controller can cause an RSVP-TE LSP to be set up by using PCEP to talk to the LSP headend, using PCEP extensions [I-D.ietf-pce-pce-initiated-lsp]. That draft specifies an "LSP-initiate" message that the controller uses to specify the RSVP-TE LSP endpoints, the ERO, a "symbolic pathname", and optionally other attributes (specified in the PCEP specification, RFC 5440 [RFC5440]) such as bandwidth.

When the headend receives an LSP-initiate message, it sets up the RSVP-TE LSP, assigns it a "PLSP-id", and reports the PLSP-id back to the controller in a PCRpt message [I-D.ietf-pce-stateful-pce]. The PCRpt message also contains the symbolic name that the controller assigned to the LSP, as well as containing some information identifying the LSP-initiate message from the controller, and details of exactly how the LSP was set up (RRO, bandwidth, etc.).

The headend can add to the PCRpt message a TE-PATH-BINDING TLV [I-D.sivabalan-pce-binding-label-sid]. This allows the headend to assign a "binding SID" to the LSP, and to report to the controller that a particular binding SID corresponds to a particular LSP. The binding SID is locally scoped to the headend.

The controller can make this label be part of the label stack that it tells the source (or the GW at the source domain) to put on the data packets being sent to prefix X. When the headend receives a packet with this label at the top of the stack it will send the packet onward on the LSP.

5.5. Data Plane

Consolidating all of the above, consider what happens when we want to move a data packet from Source to Destination in Figure 1via the following source route:


Further, assume that there is an RSVP-TE LSP from PE2a to ASBR2a that we want to use, as well as an RSVP-TE LSP from ASBR3a to PE3a that we want to use.

Let's suppose that the Source pushes a label stack following instructions from the controller (for example, using BGP-LU [I-D.ietf-mpls-rfc3107bis]). We won't worry for now about source routing through the domains themselves: that is, in practice there may be additional labels in the stack to cover the source route from the Source to GW1b and from GW2a to the Destination, but we will focus only on the labels necessary to leave the source domain, traverse the backbone, and enter the egress domain. So we only care what the stack looks like when the packet gets to GW1b.

When the packet gets to GW1b, the stack should have six labels:

Top Label:

Second Label:

Third Label:

Fourth Label:

Fifth label:

Sixth Label:

Note that the size of the label stack is proportional to the number of RSVP-TE LSPs that get stitched together by SR.

See Section 7 for some detailed examples that show the concrete use of labels in a sample topology.

In the above example, all labels except the sixth are locally significant labels: peer-SIDs, binding SIDs, or adjacency-SIDs. Only the sixth label, a prefix-SID, has a domain-wide unique value. To impose that label, the source needs to know the SRGB of GW2a. If all nodes have the same SRGB, this is not a problem. Otherwise, there are a number of different ways GW3a can advertise its SRGB. This can be done via the segment routing extensions of BGP-LS, or it can be done using the prefix-SID attribute or BGP-LU [I-D.ietf-mpls-rfc3107bis], or it can be done using the BGP Tunnel Encapsulation attribute. The exact technique to be used will depend on the details of the deployment scenario.

The reason the above example is primarily based on locally significant labels is that it creates a "strict source route", and it presupposes the EPE extensions of BGP-LS. In some scenarios, the EPE extension to BGP-LS might not be available (or BGP-LS might not be available at all). In other scenarios, it may be desirable to steer a packet through a "loose source route". In such scenarios, the label stack imposed by the source will be based upon a sequence of domain-wide unique "node-SIDs", each representing one of the hops of source route. Each label has to be computed by adding the corresponding node-SID to the SRGB of the node that will act upon the label. One way to learn the node-SIDs and SRGBs is to use the segment routing extensions of BGP-LS. Another way is to use BGP-LU as follows. Each node that may be part of a source route would originate a BGP-LU route with one of its own loopback addresses as the prefix. The BGP prefix-SID attribute would be attached to this route. The prefix-SID attribute would contain a SID, which is the domain-wide unique SID corresponding to the node's loopback address. The attribute would also contain the node's SRGB.

While this technique is useful when BGP-LS is not available, it presupposes that the source controller has some other means of discovering the topology. In this document, we focus primarily on the scenario where BGP-LS, rather than BGP-LU, is used.

5.6. Centralized and Distributed Controllers

A controller or set of controllers are needed to collate topology and TE information from the constituent networks, to apply policies and service requirements to compute paths across those networks, to select an end-to-end path, and to program key nodes in the network to take the right forwarding actions (pushing label stacks, stitching LSPs, forwarding traffic).

It is also possible to operate domain interconnection when some or all domains do not have a controller. Segment Routing is capable of routing a packet toward the next hop based on the top label on the stack, and that label does not need to indicate an immediately adjacent node or link. In these cases, the packet may be forwarded untouched, or the forwarding router may impose a locally-determined additional set of labels that define the path to the next hop.

PCE can be used to instruct the source host or a transit node on what label stacks to add to packets. That is, a node that needs to impose labels (either to start routing the packet from the source host, or to advance the packet from a transit router toward the destination) can determine the label stack to use based on local function or can have that stack supplied by a PCE. The PCE Protocol (PCEP) has been extended to allow the PCE to supply a label stack for reaching a specific destination either in response to a request or in an unsolicited manner [I-D.ietf-pce-segment-routing].

6. BGP-LS Considerations

This section gives an overview of the use of BGP-LS to export an abstraction (or summary) of the connectivity across the backbone network by means of two figures that show different views of a sample network.

Figure 2 shows a more complex reference architecture.

Figure 3 represents the minimum set of nodes and links that need to be advertised in BGP-LS with SR in order to perform Domain Interconnect with traffic engineering across the backbone network: the PEs, ASBRs, and gateways (GWs), and the links between them. In particular, EPE [I-D.ietf-idr-bgpls-segment-routing-epe] and TE information with associated segment IDs is advertised in BGP-LS with SR.

Links that are advertised may be physical links, links realized by LSP tunnels, or abstract links. It is assumed that intra-AS links are either real links, RSVP-TE LSPs with allocated bandwidth, or SR TE policies as described in [I-D.previdi-idr-segment-routing-te-policy]. Additional nodes internal to an AS and their links to PEs, ASBRs, and/or GWs may also be advertised (for example to avoid full mesh problems).

|                                                                   |
|                              AS1                                  |
|  ----    ----                                       ----    ----  |
   ----    ----                                       ----    ----
   :        :   ------------           ------------     :     : :
   :        :  | AS2        |         |        AS3 |    :     : :
   :        :  |         ------.....------         |    :     : :
   :        :  |        |ASBR2a|   |ASBR3a|        |    :     : :
   :        :  |         ------  ..:------         |    :     : :
   :        :  |            |    :    |            |    :     : :
   :        :  |         ------..:  ------         |    :     : :
   :        :  |        |ASBR2b|...|ASBR3b|        |    :     : :
   :        :  |         ------     ------         |    :     : :
   :        :  |            |         |            |    :     : :
   :        :  |            |       ------         |    :     : :
   :        :  |            |    ..|ASBR3c|        |    :     : :
   :        :  |            |    :  ------         |    : ....: :
   :  ......:  |  ----      |    :    |      ----  |    : :     :
   :  :         -|PE2a|-----     :     -----|PE3b|-     : :     :
   :  :           ----           :           ----       : :     :
   :  :     .......:             :             :....... : :     :
   :  :     :                   ------                : : :     :
   :  :     :              ----|ASBR4b|----           : : :     :
   :  :     :             |     ------     |          : : :     :
   :  :     :           ----               |          : : :     :
   :  :     : .........|PE4b|          AS4 |          : : :     :
   :  :     : :         ----               |          : : :     :
   :  :     : :           |      ----      |          : : :     :
   :  :     : :            -----|PE4a|-----           : : :     :
   :  :     : :                  ----                 : : :     :
   :  :     : :                ..:  :..               : : :     :
   :  :     : :                :      :               : : :     :
   ----    ----              ----    ----             ----:   ----
 -|GW1a|--|GW1b|-          -|GW2a|--|GW2b|-         -|GW3a|--|GW3b|-
|  ----    ----  |        |  ----    ----  |       |  ----    ----  |
|                |        |                |       |                |
|                |        |                |       |                |
| Host1a  Host1b |        | Host2a  Host2b |       | Host3a  Host3b |
|                |        |                |       |                |
|                |        |                |       |                |
| Dom1           |        | Dom2           |       |           Dom3 |
 ----------------          ----------------         ----------------

Figure 2: Network View of Example Configuration

    :                                                           :
   ----    ----                                       ----    ----
  |PE1a|  |PE1b|.....................................|PE2a|  |PE2b|
   ----    ----                                       ----    ----
   :        :                                           :     : :
   :        :                                           :     : :
   :        :            ------.....------              :     : :
   :        :     ......|ASBR2a|   |ASBR3a|......       :     : :
   :        :     :      ------  ..:------      :       :     : :
   :        :     :              :              :       :     : :
   :        :     :      ------..:  ------      :       :     : :
   :        :     :  ...|ASBR2b|...|ASBR3b|     :       :     : :
   :        :     :  :   ------     ------      :       :     : :
   :        :     :  :                 :        :       :     : :
   :        :     :  :              ------      :       :     : :
   :        :     :  :           ..|ASBR3c|...  :       :     : :
   :        :     :  :           :  ------   :  :       : ....: :
   :  ......:     ----           :           ----       : :     :
   :  :          |PE2a|          :          |PE3b|      : :     :
   :  :           ----           :           ----       : :     :
   :  :     .......:             :             :....... : :     :
   :  :     :                   ------                : : :     :
   :  :     :                  |ASBR4b|               : : :     :
   :  :     :                   ------                : : :     :
   :  :     :           ----        :                 : : :     :
   :  :     : .........|PE4b|.....  :                 : : :     :
   :  :     : :         ----     :  :                 : : :     :
   :  :     : :                  ----                 : : :     :
   :  :     : :                 |PE4a|                : : :     :
   :  :     : :                  ----                 : : :     :
   :  :     : :                ..:  :..               : : :     :
   :  :     : :                :      :               : : :     :
   ----    ----              ----    ----             ----:   ----
 -|GW1a|--|GW1b|-          -|GW2a|--|GW2b|-         -|GW3a|--|GW3b|-
|  ----    ----  |        |  ----    ----  |       |  ----    ----  |
|                |        |                |       |                |
|                |        |                |       |                |
| Host1a  Host1b |        | Host2a  Host2b |       | Host3a  Host3b |
|                |        |                |       |                |
|                |        |                |       |                |
| Dom1           |        | Dom2           |       |           Dom3 |
 ----------------          ----------------         ----------------

Figure 3: Topology View of Example Configuration

A node (a PCE, router, or host) that is computing a full or partial path correlates the topology information disseminated in BGP-LS with SR with the information advertised with the Tunnel Encapsulation attributes to compute that path and obtain the SIDs for the elements on that path. In order to allow a source host to compute exit points from its domain, some subset of the above information needs to be disseminated within that domain.

What is advertised external to a given AS is controlled by policy at the ASes' PEs, ASBRs, and GWs. Central control of what each node should advertise, based upon analysis of the network as a whole, is an important additional function. This and the amount of policy involved may make the use of a Route Reflector an attractive option.

The configuration of which links to other nodes and the characteristics of those links a given node advertises in BGP-LS with SR is done locally at each node and pairwise coordination between link end-points is required to ensure consistency.

Path Weighted ECMP (PWECMP) is assumed to be used by a GW for a given source domain to send all flows to a given destination domain using all paths in the backbone network to that destination domain in proportion to the minimum bandwidth on each path. PWECMP is also assumed to be used by hosts within a source domain to send flows to that domain's GWs.

7. Worked Examples

Figure 4 shows a view of the links, paths, and labels that can be assigned to part of the sample network shown in Figure 2 and Figure 3. The double-dash lines (===) indicate LSP tunnels across backbone ASes and dotted lines (...) are physical links.

At each node, a label may be assigned to each outgoing link. This is shown in Figure 4. For example, at GW1a the label L201 is assigned to the link connecting GW1a to PE1a. At PE1c, the label L302 is assigned to the link connecting PE1c to GW3b. Labels ("binding SIDs") may also be assigned to RSVP-TE LSPs. For example, at PE1a, label L202 is assigned to the RSVP-TE LSP leading from PE1a to PE1c.

At the destination domain, labels L302 and L305 are "node-SIDs"; they represent GW3b and Host3b respectively, rather than representing particular links.

When a node processes a packet, the label at the top of the label stack indicates the link (or RSVP-TE LSP) on which that node is to transmit the packet. The node pops that label off the label stack before transmitting the packet on the link. However, if the top label is a node-SID, the node processing the packet is expected to transmit the packet on whatever link it regards as the shortest path to the node represented by the label.


   ----        L202                                             ----
  |    |=======================================================|    |
  |PE1a|                                                       |PE1c|
  |    |=======================================================|    |
   ----        L203                                             ----
   :                                                             : :
   :     ----     L205                                     ----  : :
   :    |PE1b|============================================|PE1d| : :
   :     ----                                              ----  : :
   :      :                                                  :   : :
   :      :                                                  :   : :
   :      :    ----  L207  ------  L209  ------          L303:   : :
   :L201  :   |    |======|ASBR2a|......|      |             :   : :
   :      :   |    |       ------       |      | L210  ----  :   : :
   :      :   |PE2a|                    |ASBR3a|======|PE3b| :   : :
   :      :   |    | L208  ------  L211 |      |       ----  :   : :
   :      :   |    |======|ASBR2b|......|      |       :     :   : :
   :  L204:    ----       ------         ------     ...:     :   : :
   :      :      :                                  :        :   : :
   :  ....:      :                                  : .......:   : :
   :  :          :                                  : :          : :
   :  :          :L206                          L301: : .........: :
   :  :          :                                  : : : L304     :
   :  :      ....:                                  : : :      ....:
   :  :      :                                      : : :      : L302
   ----    ----                                     -----    ----
 -|GW1a|--|GW1b|-                                 -|GW3a |--|GW3b|-
|  ----    ----  |                               |  -----    ----  |
|    :      :    |                               |     :      :    |
|L103:      :L102|                               | L303:      :L304|
|    :      :    |                               |     :      :    |
|   N1      N2   |                               |    N3      N4   |
|    :..  ..:    |                               |     :  ....:    |
| L101 :  :      |                               |     :  :        |
|     Host1a     |                               |   Host3b (L305) |
|                |                               |                 |
| Dom1           |                               |            Dom3 |
 ----------------                                 -----------------

Figure 4: Tunnels and Labels in Example Configuration

Let's consider several different possible ways to direct a packet from Host1a in Dom1 to Host3b in Dom3.

a. Full source route imposed at source

b. It is possible that the source domain does not have visibility into the destination domain.

c. Dom1 only has reachability information

d. Stitched LSPs across the backbone

8. Label Stack Depth Considerations

As described in Section 3.1, one of the issues with a Segment Routing approach is that the label stack can get large, for example when the source route becomes long. A mechanism to mitigate this problem is needed if the solution is to be fully applicable in all environments.

An Internet-Draft called "Segment Routing Traffic Engineering Policy using BGP" [I-D.previdi-idr-segment-routing-te-policy] introduces the concept of hierarchical source routes as a way to compress source route headers. It functions by having the egress node for a set of source routes advertise those source routes along with an explicit request that each node that is an ingress node for one or more of those source routes should advertise a binding SID for the set of source routes for which it is the ingress. (It should be noted that the set of source routes can either be advertised by the egress node as described here, or could be advertised by a controller on behalf of the egress node.) Such an ingress node advertises its set of source routes and a binding SID as an adjacency in BGP-LS as described in Section 6. These source routes represent the weighted ECMP paths between the ingress node and the egress node. (Note also that the binding SID may be supplied by the node that advertises the source routes - the egress or the controller - or may be chosen by ingress node.)

A remote node that wishes to reach the egress node would then construct a source route consisting of the segment IDs necessary to reach one of the ingress nodes for the path it wishes to use along with the binding SID that the ingress node advertised to identify the set of paths. When the selected ingress node receives a packet with a binding SID it has advertised, it replaces the binding SID with the labels for one of its source routes to the egress node (it will choose one of the source routes in the set according to its own weighting algorithms and policy).

8.1. Worked Example

Consider the topology in Figure 4. Suppose that it is desired to construct full segment routed paths from ingress to egress, but that the resulting label stack (segment route) is too large. In this case the gateways to Dom3 (GW3a and GW3b) can advertise all of the source routes from the gateways to Dom1 (GW1a and GW1b). The gateways to Dom1 then assign binding SIDs to those source routes and advertise those SIDs into BGP-LS.

Thus, GW3b would advertise the two source routes (L201, L202, L302 and L201, L203, L302), and GW1a would advertise into BGP-LS its adjacency to GW3b along with a binding SID. Should Host1a wish to send a packet via GW1a and GW3b, it can include L103 and this binding SID in the source route. GW1a is free to choose which source route to use between itself and GW3b using its weighted ECMP algorithm.

Similarly, GW3a would advertise the following set of source routes:

GW1a would advertise a binding SID for the first three, and GW1b would advertise a binding SID for the other two.

9. Gateway Considerations

As described in Section 5, we define a new tunnel type, "SR tunnel", and when the GWs to a given domain advertise a route to a prefix X within the domain, they will each include a Tunnel Encapsulation attribute with multiple tunnel instances each of type "SR tunnel", one for each GW and each containing a Remote Endpoint sub-TLV with that GW's address.

In other words, each route advertised by any GW identifies all of the GWs to the same domain.

Therefore, even if only one of the routes is distributed to other ASes, it will not matter how many times the next hop changes, as the Tunnel Encapsulation attribute (and its remote endpoint sub-TLVs) will remain unchanged.

9.1. Domain Gateway Auto-Discovery

To allow a given domain's GWs to auto-discover each other and to coordinate their operations, the following procedures are implemented [I-D.drake-bess-datacenter-gateway]:

To avoid the side effect of applying the Tunnel Encapsulation attribute to any packet that is addressed to the GW, the GW should use a different loopback address.

Each GW will include a Tunnel Encapsulation attribute for each GW that is active for the domain (including itself), and will include these in every route advertised externally to the domain by each GW. As the current set of active GWs changes (due to the addition of a new GW or the failure/removal of an existing GW) each externally advertised route will be re-advertised with the set of SR tunnel instances reflecting the current set of active GWs.

9.2. Relationship to BGP Link State and Egress Peer Engineering

When a remote GW receives a route to a prefix X it can use the SR tunnel instances within the contained Tunnel Encapsulation attribute to identify the GWs through which X can be reached. It uses this information to compute SR TE paths across the backbone network looking at the information advertised to it in SR BGP Link State (BGP-LS) [I-D.gredler-idr-bgp-ls-segment-routing-ext] and correlated using the domain identity. SR Egress Peer Engineering (EPE) [I-D.ietf-idr-bgpls-segment-routing-epe] can be used to supplement the information advertised in the BGP-LS.


When a packet destined for prefix X is sent on an SR TE path to a GW for the domain containing X, it needs to carry the receiving GW's label for X such that this label rises to the top of the stack before the GW complete its processing of the packet. To achieve this we place a prefix-SID sub-TLV for X in each SR tunnel instance in the Tunnel Encapsulation attribute in the externally advertised route for X.

Alternatively, if the GWs for a given domain are configured to allow remote GWs to perform SR TE through that domain for a prefix X, then each GW computes an SR TE path through that domain to X from each of the current active GWs and places each in an MPLS label stack sub-TLV [I-D.ietf-idr-tunnel-encaps] in the SR tunnel instance for that GW.

9.4. Encapsulations

If the GWs for a given domain are configured to allow remote GWs send them a packet in that domain's native encapsulation, then each GW will also include multiple instances of a tunnel TLV for that native encapsulation, one for each GW and each containing a remote endpoint sub-TLV with that GW's address, in externally advertised routes. A remote GW may then encapsulate a packet according to the rules defined via the sub-TLVs included in each of the tunnel TLV instances.

10. Security Considerations


11. Management Considerations


12. IANA Considerations

This document makes no requests for IANA action.

13. Acknowledgements


14. Informative References

[I-D.drake-bess-datacenter-gateway] Drake, J., Farrel, A., Rosen, E., Patel, K. and L. Jalil, "Gateway Auto-Discovery and Route Advertisement for Segment Routing Enabled Data Center Interconnection", Internet-Draft draft-drake-bess-datacenter-gateway-03, April 2017.
[I-D.gredler-idr-bgp-ls-segment-routing-ext] Previdi, S., Psenak, P., Filsfils, C., Gredler, H., Chen, M. and j. jefftant@gmail.com, "BGP Link-State extensions for Segment Routing", Internet-Draft draft-gredler-idr-bgp-ls-segment-routing-ext-04, October 2016.
[I-D.ietf-idr-bgp-prefix-sid] Previdi, S., Filsfils, C., Lindem, A., Sreekantiah, A. and H. Gredler, "Segment Routing Prefix SID extensions for BGP", Internet-Draft draft-ietf-idr-bgp-prefix-sid-06, June 2017.
[I-D.ietf-idr-bgpls-segment-routing-epe] Previdi, S., Filsfils, C., Patel, K., Ray, S. and J. Dong, "BGP-LS extensions for Segment Routing BGP Egress Peer Engineering", Internet-Draft draft-ietf-idr-bgpls-segment-routing-epe-13, June 2017.
[I-D.ietf-idr-tunnel-encaps] Rosen, E., Patel, K. and G. Velde, "The BGP Tunnel Encapsulation Attribute", Internet-Draft draft-ietf-idr-tunnel-encaps-06, June 2017.
[I-D.ietf-isis-segment-routing-extensions] Previdi, S., Filsfils, C., Bashandy, A., Gredler, H., Litkowski, S., Decraene, B. and j. jefftant@gmail.com, "IS-IS Extensions for Segment Routing", Internet-Draft draft-ietf-isis-segment-routing-extensions-13, June 2017.
[I-D.ietf-mpls-rfc3107bis] Rosen, E., "Using BGP to Bind MPLS Labels to Address Prefixes", Internet-Draft draft-ietf-mpls-rfc3107bis-02, May 2017.
[I-D.ietf-ospf-segment-routing-extensions] Psenak, P., Previdi, S., Filsfils, C., Gredler, H., Shakir, R., Henderickx, W. and J. Tantsura, "OSPF Extensions for Segment Routing", Internet-Draft draft-ietf-ospf-segment-routing-extensions-17, June 2017.
[I-D.ietf-pce-pce-initiated-lsp] Crabbe, E., Minei, I., Sivabalan, S. and R. Varga, "PCEP Extensions for PCE-initiated LSP Setup in a Stateful PCE Model", Internet-Draft draft-ietf-pce-pce-initiated-lsp-10, June 2017.
[I-D.ietf-pce-segment-routing] Sivabalan, S., Filsfils, C., Tantsura, J., Henderickx, W. and J. Hardwick, "PCEP Extensions for Segment Routing", Internet-Draft draft-ietf-pce-segment-routing-09, April 2017.
[I-D.ietf-pce-stateful-pce] Crabbe, E., Minei, I., Medved, J. and R. Varga, "PCEP Extensions for Stateful PCE", Internet-Draft draft-ietf-pce-stateful-pce-21, June 2017.
[I-D.ietf-spring-segment-routing] Filsfils, C., Previdi, S., Decraene, B., Litkowski, S. and R. Shakir, "Segment Routing Architecture", Internet-Draft draft-ietf-spring-segment-routing-12, June 2017.
[I-D.ietf-spring-segment-routing-mpls] Filsfils, C., Previdi, S., Bashandy, A., Decraene, B., Litkowski, S. and R. Shakir, "Segment Routing with MPLS data plane", Internet-Draft draft-ietf-spring-segment-routing-mpls-10, June 2017.
[I-D.previdi-idr-segment-routing-te-policy] Previdi, S., Filsfils, C., Mattes, P., Rosen, E. and S. Lin, "Advertising Segment Routing Policies in BGP", Internet-Draft draft-previdi-idr-segment-routing-te-policy-07, June 2017.
[I-D.sivabalan-pce-binding-label-sid] Sivabalan, S., Filsfils, C., Previdi, S., Tantsura, J., Hardwick, J. and M. Nanduri, "Carrying Binding Label/Segment-ID in PCE-based Networks.", Internet-Draft draft-sivabalan-pce-binding-label-sid-02, October 2016.
[RFC4360] Sangli, S., Tappan, D. and Y. Rekhter, "BGP Extended Communities Attribute", RFC 4360, DOI 10.17487/RFC4360, February 2006.
[RFC5152] Vasseur, JP., Ayyangar, A. and R. Zhang, "A Per-Domain Path Computation Method for Establishing Inter-Domain Traffic Engineering (TE) Label Switched Paths (LSPs)", RFC 5152, DOI 10.17487/RFC5152, February 2008.
[RFC5440] Vasseur, JP. and JL. Le Roux, "Path Computation Element (PCE) Communication Protocol (PCEP)", RFC 5440, DOI 10.17487/RFC5440, March 2009.
[RFC5441] Vasseur, JP., Zhang, R., Bitar, N. and JL. Le Roux, "A Backward-Recursive PCE-Based Computation (BRPC) Procedure to Compute Shortest Constrained Inter-Domain Traffic Engineering Label Switched Paths", RFC 5441, DOI 10.17487/RFC5441, April 2009.
[RFC5520] Bradford, R., Vasseur, JP. and A. Farrel, "Preserving Topology Confidentiality in Inter-Domain Path Computation Using a Path-Key-Based Mechanism", RFC 5520, DOI 10.17487/RFC5520, April 2009.
[RFC6805] King, D. and A. Farrel, "The Application of the Path Computation Element Architecture to the Determination of a Sequence of Domains in MPLS and GMPLS", RFC 6805, DOI 10.17487/RFC6805, November 2012.
[RFC7752] Gredler, H., Medved, J., Previdi, S., Farrel, A. and S. Ray, "North-Bound Distribution of Link-State and Traffic Engineering (TE) Information Using BGP", RFC 7752, DOI 10.17487/RFC7752, March 2016.
[RFC7855] Previdi, S., Filsfils, C., Decraene, B., Litkowski, S., Horneffer, M. and R. Shakir, "Source Packet Routing in Networking (SPRING) Problem Statement and Requirements", RFC 7855, DOI 10.17487/RFC7855, May 2016.
[RFC7911] Walton, D., Retana, A., Chen, E. and J. Scudder, "Advertisement of Multiple Paths in BGP", RFC 7911, DOI 10.17487/RFC7911, July 2016.
[RFC7926] Farrel, A., Drake, J., Bitar, N., Swallow, G., Ceccarelli, D. and X. Zhang, "Problem Statement and Architecture for Information Exchange between Interconnected Traffic-Engineered Networks", BCP 206, RFC 7926, DOI 10.17487/RFC7926, July 2016.

Authors' Addresses

Adrian Farrel Juniper Networks EMail: afarrel@juniper.net
John Drake Juniper Networks EMail: jdrake@juniper.net