< draft-lapukhov-bgp-ecmp-considerations-01.txt   draft-lapukhov-bgp-ecmp-considerations-02.txt >
Network Working Group P. Lapukhov Network Working Group P. Lapukhov
Internet-Draft Facebook Internet-Draft Facebook
Intended status: Informational October 30, 2017 Intended status: Informational J. Tantsura
Expires: May 3, 2018 Expires: January 2, 2020 Apstra, Inc.
July 1, 2019
Equal-Cost Multipath Considerations for BGP Equal-Cost Multipath Considerations for BGP
draft-lapukhov-bgp-ecmp-considerations-01 draft-lapukhov-bgp-ecmp-considerations-02
Abstract Abstract
BGP routing protocol defined in ([RFC4271]) employs tie-breaking BGP (Border Gateway Protocol) [RFC4271] employs tie-breaking logic to
logic to elect single best path among multiple possible. At the same select a single best path among multiple paths available, known as
time, it has been common in all practical BGP implementations to BGP best path selection. At the same time, it is a common practice
allow for "equal-cost multipath" (ECMP) path election and programming to allow for "equal-cost multipath" (ECMP) selection and programming
of multiple next-hops in routing tables. This documents provides of multiple next-hops in routing tables. This document summarizes
some common considerations for the ECMP logic, with the intent of some common considerations for the ECMP logic when BGP is used as the
providing common reference on otherwise unstandardized feature. routing protocol, with the intent of providing common reference for
otherwise unstandardized set of features.
Status of This Memo Status of This Memo
This Internet-Draft is submitted in full conformance with the This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79. provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/. Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on May 3, 2018. This Internet-Draft will expire on January 2, 2020.
Copyright Notice Copyright Notice
Copyright (c) 2017 IETF Trust and the persons identified as the Copyright (c) 2019 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(https://trustee.ietf.org/license-info) in effect on the date of (https://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as the Trust Legal Provisions and are provided without warranty as
skipping to change at page 2, line 19 skipping to change at page 2, line 21
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
2. AS-PATH attribute comparison . . . . . . . . . . . . . . . . 2 2. AS-PATH attribute comparison . . . . . . . . . . . . . . . . 2
3. Multipath among eBGP-learned paths . . . . . . . . . . . . . 3 3. Multipath among eBGP-learned paths . . . . . . . . . . . . . 3
4. Multipath among iBGP learned paths . . . . . . . . . . . . . 3 4. Multipath among iBGP learned paths . . . . . . . . . . . . . 3
5. Multipath among eBGP and iBGP paths . . . . . . . . . . . . . 4 5. Multipath among eBGP and iBGP paths . . . . . . . . . . . . . 4
6. Multipath with AIGP . . . . . . . . . . . . . . . . . . . . . 5 6. Multipath with AIGP . . . . . . . . . . . . . . . . . . . . . 5
7. Best path advertisement . . . . . . . . . . . . . . . . . . . 5 7. Best path advertisement . . . . . . . . . . . . . . . . . . . 5
8. Multipath and non-deterministic tie-breaking . . . . . . . . 5 8. Multipath and non-deterministic tie-breaking . . . . . . . . 5
9. Weighted equal-cost multipath . . . . . . . . . . . . . . . . 5 9. Weighted equal-cost multipath . . . . . . . . . . . . . . . . 5
10. Informative References . . . . . . . . . . . . . . . . . . . 5 10. Informative References . . . . . . . . . . . . . . . . . . . 5
Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 6 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 6
1. Introduction 1. Introduction
Section 9.1.2.2 of [RFC4271] defines step-by step procedure for Section 9.1.2.2 of [RFC4271] defines step-by-step tie-breaking
selecting single "best-path" among multiple alternative available for procedure for selecting a single "best-path" among multiple
the same NLRI (Network Layer Reachability Information) element. In alternatives available for the same route. In order to improve
order to improve efficiency in symmetric network topologies is has efficiency in densely meshed symmetric network topologies it is
become common practice to allow selecting multiple "equivalent" paths common to allow the selection of multiple "equal cost" paths for the
for the same prefix. Most commonly used approach is to stop the tie- same route. Typical approach is to abort the tie-breaking process
breaking process after comparing the IGP cost for the NEXT_HOP after comparing IGP cost for the NEXT_HOP attribute and select either
attribute and selecting either all eBGP or all iBGP paths that all eBGP or all iBGP paths that remained "equal" under the tie-
remained equivalent under the tie-breaking rules (see [BGPMP] for a breaking rules. See [BGPMP] for a vendor document explaining the
vendor document explaining the logic). Basically, the steps that logic. In a nutshell, the steps that compare the BGP identifiers and
compare the BGP identifier and BGP peer IP addresses (steps (f) and BGP peer IP addresses (steps (f) and (g) in [RFC4271]) are ignored
(g)) are ignored for the purpose of multipath routing. BGP for the purpose of multipath routing. BGP implementations commonly
implementations commonly have a configuration knob that specifies the have a configuration knob that specifies the maximum number of equal
maximum number of equivalent paths that may be programmed to the paths that are allowed be programmed in the routing table. Commonly,
routing table. There is also common a knob to enable multipath there's also a knob to enable multipath separately for iBGP-learned
separately for iBGP-learned or eBGP-learned paths. or eBGP-learned paths.
2. AS-PATH attribute comparison 2. AS-PATH attribute comparison
A mandatory requirement is for all paths that are candidates for ECMP The mandatory requirement for all paths that are considered as the
selection to have the same AS_PATH length, computed using the candidates for ECMP selection is to have the same AS_PATH length,
standard logic defined in [RFC4271] and [RFC5065], i.e. ignoring the computed using the logic defined in [RFC4271] and [RFC5065], i.e.
AS_SET, AS_CONFED_SEQUENCE, and AS_CONFED_SET segment lengths. The ignoring the AS_SET, AS_CONFED_SEQUENCE, and AS_CONFED_SET segment
content of the latter attributes is used purely for loop detection. lengths. The content of the latter attributes is used purely for
Assuming that AS_PATH lengths computed in this fashion are the same, routing loop prevention. Assuming that AS_PATHs length computed in
many implementations require that content of AS_SEQUENCE segment MUST this fashion are the same, many implementations require that the
be the same among all equivalent paths. Two common configuration content of AS_SEQUENCE segment MUST be the same among all the paths
knobs are usually provided: one allowing only the length of AS_PATH considered. Two common configuration knobs to alter this behaviour
to be the same, and another requiring that the first AS numbers in are usually provided: First one, to relax the otherwise mandatory
first AS_SEQUENCE segment found in AS_PATH (often referred to as AS_SEQUENCE comparison rule, enforcing only the AS_PATH length rule,
"peer AS" number) be the same as the one found in best path while ignoring the content of AS_SEQUENCE. Another one requiring
(determined by running the full tie-breaking algorithm). This that the first AS numbers in the first AS_SEQUENCE segment found in
document refer to those two as "multipath as-path relaxed" and AS_PATH (often referred to as "peer AS" number) be the same as the
"multipath same peer-as" knobs. one found in best path (as determined by running the full tie-
breaking procedure). This document refers to those two as "multipath
as-path relaxed" and "multipath same peer-as" correspondingly.
3. Multipath among eBGP-learned paths 3. Multipath among eBGP-learned paths
Step (d) in Section 9.1.2.2 of [RFC4271] instructs to remove all iBGP Step (d) in Section 9.1.2.2 of [RFC4271] mandates, in presence of an
paths from considerations if an eBGP path is present in the candidate eBGP path, to remove all iBGP paths from the the ECMP candidates set.
set. This leaves the BGP process with just eBGP paths. At this This leaves the BGP tie-breaking procedure with just eBGP paths. At
point, the mandatory BGP NEXT_HOP attribute value most commonly this point, the mandatory BGP NEXT_HOP attribute value most commonly
belongs to the IP subnet that the BGP speaker shares with advertising belongs to the IP subnet that the BGP speaker shares with the
neighbor. In this case, it is common for implementation to treat all advertising neighbor. In this case, it is common for implementations
NEXT_HOP values as having the same "internal cost" to reach them per to treat all NEXT_HOP values as having the same "internal cost" to
the guidance of step (e) of Section 9.1.2.2. In some cases, either reach them per the guidance of step (e) of Section 9.1.2.2. In some
static routing or an IGP routing protocol could be running between cases, either static routing or an IGP routing protocol could be used
the BGP speakers peering over eBGP session. An implementation may between the BGP speakers peering using an eBGP session. An
use the metric discovered from the above sources to perform tie- implementation may use the next-hop metric discovered from the above
breaking even for eBGP paths. sources to perform tie-breaking even for eBGP paths.
In case when MED attribute is present in some paths, the set of If the MED attribute is present in some paths, the set of multipath
allowed multipath routes will most likely be reduced to the ones routes allowed will most likely be reduced to the ones coming from
coming from the same peer AS, per step (c) of Section 9.1.2.2. This the same peer AS, per step (c) of Section 9.1.2.2. This is unless an
is unless the implementation provided a configuration knob to always implementation provides a configuration knob to always compare MED
compare MED attributes across all paths, as recommended in [RFC4451]. attributes across all paths, as recommended by [RFC4451]. In the
In the latter case, the presence of MED attribute does not narrow the latter case, the presence of the MED attribute does not automatically
candidate path set only to the same peer AS. reduce the candidate path set to the same peer AS only.
4. Multipath among iBGP learned paths 4. Multipath among iBGP learned paths
When all paths for a prefix are learned via iBGP, the tie-breaking In most cases iBGP is used along with an underlying IGP. Thus, when
commonly occurs based on IGP metric of the NEXT_HOP attribute, since all paths for a prefix are learned via iBGP, the tie-breaking
in most cases iBGP is used along with an underlying IGP. It is commonly occurs based on IGP metric of the NEXT_HOP attribute. In
possible, in some implementations, to ignore the IGP cost as well, if some implementations, it is possible to ignore the IGP cost as well,
all of the paths are reachable via some kind of tunneling mechanism, if all of the paths are reachable via some kind of tunneling
such as MPLS ([RFC3031]). This is enabled via a knob referred to as mechanism, such as MPLS [RFC3031]. This is enabled via a knob
"skip igp check" in this document. Notice that there is no standard referred in this document as "skip igp check" . Notice that there is
way for a BGP speaker to detect presence of such tunneling techniques no standard way for a BGP speaker to detect presence of such
other than relying on configuration settings. tunneling techniques other than relying on the configuration
settings.
When iBGP is deployed with BGP route-reflectors per [RFC4456] the When iBGP is deployed with BGP route-reflectors per [RFC4456], the
path attribute list may include the CLUSTER_LIST attribute. Most path attribute list may include the CLUSTER_LIST attribute. Many
implementations commonly ignore it for the purpose of ECMP route implementations ignore it for the purpose of ECMP route selection,
selection, assuming that IGP cost along should be sufficient for loop assuming that IGP cost along should be sufficient for loop
prevention. This assumption may not hold when IGP is not deployed, prevention. This assumption may not hold when IGP is not deployed,
and instead iBGP session are configured to reset the NEXT_HOP and instead iBGP session are configured to reset the NEXT_HOP
attribute on every node (this also assumes the use of directly attribute to "self" on every node. This also assumes the use of
connected link IP addresses for session formation). In this case, directly connected link addresses for session formation. In this
ignoring CLUSTER_LIST length might lead to routing loops. It is case, ignoring CLUSTER_LIST length might lead to routing loops. It
therefore recommended for implementations to have a knob that enables is therefore recommended for implementations to have a knob that
accounting for CLUSTER_LIST length when performing multipath route enables accounting for CLUSTER_LIST length when performing multipath
selection. In this case, CLUSTER_LIST attribute length should be route selection. Effectively, the CLUSTER_LIST attribute length
effectively used to replace the IGP metric. should be as an IGP metric.
Similar to the route-reflector scenario, the use of BGP Similarly to the route-reflector scenario, the use of BGP
confederations assumes presence of an IGP for proper loop prevention confederations in multipath scenarios assumes presence of an IGP for
in multipath scenarios, and use the IGP metric as the final tie- proper loop prevention and use the IGP metric as the final tie-
breaker for multipath routing. In addition to this, and similar to breaker for multipath routing. In addition to that, and similar to
eBGP case, implementation often require that equivalent paths belong eBGP case, implementations often require that in order to be
to the same peer member AS as the best-path. It is useful to have considered equal, the paths must belong to the same peer member AS as
two configuration knobs, one enabling "multipath same confederation the best-path. It is useful to have the following two configuration
member peer-as" and another enabling less restrictive "confed as-path knobs. First one enabling "multipath same confederation member peer-
multipath relaxed", which allows selecting multipath routes going via as", and another enabling less restrictive "confed as-path multipath
relaxed" rule, that allow selecting multipath routes reachable via
any confederation member peer AS. As mentioned above, the any confederation member peer AS. As mentioned above, the
AS_CONFED_SEQUENCE value length is usually ignored for the purpose of AS_CONFED_SEQUENCE value length is usually ignored for the purpose of
AS_PATH length comparison, relying on IGP cost instead for loop AS_PATH length comparison, relying instead on the IGP cost for loop
prevention. prevention.
In case if IGP is not present with BGP confederation deployment, and In cases when IGP is not present with BGP confederation deployment,
similar to route-reflection case, it may be needed to consider and similar to route-reflection case, it may be nessesary to consider
AS_CONFED_SEQUENCE length when selecting the equivalent routes, AS_CONFED_SEQUENCE length when selecting the equivalent routes,
effectively using it as a substitution for IGP metric. A separate effectively using it as a substitution for an IGP metric. A separate
configuration knob is needed to allow this behavior. configuration knob is needed to allow this behavior.
Per [RFC5065] the path learned over BGP intra-confederation peering Per [RFC5065] paths learned over BGP intra-confederation peering
sessions are treated as iBGP. There is no specification or sessions are treated as iBGP. There is no specification or
operational document that defines how a mixed iBGP route-reflector operational document that defines how a mixed iBGP route-reflector
and confederation based model would work together. Therefore, this and confederation based deplyments would work together. Therefore,
document does not make recommendations or considers this case. this document does not make recommendations for the above case.
5. Multipath among eBGP and iBGP paths 5. Multipath among eBGP and iBGP paths
The best-path selection algorithm explicitly prefers eBGP paths over The best-path selection algorithm explicitly prefers eBGP paths over
iBGP (or learned from BGP confederation member AS, which is per iBGP or learned from a BGP confederation member AS, which is, as per
[RFC5065] is treated the same as iBGP from perspective of best-path [RFC5065] treated the same as iBGP from perspective of best-path
selection). In some case, allowing multipath routing between eBGP selection. In some cases however, it might be beneficial to allow
and iBGP learned paths might be beneficial. This is only possible if multipath routing between eBGP and iBGP learned paths. This is only
some sort of tunneling technique is used to reach both the eBGP and possible if some sort of tunneling technique is used to reach both
iBGP path. If this feature is enabled, the equivalent routes are the eBGP and iBGP paths. If this feature is enabled, the equal
selection by stopping the tie-breaking process prior at the MED routes are selected prior to the MED comparison step (c) in
comparison step (c) in Section 9.1.2.2 of [RFC4271]. Section 9.1.2.2 [RFC4271].
6. Multipath with AIGP 6. Multipath with AIGP
AIGP attribute defined in [RFC7311] must be used for best-path AIGP attribute defined in [RFC7311] must be used for best-path
selection prior to running any logic of Section 9.1.2.2. Only the selection prior to running any logic of Section 9.1.2.2 [RFC4271].
paths with minimal value of AIGP metric are eligible for further Only the paths with minimal value of AIGP metric are eligible for
consideration of tie-breaking rules. The rest of multipath selection further consideration of tie-breaking rules. The rest of multipath
logic remains the same. selection logic remains the same.
7. Best path advertisement 7. Best path advertisement
Event though multiple equivalent paths may be selected for Unless BGP "Add-Path" feature described in [RFC7911] is enabled and
programming into the routing table, the BGP speaker always announces even though multiple equal paths may be selected for programming into
single best-path to its peers, unless BGP "Add-Path" feature has been the routing table, a BGP speaker announces single best-path only to
enabled as described in [RFC7911]. The unique best-path is elected its peers. The unique best-path is elected among the multi-path set
among the multi-path set using the standard tie-breaking rules. using the standard tie-breaking rules.
8. Multipath and non-deterministic tie-breaking 8. Multipath and non-deterministic tie-breaking
Some implementations may implement non-standard tie-breaking using Some implementations may implement non-standard tie-breaking logic,
the oldest path rule to improve routing stability. This is generally for example using the oldest path rule(reference). This is generally
not recommended, and may interact with multi-path route selection on not recommended, and may interact with multi-path route selection on
downstream BGP speakers. That is, after a route flap that affects downstream BGP speakers. That is, after a route flap that affects
the best-path upstream, the original best path would not be the best-path upstream, the original best path would not be
recovered, and the older path still be advertised, possibly affecting recovered, and the older path would still be advertised, possibly
the tie-breaking rules on down-stream device, for example if the affecting the tie-breaking rules on down-stream device if for
AS_PATH contents are different from previous. example, the AS_PATH contents are different from previous.
9. Weighted equal-cost multipath 9. Weighted equal-cost multipath
The proposal in [I-D.ietf-idr-link-bandwidth] defines conditions The proposal in [I-D.ietf-idr-link-bandwidth] defines conditions
where iBGP multipath feature might inform the routing table of the where iBGP multipath feature might inform the routing table of
"weights" associated with the multiple paths. The document defines "weights" associated with the multiple external paths.
the applicability only in iBGP case, though there are implementations [I-D.ietf-idr-link-bandwidth] defines the weight extended community
that apply it to eBGP multipath as well. The proposal does not attribute as non-transitive, considers the applicability in iBGP case
change the equal-cost multipath selection logic, only associates only, though there are implementations that apply it to eBGP as well.
additional load-sharing attributes with equivalent paths. The proposal does not change the equal-cost multipath selection
logic, but associates additional load-sharing attributes with
equivalent paths.
10. Informative References 10. Informative References
[BGPMP] "BGP Best Path Selection Algorithm", [BGPMP] "BGP Best Path Selection Algorithm",
<http://www.cisco.com/c/en/us/support/docs/ip/ <http://www.cisco.com/c/en/us/support/docs/ip/
border-gateway-protocol-bgp/13753-25.html>. border-gateway-protocol-bgp/13753-25.html>.
[I-D.ietf-idr-link-bandwidth] [I-D.ietf-idr-link-bandwidth]
Mohapatra, P. and R. Fernando, "BGP Link Bandwidth Mohapatra, P. and R. Fernando, "BGP Link Bandwidth
Extended Community", draft-ietf-idr-link-bandwidth-06 Extended Community", draft-ietf-idr-link-bandwidth-07
(work in progress), January 2013. (work in progress), March 2018.
[RFC3031] Rosen, E., Viswanathan, A., and R. Callon, "Multiprotocol [RFC3031] Rosen, E., Viswanathan, A., and R. Callon, "Multiprotocol
Label Switching Architecture", RFC 3031, Label Switching Architecture", RFC 3031,
DOI 10.17487/RFC3031, January 2001, DOI 10.17487/RFC3031, January 2001,
<https://www.rfc-editor.org/info/rfc3031>. <https://www.rfc-editor.org/info/rfc3031>.
[RFC4271] Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A [RFC4271] Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A
Border Gateway Protocol 4 (BGP-4)", RFC 4271, Border Gateway Protocol 4 (BGP-4)", RFC 4271,
DOI 10.17487/RFC4271, January 2006, DOI 10.17487/RFC4271, January 2006,
<https://www.rfc-editor.org/info/rfc4271>. <https://www.rfc-editor.org/info/rfc4271>.
skipping to change at page 6, line 39 skipping to change at page 6, line 44
[RFC7311] Mohapatra, P., Fernando, R., Rosen, E., and J. Uttaro, [RFC7311] Mohapatra, P., Fernando, R., Rosen, E., and J. Uttaro,
"The Accumulated IGP Metric Attribute for BGP", RFC 7311, "The Accumulated IGP Metric Attribute for BGP", RFC 7311,
DOI 10.17487/RFC7311, August 2014, DOI 10.17487/RFC7311, August 2014,
<https://www.rfc-editor.org/info/rfc7311>. <https://www.rfc-editor.org/info/rfc7311>.
[RFC7911] Walton, D., Retana, A., Chen, E., and J. Scudder, [RFC7911] Walton, D., Retana, A., Chen, E., and J. Scudder,
"Advertisement of Multiple Paths in BGP", RFC 7911, "Advertisement of Multiple Paths in BGP", RFC 7911,
DOI 10.17487/RFC7911, July 2016, DOI 10.17487/RFC7911, July 2016,
<https://www.rfc-editor.org/info/rfc7911>. <https://www.rfc-editor.org/info/rfc7911>.
Author's Address Authors' Addresses
Petr Lapukhov Petr Lapukhov
Facebook Facebook
1 Hacker Way 1 Hacker Way
Menlo Park, CA 94025 Menlo Park, CA 94025
US US
Email: petr@fb.com Email: petr@fb.com
Jeff Tantsura
Apstra, Inc.
Menlo Park, CA 94025
US
Email: jefftant.ietf@gmail.com
 End of changes. 28 change blocks. 
130 lines changed or deleted 138 lines changed or added

This html diff was produced by rfcdiff 1.47. The latest version is available from http://tools.ietf.org/tools/rfcdiff/