Network Working Group                                            R. Bush
Internet-Draft                                 Internet Initiative Japan
Intended status: Standards Track                                 J. Haas
Expires: October 13, 2018 April 4, 2019                                        J. Scudder
                                                  Juniper Networks, Inc.
                                                               A. Nipper
                                                              C. Dietzel
                                                          April 11,
                                                         October 1, 2018

        Making Route Servers Aware of Data Link Failures at IXPs


   When BGP route servers are used, the data plane is not congruent with
   the control plane.  Therefore, peers at an Internet exchange can lose
   data connectivity without the control plane being aware of it, and
   packets are lost.  This document proposes the use of a newly defined
   BGP Subsequent Address Family Identifier (SAFI) both to allow the
   route server to request its clients use BFD to track data plane
   connectivity to their peers' addresses, and for the clients to signal
   that connectivity state back to the route server.

Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   be interpreted as described in [RFC2119] only when they appear in all
   upper case.  They may also appear in lower or mixed case as English
   words, without normative meaning.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."
   This Internet-Draft will expire on October 13, 2018. April 4, 2019.

Copyright Notice

   Copyright (c) 2018 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   ( in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
   2.  Definitions . . . . . . . . . . . . . . . . . . . . . . . . .   3
   3.  Overview  . . . . . . . . . . . . . . . . . . . . . . . . . .   4
   4.  Next Hop Validation . . . . . . . . . . . . . . . . . . . . .   5
     4.1.  ReachAsk  . . . . . . . . . . . . . . . . . . . . . . . .   6
     4.2.  LocReach  . . . . . . . . . . . . . . . . . . . . . . . .   6
     4.3.  ReachTell . . . . . . . . . . . . . . . . . . . . . . . .   7
     4.4.  NHIB  . . . . . . . . . . . . . . . . . . . . . . . . . .   7
   5.  Advertising NH-Reach state in BGP . . . . . . . . . . . . . .   7
   6.  Client Procedures for NH-Reach Changes  . . . . . . . . . . .   9
   7.  Recommendations for Using BFD . . . . . . . . . . . . . . . .   9
   8.  Other Considerations  . . . . . . . . . . . . . . . . . . . .  10
   9.  Acknolwedgments . . . . . . . . . . . . . . . . . . . . . . .  10
   10. IANA Considerations . . . . . . . . . . . . . . . . . . . . .  10
   11. Security Considerations . . . . . . . . . . . . . . . . . . .  10
   12. References  . . . . . . . . . . . . . . . . . . . . . . . . .  11
     12.1.  Normative References . . . . . . . . . . . . . . . . . .  11
     12.2.  Informative References . . . . . . . . . . . . . . . . .  12
   Appendix A.  Summary of Document Changes  . . . . . . . . . . . .  12
   Appendix B.  Other Forms of Connectity Checks . . . . . . . . . .  12
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  13

1.  Introduction

   In configurations (typically Internet Exchange Points (IXPs)) where
   EBGP routing information is exchanged between client routers through
   the agency of a route server (RS) [RFC7947], but traffic is exchanged
   directly, operational issues can arise when partial data plane
   connectivity exists among the route server client routers.  Since the
   data plane is not congruent with the control plane, the client
   routers on the IXP can lose data connectivity without the control
   plane - the route server - being aware of it, resulting in
   significant data loss.

   To remedy this, two basic problems need to be solved:

   1.  Client routers must have a means of verifying connectivity
       amongst themselves, and
   2.  Client routers must have a means of communicating the knowledge
       of the failure (and restoration) back to the route server.

   The first can be solved by application of Bidirectional Forwarding
   Detection [RFC5880].  The second can be solved by exchanging BGP
   routes which use the NH-Reach Subsequent Address Family Identifier
   (SAFI) defined in this document.

   Throughout this document, we generally assume that the route server
   being discussed is able to represent different RIBs towards different
   clients, as discussed in section of [RFC7947].  If this is
   not the case, the procedures described here to allow BFD to be
   automatically provisioned between clients still have value; however,
   the procedures for signaling reachability back to the route server
   may not.

   Throughout this document, we refer to the "route server", "RS" or
   just "server" and the "client" to describe the two BGP routers
   engaging in the exchange of information.  We observe that there could
   be other applications for this extension.  Our use of terminology is
   intended for clarity of description, and not to limit the future
   applicability of the proposal.

   [I-D.ietf-idr-bgp-bestpath-selection-criteria] discusses enhancement
   of the route resolvability condition of section of [RFC4271]
   to include next hop reachability and path availability checks.  This
   specification represents in part an instance of such, implemented
   using BFD as the OAM mechanism.

2.  Definitions

   o  Indirect peer: If a route server is configured such that routes
      from a given client might be sent to some other client, or vice-
      versa, those two clients are considered to be indirect peers.
   o  Indirect Peer's Address, IPA, next hop: We refer frequently to a
      next hop.  It should generally be clear from context what is
      intended, almost always an address associated with an indirect
      peer (the exception, when an indirect peer sends a third party
      next hop, is discussed in Section 3).  In Section 5 we discuss the
      MP-BGP [RFC4760] Next Hop field; this is distinguished by its
      capitalization and should also be clear from context.  Later in
      that section we define the Indirect Peer's Address field of the
      NLRI, also called "IPA".  It will be clear to the reader that this
      refers to the "next hops" discussed elsewhere in the document, but
      we don't use the name "next hop" for this field to avoid confusion
      with the pre-existing next hop path attribute of [RFC4271] and
      attribute field of [RFC4760].
   o  RS: Route Server.  See [RFC7947].

3.  Overview

   As with the base BGP protocol, we model the function of this
   extension as the interaction between a conceptual set of databases:

   o  ReachAsk: The reachability request database.  A database of next
      hops (host addresses) for which data plane reachability is being
   o  ReachAsk-Out: A set of queries sent to the client.
   o  ReachAsk-In: A set of queries received from the route server.
   o  ReachTell: The reachability response database.  A database of
      responses to ReachAsk queries, indicating what is known about data
      plane reachability.
   o  ReachTell-Out: The responses being sent to the route server.
   o  ReachTell-In: The response received from the client.
   o  LocReach: The local reachability database.
   o  NHIB: Next Hop Information Base.  Stores what is known about the
      client's reachability to its next hops.

   |   +------------+    +------------+    +------------+   |
   |   |    Per-    |    | Configured |    |    Per-    |   |
   |   |   Client   |    |  indirect  |    |   Client   |   |
   |   |    NHIB    |    |   peers    |    |    RIB     |   |
   |   +-----^------+    +------------+    +-----+------+   |
   |         |                         \         |          |
   |   +-----+------+                   `-->-----v------+   |
   |   |ReachTell-In|                      |ReachAsk-Out|   |
   |   +------^-----+     Route Server     +-----+------+   |
              |                                  |
              |                                  |
              |                                  |
              |                                  |
   |   +------+------+       RS Client     +-----v-----+    |
   |   |ReachTell-Out|                     |ReachAsk-In|    |
   |   +------^------+                     +-----+-----+    |
   |          |          +------------+          |          |
   |          |          |            |          |          |
   |          `----------+  LocReach  <----------'          |
   |                     |            |                     |
   |                     +------------+                     |

   Route Server, RS Client, and Reachability Ask and Tell databases with
                               In/Out Queues

   In outline, the route server requests its client to track
   connectivity for all the potential next hops the RS might send to the
   client, by sending these next hops as ReachAsk "routes".  The client
   tracks connectivity using BFD and reports its connectivity status to
   the RS using ReachTell "routes".  Connectivity status may be that the
   next hop is reachable, unreachable, or unknown.  Once the RS has been
   informed by the client of its connectivity, it uses this information
   to influence the route selection the RS performs on behalf of the
   client.  Details are elaborated in the following sections.

4.  Next Hop Validation

   Below, we detail procedures where a route server tells its client
   router about other client next hops by sending it ReachAsk routes and
   the client router verifies connectivity to those other client routers
   and communicates its findings back to the RS using ReachTell routes.
   The RS uses the received ReachTell routes as input to the NHIB and
   hence the route selection process it performs on behalf of the

4.1.  ReachAsk

   The route server maintains a ReachAsk database for each client that
   supports this proposal, that is, for each client that has advertised
   support (Section 5) for the NH-Reach SAFI.  This database is the
   union of:

   o  The set of next hops found in the associated per-client Loc-RIB
      (see section of [RFC7947]).
   o  The set of addresses of this client's indirect peers (Section 2).
   o  The RS MAY also add other entries, for example under configuration

   We note that under most circumstances, the first (Loc-RIB next hops)
   set will be a subset of the second (indirect peers) set.  For this
   not to be the case, a client would have to have sent a "third party"
   next hop [RFC4271] to the server.  To cover such a case, an
   implementation MAY note any such next hops, and include them in its
   list of indirect peers.  (This implies that if a third party next hop
   for client C is conveyed to client A, not only will C be placed in
   A's ReachAsk database, but A will be placed in C's ReachAsk

   The contents of the ReachAsk database are communicated to the client
   using the NLRI format and procedures described in Section 5.

4.2.  LocReach

   The client MUST attempt to track data plane connectivity to each host
   address depicted in the ReachAsk database.  It MAY also track
   connectivity to other addresses.  The use of BFD for this purpose is
   detailed in Section 6.

   For each address being tracked, its state is maintained by the client
   in a LocReach entry.  The state can be:

   o  Unknown.  Connectivity status is unknown.  This may be due to a
      temporary or permanent lack of feasible OAM mechanism to determine
      the status.
   o  Up.  The address has been determined to be reachable.
   o  Down.  The address has been determined to be unreachable.

   The LocReach database is used as input for the ReachTell database; it
   MAY also be used as input to the client's route resolvability
   condition (section of [RFC4271]).

4.3.  ReachTell

   The ReachTell database contains an entry for every entry in the
   LocReach database.

   The contents of the ReachTell database are communicated to the server
   using the NLRI format and procedures described in Section 5.

4.4.  NHIB

   The route server maintains a per-client Next Hop Information Base, or
   NHIB.  This contains the information about next hop status received
   from ReachTell.

   In computing its per-client Loc-RIB, the RS uses the content of the
   related per-client NHIB as input to the route resolvability condition
   (section of [RFC4271]).  The next hop being resolved is
   looked up in the NHIB and its state determined:

   o  Up next hops are considered resolvable.
   o  Unknown next hops MAY be considered resolvable.  They MAY be less
      preferred for selection.
   o  Down next hops MUST NOT be considered resolvable.
   o  If a given next hop is not present in the NHIB, but is present in
      ReachAsk-Out, either the client has not responded yet (a transient
      condition) or an error exists.  Similar to Unknown next hops, such
      routes MAY be considered resolvable; they MAY be less preferred.

5.  Advertising NH-Reach state in BGP

   A new BGP SAFI, the NH-Reach SAFI, is defined in this document.  It
   has been assigned value TBD.  A route server or a route server client
   using the procedures in this document MUST advertise support for this
   SAFI, for the IPv4 and/or IPv6 Address Family Identifier (AFI).  The
   use of this SAFI with any other AFI is not defined by this document.

   NH-Reach NLRI "routes" have a Length of Next Hop Network Address
   value of 0, therefore they have an empty Network Address of Next Hop
   field (section 3 of [RFC4760]).

   Since as specified here, ReachTell "routes" from different clients
   populate distinct databases on the RS, there will generally be only a
   single path per "route"; this implies that route selection need not
   be performed (or equivalently, that it's trivial to perform).

   In the other direction, a client might peer with multiple route
   servers and receive differing sets of ReachAsk routes from them.  An
   implementation MAY handle this situation by implementing a distinct
   ReachAsk and ReachTell per server, but it MAY also handle it by
   placing all servers' ReachAsk "routes" into a single ReachAsk, and
   sending the results to all servers from a single ReachTell.  This
   would imply some route server(s) might get ReachTell results they had
   not asked for, but this is permissible in any case.  Again, since the
   contents of ReachAsk are simply a set of host routes to be tested,
   route selection over a combined ReachAsk MAY be omitted.

   ReachAsk and ReachTell entries are exchanged using the NH-Reach NLRI

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   |T|Reserved |Sta|    Indirect Peer's Address (4 or 16 octets)   |
   .      ...  Indirect Peer's Address (4 or 16 octets) ...        .
   .                                                               .

                           NH-Reach NLRI Format

   o  T: Type is a one-bit field that can take the value 0, meaning the
      NLRI is a ReachAsk entry, or 1, meaning it is a ReachTell entry.
   o  Reserved: These five bits are reserved.  They MUST be sent as zero
      and MUST be disregarded on receipt.
   o  Sta: State is a two-bit field used to signal the LocReach
      (Section 4.2) state:

      *  0 or 3: Unknown.
      *  1: Up.
      *  2: Down.

      Although either 0 or 3 is to be interpreted as "Unknown", the
      value 0 MUST be used on transmission.  The value 3 MUST be
      accepted as an alias for 0 on receipt.
   o  The Indirect Peer's Address ("IPA") field is an IPv4 or IPv6 host
      route, depending on whether the AFI is IPv4 or IPv6.

   ReachAsk and ReachTell entries MUST NOT be propagated from one BGP
   peering session to another; the routes are not transitive.

   The IPA field is the key for the NH-Reach NLRI type; the information
   encoded in the top octet is non-key information.  It is possible in
   principle (although unlikely) for two NLRI to be validly present in
   an UPDATE message with identical IPA fields but different types.
   However, two NLRI with the same IPA field and different State fields
   MUST NOT be encoded in the same UPDATE message.  If such is
   encountered, the receiver MUST behave as though the state "Unknown"
   was received for the IPA in question.

6.  Client Procedures for NH-Reach Changes

   When an entry is added to a route server client's ReachAsk-In for a
   route server peering session, the client will then attempt to verify
   connectivity to the host depicted by that entry.  The procedure
   described in this specification utilizes BFD.

   If no existing BFD session exists to this next hop, a BFD session is
   provisioned to that IP address and the LocReach reachability state
   (Section 4.2) is set to Unknown.

   If the client cannot establish a BFD session with an entry in its
   ReachAsk-In, the next hop remains in LocReach with its Reachable
   state Unknown.

   Once the BFD session moves to the Up state, the LocReach reachability
   state is set to Up.

   When the BFD session transitions out of the Up state to the Down
   state, the LocReach reachability state is set to Down.

   If the BFD session transitions out of the Up state to the AdminDown
   state, the LocReach reachability state is set to Unknown.

   When entries are removed from the route server client's ReachAsk-In
   for a route server peering session, the client MAY delay de-
   provisioning the BFD peering session.  If the client delays de-
   provisioning the session, it should remove it if the BFD session
   transitions to the Down or AdminDown states.

7.  Recommendations for Using BFD

   The RECOMMENDED way a client router can confirm the data plane
   connectivity to its next hops is available, is the use of BFD in
   asynchronous mode.  Echo mode MAY be used if both client routers
   running a BFD session support this.  The use of authentication in BFD
   is OPTIONAL as there is a certain level of trust between the
   operators of the client routers at a particular IXP.  If trust cannot
   be assumed, it is recommended to use pair-wise keys (how this can be
   achieved is outside the scope of this document).  The ttl/hop limit
   values as described in section 5 [RFC5881] MUST be obeyed in order to
   shield BFD sessions against packets coming from outside the IXP.

   The following values of the BFD configuration of client routers (see
   section 6.8.1 [RFC5880]) are RECOMMENDED:

   o  DesiredMinTxInterval: 1,000,000 (microseconds)
   o  RequiredMinRxInterval: 1,000,000 (microseconds)
   o  DetectMult: 3

   A client router administrator MAY select more appropriate values to
   meet the special needs of a particular deployment.

8.  Other Considerations

   For purposes of routing stability, implementations may wish to apply
   hysteresis ("holddown") to next hops that have transitioned from
   reachable to unreachable and back.

   Implementations MAY restrict the range of addresses with which they
   will attempt to form BFD relationships.  For example, an
   implementation might by default only allow BFD relationships with
   peers that share a subnetwork with the route server.  An
   implementation MAY apply such restrictions by default.

   In a route-server environment, use of this feature SHOULD be
   restricted to consider only routes that are advertised from within
   the IXP network.  This might include checks on AS_PATH length.

9.  Acknolwedgments

   The authors would like to thanks Thomas King for his contributions
   toward this work.

10.  IANA Considerations

   IANA is requested to allocate a value from the Subsequent Address
   Family Identifiers (SAFI) Parameters registry for this proposal.  Its
   Description in that registry shall be NH-Reach with a Reference of
   this RFC.

11.  Security Considerations

   The mechanism in this document permits a route server client to
   influence the contents of the route server's Adj-Ribs-Out through its
   reports of next hop reachability state using the NH-Reach SAFI.
   Since this state is per-client, if a route server client is able to
   inject NH-Reach routes for another route server's BGP session to a
   client, it can cause the route server to select different forwarding
   than otherwise expected.  This issue may be mitigated using transport
   security on the BGP sessions between the route server and its
   clients.  See [RFC4272].

   The NH-Reach SAFI enables the server to trigger creation of a BFD
   session on its client.  A malicious or misbehaving server could
   trigger an unreasonable number of sessions, a potential resource
   exhaustion attack.  The sedate default timers proposed in Section 7
   mitigate this; they also mitigate concerns about use of the client as
   a source of packets in a flooding attack.  An implementation MAY also
   impose limits on the number of BFD sessions it will create at the
   request of the server.

   The reachability tests between route server clients themselves may be
   a target for attack.  Such attacks may include forcing a BFD session
   Down through injecting false BFD state.  A less likely attack
   includes forcing a BFD session to stay Up when its real state is
   Down.  These attacks may be mitigated using the BFD security
   mechanisms defined in [RFC5880].

12.  References

12.1.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <>. <https://www.rfc-

   [RFC4271]  Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A
              Border Gateway Protocol 4 (BGP-4)", RFC 4271,
              DOI 10.17487/RFC4271, January 2006,
              <>. <https://www.rfc-

   [RFC4760]  Bates, T., Chandra, R., Katz, D., and Y. Rekhter,
              "Multiprotocol Extensions for BGP-4", RFC 4760,
              DOI 10.17487/RFC4760, January 2007,
              <>. <https://www.rfc-

   [RFC5880]  Katz, D. and D. Ward, "Bidirectional Forwarding Detection
              (BFD)", RFC 5880, DOI 10.17487/RFC5880, June 2010,

   [RFC5881]  Katz, D. and D. Ward, "Bidirectional Forwarding Detection
              (BFD) for IPv4 and IPv6 (Single Hop)", RFC 5881,
              DOI 10.17487/RFC5881, June 2010,
              <>. <https://www.rfc-

   [RFC7947]  Jasinska, E., Hilliard, N., Raszuk, R., and N. Bakker,
              "Internet Exchange BGP Route Server", RFC 7947,
              DOI 10.17487/RFC7947, September 2016,
              <>. <https://www.rfc-

12.2.  Informative References

              Chen, E., Shen, N., and R. Raszuk, "Unsolicited BFD for
              Sessionless Applications", draft-chen-bfd-unsolicited-02
              (work in progress), January 2018.

              Asati, R., "BGP Bestpath Selection Criteria Enhancement",
              draft-ietf-idr-bgp-bestpath-selection-criteria-09 (work in
              progress), October 2017. June 2018.

   [RFC4272]  Murphy, S., "BGP Security Vulnerabilities Analysis",
              RFC 4272, DOI 10.17487/RFC4272, January 2006,

   [RFC7880]  Pignataro, C., Ward, D., Akiya, N., Bhatia, M., and S.
              Pallagatti, "Seamless Bidirectional Forwarding Detection
              (S-BFD)", RFC 7880, DOI 10.17487/RFC7880, July 2016,

Appendix A.  Summary of Document Changes

   idr-06:  Refresh -05.
   idr-04 to idr-05:  Added reference to "BGP Bestpath Selection
     Criteria Enhancement" draft.  Rename "next hop" field of NLRI to
     "Indirect Peer's Address".  Add suggestion about AS_PATH length
   idr-03 to idr-04:  Note other forms of connectivity checks.
   idr-02 to idr-03:  Substantial rewrite.  Introduce NLRI format that
     embeds state.
   idr-01 to idr-02:  Move from BGP-LS to NH-Reach SAFI.  Lots of
     editorial changes.
   idr-00 to idr-01:  Add BGP Capability.  Move from NH-Cost to BGP-LS.
   ymbk-01 to idr-00:  No technical changes; adopted by IDR.
   ymbk-00 to ymbk-01:  Clarifications to BFD procedures.  Use BFD state
     as an input to BGP route selection.

Appendix B.  Other Forms of Connectity Checks

   RFC 5880/5881 BFD is a well-deployed feature.  For this reason, it
   was chosen as the connectivity check utilized for nexthop
   reachability by this document.  As other forms of BFD become more
   widely deployed, they may also be utilized to provide the
   connectivity check functionality.

   Examples of other such BFD mechanisms include:

   o  Seamless BFD [RFC7880]
   o  Unsolicited BFD for Sessionless Applications

   Implementations MUST support RFC 5880/5881 BFD to be compliant with
   this specification.  Implementations MAY support other forms of
   connectivity check, including those mechanisms listed above, so long
   as they provide the ability to fall-back to RFC 5880/5881 BFD.

Authors' Addresses

   Randy Bush
   Internet Initiative Japan
   5147 Crystal Springs
   Bainbridge Island, Washington  98110


   Jeffrey Haas
   Juniper Networks, Inc.
   1133 Innovation Way
   Sunnyvale, CA  94089


   John G. Scudder
   Juniper Networks, Inc.
   1133 Innovation Way
   Sunnyvale, CA  94089


   Arnold Nipper
   DE-CIX Management GmbH
   Lichtstrasse 43i
   Cologne  50825

   Christoph Dietzel
   DE-CIX Management GmbH
   Lichtstrasse 43i
   Cologne  50825