draft-ietf-rift-rift-03.txt   draft-ietf-rift-rift-04.txt 
RIFT Working Group The RIFT Authors RIFT Working Group The RIFT Authors
Internet-Draft "Heaven is under our feet as well as over our heads" Internet-Draft March 3, 2019
Intended status: Standards Track October 19, 2018 Intended status: Standards Track
Expires: April 22, 2019 Expires: September 4, 2019
RIFT: Routing in Fat Trees RIFT: Routing in Fat Trees
draft-ietf-rift-rift-03 draft-ietf-rift-rift-04
Abstract Abstract
This document outlines a specialized, dynamic routing protocol for This document outlines a specialized, dynamic routing protocol for
Clos and fat-tree network topologies. The protocol (1) deals with Clos and fat-tree network topologies. The protocol (1) deals with
fully automatied construction of fat-tree topologies based on fully automated construction of fat-tree topologies based on
detection of links, (2) minimizes the amount of routing state held at detection of links, (2) minimizes the amount of routing state held at
each level, (3) automatically prunes and load balances topology each level, (3) automatically prunes and load balances topology
flooding exchanges over a sufficient subset of links, (4) supports flooding exchanges over a sufficient subset of links, (4) supports
automatic disaggregation of prefixes on link and node failures to automatic disaggregation of prefixes on link and node failures to
prevent black-holing and suboptimal routing, (5) allows traffic prevent black-holing and suboptimal routing, (5) allows traffic
steering and re-routing policies, (6) allows loop-free non-ECMP steering and re-routing policies, (6) allows loop-free non-ECMP
forwarding, (7) automatically re-balances traffic towards the spines forwarding, (7) automatically re-balances traffic towards the spines
based on bandwidth available and finally (8) provides mechanisms to based on bandwidth available and finally (8) provides mechanisms to
synchronize a limited key-value data-store that can be used after synchronize a limited key-value data-store that can be used after
protocol convergence to e.g. bootstrap higher levels of protocol convergence to e.g. bootstrap higher levels of
skipping to change at page 1, line 43 skipping to change at page 1, line 43
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/. Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on April 22, 2019. This Internet-Draft will expire on September 4, 2019.
Copyright Notice Copyright Notice
Copyright (c) 2018 IETF Trust and the persons identified as the Copyright (c) 2019 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(https://trustee.ietf.org/license-info) in effect on the date of (https://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License. described in the Simplified BSD License.
Table of Contents Table of Contents
1. Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1. Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4 2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1. Requirements Language . . . . . . . . . . . . . . . . . . 6 2.1. Requirements Language . . . . . . . . . . . . . . . . . . 7
3. Reference Frame . . . . . . . . . . . . . . . . . . . . . . . 6 3. Reference Frame . . . . . . . . . . . . . . . . . . . . . . . 7
3.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 6 3.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 7
3.2. Topology . . . . . . . . . . . . . . . . . . . . . . . . 10 3.2. Topology . . . . . . . . . . . . . . . . . . . . . . . . 10
4. Requirement Considerations . . . . . . . . . . . . . . . . . 11 4. Requirement Considerations . . . . . . . . . . . . . . . . . 12
5. RIFT: Routing in Fat Trees . . . . . . . . . . . . . . . . . 14 5. RIFT: Routing in Fat Trees . . . . . . . . . . . . . . . . . 15
5.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . 15 5.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . 16
5.1.1. Properties . . . . . . . . . . . . . . . . . . . . . 15 5.1.1. Properties . . . . . . . . . . . . . . . . . . . . . 16
5.1.2. Generalized Topology View . . . . . . . . . . . . . . 15 5.1.2. Generalized Topology View . . . . . . . . . . . . . . 16
5.1.3. Fallen Leaf Problem . . . . . . . . . . . . . . . . . 25 5.1.3. Fallen Leaf Problem . . . . . . . . . . . . . . . . . 26
5.1.4. Discovering Fallen Leaves . . . . . . . . . . . . . . 27 5.1.4. Discovering Fallen Leaves . . . . . . . . . . . . . . 28
5.1.5. Addressing the Fallen Leaves Problem . . . . . . . . 28 5.1.5. Addressing the Fallen Leaves Problem . . . . . . . . 29
5.2. Specification . . . . . . . . . . . . . . . . . . . . . . 29 5.2. Specification . . . . . . . . . . . . . . . . . . . . . . 30
5.2.1. Transport . . . . . . . . . . . . . . . . . . . . . . 29 5.2.1. Transport . . . . . . . . . . . . . . . . . . . . . . 30
5.2.2. Link (Neighbor) Discovery (LIE Exchange) . . . . . . 30 5.2.2. Link (Neighbor) Discovery (LIE Exchange) . . . . . . 31
5.2.3. Topology Exchange (TIE Exchange) . . . . . . . . . . 32 5.2.3. Topology Exchange (TIE Exchange) . . . . . . . . . . 33
5.2.3.1. Topology Information Elements . . . . . . . . . . 32 5.2.3.1. Topology Information Elements . . . . . . . . . . 33
5.2.3.2. South- and Northbound Representation . . . . . . 33 5.2.3.2. South- and Northbound Representation . . . . . . 33
5.2.3.3. Flooding . . . . . . . . . . . . . . . . . . . . 35 5.2.3.3. Flooding . . . . . . . . . . . . . . . . . . . . 36
5.2.3.4. TIE Flooding Scopes . . . . . . . . . . . . . . . 36 5.2.3.4. TIE Flooding Scopes . . . . . . . . . . . . . . . 36
5.2.3.5. Initial and Periodic Database Synchronization . . 38 5.2.3.5. 'Flood Only Node TIEs' Bit . . . . . . . . . . . 39
5.2.3.6. Purging . . . . . . . . . . . . . . . . . . . . . 38 5.2.3.6. Initial and Periodic Database Synchronization . . 40
5.2.3.7. Southbound Default Route Origination . . . . . . 39 5.2.3.7. Purging and Roll-Overs . . . . . . . . . . . . . 40
5.2.3.8. Northbound TIE Flooding Reduction . . . . . . . . 40 5.2.3.8. Southbound Default Route Origination . . . . . . 41
5.2.4. Reachability Computation . . . . . . . . . . . . . . 44 5.2.3.9. Northbound TIE Flooding Reduction . . . . . . . . 41
5.2.4.1. Northbound SPF . . . . . . . . . . . . . . . . . 44 5.2.3.10. Special Considerations . . . . . . . . . . . . . 46
5.2.4.2. Southbound SPF . . . . . . . . . . . . . . . . . 45 5.2.4. Reachability Computation . . . . . . . . . . . . . . 47
5.2.4.3. East-West Forwarding Within a Level . . . . . . . 45 5.2.4.1. Northbound SPF . . . . . . . . . . . . . . . . . 47
5.2.5. Automatic Disaggregation on Link & Node Failures . . 45 5.2.4.2. Southbound SPF . . . . . . . . . . . . . . . . . 48
5.2.5.1. Positive, Non-transitive Disaggregation . . . . . 45 5.2.4.3. East-West Forwarding Within a Level . . . . . . . 48
5.2.5. Automatic Disaggregation on Link & Node Failures . . 48
5.2.5.1. Positive, Non-transitive Disaggregation . . . . . 48
5.2.5.2. Negative, Transitive Disaggregation for Fallen 5.2.5.2. Negative, Transitive Disaggregation for Fallen
Leafs . . . . . . . . . . . . . . . . . . . . . . 49 Leafs . . . . . . . . . . . . . . . . . . . . . . 52
5.2.6. Attaching Prefixes . . . . . . . . . . . . . . . . . 51
5.2.7. Optional Zero Touch Provisioning (ZTP) . . . . . . . 53 5.2.6. Attaching Prefixes . . . . . . . . . . . . . . . . . 54
5.2.7.1. Terminology . . . . . . . . . . . . . . . . . . . 54 5.2.7. Optional Zero Touch Provisioning (ZTP) . . . . . . . 63
5.2.7.2. Automatic SystemID Selection . . . . . . . . . . 55 5.2.7.1. Terminology . . . . . . . . . . . . . . . . . . . 64
5.2.7.3. Generic Fabric Example . . . . . . . . . . . . . 55 5.2.7.2. Automatic SystemID Selection . . . . . . . . . . 65
5.2.7.4. Level Determination Procedure . . . . . . . . . . 56 5.2.7.3. Generic Fabric Example . . . . . . . . . . . . . 66
5.2.7.5. Resulting Topologies . . . . . . . . . . . . . . 58 5.2.7.4. Level Determination Procedure . . . . . . . . . . 67
5.2.8. Stability Considerations . . . . . . . . . . . . . . 59 5.2.7.5. Resulting Topologies . . . . . . . . . . . . . . 68
5.3. Further Mechanisms . . . . . . . . . . . . . . . . . . . 60 5.2.8. Stability Considerations . . . . . . . . . . . . . . 70
5.3.1. Overload Bit . . . . . . . . . . . . . . . . . . . . 60 5.3. Further Mechanisms . . . . . . . . . . . . . . . . . . . 70
5.3.2. Optimized Route Computation on Leafs . . . . . . . . 60 5.3.1. Overload Bit . . . . . . . . . . . . . . . . . . . . 70
5.3.3. Mobility . . . . . . . . . . . . . . . . . . . . . . 60 5.3.2. Optimized Route Computation on Leafs . . . . . . . . 70
5.3.3.1. Clock Comparison . . . . . . . . . . . . . . . . 62 5.3.3. Mobility . . . . . . . . . . . . . . . . . . . . . . 70
5.3.3.1. Clock Comparison . . . . . . . . . . . . . . . . 72
5.3.3.2. Interaction between Time Stamps and Sequence 5.3.3.2. Interaction between Time Stamps and Sequence
Counters . . . . . . . . . . . . . . . . . . . . 62 Counters . . . . . . . . . . . . . . . . . . . . 72
5.3.3.3. Anycast vs. Unicast . . . . . . . . . . . . . . . 63 5.3.3.3. Anycast vs. Unicast . . . . . . . . . . . . . . . 73
5.3.3.4. Overlays and Signaling . . . . . . . . . . . . . 63 5.3.3.4. Overlays and Signaling . . . . . . . . . . . . . 73
5.3.4. Key/Value Store . . . . . . . . . . . . . . . . . . . 64 5.3.4. Key/Value Store . . . . . . . . . . . . . . . . . . . 74
5.3.4.1. Southbound . . . . . . . . . . . . . . . . . . . 64 5.3.4.1. Southbound . . . . . . . . . . . . . . . . . . . 74
5.3.4.2. Northbound . . . . . . . . . . . . . . . . . . . 64 5.3.4.2. Northbound . . . . . . . . . . . . . . . . . . . 74
5.3.5. Interactions with BFD . . . . . . . . . . . . . . . . 64 5.3.5. Interactions with BFD . . . . . . . . . . . . . . . . 74
5.3.6. Fabric Bandwidth Balancing . . . . . . . . . . . . . 65 5.3.6. Fabric Bandwidth Balancing . . . . . . . . . . . . . 75
5.3.6.1. Northbound Direction . . . . . . . . . . . . . . 65 5.3.6.1. Northbound Direction . . . . . . . . . . . . . . 75
5.3.6.2. Southbound Direction . . . . . . . . . . . . . . 67 5.3.6.2. Southbound Direction . . . . . . . . . . . . . . 77
5.3.7. Label Binding . . . . . . . . . . . . . . . . . . . . 68 5.3.7. Label Binding . . . . . . . . . . . . . . . . . . . . 78
5.3.8. Segment Routing Support with RIFT . . . . . . . . . . 68 5.3.8. Segment Routing Support with RIFT . . . . . . . . . . 78
5.3.8.1. Global Segment Identifiers Assignment . . . . . . 68 5.3.8.1. Global Segment Identifiers Assignment . . . . . . 78
5.3.8.2. Distribution of Topology Information . . . . . . 68 5.3.8.2. Distribution of Topology Information . . . . . . 78
5.3.9. Leaf to Leaf Procedures . . . . . . . . . . . . . . . 69 5.3.9. Leaf to Leaf Procedures . . . . . . . . . . . . . . . 79
5.3.10. Address Family and Multi Topology Considerations . . 69 5.3.10. Address Family and Multi Topology Considerations . . 79
5.3.11. Reachability of Internal Nodes in the Fabric . . . . 69 5.3.11. Reachability of Internal Nodes in the Fabric . . . . 79
5.3.12. One-Hop Healing of Levels with East-West Links . . . 70 5.3.12. One-Hop Healing of Levels with East-West Links . . . 80
6. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.4. Security . . . . . . . . . . . . . . . . . . . . . . . . 80
6.1. Normal Operation . . . . . . . . . . . . . . . . . . . . 70 5.4.1. Security Model . . . . . . . . . . . . . . . . . . . 80
6.2. Leaf Link Failure . . . . . . . . . . . . . . . . . . . . 71 5.4.2. Security Mechanisms . . . . . . . . . . . . . . . . . 82
6.3. Partitioned Fabric . . . . . . . . . . . . . . . . . . . 72 5.4.3. Security Envelope . . . . . . . . . . . . . . . . . . 82
5.4.4. Nonces . . . . . . . . . . . . . . . . . . . . . . . 85
5.4.5. Lifetime . . . . . . . . . . . . . . . . . . . . . . 86
5.4.6. Key Management . . . . . . . . . . . . . . . . . . . 86
5.4.7. Security Association Changes . . . . . . . . . . . . 86
6. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.1. Normal Operation . . . . . . . . . . . . . . . . . . . . 86
6.2. Leaf Link Failure . . . . . . . . . . . . . . . . . . . . 88
6.3. Partitioned Fabric . . . . . . . . . . . . . . . . . . . 89
6.4. Northbound Partitioned Router and Optional East-West 6.4. Northbound Partitioned Router and Optional East-West
Links . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Links . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.5. Multi-Plane Fabric and Negative Disaggregation . . . . . 75 6.5. Multi-Plane Fabric and Negative Disaggregation . . . . . 91
7. Implementation and Operation: Further Details . . . . . . . . 75 7. Implementation and Operation: Further Details . . . . . . . . 91
7.1. Considerations for Leaf-Only Implementation . . . . . . . 75 7.1. Considerations for Leaf-Only Implementation . . . . . . . 91
7.2. Adaptations to Other Proposed Data Center Topologies . . 76 7.2. Considerations for Spine Implementation . . . . . . . . . 92
7.3. Originating Non-Default Route Southbound . . . . . . . . 77 7.3. Adaptations to Other Proposed Data Center Topologies . . 92
8. Security Considerations . . . . . . . . . . . . . . . . . . . 77 7.4. Originating Non-Default Route Southbound . . . . . . . . 93
8.1. General . . . . . . . . . . . . . . . . . . . . . . . . . 77 8. Security Considerations . . . . . . . . . . . . . . . . . . . 94
8.2. ZTP . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 8.1. General . . . . . . . . . . . . . . . . . . . . . . . . . 94
8.3. Lifetime . . . . . . . . . . . . . . . . . . . . . . . . 78 8.2. ZTP . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 78 8.3. Lifetime . . . . . . . . . . . . . . . . . . . . . . . . 94
10. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 78 8.4. Packet Number . . . . . . . . . . . . . . . . . . . . . . 94
11. References . . . . . . . . . . . . . . . . . . . . . . . . . 78 8.5. Outer Fingerprint Attacks . . . . . . . . . . . . . . . . 95
11.1. Normative References . . . . . . . . . . . . . . . . . . 78 8.6. Inner Fingerprint DoS Attacks . . . . . . . . . . . . . . 95
11.2. Informative References . . . . . . . . . . . . . . . . . 81 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 95
Appendix A. Information Elements Schema . . . . . . . . . . . . 83 10. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 95
A.1. common.thrift . . . . . . . . . . . . . . . . . . . . . . 84 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 96
A.2. encoding.thrift . . . . . . . . . . . . . . . . . . . . . 88 11.1. Normative References . . . . . . . . . . . . . . . . . . 96
Appendix B. Finite State Machines and Precise Operational 11.2. Informative References . . . . . . . . . . . . . . . . . 98
Specifications . . . . . . . . . . . . . . . . . . . 94 Appendix A. Sequence Number Binary Arithmetic . . . . . . . . . 100
B.1. LIE FSM . . . . . . . . . . . . . . . . . . . . . . . . . 95 Appendix B. Information Elements Schema . . . . . . . . . . . . 101
B.2. ZTP FSM . . . . . . . . . . . . . . . . . . . . . . . . . 101 B.1. common.thrift . . . . . . . . . . . . . . . . . . . . . . 102
B.3. Flooding Procedures . . . . . . . . . . . . . . . . . . . 105 B.2. encoding.thrift . . . . . . . . . . . . . . . . . . . . . 108
B.3.1. FloodState Structure per Adjacency . . . . . . . . . 106 Appendix C. Finite State Machines and Precise Operational
B.3.2. TIDEs . . . . . . . . . . . . . . . . . . . . . . . . 107 Specifications . . . . . . . . . . . . . . . . . . . 115
B.3.2.1. TIDE Generation . . . . . . . . . . . . . . . . . 107 C.1. LIE FSM . . . . . . . . . . . . . . . . . . . . . . . . . 115
B.3.2.2. TIDE Processing . . . . . . . . . . . . . . . . . 108 C.2. ZTP FSM . . . . . . . . . . . . . . . . . . . . . . . . . 121
B.3.3. TIREs . . . . . . . . . . . . . . . . . . . . . . . . 109 C.3. Flooding Procedures . . . . . . . . . . . . . . . . . . . 129
B.3.3.1. TIRE Generation . . . . . . . . . . . . . . . . . 109 C.3.1. FloodState Structure per Adjacency . . . . . . . . . 130
B.3.3.2. TIRE Processing . . . . . . . . . . . . . . . . . 110 C.3.2. TIDEs . . . . . . . . . . . . . . . . . . . . . . . . 131
B.3.4. TIEs Processing on Flood State Adjacency . . . . . . 110 C.3.2.1. TIDE Generation . . . . . . . . . . . . . . . . . 132
B.3.5. TIEs Processing When LSDB Received Newer Version on C.3.2.2. TIDE Processing . . . . . . . . . . . . . . . . . 132
Other Adjacencies . . . . . . . . . . . . . . . . . . 111 C.3.3. TIREs . . . . . . . . . . . . . . . . . . . . . . . . 134
Appendix C. Constants . . . . . . . . . . . . . . . . . . . . . 111 C.3.3.1. TIRE Generation . . . . . . . . . . . . . . . . . 134
C.1. Configurable Protocol Constants . . . . . . . . . . . . . 111 C.3.3.2. TIRE Processing . . . . . . . . . . . . . . . . . 134
Appendix D. TODO . . . . . . . . . . . . . . . . . . . . . . . . 113 C.3.4. TIEs Processing on Flood State Adjacency . . . . . . 134
Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 113 C.3.5. TIEs Processing When LSDB Received Newer Version on
Other Adjacencies . . . . . . . . . . . . . . . . . . 135
C.3.6. Sending TIEs . . . . . . . . . . . . . . . . . . . . 136
Appendix D. Constants . . . . . . . . . . . . . . . . . . . . . 136
D.1. Configurable Protocol Constants . . . . . . . . . . . . . 136
Appendix E. TODO . . . . . . . . . . . . . . . . . . . . . . . . 138
Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 138
1. Authors 1. Authors
This work is a product of a growing list of individuals. This work is a product of a growing list of individuals.
Tony Przygienda, Ed | Alankar Sharma | Pascal Thubert Tony Przygienda, Ed | Alankar Sharma | Pascal Thubert
Juniper Networks | Comcast | Cisco Juniper Networks | Comcast | Cisco
Bruno Rijsman | Ilya Vershkov | Alia Atlas Bruno Rijsman | Ilya Vershkov | Dmitry Afanasiev
Individual | Mellanox | Individual Individual | Mellanox | Yandex
Don Fedyk | John Drake | Don Fedyk | Alia Atlas | John Drake
HPE | Juniper | Individual | Individual | Juniper
Table 1: RIFT Authors Table 1: RIFT Authors
2. Introduction 2. Introduction
Clos [CLOS] and Fat-Tree [FATTREE] have gained prominence in today's Clos [CLOS] and Fat-Tree [FATTREE] topologies have gained prominence
networking, primarily as result of the paradigm shift towards a in today's networking, primarily as result of the paradigm shift
centralized data-center based architecture that is poised to deliver towards a centralized data-center based architecture that is poised
a majority of computation and storage services in the future. to deliver a majority of computation and storage services in the
Today's current routing protocols were geared towards a network with future. Today's current routing protocols were geared towards a
an irregular topology and low degree of connectivity originally but network with an irregular topology and low degree of connectivity
given they were the only available options, consequently several originally but given they were the only available options,
attempts to apply those protocols to Clos have been made. Most consequently several attempts to apply those protocols to Clos have
successfully BGP [RFC4271] [RFC7938] has been extended to this been made. Most successfully BGP [RFC4271] [RFC7938] has been
purpose, not as much due to its inherent suitability but rather extended to this purpose, not as much due to its inherent suitability
because the perceived capability to easily modify BGP and the but rather because the perceived capability to easily modify BGP and
immanent difficulties with link-state link-state [DIJKSTRA] based the immanent difficulties with link-state [DIJKSTRA] based protocols
protocols to optimize topology exchange and converge quickly in large to optimize topology exchange and converge quickly in large scale
scale densely meshed topologies. The incumbent protocols densely meshed topologies. The incumbent protocols precondition
precondition normally extensive configuration or provisioning during normally extensive configuration or provisioning during bring up and
bring up and re-dimensioning which is only viable for a set of re-dimensioning which is only viable for a set of organizations with
organizations with according networking operation skills and budgets. according networking operation skills and budgets. For the majority
For the majority of data center consumers a preferable protocol would of data center consumers a preferable protocol would be one that
be one that auto-configures itself and deals with failures and auto-configures itself and deals with failures and misconfigurations
misconfigurations with a minimum of human intervention only. Such a with a minimum of human intervention only. Such a solution would
solution would allow local IP fabric bandwidth to be consumed in a allow local IP fabric bandwidth to be consumed in a standardized
standardized component fashion, i.e. provision it much faster and component fashion, i.e. provision it much faster and operate it at
operate it at much lower costs, much like compute or storage is much lower costs, much like compute or storage is consumed today.
consumed today.
In looking at the problem through the lens of data center In looking at the problem through the lens of data center
requirements, an optimal approach does not seem however to be a requirements, an optimal approach does not seem however to be a
simple modification of either a link-state (distributed computation) simple modification of either a link-state (distributed computation)
or distance-vector (diffused computation) approach but rather a or distance-vector (diffused computation) approach but rather a
mixture of both, colloquially best described as "link-state towards mixture of both, colloquially best described as "link-state towards
the spine" and "distance vector towards the leafs". In other words, the spine" and "distance vector towards the leafs". In other words,
"bottom" levels are flooding their link-state information in the "bottom" levels are flooding their link-state information in the
"northern" direction while each node generates under normal "northern" direction while each node generates under normal
conditions a default route and floods it in the "southern" direction. conditions a default route and floods it in the "southern" direction.
This type of protocol allows highly desirable aggregation. Alas, This type of protocol allows naturally for highly desirable
such aggregation could blackhole traffic in cases of misconfiguration aggregation. Alas, such aggregation could blackhole traffic in cases
or while failures are being resolved or even cause partial network of misconfiguration or while failures are being resolved or even
partitioning and this has to be addressed. The approach RIFT takes cause partial network partitioning and this has to be addressed. The
is described in Section 5.2.5 and is basically based on automatic, approach RIFT takes is described in Section 5.2.5 and is basically
sufficient disaggregation of prefixes to prevent any possible based on automatic, sufficient disaggregation of prefixes.
problems.
For the visually oriented reader, Figure 1 presents a first level For the visually oriented reader, Figure 1 presents a first level
simplified view of the resulting information and routes on a RIFT simplified view of the resulting information and routes on a RIFT
fabric. The top of the fabric, is holding, in its link-state fabric. The top of the fabric is holding in its link-state database
database the nodes below it and the routes to them. In the second the nodes below it and the routes to them. In the second row of the
row of the database we indicate that partial information of other database table we indicate that partial information of other nodes in
nodes in the same level is available as well. The details of how the same level is available as well. The details of how this is
this is achieved will be postponed for the moment. When we look at achieved will be postponed for the moment. When we look at the
the "bottom" of the fabric, the leafs, we see that the topology is "bottom" of the fabric, the leafs, we see that the topology is
basically empty and they only hold a load balanced default route to basically empty and they only hold a load balanced default route to
the next level. the next level.
The balance of this document details the resulting protocol and fills The balance of this document details the resulting protocol and fills
in the missing details. in the missing details.
. [A,B,C,D] . [A,B,C,D]
. [E] . [E]
. +-----+ +-----+ . +-----+ +-----+
. | E | | F | A/32 @ [C,D] . | E | | F | A/32 @ [C,D]
skipping to change at page 8, line 24 skipping to change at page 8, line 45
Southbound Link: A link to a node one level down or in other words, Southbound Link: A link to a node one level down or in other words,
one level further south. one level further south.
East-West Link: A link between two nodes at the same level. East- East-West Link: A link between two nodes at the same level. East-
West links are normally not part of Clos or "fat-tree" topologies. West links are normally not part of Clos or "fat-tree" topologies.
Leaf shortcuts (L2L): East-West links at leaf level will need to be Leaf shortcuts (L2L): East-West links at leaf level will need to be
differentiated from East-West links at other levels. differentiated from East-West links at other levels.
Southbound representation: Information sent towards a lower level Southbound representation: Subset of topology information sent
representing only limited amount of information. towards a lower level.
South Reflection: Often abbreviated just as "reflection" it defines South Reflection: Often abbreviated just as "reflection" it defines
a mechanism where South Node TIEs are "reflected" back up north to a mechanism where South Node TIEs are "reflected" back up north to
allow nodes in same level without E-W links to "see" each other. allow nodes in same level without E-W links to "see" each other.
TIE: This is an acronym for a "Topology Information Element". TIEs TIE: This is an acronym for a "Topology Information Element". TIEs
are exchanged between RIFT nodes to describe parts of a network are exchanged between RIFT nodes to describe parts of a network
such as links and address prefixes. A TIE can be thought of as such as links and address prefixes. A TIE can be thought of as
largely equivalent to ISIS LSPs or OSPF LSA. We will talk about largely equivalent to ISIS LSPs or OSPF LSA. We will talk about
N-TIEs when talking about TIEs in the northbound representation N-TIEs when talking about TIEs in the northbound representation
skipping to change at page 9, line 17 skipping to change at page 9, line 38
TIRE: Topology Information Request Element, equivalent to PSNP in TIRE: Topology Information Request Element, equivalent to PSNP in
ISIS. It can both confirm received and request missing TIEs. ISIS. It can both confirm received and request missing TIEs.
De-aggregation/Disaggregation: Process in which a node decides to De-aggregation/Disaggregation: Process in which a node decides to
advertise certain prefixes it received in N-TIEs to prevent black- advertise certain prefixes it received in N-TIEs to prevent black-
holing and suboptimal routing upon link failures. holing and suboptimal routing upon link failures.
LIE: This is an acronym for a "Link Information Element", largely LIE: This is an acronym for a "Link Information Element", largely
equivalent to HELLOs in IGPs and exchanged over all the links equivalent to HELLOs in IGPs and exchanged over all the links
between systems running RIFT to form adjacencies. all the links
between systems running RIFT to form adjacencies. between systems running RIFT to form adjacencies.
Flooding Leader (FL): Flooding Leader for a specific system has a Flood Repeater (FR): A node can designate one or more northbound
dedicated role to flood on northbound TIEs sent by this system. neighbor nodes to be flood repeaters. The flood repeaters are
Similar to MPR in OSLR. responsible for flooding northbound TIEs further north. They are
similar to MPR in OSLR. The document sometimes calls them flood
leaders as well.
Bandwidth Adjusted Distance (BAD): This is an acronym for Bandwidth Bandwidth Adjusted Distance (BAD): This is an acronym for Bandwidth
Adjusted Distance. Each RIFT node calculates the amount of Adjusted Distance. Each RIFT node calculates the amount of
northbound bandwidth available towards a node compared to other northbound bandwidth available towards a node compared to other
nodes at the same level and modifies the default route distance nodes at the same level and modifies the default route distance
accordingly to allow for the lower level to adjust their load accordingly to allow for the lower level to adjust their load
balancing towards spines. balancing towards spines.
Overloaded: Applies to a node advertising `overload` attribute as Overloaded: Applies to a node advertising `overload` attribute as
set. The semantics closely follow the meaning of the same set. The semantics closely follow the meaning of the same
skipping to change at page 10, line 5 skipping to change at page 10, line 27
adjacencies can be formed to a neighbor via parallel interfaces adjacencies can be formed to a neighbor via parallel interfaces
but such adjacencies are NOT sharing a neighbor structure. Saying but such adjacencies are NOT sharing a neighbor structure. Saying
"neighbor" is thus equivalent to saying "a three way adjacency". "neighbor" is thus equivalent to saying "a three way adjacency".
Cost: The term signifies the weighted distance between two Cost: The term signifies the weighted distance between two
neighbors. neighbors.
Distance: Sum of costs (bound by infinite distance) between two Distance: Sum of costs (bound by infinite distance) between two
nodes. nodes.
Metric: Without going deeper into the mathematic differentiation, a Metric: Without going deeper into the proper differentiation, a
metric is equivalent to distance. metric is equivalent to distance.
3.2. Topology 3.2. Topology
. +--------+ +--------+ ^ N . +--------+ +--------+ ^ N
. |ToF 21| |ToF 22| | . |ToF 21| |ToF 22| |
.Level 2 ++-+--+-++ ++-+--+-++ <-*-> E/W .Level 2 ++-+--+-++ ++-+--+-++ <-*-> E/W
. | | | | | | | | | . | | | | | | | | |
. P111/2| |P121 | | | | S v . P111/2| |P121 | | | | S v
. ^ ^ ^ ^ | | | | . ^ ^ ^ ^ | | | |
. | | | | | | | | . | | | | | | | |
. +--------------+ | +-----------+ | | | +---------------+ . +--------------+ | +-----------+ | | | +---------------+
. | | | | | | | | . | | | | | | | |
. South +-----------------------------+ | | ^ . South +-----------------------------+ | | ^
skipping to change at page 14, line 12 skipping to change at page 15, line 12
complexity while possibly not being reactive enough to complexity while possibly not being reactive enough to
account for short-lived flows. account for short-lived flows.
REQ17: The control plane should be able to unambiguously determine REQ17: The control plane should be able to unambiguously determine
the current point of attachment (which port on which leaf the current point of attachment (which port on which leaf
node) of a prefix, even in a context of fast mobility, e.g., node) of a prefix, even in a context of fast mobility, e.g.,
when the prefix is a host address on a wireless node that 1) when the prefix is a host address on a wireless node that 1)
may associate to any of multiple access points (APs) that may associate to any of multiple access points (APs) that
are attached to different ports on a same leaf node or to are attached to different ports on a same leaf node or to
different leaf nodes, and 2) may move and reassociate different leaf nodes, and 2) may move and reassociate
several times to a different AP within a sub-second period. several times to a different access point within a sub-
second period.
REQ18: The protocol should provide security mechanisms that allow REQ18: The protocol should provide security mechanisms that allow
to restrict nodes, especially leafs without proper to restrict nodes, especially leafs without proper
credentials from forming three-way adjacencies. credentials from forming three-way adjacencies.
Following list represents possible requirements and requirements Following list represents possible requirements and requirements
under discussion: under discussion:
PEND1: Supporting anything but point-to-point links is a non- PEND1: Supporting anything but point-to-point links is a non-
requirement. Questions remain: for connecting to the requirement. Questions remain: for connecting to the
skipping to change at page 14, line 36 skipping to change at page 15, line 37
PEND2: What is the maximum scale of number leaf prefixes we need to PEND2: What is the maximum scale of number leaf prefixes we need to
carry. 500'000 seems plenty even if we deploy RIFT down to carry. 500'000 seems plenty even if we deploy RIFT down to
servers as leafs. servers as leafs.
Finally, following are the non-requirements: Finally, following are the non-requirements:
NONREQ1: Broadcast media support is unnecessary. However, NONREQ1: Broadcast media support is unnecessary. However,
miscabling leading to multiple nodes on a broadcast miscabling leading to multiple nodes on a broadcast
segment must be operationally easily recognizable and segment must be operationally easily recognizable and
operationally easily detectable while not taxing the detectable while not taxing the protocol excessively.
protocol excessively.
NONREQ2: Purging link state elements is unnecessary given its NONREQ2: Purging link state elements is unnecessary given its
fragility and complexity and today's large memory size on fragility and complexity and today's large memory size on
even modest switches and routers. even modest switches and routers.
NONREQ3: Special support for layer 3 multi-hop adjacencies is not NONREQ3: Special support for layer 3 multi-hop adjacencies is not
part of the protocol specification. Such support can be part of the protocol specification. Such support can be
easily provided by using tunneling technologies the same easily provided by using tunneling technologies the same
way IGPs today are solving the problem. way IGPs today are solving the problem.
skipping to change at page 15, line 34 skipping to change at page 16, line 34
build an update per adjacency. We omit describing the East-West build an update per adjacency. We omit describing the East-West
direction out for the moment. direction out for the moment.
Those information flow constraints create not only an anisotropic Those information flow constraints create not only an anisotropic
protocol (i.e. the information is not distributed "evenly" or protocol (i.e. the information is not distributed "evenly" or
"clumped" but summarized along the N-S gradient) but also a "smooth" "clumped" but summarized along the N-S gradient) but also a "smooth"
information propagation where nodes do not receive the same information propagation where nodes do not receive the same
information from multiple fronts which would force them to perform a information from multiple fronts which would force them to perform a
diffused computation to tie-break the same reachability information diffused computation to tie-break the same reachability information
arriving on arbitrary links and ultimately force hop-by-hop arriving on arbitrary links and ultimately force hop-by-hop
forwarding on shortest-paths only. forwarding on shortest-paths only. The application of those
principle lead to RIFT having moreover the highly desirable
properties of being loop-free and guaranteeing valley-free forwarding
behavior.
To account for the "northern" and the "southern" information split To account for the "northern" and the "southern" information split
the link state database is partitioned into "north representation" the link state database is partitioned into "north representation"
and "south representation" TIEs, whereas in simplest terms the N-TIEs and "south representation" TIEs, whereas in simplest terms the N-TIEs
contain a link state topology description of lower levels and and contain a link state topology description of lower levels and and
S-TIEs carry simply default routes. This oversimplified view will be S-TIEs carry simply default routes. This oversimplified view will be
refined gradually in following sections while introducing protocol refined gradually in following sections while introducing protocol
procedures aimed to fulfill the described requirements. procedures aimed to fulfill the described requirements.
5.1.2. Generalized Topology View 5.1.2. Generalized Topology View
skipping to change at page 16, line 27 skipping to change at page 17, line 29
chosen differently without loss of generality when port speeds differ chosen differently without loss of generality when port speeds differ
or fabric is oversubscribed but K=R/2 allows for more readable or fabric is oversubscribed but K=R/2 allows for more readable
representation whereby there are as many ports facing north as south representation whereby there are as many ports facing north as south
on any intermediate node. We represent a node hence in a schematic on any intermediate node. We represent a node hence in a schematic
fashion with ports "sticking out" to its north and south rather than fashion with ports "sticking out" to its north and south rather than
by the usual real-world front faceplate designs of the day. by the usual real-world front faceplate designs of the day.
Figure 4 provides a view of a leaf node as seen from the north, i.e. Figure 4 provides a view of a leaf node as seen from the north, i.e.
showing ports that connect northbound and for lack of a better showing ports that connect northbound and for lack of a better
symbol, we have chosen to use the "HH" symbol as ASCII visualisation symbol, we have chosen to use the "HH" symbol as ASCII visualisation
of a RJ45 jack. Observe that the number of PoDs is not related to of a RJ45 jack. In that example, K_LEAF is chosen to be 6 ports.
Radix unless the Spine Nodes are constrained to be the same as the Observe that the number of PoDs is not related to Radix unless the
PoD nodes in a particular deployment. We set the radix of the leaf ToF Nodes are constrained to be the same as the PoD nodes in a
to K_LEAF, in this example 6 ports. particular deployment.
Top view Top view
+----+ +----+
| | | |
| HH | e.g., Radix = 12, K_LEAF = 6 | HH | e.g., Radix = 12, K_LEAF = 6
| | | |
| HH | | HH |
| | ------------------------- | | -------------------------
| HH ------- Physical Port (Ethernet) ----+ | HH ------- Physical Port (Ethernet) ----+
| | ------------------------- | | | ------------------------- |
skipping to change at page 19, line 5 skipping to change at page 20, line 5
Figure 5: Southern View of a PoD, K_TOP=8 Figure 5: Southern View of a PoD, K_TOP=8
The K_TOP Leaf Nodes are fully interconnected with the K_LEAF PoD-top The K_TOP Leaf Nodes are fully interconnected with the K_LEAF PoD-top
nodes, providing a connectivity that can be represented as a crossbar nodes, providing a connectivity that can be represented as a crossbar
as seen from the north and illustrated in Figure 6. The result is as seen from the north and illustrated in Figure 6. The result is
that, in the absence of a breakage, a packet entering the PoD from that, in the absence of a breakage, a packet entering the PoD from
North on any port can be routed to any port on the south of the PoD North on any port can be routed to any port on the south of the PoD
and vice versa. and vice versa.
E<-*->W E<-*->W
+----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
+----------------------------------------------------------------+ +----------------------------------------------------------------+
| HH HH HH HH HH HH HH HH | | HH HH HH HH HH HH HH HH |
+----------------------------------------------------------------+ +----------------------------------------------------------------+
+----------------------------------------------------------------+ +----------------------------------------------------------------+
| HH HH HH HH HH HH HH HH | | HH HH HH HH HH HH HH HH |
+----------------------------------------------------------------+ +----------------------------------------------------------------+
+----------------------------------------------------------------+ +----------------------------------------------------------------+
skipping to change at page 20, line 5 skipping to change at page 21, line 5
^ | ^ |
| | | |
| ---------- --------------------- | | ---------- --------------------- |
+----- Leaf Node PoD top Node (Spine) --+ +----- Leaf Node PoD top Node (Spine) --+
---------- --------------------- ---------- ---------------------
Figure 6: Northern View of a PoD's Spines, K_TOP=8 Figure 6: Northern View of a PoD's Spines, K_TOP=8
Side views of this PoD is illustrated in Figure 7 and Figure 8. Side views of this PoD is illustrated in Figure 7 and Figure 8.
Connecting to Spine Connecting to Spine
|| || || || || || || || || || || || || || || ||
+----------------------------------------------------------------+ N +----------------------------------------------------------------+ N
| PoD top Node seen sideways | ^ | PoD top Node seen sideways | ^
+----------------------------------------------------------------+ | +----------------------------------------------------------------+ |
|| || || || || || || || * || || || || || || || || *
+----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ | +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ |
| | | | | | | | | | | | | | | | v | | | | | | | | | | | | | | | | v
+----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ S +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ S
|| || || || || || || || || || || || || || || ||
Connecting to Client nodes Connecting to Client nodes
Figure 7: Side View of a PoD, K_TOP=8, K_LEAF=6 Figure 7: Side View of a PoD, K_TOP=8, K_LEAF=6
Connecting to Spine Connecting to Spine
|| || || || || || || || || || || ||
+----+ +----+ +----+ +----+ +----+ +----+ N +----+ +----+ +----+ +----+ +----+ +----+ N
| | | | | | | | | | | PoD top Nodes ^ | | | | | | | | | | | PoD top Nodes ^
+----+ +----+ +----+ +----+ +----+ +----+ | +----+ +----+ +----+ +----+ +----+ +----+ |
|| || || || || || * || || || || || || *
+------------------------------------------------+ | +------------------------------------------------+ |
| Leaf seen sideways | v | Leaf seen sideways | v
+------------------------------------------------+ S +------------------------------------------------+ S
|| || || || || || || || || || || ||
Connecting to Client nodes Connecting to Client nodes
Figure 8: Other side View of a PoD, K_TOP=8, K_LEAF=6, 90o turn in Figure 8: Other side View of a PoD, K_TOP=8, K_LEAF=6, 90o turn in
E-W Plane E-W Plane
Note that a resulting PoD can be abstracted as a bigger node with a Note that a resulting PoD can be abstracted as a bigger node with a
number K of K_POD= K_TOP * K_LEAF, and the design can recurse. number K of K_POD= K_TOP * K_LEAF, and the design can recurse.
It is critical at this junction that the concept and the picture of It is critical at this junction that the concept and the picture of
those "crossed crossbars" is clear before progressing further, those "crossed crossbars" is clear before progressing further,
otherwise following considerations will be difficult to comprehend. otherwise following considerations will be difficult to comprehend.
skipping to change at page 25, line 24 skipping to change at page 26, line 24
ends often being partioned in planes which makes the occurrence of ends often being partioned in planes which makes the occurrence of
having a given leaf being only reachable from a subset of the ToF having a given leaf being only reachable from a subset of the ToF
nodes more likely to happen. This makes some further considerations nodes more likely to happen. This makes some further considerations
necessary. necessary.
We define a "Fallen Leaf" as a leaf that can be reached by only a We define a "Fallen Leaf" as a leaf that can be reached by only a
subset of Top-of-Fabric nodes but cannot be reached by all due to subset of Top-of-Fabric nodes but cannot be reached by all due to
missing connectivity. If R is the redundancy factor, then it takes missing connectivity. If R is the redundancy factor, then it takes
at least R breakages to reach a "Fallen Leaf" situation. at least R breakages to reach a "Fallen Leaf" situation.
In a maximally partitioned fabric, the redundancy factor R is 1, so
any breakage may cause one or more fallen leaves, but not all cases
require disaggregation. The following cases do not require
particular action:
If the breakage is a southern link from a leaf Node going down,
then connectivity to any node attached to the link is lost. There
is no need to disaggregate since the connectivity is lost for all
spine nodes in a same fashion.
If the breakage is a leaf Node going down, then connectivity
through that leaf is lost for all nodes. There is no need to
disaggregate since the connectivity is lost for all spine nodes in
a same fashion.
If the breakage is a ToF Node going down, then northern traffic is
routed via alternate ToF nodes in the same plane and there is no
need to disaggregate routes.
In a general manner, the mechanism of non-transitive positive In a general manner, the mechanism of non-transitive positive
disaggregation is sufficient when the disaggregating ToF nodes disaggregation is sufficient when the disaggregating ToF nodes
collectively connect to all the ToP nodes in the broken plane. This collectively connect to all the ToP nodes in the broken plane. This
happens in the following case: happens in the following case:
If the breakage is the last northern link from a ToP node to a ToF If the breakage is the last northern link from a ToP node to a ToF
node going down, then the fallen leaf problem affects only The ToF node going down, then the fallen leaf problem affects only The ToF
node, and the connectivity to all the nodes in the PoD is lost node, and the connectivity to all the nodes in the PoD is lost
from that ToF node. This can be observed by other ToF nodes from that ToF node. This can be observed by other ToF nodes
within the plane where the ToP node is located and positively within the plane where the ToP node is located and positively
skipping to change at page 26, line 50 skipping to change at page 27, line 30
the breakage can be observed by the ToF nodes in the plane where the breakage can be observed by the ToF nodes in the plane where
the breakage happened, and then again, the ToF nodes in the plane the breakage happened, and then again, the ToF nodes in the plane
need to be aware of all the affected prefixes for the negative need to be aware of all the affected prefixes for the negative
disaggregation to be fully effective. The problem can also be disaggregation to be fully effective. The problem can also be
observed by the ToF nodes in the other planes through the flooding observed by the ToF nodes in the other planes through the flooding
of N-TIEs from the affected Leaf nodes, if there are only 3 levels of N-TIEs from the affected Leaf nodes, if there are only 3 levels
and the ToP nodes are directly connected to the Leaf nodes, and and the ToP nodes are directly connected to the Leaf nodes, and
then again it can only be effective it is propagated transitively then again it can only be effective it is propagated transitively
to the Leaf, and useless above that level. to the Leaf, and useless above that level.
For the sake of readability let us roll the abstractions back to a For the sake of easy comprehension let us roll the abstractions back
simplest example and observe that in Figure 3 the loss of link Spine to a simple example and observe that in Figure 3 the loss of link
122 to Leaf 122 will make Leaf 122 a fallen leaf for Top-of-Fabric Spine 122 to Leaf 122 will make Leaf 122 a fallen leaf for Top-of-
plane B. Worse, if the cabling was never present in first place, Fabric plane B. Worse, if the cabling was never present in first
plane B will not even be able to know that such a fallen leaf exists. place, plane B will not even be able to know that such a fallen leaf
Hence partitioning without further treatment results in two grave exists. Hence partitioning without further treatment results in two
problems: grave problems:
o Leaf111 trying to route to Leaf122 MUST choose Spine 111 in plane o Leaf111 trying to route to Leaf122 MUST choose Spine 111 in plane
A as its next hop since plane B will inevitably blackhole the A as its next hop since plane B will inevitably blackhole the
packet when forwarding using default routes or do excessive bow packet when forwarding using default routes or do excessive bow
tie'ing, i.e. this information must be in its routing table. tie'ing, i.e. this information must be in its routing table.
o any kind of "flooding" or distance vector trying to deal with the o any kind of "flooding" or distance vector trying to deal with the
problem by distributing host routes will be able to converge only problem by distributing host routes will be able to converge only
using paths through leafs, i.e. the flooding of information on using paths through leafs, i.e. the flooding of information on
Leaf122 will go up to Top-of-Fabric A and then "loopback" over Leaf122 will go up to Top-of-Fabric A and then "loopback" over
skipping to change at page 27, line 32 skipping to change at page 28, line 14
5.1.4. Discovering Fallen Leaves 5.1.4. Discovering Fallen Leaves
As we illustrate later and without further proof here, to deal with As we illustrate later and without further proof here, to deal with
fallen leafs in multi-plane designs RIFT requires all the ToF nodes fallen leafs in multi-plane designs RIFT requires all the ToF nodes
to share the same topology database. This happens naturally in to share the same topology database. This happens naturally in
single plane design but needs additional considerations in multi- single plane design but needs additional considerations in multi-
plane fabrics. To satisfy this RIFT in multi-plane designs relies at plane fabrics. To satisfy this RIFT in multi-plane designs relies at
the ToF Level on ring interconnection of switches in multiple planes. the ToF Level on ring interconnection of switches in multiple planes.
Other solutions are possible but they either need more cabling or end Other solutions are possible but they either need more cabling or end
up having much longer flooding path or single points of failure. up having much longer flooding path and/or single points of failure.
In more detail, by reserving two ports on each Top-of-Fabric node it In more detail, by reserving two ports on each Top-of-Fabric node it
is possible to connect them together in an interplane bi-directional is possible to connect them together in an interplane bi-directional
ring as illustrated in Figure 13 (where we show a bi-directional ring ring as illustrated in Figure 13 (where we show a bi-directional ring
connecting switches across planes). The rings will exchange full connecting switches across planes). The rings will exchange full
topology information between planes and with that allow consequently topology information between planes and with that allow consequently
by the means of transitive, negative disaggregation described in by the means of transitive, negative disaggregation described in
Section 5.2.5.2 to efficiently fix any possible fallen leaf scenario. Section 5.2.5.2 to efficiently fix any possible fallen leaf scenario.
Somewhat as a side-effect, the exchange of information fulfills the Somewhat as a side-effect, the exchange of information fulfills the
requirement to present full view of the fabric topology at the Top- requirement to present full view of the fabric topology at the Top-
skipping to change at page 29, line 44 skipping to change at page 30, line 44
packet typically occurs at the leaf level and the disaggregation must packet typically occurs at the leaf level and the disaggregation must
be transitive and reach all the leaves. In that case, the negative be transitive and reach all the leaves. In that case, the negative
disaggregation is necessary. The details on the RIFT approach to disaggregation is necessary. The details on the RIFT approach to
deal with fallen leafs in an optimal way is specified in deal with fallen leafs in an optimal way is specified in
Section 5.2.5.2. Section 5.2.5.2.
5.2. Specification 5.2. Specification
5.2.1. Transport 5.2.1. Transport
All packet formats are defined in Thrift models in Appendix A. All packet formats are defined in Thrift models in Appendix B.
The serialized model is carried in an envelope within a UDP frame The serialized model is carried in an envelope within a UDP frame
that provides security and allows validation/modification of several that provides security and allows validation/modification of several
important fields without de-serialization for performance and important fields without de-serialization for performance and
security reasons. security reasons.
The ultimate transport envelope, especially the placement of nonces
is under active discussion.
+--------+----------+-------------+----------+-------------+---------+----------------+
| UDP | TIE | Fingerprint | Key ID | Security | Model | Serialized |
| Header | Lifetime | Type | | Fingerprint | Version | RIFT Model ... |
| | | (e.g. SHA) | | | | Object |
+--------+----------+-------------+----------+-------------+---------+----------------+
Figure 14: Security Envelope
5.2.2. Link (Neighbor) Discovery (LIE Exchange) 5.2.2. Link (Neighbor) Discovery (LIE Exchange)
LIE exchange happens over well-known administratively locally scoped LIE exchange happens over well-known administratively locally scoped
and configured or otherwise well-known IPv4 multicast address and configured or otherwise well-known IPv4 multicast address
[RFC2365] or link-local multicast scope [RFC4291] for IPv6 [RFC8200] [RFC2365] or link-local multicast scope [RFC4291] for IPv6 [RFC8200]
using a configured or otherwise a well-known destination UDP port using a configured or otherwise a well-known destination UDP port
defined in Appendix C.1. LIEs SHOULD be sent with a TTL of 1 to defined in Appendix D.1. LIEs SHOULD be sent with a TTL of 1 to
prevent RIFT information reaching beyond a single L3 next-hop in the prevent RIFT information reaching beyond a single L3 next-hop in the
topology. LIEs SHOULD be sent with network control precedence. topology. LIEs SHOULD be sent with network control precedence.
Originating port of the LIE has no further significance other than Originating port of the LIE has no further significance other than
identifying the origination point. LIEs are exchanged over all links identifying the origination point. LIEs are exchanged over all links
running RIFT. An implementation MAY listen and send LIEs on IPv4 running RIFT. An implementation MAY listen and send LIEs on IPv4
and/or IPv6 multicast addresses. LIEs on same link are considered and/or IPv6 multicast addresses. LIEs on same link are considered
part of the same negotiation independent on the address family they part of the same negotiation independent on the address family they
arrive on. Observe further that the LIE source address may not arrive on. Observe further that the LIE source address may not
identify the peer uniquely in unnumbered or link-local address cases identify the peer uniquely in unnumbered or link-local address cases
so the response transmission MUST occur over the same interface the so the response transmission MUST occur over the same interface the
skipping to change at page 32, line 6 skipping to change at page 32, line 39
iii) both nodes are at level 0 AND both indicate support for iii) both nodes are at level 0 AND both indicate support for
Section 5.3.9 OR Section 5.3.9 OR
iv) neither node is at level 0 and the neighboring node is at iv) neither node is at level 0 and the neighboring node is at
most one level away most one level away
]. ].
The rule in Paragraph 3 MAY be optionally disregarded by a node if The rule in Paragraph 3 MAY be optionally disregarded by a node if
PoD detection is undesirable or has to be disregarded. PoD detection is undesirable or has to be ignored.
A node configured with "undefined" PoD membership MUST, after A node configured with "undefined" PoD membership MUST, after
building first northbound three way adjacencies to a node being in a building first northbound three way adjacencies to a node being in a
defined PoD, advertise that PoD as part of its LIEs. In case that defined PoD, advertise that PoD as part of its LIEs. In case that
adjacency is lost, from all available northbound three way adjacency is lost, from all available northbound three way
adjacencies the node with the highest System ID and defined PoD is adjacencies the node with the highest System ID and defined PoD is
chosen. That way the northmost defined PoD value (normally the top chosen. That way the northmost defined PoD value (normally the top
spines in a PoD) can diffuse southbound towards the leafs "forcing" spines in a PoD) can diffuse southbound towards the leafs "forcing"
the PoD value on any node with "undefined" PoD. the PoD value on any node with "undefined" PoD.
LIEs arriving with a TTL larger than 1 MUST be ignored. LIEs arriving with a TTL larger than 1 MUST be ignored.
A node SHOULD NOT send out LIEs without defined level in the header A node SHOULD NOT send out LIEs without defined level in the header
but in certain scenarios it may be beneficial for trouble-shooting but in certain scenarios it may be beneficial for trouble-shooting
purposes. purposes.
LIE exchange uses three way handshake mechanism which is a cleaned up LIE exchange uses three way handshake mechanism which is a cleaned up
version of [RFC5303]. Observe that for easier comprehension the version of [RFC5303]. Observe that for easier comprehension the
terminology of one/two and three-way states does NOT align with OSPF terminology of one/two and three-way states does NOT align with OSPF
or ISIS FSMs albeit they use roughly same mechanisms. LIE packets or ISIS FSMs albeit they use roughly same mechanisms.
reflect nonces and may contain an SHA-1 [RFC6234] over nonces and
some of the LIE data which prevents corruption and replay attacks.
5.2.3. Topology Exchange (TIE Exchange) 5.2.3. Topology Exchange (TIE Exchange)
5.2.3.1. Topology Information Elements 5.2.3.1. Topology Information Elements
Topology and reachability information in RIFT is conveyed by the Topology and reachability information in RIFT is conveyed by the
means of TIEs which have good amount of commonalities with LSAs in means of TIEs which have good amount of commonalities with LSAs in
OSPF. OSPF.
The TIE exchange mechanism uses the port indicated by each node in The TIE exchange mechanism uses the port indicated by each node in
the LIE exchange and the interface on which the adjacency has been the LIE exchange and the interface on which the adjacency has been
formed as destination. It SHOULD use TTL of 1 as well. formed as destination. It SHOULD use TTL of 1 as well and set inter-
network control precedence on according packets.
TIEs contain sequence numbers, lifetimes and a type. Each type has a TIEs contain sequence numbers, lifetimes and a type. Each type has
large identifying number space and information is spread across ample identifying number space and information is spread across
possibly many TIEs of a certain type by the means of a hash function possibly many TIEs of a certain type by the means of a hash function
that a node or deployment can individually determine. One extreme that a node or deployment can individually determine. One extreme
design choice is a prefix per TIE which leads to more BGP-like design choice is a prefix per TIE which leads to more BGP-like
behavior where small increments are only advertised on route changes behavior where small increments are only advertised on route changes
vs. deploying with dense prefix packing into few TIEs leading to more vs. deploying with dense prefix packing into few TIEs leading to more
traditional IGP trade-off with fewer TIEs. An implementation may traditional IGP trade-off with fewer TIEs. An implementation may
even rehash prefix to TIE mapping at any time at the cost of even rehash prefix to TIE mapping at any time at the cost of
significant amount of re-advertisements of TIEs. significant amount of re-advertisements of TIEs.
More information about the TIE structure can be found in the schema More information about the TIE structure can be found in the schema
in Appendix A. in Appendix B.
5.2.3.2. South- and Northbound Representation 5.2.3.2. South- and Northbound Representation
A central concept of RIFT is that each node represents itself A central concept of RIFT is that each node represents itself
differently depending on the direction in which it is advertising differently depending on the direction in which it is advertising
information. More precisely, a spine node represents two different information. More precisely, a spine node represents two different
databases over its adjacencies depending whether it advertises TIEs databases over its adjacencies depending whether it advertises TIEs
to the north or to the south/sideways. We call those differing TIE to the north or to the south/sideways. We call those differing TIE
databases either south- or northbound (S-TIEs and N-TIEs) depending databases either south- or northbound (S-TIEs and N-TIEs) depending
on the direction of distribution. on the direction of distribution.
The N-TIEs hold all of the node's adjacencies and local prefixes The N-TIEs hold all of the node's adjacencies and local prefixes
while the S-TIEs hold only all of the node's adjacencies, the default while the S-TIEs hold only all of the node's adjacencies, the default
prefix with necessary disaggregated prefixes and local prefixes. We prefix with necessary disaggregated prefixes and local prefixes. We
will explain this in detail further in Section 5.2.5. will explain this in detail further in Section 5.2.5.
The TIE types are symmetric in both directions and Table 2 provides a The TIE types are mostly symmetric in both directions and Table 2
quick reference to main TIE types including direction and their provides a quick reference to main TIE types including direction and
function. their function.
+----------+--------------------------------------------------------+ +-------------------+-----------------------------------------------+
| TIE-Type | Content | | TIE-Type | Content |
+----------+--------------------------------------------------------+ +-------------------+-----------------------------------------------+
| node | node properties, adjacencies and information helping | | Node N-TIE | node properties and adjacencies |
| N-TIE | in complex disaggregation scenarios | +-------------------+-----------------------------------------------+
+----------+--------------------------------------------------------+ | Node S-TIE | same content as node N-TIE |
| node | same content as node N-TIE except the information to | +-------------------+-----------------------------------------------+
| S-TIE | help disaggregation | | Prefix N-TIE | contains nodes' directly reachable prefixes |
+----------+--------------------------------------------------------+ +-------------------+-----------------------------------------------+
| Prefix | contains nodes' directly reachable prefixes | | Prefix S-TIE | contains originated defaults and directly |
| N-TIE | | | | reachable prefixes |
+----------+--------------------------------------------------------+ +-------------------+-----------------------------------------------+
| Prefix | contains originated defaults and de-aggregated | | Positive | contains disaggregated prefixes |
| S-TIE | prefixes | | Disaggregation | |
+----------+--------------------------------------------------------+ | S-TIE | |
| KV | contains nodes northbound KVs | +-------------------+-----------------------------------------------+
| N-TIE | | | Negative | contains special, negatively disaggreagted |
+----------+--------------------------------------------------------+ | Disaggregation | prefixes to support multi-plane designs |
| KV | contains nodes southbound KVs | | S-TIE | |
| S-TIE | | +-------------------+-----------------------------------------------+
+----------+--------------------------------------------------------+ | External Prefix | contains external prefixes |
| N-TIE | |
+-------------------+-----------------------------------------------+
| Key-Value N-TIE | contains nodes northbound KVs |
+-------------------+-----------------------------------------------+
| Key-Value S-TIE | contains nodes southbound KVs |
+-------------------+-----------------------------------------------+
Table 2: TIE Types Table 2: TIE Types
As an example illustrating a databases holding both representations, As an example illustrating a databases holding both representations,
consider the topology in Figure 2 with the optional link between consider the topology in Figure 2 with the optional link between
spine 111 and spine 112 (so that the flooding on an East-West link spine 111 and spine 112 (so that the flooding on an East-West link
can be shown). This example assumes unnumbered interfaces. First, can be shown). This example assumes unnumbered interfaces. First,
here are the TIEs generated by some nodes. For simplicity, the key here are the TIEs generated by some nodes. For simplicity, the key
value elements which may be included in their S-TIEs or N-TIEs are value elements which may be included in their S-TIEs or N-TIEs are
not shown. not shown.
skipping to change at page 35, line 19 skipping to change at page 36, line 10
Leaf112 N-TIEs: Leaf112 N-TIEs:
Node N-TIE: Node N-TIE:
NodeElement(level=0, NodeElement(level=0,
neighbors((Spine 111, level 1, cost 1, links(...)), neighbors((Spine 111, level 1, cost 1, links(...)),
(Spine 112, level 1, cost 1, links(...)))) (Spine 112, level 1, cost 1, links(...))))
Prefix N-TIE: Prefix N-TIE:
NorthPrefixesElement(prefixes(Leaf112.loopback, Prefix112, NorthPrefixesElement(prefixes(Leaf112.loopback, Prefix112,
Prefix_MH)) Prefix_MH))
Figure 15: example TIES generated in a 2 level spine-and-leaf Figure 14: example TIES generated in a 2 level spine-and-leaf
topology topology
5.2.3.3. Flooding 5.2.3.3. Flooding
The mechanism used to distribute TIEs is the well-known (albeit The mechanism used to distribute TIEs is the well-known (albeit
modified in several respects to address fat tree requirements) modified in several respects to address fat tree requirements)
flooding mechanism used by today's link-state protocols. Although flooding mechanism used by today's link-state protocols. Although
flooding is initially more demanding to implement it avoids many flooding is initially more demanding to implement it avoids many
problems with update style used in diffused computation such as path problems with update style used in diffused computation such as path
vector protocols. Since flooding tends to present an unscalable vector protocols. Since flooding tends to present an unscalable
burden in large, densely meshed topologies (fat trees being burden in large, densely meshed topologies (fat trees being
unfortunately such a topology) we provide as solution a close to unfortunately such a topology) we provide as solution a close to
optimal global flood reduction and load balancing optimization in optimal global flood reduction and load balancing optimization in
Section 5.2.3.8. Section 5.2.3.9.
As described before, TIEs themselves are transported over UDP with As described before, TIEs themselves are transported over UDP with
the ports indicated in the LIE exchanges and using the destination the ports indicated in the LIE exchanges and using the destination
address on which the LIE adjacency has been formed. For unnumbered address on which the LIE adjacency has been formed. For unnumbered
IPv4 interfaces same considerations apply as in equivalent OSPF case. IPv4 interfaces same considerations apply as in equivalent OSPF case.
On reception of a TIE with an undefined level value in the packet On reception of a TIE with an undefined level value in the packet
header the node SHOULD issue a warning and indiscriminately discard header the node SHOULD issue a warning and indiscriminately discard
the packet. the packet.
Precise finite state machines and procedures can be found in Precise finite state machines and procedures can be found in
Appendix B.3. Appendix C.3.
5.2.3.4. TIE Flooding Scopes 5.2.3.4. TIE Flooding Scopes
In a somewhat analogous fashion to link-local, area and domain In a somewhat analogous fashion to link-local, area and domain
flooding scopes, RIFT defines several complex "flooding scopes" flooding scopes, RIFT defines several complex "flooding scopes"
depending on the direction and type of TIE propagated. depending on the direction and type of TIE propagated.
Every N-TIE is flooded northbound, providing a node at a given level Every N-TIE is flooded northbound, providing a node at a given level
with the complete topology of the Clos or Fat Tree network underneath with the complete topology of the Clos or Fat Tree network underneath
it, including all specific prefixes. This means that a packet it, including all specific prefixes. This means that a packet
skipping to change at page 36, line 27 skipping to change at page 37, line 13
packet to a node at a higher level. packet to a node at a higher level.
A node's Node S-TIEs, consisting of all node's adjacencies and prefix A node's Node S-TIEs, consisting of all node's adjacencies and prefix
S-TIEs limited to those related to default IP prefix and S-TIEs limited to those related to default IP prefix and
disaggregated prefixes, are flooded southbound in order to allow the disaggregated prefixes, are flooded southbound in order to allow the
nodes one level down to see connectivity of the higher level as well nodes one level down to see connectivity of the higher level as well
as reachability to the rest of the fabric. In order to allow an E-W as reachability to the rest of the fabric. In order to allow an E-W
disconnected node in a given level to receive the S-TIEs of other disconnected node in a given level to receive the S-TIEs of other
nodes at its level, every *NODE* S-TIE is "reflected" northbound to nodes at its level, every *NODE* S-TIE is "reflected" northbound to
level from which it was received. It should be noted that East-West level from which it was received. It should be noted that East-West
links are included in South TIE flooding; those TIEs need to be links are included in South TIE flooding (except at ToF level); those
flooded to satisfy algorithms in Section 5.2.4. In that way nodes at TIEs need to be flooded to satisfy algorithms in Section 5.2.4. In
same level can learn about each other without a lower level, e.g. in that way nodes at same level can learn about each other without a
case of leaf level. The precise flooding scopes are given in lower level, e.g. in case of leaf level. The precise flooding scopes
Table 3. Those rules govern as well what SHOULD be included in TIDEs are given in Table 3. Those rules govern as well what SHOULD be
on the adjacency. East-West flooding scopes are identical to South included in TIDEs on the adjacency. Again, East-West flooding scopes
flooding scopes except in case of ToF East-West links (rings). are identical to South flooding scopes except in case of ToF East-
West links (rings).
Node S-TIE "south reflection" allows to support positive Node S-TIE "south reflection" allows to support positive
disaggregation on failures describes in Section 5.2.5 and flooding disaggregation on failures describes in Section 5.2.5 and flooding
reduction in Section 5.2.3.8. reduction in Section 5.2.3.9.
+-----------+---------------------+---------------+-----------------+ +-----------+---------------------+---------------+-----------------+
| Type / | South | North | East-West | | Type / | South | North | East-West |
| Direction | | | | | Direction | | | |
+-----------+---------------------+---------------+-----------------+ +-----------+---------------------+---------------+-----------------+
| node | flood if level of | flood if | flood only if | | node | flood if level of | flood if | flood only if |
| S-TIE | originator is equal | level of | this node is | | S-TIE | originator is equal | level of | this node is |
| | to this node | originator is | not ToF | | | to this node | originator is | not ToF |
| | | higher than | | | | | higher than | |
| | | this node | | | | | this node | |
skipping to change at page 38, line 4 skipping to change at page 39, line 4
Table 3: Flooding Scopes Table 3: Flooding Scopes
If the TIDE includes additional TIE headers beside the ones If the TIDE includes additional TIE headers beside the ones
specified, the receiving neighbor must apply according filter to the specified, the receiving neighbor must apply according filter to the
received TIDE strictly and MUST NOT request the extra TIE headers received TIDE strictly and MUST NOT request the extra TIE headers
that were not allowed by the flooding scope rules in its direction. that were not allowed by the flooding scope rules in its direction.
As an example to illustrate these rules, consider using the topology As an example to illustrate these rules, consider using the topology
in Figure 2, with the optional link between spine 111 and spine 112, in Figure 2, with the optional link between spine 111 and spine 112,
and the associated TIEs given in Figure 15. The flooding from and the associated TIEs given in Figure 14. The flooding from
particular nodes of the TIEs is given in Table 4. particular nodes of the TIEs is given in Table 4.
+-------------+----------+------------------------------------------+ +-------------+----------+------------------------------------------+
| Router | Neighbor | TIEs | | Router | Neighbor | TIEs |
| floods to | | | | floods to | | |
+-------------+----------+------------------------------------------+ +-------------+----------+------------------------------------------+
| Leaf111 | Spine | Leaf111 N-TIEs, Spine 111 node S-TIE | | Leaf111 | Spine | Leaf111 N-TIEs, Spine 111 node S-TIE |
| | 112 | | | | 112 | |
| Leaf111 | Spine | Leaf111 N-TIEs, Spine 112 node S-TIE | | Leaf111 | Spine | Leaf111 N-TIEs, Spine 112 node S-TIE |
| | 111 | | | | 111 | |
skipping to change at page 38, line 39 skipping to change at page 39, line 39
| | 112 | | | | 112 | |
| Spine21 | Spine | Spine21 S-TIEs | | Spine21 | Spine | Spine21 S-TIEs |
| | 121 | | | | 121 | |
| Spine21 | Spine | Spine21 S-TIEs | | Spine21 | Spine | Spine21 S-TIEs |
| | 122 | | | | 122 | |
| ... | ... | ... | | ... | ... | ... |
+-------------+----------+------------------------------------------+ +-------------+----------+------------------------------------------+
Table 4: Flooding some TIEs from example topology Table 4: Flooding some TIEs from example topology
5.2.3.5. Initial and Periodic Database Synchronization 5.2.3.5. 'Flood Only Node TIEs' Bit
RIFT includes an optional ECN mechanism to prevent "flooding inrush"
on restart or bring-up with many southbound neighbors. A node MAY
set on its LIEs the according bit to indicate to the neighbor that it
should temporarily flood node TIEs only to it. It should only set it
in the southbound direction. The receiving node SHOULD accomodate
the request to lessen the flooding load on the affected node if south
of the sender and SHOULD ignore the bit if northbound.
Obviously this mechanism is most useful in southbound direction. The
distribution of node TIEs guarantees correct behavior of algorithms
like disaggregation or default route origination. Furthermore
though, the use of this bit presents an inherent trade-off between
processing load and convergence speed since suppressing flooding of
northbound prefixes from neighbors will lead to blackholes.
5.2.3.6. Initial and Periodic Database Synchronization
The initial exchange of RIFT is modeled after ISIS with TIDE being The initial exchange of RIFT is modeled after ISIS with TIDE being
equivalent to CSNP and TIRE playing the role of PSNP. The content of equivalent to CSNP and TIRE playing the role of PSNP. The content of
TIDEs and TIREs is governed by Table 3. TIDEs and TIREs is governed by Table 3.
5.2.3.6. Purging 5.2.3.7. Purging and Roll-Overs
RIFT does not purge information that has been distributed by the RIFT does not purge information that has been distributed by the
protocol. Purging mechanisms in other routing protocols have proven protocol. Purging mechanisms in other routing protocols have proven
to be complex and fragile over many years of experience. Abundant to be complex and fragile over many years of experience. Abundant
amounts of memory are available today even on low-end platforms. The amounts of memory are available today even on low-end platforms. The
information will age out and all computations will deliver correct information will age out and all computations will deliver correct
results if a node leaves the network due to the new information results if a node leaves the network due to the new information
distributed by its adjacent nodes. distributed by its adjacent nodes.
Once a RIFT node issues a TIE with an ID, it MUST preserve the ID as Once a RIFT node issues a TIE with an ID, it MUST preserve the ID as
long as feasible (also when the protocol restarts), even if the TIE long as feasible (also when the protocol restarts), even if the TIE
looses all content. The re-advertisement of empty TIE fulfills the looses all content. The re-advertisement of empty TIE fulfills the
purpose of purging any information advertised in previous versions. purpose of purging any information advertised in previous versions.
The originator is free to not re-originate the according empty TIE The originator is free to not re-originate the according empty TIE
again or originate an empty TIE with relatively short lifetime to again or originate an empty TIE with relatively short lifetime to
prevent large number of long-lived empty stubs polluting the network. prevent large number of long-lived empty stubs polluting the network.
Each node MUST timeout and clean up the according empty TIEs Each node MUST timeout and clean up the according empty TIEs
independently. independently.
Upon restart a node MUST, as any link-state implementation, be Upon restart a node MUST, as any link-state implementation, be
prepared to receive TIEs with its own system ID and supercede them prepared to receive TIEs with its own system ID and supersede them
with equivalent, newly generated, empty TIEs with a higher sequence with equivalent, newly generated, empty TIEs with a higher sequence
number. As above, the lifetime can be relatively short since it only number. As above, the lifetime can be relatively short since it only
needs to exceed the necessary propagation and processing delay by all needs to exceed the necessary propagation and processing delay by all
the nodes that are within the TIE's flooding scope. the nodes that are within the TIE's flooding scope.
5.2.3.7. Southbound Default Route Origination TIE sequence numbers are rolled over using the method described in
Appendix A. First sequence number of any spontaneously originated
TIE (i.e. not originated to override a detected older copy in the
network) MUST be a reasonbly unpredictable random number in the
interval [0, 2^10-1] which will prevent otherwise identical TIE
headers to remain "stuck" in the network with content different from
TIE originated after reboot.
5.2.3.8. Southbound Default Route Origination
Under certain conditions nodes issue a default route in their South Under certain conditions nodes issue a default route in their South
Prefix TIEs with costs as computed in Section 5.3.6.1. Prefix TIEs with costs as computed in Section 5.3.6.1.
A node X that A node X that
1. is NOT overloaded AND 1. is NOT overloaded AND
2. has southbound or East-West adjacencies 2. has southbound or East-West adjacencies
skipping to change at page 40, line 5 skipping to change at page 41, line 32
3. X has computed reachability to a default route during N-SPF. 3. X has computed reachability to a default route during N-SPF.
The term "all other nodes at X's' level" describes obviously just the The term "all other nodes at X's' level" describes obviously just the
nodes at the same level in the PoD with a viable lower level nodes at the same level in the PoD with a viable lower level
(otherwise the node S-TIEs cannot be reflected and the nodes in e.g. (otherwise the node S-TIEs cannot be reflected and the nodes in e.g.
PoD 1 and PoD 2 are "invisible" to each other). PoD 1 and PoD 2 are "invisible" to each other).
A node originating a southbound default route MUST install a default A node originating a southbound default route MUST install a default
discard route if it did not compute a default route during N-SPF. discard route if it did not compute a default route during N-SPF.
5.2.3.8. Northbound TIE Flooding Reduction 5.2.3.9. Northbound TIE Flooding Reduction
Section 1.4 of the Optimized Link State Routing Protocol [RFC3626] Section 1.4 of the Optimized Link State Routing Protocol [RFC3626]
(OLSR) introduces the concept of a "multipoint relay" (MPR) that (OLSR) introduces the concept of a "multipoint relay" (MPR) that
minimize the overhead of flooding messages in the network by reducing minimize the overhead of flooding messages in the network by reducing
redundant retransmissions in the same region. redundant retransmissions in the same region.
A similar technique is applied to RIFT to control northbound A similar technique is applied to RIFT to control northbound
flooding. Important observations first: flooding. Important observations first:
1. a node MUST flood self-originated N-TIE to all the reachable 1. a node MUST flood self-originated N-TIEs to all the reachable
nodes at the level above which we call the node's "parents"; nodes at the level above which we call the node's "parents";
2. it is typically not necessary that all parents reflood the N-TIEs 2. it is typically not necessary that all parents reflood the N-TIEs
to achieve a complete flooding of all the reachable nodes two to achieve a complete flooding of all the reachable nodes two
levels above which we choose to call the node's "grandparents"; levels above which we choose to call the node's "grandparents";
3. to control the volume of its flooding two hops North and yet keep 3. to control the volume of its flooding two hops North and yet keep
it robust enough, it is advantageous for a node to select a it robust enough, it is advantageous for a node to select a
subset of its parents as "Flood Repeaters" (FRs), which combined subset of its parents as "Flood Repeaters" (FRs), which combined
together deliver two or more copies of its flooding to all of its together deliver two or more copies of its flooding to all of its
skipping to change at page 41, line 17 skipping to change at page 42, line 45
computation can be kept relatively simple and completely distributed computation can be kept relatively simple and completely distributed
without any need for synchronization amongst nodes. In a "PoD" without any need for synchronization amongst nodes. In a "PoD"
structure, where the Level L+2 is partitioned in silos of equivalent structure, where the Level L+2 is partitioned in silos of equivalent
grandparents that are only reachable from respective parents, this grandparents that are only reachable from respective parents, this
means treating each silo as a fully connected Clos Network and solve means treating each silo as a fully connected Clos Network and solve
the problem within the silo. the problem within the silo.
In terms of signaling, a node has enough information to select its In terms of signaling, a node has enough information to select its
set of FRs; this information is derived from the node's parents' Node set of FRs; this information is derived from the node's parents' Node
S-TIEs, which indicate the parent's reachable northbound adjacencies S-TIEs, which indicate the parent's reachable northbound adjacencies
to its own parents, i.e. the node's grandparents. An optional to its own parents, i.e. the node's grandparents. A node may send a
boolean information `you_are_flood_repeater` in a LIE packet to a LIE to a northbound neighbor with the optional boolean field
parent is set to indicate whether the parent lost its flood repeater `you_are_flood_repeater` set to false, to indicate that the
leadership and with that SHOULD NOT reflood N-TIEs. northbound neighbor is not a flood repeater for the node that sent
the LIE. In that case the northbound neighbor SHOULD NOT reflood
northbound TIEs received from the node that sent the LIE. If the
`you_are_flood_repeater` is absent or if `you_are_flood_repeater` is
set to true, then the northbound neighbor is a flood repeater for the
node that sent the LIE and MUST reflood northbound TIEs received from
that node.
This specification proposes a simple default algorithm that SHOULD be This specification proposes a simple default algorithm that SHOULD be
implemented and used by default on every RIFT node. implemented and used by default on every RIFT node.
o let |NA(Node) be the set of Northbound adjacencies of node Node o let |NA(Node) be the set of Northbound adjacencies of node Node
and CN(Node) be the cardinality of |NA(Node); and CN(Node) be the cardinality of |NA(Node);
o let |SA(Node) be the set of Southbound adjacencies of node Node o let |SA(Node) be the set of Southbound adjacencies of node Node
and CS(Node) be the cardinality of |SA(Node); and CS(Node) be the cardinality of |SA(Node);
skipping to change at page 42, line 10 skipping to change at page 43, line 45
o let S be a similarity constant integer; a value in range 0 .. 2 o let S be a similarity constant integer; a value in range 0 .. 2
for S is RECOMMENDED, the value of 1 SHOULD be used. Two for S is RECOMMENDED, the value of 1 SHOULD be used. Two
cardinalities are considered as equivalent if their absolute cardinalities are considered as equivalent if their absolute
difference is less than or equal to S, i.e. |a-b|<=S. difference is less than or equal to S, i.e. |a-b|<=S.
o let RND be a 64-bit random number generated by the system once on o let RND be a 64-bit random number generated by the system once on
startup. startup.
The algorithm consists of the following steps: The algorithm consists of the following steps:
1. derive a 64-bits number by XOR'ing 'N's system ID with RND and 1. Derive a 64-bits number by XOR'ing 'N's system ID with RND.
then
2. derive a 16-bits pseudo-random unsigned integer PR(N) from the 2. Derive a 16-bits pseudo-random unsigned integer PR(N) from the
resulting 64-bits number by splitting it in 16-bits-long words resulting 64-bits number by splitting it in 16-bits-long words
W1, W2, ..., Wn and then XOR'ing the circularly shifted resulting W1, W2, W3, W4 (where W1 are the least significant 16 bits of the
words together, and casting the resulting representation: 64-bits number, and W4 are the most significant 16 bits) and then
XOR'ing the circularly shifted resulting words together:
1. (unsigned integer) (W1<<1 xor (W2<<2) xor ... xor (Wn<<n) ); (W1<<1) xor (W2<<2) xor (W3<<3) xor (W4<<4);
and then where << is the circular shift operator.
3. sort the parents by decreasing number of northbound adjacencies: 3. Sort the parents by decreasing number of northbound adjacencies
(using decreasing system id of the parent as tie-breaker):
sort |P(N) by decreasing CN(P), for all P in |P(N), as ordered sort |P(N) by decreasing CN(P), for all P in |P(N), as ordered
array |A(N) and then array |A(N)
4. partition |A(N) in subarrays |A_k(N) of parents with equivalent 4. Partition |A(N) in subarrays |A_k(N) of parents with equivalent
cardinality of northbound adjacencies (in other words with cardinality of northbound adjacencies (in other words with
equivalent number of grandparents they can reach): equivalent number of grandparents they can reach):
1. set k=0; // k is the ID of the subarrray 1. set k=0; // k is the ID of the subarrray
2. set i=0; 2. set i=0;
3. while i < CN(N) do 3. while i < CN(N) do
1. set k=k+1; 1. set j=i;
2. set j=i;
3. while CN(|A(N)[j]) - CN(|A(N)[i]) <= S 2. while i < CN(N) and CN(|A(N)[j]) - CN(|A(N)[i]) <= S
1. place |A(N)[i] in |A_k(N) // abstract action, maybe 1. place |A(N)[i] in |A_k(N) // abstract action, maybe
noop noop
2. set i=i+1; 2. set i=i+1;
/* At this point j is the index in |A(N) of the first member 3. /* At this point j is the index in |A(N) of the first
of |A_k(N) and (i-j) is C_k(N) defined as the cardinality member of |A_k(N) and (i-j) is C_k(N) defined as the
of |A_k(N) */ cardinality of |A_k(N) */
/* At this point k is the total number of subarrays, initialized 4. set k=k+1;
for the shuffling operation below */
4. /* At this point k is the total number of subarrays,
initialized for the shuffling operation below */
5. shuffle individually each subarrays |A_k(N) of cardinality C_k(N) 5. shuffle individually each subarrays |A_k(N) of cardinality C_k(N)
within |A(N) using a Fisher-Yates method that depends on N's within |A(N) using the Durstenfeld variation of Fisher-Yates
System ID: algorithm that depends on N's System ID:
1. while k > 0 do 1. while k > 0 do
1. for i from C_k(N)-1 to 1 decrementing by 1 do 1. for i from C_k(N)-1 to 1 decrementing by 1 do
1. set j to PR(N) modulo i; 1. set j to PR(N) modulo i;
2. exchange |A_k[j] and |A_k[i]; 2. exchange |A_k[j] and |A_k[i];
2. set k=k-1; 2. set k=k-1;
6. for each grandparent, initialize a counter with the number of its 6. For each grandparent G, initialize a counter c(G) with the number
Southbound adjacencies : of its south-bound adjacencies to elected flood repeaters (which
is initially zero):
1. for each G in |G(N) set c(G) = CS(G);
and 1. for each G in |G(N) set c(G) = 0;
7. finally keep as FRs only parents that are needed to maintain the 7. Finally keep as FRs only parents that are needed to maintain the
number of adjacencies between the FRs and any grandparent G equal number of adjacencies between the FRs and any grandparent G equal
or above the redundancy constant R: or above the redundancy constant R:
1. for each P in reshuffled |A(N); 1. for each P in reshuffled |A(N);
1. if there exists an adjacency ADJ(P, G) in |NA(P) such 1. if there exists an adjacency ADJ(P, G) in |NA(P) such
that c(G) <= R then that c(G) < R then
1. place P in FR set; 1. place P in FR set;
2. else 2. for all adjacencies ADJ(P, G') in |NA(P) increment
c(G')
1. for all adjacencies ADJ(P, G) in |NA(P) 2. If any c(G) is still < R, it was not possible to elect a set
of FRs that covers all grandparents with redundancy R
1. decrement c(G); Additional rules for flooding reduction:
The algorithm MUST be re-evaluated by a node on every change of local 1. The algorithm MUST be re-evaluated by a node on every change of
adjacencies or reception of a parent S-TIE with changed adjacencies. local adjacencies or reception of a parent S-TIE with changed
A node MAY apply a hysteresis to prevent excessive amount of adjacencies. A node MAY apply a hysteresis to prevent excessive
computation during periods of network instability just like in case amount of computation during periods of network instability just
of reachability computation. like in case of reachability computation.
A node SHOULD send out LIEs that grant leadership before LIEs that 2. A node SHOULD send out LIEs that grant flood repeater status
revoke it on leadership changes to prevent transient behavior where before LIEs that revoke it on flood repeater set changes to
the full coverage of grand parents is not guaranteed. Albeit the prevent transient behavior where the full coverage of grand
condition will correct in positively stable manner due to LIE parents is not guaranteed. Albeit the condition will correct in
retransmission and periodic TIDEs, it can slow down flooding positively stable manner due to LIE retransmission and periodic
convergence on leadership changes. TIDEs, it can slow down flooding convergence on flood repeater
status changes.
3. A node always floods its self-originated TIEs.
4. A node receiving a TIE originated by a node for which it is not a
flood repeater does NOT re-flood such TIEs to its neighbors
except for rules in Paragraph 6.
5. The indication of flood reduction capability is carried in the
node TIEs and can be used to optimize the algorithm to account
for nodes that will flood regardless.
6. A node generates TIDEs as usual but when receiving TIREs or TIDEs
resulting in requests for a TIE of which the newest received copy
came on an adjacency where the node was not flood repeater it
SHOULD ignore such requests on first and first request ONLY.
Normally, the nodes that received the TIEs as flooding repeaters
should satisfy the requesting node and with that no further TIREs
for such TIEs will be generated. Otherwise, the next set of
TIDEs and TIREs MUST lead to flooding independent of the flood
repeater status. This solves a very difficult incast problem on
nodes restarting with a very wide fanout, especially northbound.
To retrieve the full database they often end up processing many
in-rushing copies whereas this approach should load-balance the
incoming database between adjacent nodes and flood repeaters
should guarantee that two copies are sent by different nodes to
ensure against any losses.
7. Obviously sine flooding reduction does NOT apply to self
originated TIEs and since all policy-guided information consists
of self-originated TIEs those are unaffected.
5.2.3.10. Special Considerations
First, due to the distributed, asynchronous nature of ZTP, it can
create temporary convergence anomalies where nodes at higher levels
of the fabric temporarily see themselves lower than they belong.
Since flooding can begin before ZTP is "finished" and in fact must do
so given there is no global termination criteria, information may end
up in wrong layers. A special clause when changing level takes care
of that.
More difficult is a condition where a node floods a TIE north towards
a super-spine, then its spine reboots, in fact partitioning
superspine from it directly and then the node itself reboots. That
leaves in a sense the super-spine holding the "primary copy" of the
node's TIE. Normally this condition is resolved easily by the node
re-originating its TIE with a higher sequence number than it sees in
northbound TIEs, here however, when spine comes back it won't be able
to obtain a N-TIE from its superspine easily and with that the node
below may issue the same version of the TIE with a lower sequence
number. Flooding procedures are are extended to deal with the
problem by the means of special clauses that override the database of
a lower level with headers of newer TIEs seen in TIDEs coming from
the north.
5.2.4. Reachability Computation 5.2.4. Reachability Computation
A node has three sources of relevant information. A node knows the A node has three sources of relevant information. A node knows the
full topology south from the received N-TIEs. A node has the set of full topology south from the received N-TIEs. A node has the set of
prefixes with associated distances and bandwidths from received prefixes with associated distances and bandwidths from received
S-TIEs. S-TIEs.
To compute reachability, a node runs conceptually a northbound and a To compute reachability, a node runs conceptually a northbound and a
southbound SPF. We call that N-SPF and S-SPF. southbound SPF. We call that N-SPF and S-SPF.
skipping to change at page 44, line 33 skipping to change at page 47, line 31
N-SPF uses northbound and East-West adjacencies in the computing N-SPF uses northbound and East-West adjacencies in the computing
node's node N-TIEs (since if the node is a leaf it may not have node's node N-TIEs (since if the node is a leaf it may not have
generated a node S-TIE) when starting Dijkstra. Observe that N-SPF generated a node S-TIE) when starting Dijkstra. Observe that N-SPF
is really just a one hop variety since Node S-TIEs are not re-flooded is really just a one hop variety since Node S-TIEs are not re-flooded
southbound beyond a single level (or East-West) and with that the southbound beyond a single level (or East-West) and with that the
computation cannot progress beyond adjacent nodes. computation cannot progress beyond adjacent nodes.
Once progressing, we are using the next level's node S-TIEs to find Once progressing, we are using the next level's node S-TIEs to find
according adjacencies to verify backlink connectivity. Just as in according adjacencies to verify backlink connectivity. Just as in
case of IS-IS or OSPF, two unidirectional links are associated case of IS-IS or OSPF, two unidirectional links are associated
together to confirm bidirectional connectivity. together to confirm bidirectional connectivity. Particular care MUST
be paid that the Node TIEs do not only contain the correct system IDs
but matching levels as well.
Default route found when crossing an E-W link is used IIF Default route found when crossing an E-W link is used IIF
1. the node itself does NOT have any northbound adjacencies AND 1. the node itself does NOT have any northbound adjacencies AND
2. the adjacent node has one or more northbound adjacencies 2. the adjacent node has one or more northbound adjacencies
This rule forms a "one-hop default route split-horizon" and prevents This rule forms a "one-hop default route split-horizon" and prevents
looping over default routes while allowing for "one-hop protection" looping over default routes while allowing for "one-hop protection"
of nodes that lost all northbound adjacencies except at Top-of-Fabric of nodes that lost all northbound adjacencies except at Top-of-Fabric
skipping to change at page 48, line 35 skipping to change at page 51, line 35
add (prefix, distance) to disaggregated_prefixes add (prefix, distance) to disaggregated_prefixes
end for end for
end if end if
end for end for
copy disaggregated_prefixes to X's S-TIE copy disaggregated_prefixes to X's S-TIE
if X's S-TIE is different if X's S-TIE is different
schedule S-TIE for flooding schedule S-TIE for flooding
end if end if
Figure 16: Computation of Disaggregated Prefixes Figure 15: Computation of Disaggregated Prefixes
Each disaggregated prefix is sent with the according path_distance. Each disaggregated prefix is sent with the according path_distance.
This allows a node to send the same S-TIE to each south neighbor. This allows a node to send the same S-TIE to each south neighbor.
The south neighbor which is connected to that prefix will thus have a The south neighbor which is connected to that prefix will thus have a
shorter path. shorter path.
Finally, to summarize the less obvious points partially omitted in Finally, to summarize the less obvious points partially omitted in
the algorithms to keep them more tractable: the algorithms to keep them more tractable:
1. all neighbor relationships MUST perform backlink checks. 1. all neighbor relationships MUST perform backlink checks.
skipping to change at page 49, line 18 skipping to change at page 52, line 18
lower levels. With that the disturbance in terms of new flooding lower levels. With that the disturbance in terms of new flooding
is contained to a single level experiencing failures. is contained to a single level experiencing failures.
5. disaggregated prefix S-TIEs are not "reflected" by the lower 5. disaggregated prefix S-TIEs are not "reflected" by the lower
level, i.e. nodes within same level do NOT need to be aware level, i.e. nodes within same level do NOT need to be aware
which node computed the need for disaggregation. which node computed the need for disaggregation.
6. The fabric is still supporting maximum load balancing properties 6. The fabric is still supporting maximum load balancing properties
while not trying to send traffic northbound unless necessary. while not trying to send traffic northbound unless necessary.
In case positive disaggregation is triggered and due to the very
stable but un-synchronized nature of the algorithm the nodes may
issue the necessary disaggregated prefixes at different points in
time. This can lead for a short time to an "incast" behavior where
the first advertising router based on the nature of longest prefix
match will attract all the traffic. An implementation MAY hence
choose different strategies to address this behavior if needed.
To close this section it is worth to observe that in a single plane To close this section it is worth to observe that in a single plane
ToF this disaggregation prevents blackholing up to (K_LEAF * P) link ToF this disaggregation prevents blackholing up to (K_LEAF * P) link
failures in terms of Section 5.1.2 or in other terms, it takes at failures in terms of Section 5.1.2 or in other terms, it takes at
minimum that many link failures to partition the ToF into multiple minimum that many link failures to partition the ToF into multiple
planes. planes.
5.2.5.2. Negative, Transitive Disaggregation for Fallen Leafs 5.2.5.2. Negative, Transitive Disaggregation for Fallen Leafs
As explained in Section 5.1.3 failures in multi-plane Top-of-Fabric As explained in Section 5.1.3 failures in multi-plane Top-of-Fabric
or more than (K_LEAF * P) links failing in single plane design can or more than (K_LEAF * P) links failing in single plane design can
generate fallen leafs. Such scenario cannot be addressed by positive generate fallen leafs. Such scenario cannot be addressed by positive
disaggregation only and needs a further mechanism. disaggregation only and needs a further mechanism.
5.2.5.2.1. Cabling of Multiple Top-of-Fabric Planes 5.2.5.2.1. Cabling of Multiple Top-of-Fabric Planes
Let us return in this section to designs with multiple planes as Let us return in this section to designs with multiple planes as
shown in Figure 3. Figure 17 highlights how the ToF is cabled in shown in Figure 3. Figure 16 highlights how the ToF is cabled in
case of two planes by the means of dual-rings to distribute all the case of two planes by the means of dual-rings to distribute all the
N-TIEs within both planes. For people familiar with traditional N-TIEs within both planes. For people familiar with traditional
link-state routing protocols ToF level can be considered equivalent link-state routing protocols ToF level can be considered equivalent
to area 0 in OSPF or level-2 in ISIS which need to be "connected" as to area 0 in OSPF or level-2 in ISIS which need to be "connected" as
well for the protocol to operate correctly. well for the protocol to operate correctly.
. ++==========++ ++==========++ . ++==========++ ++==========++
. II II II II . II II II II
.+----++--+ +----++--+ +----++--+ +----++--+ .+----++--+ +----++--+ +----++--+ +----++--+
.|ToF A1| |ToF B1| |ToF B2| |ToF A2| .|ToF A1| |ToF B1| |ToF B2| |ToF A2|
.++-+-++--+ ++-+-++--+ ++-+-++--+ ++-+-++--+ .++-+-++--+ ++-+-++--+ ++-+-++--+ ++-+-++--+
. | | II | | II | | II | | II . | | II | | II | | II | | II
. | | ++==========++ | | ++==========++ . | | ++==========++ | | ++==========++
. | | | | | | | | . | | | | | | | |
. .
. ~~~ Highlighted ToF of the previous multi-plane figure ~~ . ~~~ Highlighted ToF of the previous multi-plane figure ~~
Figure 17: Topologically connected planes Figure 16: Topologically connected planes
As described in Section 5.1.3 failures in multi-plane fabrics can As described in Section 5.1.3 failures in multi-plane fabrics can
lead to blackholes which normal positive disaggregation cannot fix. lead to blackholes which normal positive disaggregation cannot fix.
The mechanism of negative, transitive disaggregation incorporated in The mechanism of negative, transitive disaggregation incorporated in
RIFT provides the according solution. RIFT provides the according solution.
5.2.5.2.2. Transitive Advertisement of Negative Disaggregates 5.2.5.2.2. Transitive Advertisement of Negative Disaggregates
A ToF node that discovers that it cannot reach a fallen leaf A ToF node that discovers that it cannot reach a fallen leaf
disaggregates all the prefixes of such leafs. It uses for that disaggregates all the prefixes of such leafs. It uses for that
skipping to change at page 51, line 7 skipping to change at page 54, line 7
5.2.5.2.3. Computation of Negative Disaggregates 5.2.5.2.3. Computation of Negative Disaggregates
The document omitted so far the description of the computation The document omitted so far the description of the computation
necessary to generate the correct set of negative prefixes. Negative necessary to generate the correct set of negative prefixes. Negative
prefixes can in fact be advertised due to two different triggers. We prefixes can in fact be advertised due to two different triggers. We
describe them consecutively. describe them consecutively.
The first origination reason is a computation that uses all the node The first origination reason is a computation that uses all the node
N-TIEs to build the set of all reachable nodes by reachability N-TIEs to build the set of all reachable nodes by reachability
computation over the complete graph. The computation uses the node computation over the complete graph and including ToF links. The
itself as root. This is compared with the result of the normal computation uses the node itself as root. This is compared with the
southbound SPF as described in Section 5.2.4.2. The difference are result of the normal southbound SPF as described in Section 5.2.4.2.
the fallen leafs and all their attached prefixes are advertised as The difference are the fallen leafs and all their attached prefixes
negative prefixes southbound if the node does not see the prefix are advertised as negative prefixes southbound if the node does not
being reachable within southbound SPF. see the prefix being reachable within southbound SPF.
The second mechanism hinges on the understanding how the negative The second mechanism hinges on the understanding how the negative
prefixes are used within the computation as described in Figure 18. prefixes are used within the computation as described in Figure 17.
When attaching the negative prefixes at certain point in time the When attaching the negative prefixes at certain point in time the
negative prefix may find itself with all the viable nodes from the negative prefix may find itself with all the viable nodes from the
shorter match nexthop being pruned. In other words, all its shorter match nexthop being pruned. In other words, all its
northbound neighbors provided a negative prefix advertisement. This northbound neighbors provided a negative prefix advertisement. This
is the trigger to advertise this negative prefix transitively south is the trigger to advertise this negative prefix transitively south
and normally caused by the node being in a plane where the prefix and normally caused by the node being in a plane where the prefix
belongs to a fabric leaf that has "fallen" in this plane. Obviously, belongs to a fabric leaf that has "fallen" in this plane. Obviously,
when one of the northbound switches withdraws its negative when one of the northbound switches withdraws its negative
advertisement, the node has to withdraw its transitively provided advertisement, the node has to withdraw its transitively provided
negative prefix as well. negative prefix as well.
skipping to change at page 52, line 21 skipping to change at page 55, line 21
if S-TIE.level > X.level if S-TIE.level > X.level
next_hop_set = set of minimum cost links to the S-TIE.originator next_hop_set = set of minimum cost links to the S-TIE.originator
next_hop_cost = minimum cost link to S-TIE.originator next_hop_cost = minimum cost link to S-TIE.originator
end if end if
for each prefix P in the S-TIE for each prefix P in the S-TIE
P.cost = P.cost + next_hop_cost P.cost = P.cost + next_hop_cost
if P not in route_database: if P not in route_database:
add (P, type=DistVector, P.cost, next_hop_set) to route_database add (P, type=DistVector, P.cost, next_hop_set) to route_database
end if end if
if (P in route_database): if (P in route_database):
if route_database[P].cost > P.cost): if route_database[P].cost > P.cost or route_database[P].type > P.type:
update route_database[P] with (P, DistVector, P.cost, next_hop_set) update route_database[P] with (P, DistVector, P.cost, P.type, next_hop_set)
else if route_database[P].cost == P.cost else if route_database[P].cost == P.cost and route_database[P].type == P.type:
update route_database[P] with (P, DistVector, P.cost, update route_database[P] with (P, DistVector, P.cost, P.type,
merge(next_hop_set, route_database[P].next_hop_set)) merge(next_hop_set, route_database[P].next_hop_set))
else else
// Not preferred route so ignore // Not preferred route so ignore
end if end if
end if end if
end for end for
end for end for
Figure 18: Adding Routes from S-TIE Positive and Negative Prefixes Figure 17: Adding Routes from S-TIE Positive and Negative Prefixes
After the positive prefixes are attached and tie-broken, negative After the positive prefixes are attached and tie-broken, negative
prefixes are attached and used in case of northbound computation, prefixes are attached and used in case of northbound computation,
ideally from the shortest length to the longest. The nexthop ideally from the shortest length to the longest. The nexthop
adjacencies for a negative prefix are inherited from the longest adjacencies for a negative prefix are inherited from the longest
prefix that aggregates it, and subsequently adjacencies to nodes that prefix that aggregates it, and subsequently adjacencies to nodes that
advertised negative for this prefix are removed. As an example, if a advertised negative for this prefix are removed.
hypothetical RIFT routing table contains A.1/16 @ [A,B], A.1.1/24 @
[C,D] it will on reception of negative A.1.1.1/32 from D include the
entry A.1.1.1/32 @ [C] resulting from computation inheriting A.1.1/24
nexthops (C and D) and pruning all the nodes that advertised this
negative prefix (which is D in this case).
The rule of inheritance MUST be maintained when the nexthop list for The rule of inheritance MUST be maintained when the nexthop list for
a prefix is modified, as the modification may affect the entries for a prefix is modified, as the modification may affect the entries for
matching negative prefixes of immediate longer prefix length. For matching negative prefixes of immediate longer prefix length. For
instance, if a nexthop is added, then by inheritance it must be added instance, if a nexthop is added, then by inheritance it must be added
to all the negative routes of immediate longer prefixes length unless to all the negative routes of immediate longer prefixes length unless
it is pruned due to a negative advertisement for the same next hop. it is pruned due to a negative advertisement for the same next hop.
Similarily, if a nexthop is deleted for a given prefix, then it is Similarily, if a nexthop is deleted for a given prefix, then it is
deleted for all the immediately aggregated negative routes. This deleted for all the immediately aggregated negative routes. This
will recurse in the case of nested negative prefix aggregations. will recurse in the case of nested negative prefix aggregations.
skipping to change at page 53, line 25 skipping to change at page 56, line 21
their adjacencies. As the aggregating prefix changes, all the their adjacencies. As the aggregating prefix changes, all the
negative routes must be recomputed, and then again the process may negative routes must be recomputed, and then again the process may
recurse in case of nested negative prefix aggregations. recurse in case of nested negative prefix aggregations.
Observe that despite seeming quite computationally expensive the Observe that despite seeming quite computationally expensive the
operations are only necessary if the only available advertisements operations are only necessary if the only available advertisements
for a prefix are negative since tie-breaking always prefers positive for a prefix are negative since tie-breaking always prefers positive
information for the prefix which stops any kind of recursion since information for the prefix which stops any kind of recursion since
positive information never inherits next hops. positive information never inherits next hops.
To make the negative disaggregation less abstract and provide an
example let us consider a ToP node T1 with 4 ToF parents S1..S4 as
represented in Figure 18:
+----+ +----+ +----+ +----+ N
| S1 | | S1 | | S1 | | S1 | ^
+----+ +----+ +----+ +----+ W< + >E
| | | | v
|+--------+ | | S
||+-----------------+ |
|||+----------------+---------+
||||
+----+
| T1 |
+----+
Figure 18: A ToP node with 4 parents
If all ToF nodes can reach all the prefixes in the network; with
RIFT, they will normally advertise a default route south. An
abstract Routing Information Base (RIB), more commonly known as a
routing table, stores all types of maintained routes including the
negative ones and "tie-breaks" for the best one, whereas an abstract
Forwarding table (FIB) retains only the ultimately computed
"positive" routing instructions. In T1, those tables would look as
illustrated in Figure 19:
+---------+
| Default |
+---------+
|
| +--------+
+---> | Via S1 |
| +--------+
|
| +--------+
+---> | Via S2 |
| +--------+
|
| +--------+
+---> | Via S3 |
| +---------+
|
| +--------+
+---> | Via S4 |
+--------+
Figure 19: Abstract RIB
In case T1 receives a negative advertisement for prefix 2001:db8::/32
from S1 a negative route is stored in the RIB (indicated by a ~
sign), while the more specific routes to the complementing ToF nodes
are installed in FIB. RIB and FIB in T1 now look as illustrated in
Figure 20 and Figure 21, respectively:
+---------+ +-----------------+
| Default | <-------------- | ~2001:db8::/32 |
+---------+ +-----------------+
| |
| +--------+ | +--------+
+---> | Via S1 | +---> | Via S1 |
| +--------+ +--------+
|
| +--------+
+---> | Via S2 |
| +--------+
|
| +--------+
+---> | Via S3 |
| +---------+
|
| +--------+
+---> | Via S4 |
+--------+
Figure 20: Abstract RIB after negative 2001:db8::/32 from S1
Negative 2001:db8::/32 entry inherits from ::/0, so the positive more
specific routes are the complements to S1 in the set of next-hops for
the default route. That entry is composed of S2, S3, and S4, or, in
other words, it uses all entries of the default route with a "hole
punched" for S1 into them. These are the next hops that are still
available to reach 2001:db8::/32, now that S1 advertised that it will
not forward 2001:db8::/32 anymore. Ultimately, those resulting next-
hops are installed in FIB for the more specific route to
2001:db8::/32 as illustrated below:
+---------+ +---------------+
| Default | | 2001:db8::/32 |
+---------+ +---------------+
| |
| +--------+ |
+---> | Via S1 | |
| +--------+ |
| |
| +--------+ | +--------+
+---> | Via S2 | +---> | Via S2 |
| +--------+ | +--------+
| |
| +--------+ | +--------+
+---> | Via S3 | +---> | Via S3 |
| +--------+ | +--------+
| |
| +--------+ | +--------+
+---> | Via S4 | +---> | Via S4 |
+--------+ +--------+
Figure 21: Abstract FIB after negative 2001:db8::/32 from S1
To illustrate matters further let us consider T1 receiving a negative
advertisement for prefix 2001:db8:1::/48 from S2, which is stored in
RIB again. After the update, the RIB in T1 is illustrated in
Figure 22:
+---------+ +----------------+ +------------------+
| Default | <----- | ~2001:db8::/32 | <------ | ~2001:db8:1::/48 |
+---------+ +----------------+ +------------------+
| | |
| +--------+ | +--------+ |
+---> | Via S1 | +---> | Via S1 | |
| +--------+ +--------+ |
| |
| +--------+ | +--------+
+---> | Via S2 | +---> | Via S2 |
| +--------+ +--------+
|
| +--------+
+---> | Via S3 |
| +---------+
|
| +--------+
+---> | Via S4 |
+--------+
Figure 22: Abstract RIB after negative 2001:db8:1::/48 from S2
Negative 2001:db8:1::/48 inherits from 2001:db8::/32 now, so the
positive more specific routes are the complements to S2 in the set of
next hops for 2001:db8::/32, which are S3 and S4, or, in other words,
all entries of the father with the negative holes "punched in" again.
After the update, the FIB in T1 shows as illustrated in Figure 23:
+---------+ +---------------+ +-----------------+
| Default | | 2001:db8::/32 | | 2001:db8:1::/48 |
+---------+ +---------------+ +-----------------+
| | |
| +--------+ | |
+---> | Via S1 | | |
| +--------+ | |
| | |
| +--------+ | +--------+ |
+---> | Via S2 | +---> | Via S2 | |
| +--------+ | +--------+ |
| | |
| +--------+ | +--------+ | +--------+
+---> | Via S3 | +---> | Via S3 | +---> | Via S3 |
| +--------+ | +--------+ | +--------+
| | |
| +--------+ | +--------+ | +--------+
+---> | Via S4 | +---> | Via S4 | +---> | Via S4 |
+--------+ +--------+ +--------+
Figure 23: Abstract FIB after negative 2001:db8:1::/48 from S2
Further, let us say that S3 stops advertising its service as default
gateway. The entry is removed from RIB as usual. In order to update
the FIB, it is necessary to eliminate the FIB entry for the default
route, as well as all the FIB entries that were created for negative
routes pointing to the RIB entry being removed (::/0). This is done
recursively for 2001:db8::/32 and then for, 2001:db8:1::/48. The
related FIB entries via S3 are removed, as illustrated in Figure 24.
+---------+ +---------------+ +-----------------+
| Default | | 2001:db8::/32 | | 2001:db8:1::/48 |
+---------+ +---------------+ +-----------------+
| | |
| +--------+ | |
+---> | Via S1 | | |
| +--------+ | |
| | |
| +--------+ | +--------+ |
+---> | Via S2 | +---> | Via S2 | |
| +--------+ | +--------+ |
| | |
| | |
| | |
| | |
| | |
| +--------+ | +--------+ | +--------+
+---> | Via S4 | +---> | Via S4 | +---> | Via S4 |
+--------+ +--------+ +--------+
Figure 24: Abstract FIB after loss of S3
Say that at that time, S4 would also disaggregate prefix
2001:db8:1::/48. This would mean that the FIB entry for
2001:db8:1::/48 becomes a discard route, and that would be the signal
for T1 to disaggregate prefix 2001:db8:1::/48 negatively in a
transitive fashion with its own children.
Finally, let us look at the case where S3 becomes available again as
default gateway, and a negative advertisement is received from S4
about prefix 2001:db8:2::/48 as opposed to 2001:db8:1::/48. Again, a
negative route is stored in the RIB, and the more specific route to
the complementing ToF nodes are installed in FIB. Since
2001:db8:2::/48 inherits from 2001:db8::/32, the positive FIB routes
are chosen by removing S4 from S2, S3, S4. The abstract FIB in T1
now shows as illustrated in Figure 25:
+-----------------+
| 2001:db8:2::/48 |
+-----------------+
|
+---------+ +---------------+ +-----------------+
| Default | | 2001:db8::/32 | | 2001:db8:1::/48 |
+---------+ +---------------+ +-----------------+
| | | |
| +--------+ | | | +--------+
+---> | Via S1 | | | +---> | Via S2 |
| +--------+ | | | +--------+
| | | |
| +--------+ | +--------+ | | +--------+
+---> | Via S2 | +---> | Via S2 | | +---> | Via S3 |
| +--------+ | +--------+ | +--------+
| | |
| +--------+ | +--------+ | +--------+
+---> | Via S3 | +---> | Via S3 | +---> | Via S3 |
| +--------+ | +--------+ | +--------+
| | |
| +--------+ | +--------+ | +--------+
+---> | Via S4 | +---> | Via S4 | +---> | Via S4 |
+--------+ +--------+ +--------+
Figure 25: Abstract FIB after negative 2001:db8:2::/48 from S4
5.2.7. Optional Zero Touch Provisioning (ZTP) 5.2.7. Optional Zero Touch Provisioning (ZTP)
Each RIFT node can optionally operate in zero touch provisioning Each RIFT node can optionally operate in zero touch provisioning
(ZTP) mode, i.e. it has no configuration (unless it is a Top-of- (ZTP) mode, i.e. it has no configuration (unless it is a Top-of-
Fabric at the top of the topology or the must operate in the topology Fabric at the top of the topology or the must operate in the topology
as leaf and/or support leaf-2-leaf procedures) and it will fully as leaf and/or support leaf-2-leaf procedures) and it will fully
configure itself after being attached to the topology. Configured configure itself after being attached to the topology. Configured
nodes and nodes operating in ZTP can be mixed and will form a valid nodes and nodes operating in ZTP can be mixed and will form a valid
topology if achievable. topology if achievable.
The derivation of the level of each node happens based on offers The derivation of the level of each node happens based on offers
received from its neighbors whereas each node (with possibly received from its neighbors whereas each node (with possibly
exceptions of configured leafs) tries to attach at the highest exceptions of configured leafs) tries to attach at the highest
possible point in the fabric. This guarantees that even if the possible point in the fabric. This guarantees that even if the
diffusion front reaches a node from "below" faster than from "above", diffusion front reaches a node from "below" faster than from "above",
it will greedily abandon already negotiated level derived from nodes it will greedily abandon already negotiated level derived from nodes
topologically below it and properly peers with nodes above. topologically below it and properly peers with nodes above.
The fabric is very conciously numbered from the top to allow for PoDs The fabric is very conciously numbered from the top to allow for PoDs
of different heights and minimize number of provisioning necessary, of different heights and minimize number of provisioning necessary,
in this case just a TOP_OF_FABRIC flag. in this case just a TOP_OF_FABRIC flag on every node at the top of
the fabric.
This section describes the necessary concepts and procedures for ZTP This section describes the necessary concepts and procedures for ZTP
operation. operation.
5.2.7.1. Terminology 5.2.7.1. Terminology
The interdependencies between the different flags and the configured The interdependencies between the different flags and the configured
level can be somewhat vexing at first and it may take multiple reads level can be somewhat vexing at first and it may take multiple reads
of the glossary to make sense. of the glossary to comprehend them.
Automatic Level Derivation: Procedures which allow nodes without Automatic Level Derivation: Procedures which allow nodes without
level configured to derive it automatically. Only applied if level configured to derive it automatically. Only applied if
CONFIGURED_LEVEL is undefined. CONFIGURED_LEVEL is undefined.
UNDEFINED_LEVEL: An imaginary value that indicates that the level UNDEFINED_LEVEL: A "null" value that indicates that the level has
has not beeen determined and has not been configured. Schemas not beeen determined and has not been configured. Schemas
normally indicate that by a missing optional value without an normally indicate that by a missing optional value without an
available defined default. available defined default.
LEAF_ONLY: An optional configuration flag that can be configured on LEAF_ONLY: An optional configuration flag that can be configured on
a node to make sure it never leaves the "bottom of the hierarchy". a node to make sure it never leaves the "bottom of the hierarchy".
TOP_OF_FABRIC flag and CONFIGURED_LEVEL cannot be defined at the TOP_OF_FABRIC flag and CONFIGURED_LEVEL cannot be defined at the
same time as this flag. It implies CONFIGURED_LEVEL value of 0. same time as this flag. It implies CONFIGURED_LEVEL value of 0.
TOP_OF_FABRIC flag: Configuration flag provided to all Top-of-Fabric TOP_OF_FABRIC flag: Configuration flag that MUST be provided to all
nodes. LEAF_FLAG and CONFIGURED_LEVEL cannot be defined at the Top-of-Fabric nodes. LEAF_FLAG and CONFIGURED_LEVEL cannot be
same time as this flag. It implies a CONFIGURED_LEVEL value. In defined at the same time as this flag. It implies a
fact, it is basically a shortcut for configuring same level at all CONFIGURED_LEVEL value. In fact, it is basically a shortcut for
Top-of-Fabric nodes which is unavoidable since an initial 'seed' configuring same level at all Top-of-Fabric nodes which is
is needed for other ZTP nodes to derive their level in the unavoidable since an initial 'seed' is needed for other ZTP nodes
topology. The flag plays an important role in fabrics with to derive their level in the topology. The flag plays an
multiple planes to enable successful negative disaggregation important role in fabrics with multiple planes to enable
(Section 5.2.5.2). successful negative disaggregation (Section 5.2.5.2).
CONFIGURED_LEVEL: A level value provided manually. When this is CONFIGURED_LEVEL: A level value provided manually. When this is
defined (i.e. it is not an UNDEFINED_LEVEL) the node is not defined (i.e. it is not an UNDEFINED_LEVEL) the node is not
participating in ZTP. TOP_OF_FABRIC flag is ignored when this participating in ZTP. TOP_OF_FABRIC flag is ignored when this
value is defined. LEAF_ONLY can be set only if this value is value is defined. LEAF_ONLY can be set only if this value is
undefined or set to 0. undefined or set to 0.
DERIVED_LEVEL: Level value computed via automatic level derivation DERIVED_LEVEL: Level value computed via automatic level derivation
when CONFIGURED_LEVEL is equal to UNDEFINED_LEVEL. when CONFIGURED_LEVEL is equal to UNDEFINED_LEVEL.
skipping to change at page 55, line 50 skipping to change at page 66, line 10
sending all its TIEs with fairly short lifetimes) since otherwise the sending all its TIEs with fairly short lifetimes) since otherwise the
network may be left with large amounts of stale TIEs in other nodes network may be left with large amounts of stale TIEs in other nodes
(though this is not necessarily a serious problem if the procedures (though this is not necessarily a serious problem if the procedures
described in Section 8 are implemented). described in Section 8 are implemented).
5.2.7.3. Generic Fabric Example 5.2.7.3. Generic Fabric Example
ZTP forces us to think about miscabled or unusually cabled fabric and ZTP forces us to think about miscabled or unusually cabled fabric and
how such a topology can be forced into a "lattice" structure which a how such a topology can be forced into a "lattice" structure which a
fabric represents (with further restrictions). Let us consider a fabric represents (with further restrictions). Let us consider a
necessary and sufficient physical cabling in Figure 19. We assume necessary and sufficient physical cabling in Figure 26. We assume
all nodes being in the same PoD. all nodes being in the same PoD.
. +---+ . +---+
. | A | s = TOP_OF_FABRIC . | A | s = TOP_OF_FABRIC
. | s | l = LEAF_ONLY . | s | l = LEAF_ONLY
. ++-++ l2l = LEAF_2_LEAF . ++-++ l2l = LEAF_2_LEAF
. | | . | |
. +--+ +--+ . +--+ +--+
. | | . | |
. +--++ ++--+ . +--++ ++--+
skipping to change at page 56, line 35 skipping to change at page 66, line 43
. | | | | | . | | | | |
. +---------+ | +------+ | . +---------+ | +------+ |
. | | | | | . | | | | |
. +-----------------+ | | . +-----------------+ | |
. | | | | | . | | | | |
. ++-++ ++-++ | . ++-++ ++-++ |
. | X +-----+ Y +-+ . | X +-----+ Y +-+
. |l2l| | l | . |l2l| | l |
. +---+ +---+ . +---+ +---+
Figure 19: Generic ZTP Cabling Considerations Figure 26: Generic ZTP Cabling Considerations
First, we must anchor the "top" of the cabling and that's what the First, we must anchor the "top" of the cabling and that's what the
TOP_OF_FABRIC flag at node A is for. Then things look smooth until TOP_OF_FABRIC flag at node A is for. Then things look smooth until
we have to decide whether node Y is at the same level as I, J or at we have to decide whether node Y is at the same level as I, J or at
the same level as Y and consequently, X is south of it. This is the same level as Y and consequently, X is south of it. This is
unresolvable here until we "nail down the bottom" of the topology. unresolvable here until we "nail down the bottom" of the topology.
To achieve that we choose to use in this example the leaf flags. We To achieve that we choose to use in this example the leaf flags. We
will see further then whether Y chooses to form adjacencies to F or will see further then whether Y chooses to form adjacencies to F or
I, J successively. I, J successively.
skipping to change at page 57, line 30 skipping to change at page 67, line 38
5. A node MUST reset any adjacency that has changed the level it is 5. A node MUST reset any adjacency that has changed the level it is
offering and is in three way state. offering and is in three way state.
6. A node that changed its defined level value MUST readvertise its 6. A node that changed its defined level value MUST readvertise its
own TIEs (since the new `PacketHeader` will contain a different own TIEs (since the new `PacketHeader` will contain a different
level than before). Sequence number of each TIE MUST be level than before). Sequence number of each TIE MUST be
increased. increased.
7. After a level has been derived the node MUST set the 7. After a level has been derived the node MUST set the
`not_a_ztp_offer` on LIEs towards all systems extending a VOL for `not_a_ztp_offer` on LIEs towards all systems offering a VOL for
HAL. HAL.
8. A node that changed its level SHOULD flush from its link state
database TIEs of all other nodes, otherwise stale information may
persist on "direction reversal", i.e. nodes that seemed south
are now north or east-west. This will not prevent the correct
operation of the protocol but could be slightly confusing
operationally.
A node starting with LEVEL_VALUE being 0 (i.e. it assumes a leaf A node starting with LEVEL_VALUE being 0 (i.e. it assumes a leaf
function by being configured with the appropriate flags or has a function by being configured with the appropriate flags or has a
CONFIGURED_LEVEL of 0) MUST follow those additional procedures: CONFIGURED_LEVEL of 0) MUST follow those additional procedures:
1. It computes HAT per procedures above but does NOT use it to 1. It computes HAT per procedures above but does NOT use it to
compute DERIVED_LEVEL. HAT is used to limit adjacency formation compute DERIVED_LEVEL. HAT is used to limit adjacency formation
per Section 5.2.2. per Section 5.2.2.
It MAY also follow modified procedures: It MAY also follow modified procedures:
1. It may pick a different strategy to choose VOL, e.g. use the VOL 1. It may pick a different strategy to choose VOL, e.g. use the VOL
value with highest number of VOLs. Such strategies are only value with highest number of VOLs. Such strategies are only
possible since the node always remains "at the bottom of the possible since the node always remains "at the bottom of the
fabric" while another layer could "invert" the fabric by picking fabric" while another layer could "invert" the fabric by picking
its prefered VOL in a different fashion than always trying to its prefered VOL in a different fashion than always trying to
achieve the highest viable level. achieve the highest viable level.
5.2.7.5. Resulting Topologies 5.2.7.5. Resulting Topologies
The procedures defined in Section 5.2.7.4 will lead to the RIFT The procedures defined in Section 5.2.7.4 will lead to the RIFT
topology and levels depicted in Figure 20. topology and levels depicted in Figure 27.
. +---+ . +---+
. | As| . | As|
. | 24| . | 24|
. ++-++ . ++-++
. | | . | |
. +--+ +--+ . +--+ +--+
. | | . | |
. +--++ ++--+ . +--++ ++--+
. | E | | F | . | E | | F |
skipping to change at page 58, line 38 skipping to change at page 68, line 51
. | 22| | 22| | . | 22| | 22| |
. ++--+ +--++ | . ++--+ +--++ |
. | | | . | | |
. +---------+ | | . +---------+ | |
. | | | . | | |
. ++-++ +---+ | . ++-++ +---+ |
. | X | | Y +-+ . | X | | Y +-+
. | 0 | | 0 | . | 0 | | 0 |
. +---+ +---+ . +---+ +---+
Figure 20: Generic ZTP Topology Autoconfigured Figure 27: Generic ZTP Topology Autoconfigured
In case we imagine the LEAF_ONLY restriction on Y is removed the In case we imagine the LEAF_ONLY restriction on Y is removed the
outcome would be very different however and result in Figure 21. outcome would be very different however and result in Figure 28.
This demonstrates basically that auto configuration makes miscabling This demonstrates basically that auto configuration makes miscabling
detection hard and with that can lead to undesirable effects in cases detection hard and with that can lead to undesirable effects in cases
where leafs are not "nailed" by the accordingly configured flags and where leafs are not "nailed" by the accordingly configured flags and
arbitrarily cabled. arbitrarily cabled.
A node MAY analyze the outstanding level offers on its interfaces and A node MAY analyze the outstanding level offers on its interfaces and
generate warnings when its internal ruleset flags a possible generate warnings when its internal ruleset flags a possible
miscabling. As an example, when a node's sees ZTP level offers that miscabling. As an example, when a node's sees ZTP level offers that
differ by more than one level from its chosen level (with proper differ by more than one level from its chosen level (with proper
accounting for leaf's being at level 0) this can indicate miscabling. accounting for leaf's being at level 0) this can indicate miscabling.
skipping to change at page 59, line 35 skipping to change at page 69, line 48
. | | | | | . | | | | |
. | +-----------------+ | . | +-----------------+ |
. | | | . | | |
. +---------+ | | . +---------+ | |
. | | | . | | |
. ++-++ | . ++-++ |
. | X +--------+ . | X +--------+
. | 0 | . | 0 |
. +---+ . +---+
Figure 21: Generic ZTP Topology Autoconfigured Figure 28: Generic ZTP Topology Autoconfigured
5.2.8. Stability Considerations 5.2.8. Stability Considerations
The autoconfiguration mechanism computes a global maximum of levels The autoconfiguration mechanism computes a global maximum of levels
by diffusion. The achieved equilibrium can be disturbed massively by by diffusion. The achieved equilibrium can be disturbed massively by
all nodes with highest level either leaving or entering the domain all nodes with highest level either leaving or entering the domain
(with some finer distinctions not explained further). It is (with some finer distinctions not explained further). It is
therefore recommended that each node is multi-homed towards nodes therefore recommended that each node is multi-homed towards nodes
with respective HAL offerings. Fortuntately, this is the natural with respective HAL offerings. Fortuntately, this is the natural
state of things for the topology variants considered in RIFT. state of things for the topology variants considered in RIFT.
skipping to change at page 60, line 36 skipping to change at page 70, line 46
A leaf will have no N-TIEs except its own and optionally from its A leaf will have no N-TIEs except its own and optionally from its
East-West neighbors. A leaf will have S-TIEs from its neighbors. East-West neighbors. A leaf will have S-TIEs from its neighbors.
Instead of creating a network graph from its N-TIEs and neighbor's Instead of creating a network graph from its N-TIEs and neighbor's
S-TIEs and then running an SPF, a leaf node can simply compute the S-TIEs and then running an SPF, a leaf node can simply compute the
minimum cost and next_hop_set to each leaf neighbor by examining its minimum cost and next_hop_set to each leaf neighbor by examining its
local adjacencies, determining bi-directionality from the associated local adjacencies, determining bi-directionality from the associated
N-TIE, and specifying the neighbor's next_hop_set set and cost from N-TIE, and specifying the neighbor's next_hop_set set and cost from
the minimum cost local adjacency to that neighbor. the minimum cost local adjacency to that neighbor.
Then a leaf attaches prefixes as in Section 5.2.6. Then a leaf attaches prefixes as described in Section 5.2.6.
5.3.3. Mobility 5.3.3. Mobility
It is a requirement for RIFT to maintain at the control plane a real It is a requirement for RIFT to maintain at the control plane a real
time status of which prefix is attached to which port of which leaf, time status of which prefix is attached to which port of which leaf,
even in a context of mobility where the point of attachement may even in a context of mobility where the point of attachement may
change several times in a subsecond period of time. change several times in a subsecond period of time.
There are two classical approaches to maintain such knowledge in an There are two classical approaches to maintain such knowledge in an
unambiguous fashion: unambiguous fashion:
skipping to change at page 61, line 13 skipping to change at page 71, line 24
synchronized to be able to compare time stamps as observed by the synchronized to be able to compare time stamps as observed by the
various points of attachment, e.g., using the variation of the various points of attachment, e.g., using the variation of the
Precision Time Protocol (PTP) IEEE Std. 1588 [IEEEstd1588], Precision Time Protocol (PTP) IEEE Std. 1588 [IEEEstd1588],
[IEEEstd8021AS] designed for bridged LANs IEEE Std. 802.1AS [IEEEstd8021AS] designed for bridged LANs IEEE Std. 802.1AS
[IEEEstd8021AS]. Both the precision of the synchronisation [IEEEstd8021AS]. Both the precision of the synchronisation
protocol and the resolution of the time stamp must beat the protocol and the resolution of the time stamp must beat the
highest possible roaming time on the fabric. Another drawback is highest possible roaming time on the fabric. Another drawback is
that the presence of the mobile device may be observed only that the presence of the mobile device may be observed only
asynchronously, e.g., after it starts using an IP protocol such as asynchronously, e.g., after it starts using an IP protocol such as
ARP [RFC0826], IPv6 Neighbor Discovery [RFC4861][RFC4862], or DHCP ARP [RFC0826], IPv6 Neighbor Discovery [RFC4861][RFC4862], or DHCP
[RFC2131][RFC3315]. [RFC2131][RFC8415].
sequence counter: With this method, a mobile node notifies its point sequence counter: With this method, a mobile node notifies its point
of attachment on arrival with a sequence counter that is of attachment on arrival with a sequence counter that is
incremented upon each movement. On the positive side, this method incremented upon each movement. On the positive side, this method
does not have a dependency on a precise sense of time, since the does not have a dependency on a precise sense of time, since the
sequence of movements is kept in order by the device. The sequence of movements is kept in order by the device. The
disadvantage of this approach is the lack of support for protocols disadvantage of this approach is the lack of support for protocols
that may be used by the mobile node to register its presence to that may be used by the mobile node to register its presence to
the leaf node with the capability to provide a sequence counter. the leaf node with the capability to provide a sequence counter.
Well-known issues with wrapping sequence counters must be Well-known issues with wrapping sequence counters must be
skipping to change at page 61, line 41 skipping to change at page 72, line 6
`PrefixSequenceType` prefix attribute that we call a `monotonic `PrefixSequenceType` prefix attribute that we call a `monotonic
clock` consisting of a timestamp and optional sequence number. In clock` consisting of a timestamp and optional sequence number. In
case of presence of the attribute: case of presence of the attribute:
o The leaf node MUST advertise a time stamp of the latest sighting o The leaf node MUST advertise a time stamp of the latest sighting
of a prefix, e.g., by snooping IP protocols or the node using the of a prefix, e.g., by snooping IP protocols or the node using the
time at which it advertised the prefix. RIFT transports the time time at which it advertised the prefix. RIFT transports the time
stamp within the desired prefix N-TIEs as 802.1AS timestamp. stamp within the desired prefix N-TIEs as 802.1AS timestamp.
o RIFT may interoperate with the "update to 6LoWPAN Neighbor o RIFT may interoperate with the "update to 6LoWPAN Neighbor
Discovery" [I-D.ietf-6lo-rfc6775-update], which provides a method Discovery" [RFC8505], which provides a method for registering a
for registering a prefix with a sequence counter called a prefix with a sequence counter called a Transaction ID (TID).
Transaction ID (TID). RIFT transports in such case the TID in its RIFT transports in such case the TID in its native form.
native form.
o RIFT also defines an abstract negative clock (ANSC) that compares o RIFT also defines an abstract negative clock (ANSC) that compares
as less than any other clock. By default, the lack of a as less than any other clock. By default, the lack of a
`PrefixSequenceType` in a Prefix N-TIE is interpreted as ANSC. We `PrefixSequenceType` in a Prefix N-TIE is interpreted as ANSC. We
call this also an `undefined` clock. call this also an `undefined` clock.
o Any prefix present on the fabric in multiple nodes that has the o Any prefix present on the fabric in multiple nodes that has the
`same` clock is considered as anycast. ASNC is always considered `same` clock is considered as anycast. ASNC is always considered
smaller than any defined clock. smaller than any defined clock.
skipping to change at page 62, line 30 skipping to change at page 72, line 40
1. ASNC is older than any other value except ASNC AND 1. ASNC is older than any other value except ASNC AND
2. Clock with timestamp differing by more than MAXIMUM_CLOCK_DELTA 2. Clock with timestamp differing by more than MAXIMUM_CLOCK_DELTA
are comparable by using the timestamps only AND are comparable by using the timestamps only AND
3. Clocks with timestamps differing by less than MAXIMUM_CLOCK_DELTA 3. Clocks with timestamps differing by less than MAXIMUM_CLOCK_DELTA
are comparable by using their TIDs only AND are comparable by using their TIDs only AND
4. An undefined TID is always older than any other TID AND 4. An undefined TID is always older than any other TID AND
5. TIDs are compared using rules of [I-D.ietf-6lo-rfc6775-update]. 5. TIDs are compared using rules of [RFC8505].
5.3.3.2. Interaction between Time Stamps and Sequence Counters 5.3.3.2. Interaction between Time Stamps and Sequence Counters
For slow movements that occur less frequently than e.g. once per For slow movements that occur less frequently than e.g. once per
second, the time stamp that the RIFT infrastruture captures is enough second, the time stamp that the RIFT infrastruture captures is enough
to determine the freshest discovery. If the point of attachement to determine the freshest discovery. If the point of attachement
changes faster than the maximum drift of the time stamping mechanism changes faster than the maximum drift of the time stamping mechanism
(i.e. MAXIMUM_CLOCK_DELTA), then a sequence counter is required to (i.e. MAXIMUM_CLOCK_DELTA), then a sequence counter is required to
add resolution to the freshness evaluation, and it must be sized so add resolution to the freshness evaluation, and it must be sized so
that the counters stay comparable within the resolution of the time that the counters stay comparable within the resolution of the time
stampling mechanism. stampling mechanism.
The sequence counter in [I-D.ietf-6lo-rfc6775-update] is encoded as The sequence counter in [RFC8505] is encoded as one octet and wraps
one octet, wraps after 127 increments, and, by default, values are around using Appendix A.
defined as comparable as long as they are less than SEQUENCE_WINDOW =
16 apart. An implementation MAY allow this to be configurable
throughout the domain, and the number can be pushed up to 64 and
still preserve the capability to discover an error situation where
counters are not comparable.
Within the resolution of MAXIMUM_CLOCK_DELTA the sequence counters Within the resolution of MAXIMUM_CLOCK_DELTA the sequence counters
captured during 2 sequential values of the time stamp must be captured during 2 sequential values of the time stamp SHOULD be
comparable. This means with default values that a node may move up comparable. This means with default values that a node may move up
to 16 times during a 200 milliseconds period and the clocks remain to 127 times during a 200 milliseconds period and the clocks remain
still comparable thus allowing the infrastructure to assert the still comparable thus allowing the infrastructure to assert the
freshest advertisement with no ambiguity. freshest advertisement with no ambiguity.
5.3.3.3. Anycast vs. Unicast 5.3.3.3. Anycast vs. Unicast
A unicast prefix can be attached to at most one leaf, whereas an A unicast prefix can be attached to at most one leaf, whereas an
anycast prefix may be reachable via more than one leaf. anycast prefix may be reachable via more than one leaf.
If a monotonic clock attribute is provided on the prefix, then the If a monotonic clock attribute is provided on the prefix, then the
prefix with the `newest` clock value is strictly prefered. An prefix with the `newest` clock value is strictly prefered. An
anycast prefix does not carry a clock or all clock attributes MUST be anycast prefix does not carry a clock or all clock attributes MUST be
the same under the rules of Section 5.3.3.1. the same under the rules of Section 5.3.3.1.
Observe that it is important that in mobility events the leaf is re- Observe that it is important that in mobility events the leaf is re-
flooding as quickly as possible the absence of the prefix that moved flooding as quickly as possible the absence of the prefix that moved
away. away.
Observe further that without support for Observe further that without support for [RFC8505] movements on the
[I-D.ietf-6lo-rfc6775-update] movements on the fabric within fabric within intervals smaller than 100msec will be seen as anycast.
intervals smaller than 100msec will be seen as anycast.
5.3.3.4. Overlays and Signaling 5.3.3.4. Overlays and Signaling
RIFT is agnostic whichever the overlay technology [MIP, LISP, VxLAN, RIFT is agnostic whether any overlay technology like [MIP, LISP,
NVO3] and the associated signaling is deployed over it. But it is VxLAN, NVO3] and the associated signaling is deployed over it. But
expected that leaf nodes, and possibly Top-of-Fabric nodes can it is expected that leaf nodes, and possibly Top-of-Fabric nodes can
perform the according encapsulation. perform the according encapsulation.
In the context of mobility, overlays provide a classical solution to In the context of mobility, overlays provide a classical solution to
avoid injecting mobile prefixes in the fabric and improve the avoid injecting mobile prefixes in the fabric and improve the
scalability of the solution. It makes sense on a data center that scalability of the solution. It makes sense on a data center that
already uses overlays to consider their applicability to the mobility already uses overlays to consider their applicability to the mobility
solution; as an example, a mobility protocol such as LISP may inform solution; as an example, a mobility protocol such as LISP may inform
the ingress leaf of the location of the egress leaf in real time. the ingress leaf of the location of the egress leaf in real time.
Another possibility is to consider that mobility as an underlay Another possibility is to consider that mobility as an underlay
skipping to change at page 64, line 14 skipping to change at page 74, line 14
5.3.4. Key/Value Store 5.3.4. Key/Value Store
5.3.4.1. Southbound 5.3.4.1. Southbound
The protocol supports a southbound distribution of key-value pairs The protocol supports a southbound distribution of key-value pairs
that can be used to e.g. distribute configuration information during that can be used to e.g. distribute configuration information during
topology bring-up. The KV S-TIEs can arrive from multiple nodes and topology bring-up. The KV S-TIEs can arrive from multiple nodes and
hence need tie-breaking per key. We use the following rules hence need tie-breaking per key. We use the following rules
1. Only KV TIEs originated by a node to which the receiver has an 1. Only KV TIEs originated by nodes to which the receiver has a bi-
adjacency are considered. directional adjacency are considered.
2. Within all valid KV S-TIEs containing the key, the value of the 2. Within all such valid KV S-TIEs containing the key, the value of
KV S-TIE for which the according node S-TIE is present, has the the KV S-TIE for which the according node S-TIE is present, has
highest level and within the same level has highest originating the highest level and within the same level has highest
system ID is preferred. If keys in the most preferred TIEs are originating system ID is preferred. If keys in the most
overlapping, the behavior is undefined. preferred TIEs are overlapping, the behavior is undefined.
Observe that if a node goes down, the node south of it looses Observe that if a node goes down, the node south of it looses
adjacencies to it and with that the KVs will be disregarded and on adjacencies to it and with that the KVs will be disregarded and on
tie-break changes new KV re-advertised to prevent stale information tie-break changes new KV re-advertised to prevent stale information
being used by nodes further south. KV information in southbound being used by nodes further south. KV information in southbound
direction is not result of independent computation of every node but direction is not result of independent computation of every node over
a diffused computation. same set of TIEs but a diffused computation.
5.3.4.2. Northbound 5.3.4.2. Northbound
Certain use cases seem to necessitate distribution of essentialy KV Certain use cases seem to necessitate distribution of essentialy KV
information that is generated in the leafs in the northbound information that is generated in the leafs in the northbound
direction. Such information is flooded in KV N-TIEs. Since the direction. Such information is flooded in KV N-TIEs. Since the
originator of northbound KV is preserved during northbound flooding, originator of northbound KV is preserved during northbound flooding,
overlapping keys could be used. However, to omit further protocol overlapping keys could be used. However, to omit further protocol
complexity, only the value of the key in TIE tie-broken in same complexity, only the value of the key in TIE tie-broken in same
fashion as southbound KV TIEs is used. fashion as southbound KV TIEs is used.
5.3.5. Interactions with BFD 5.3.5. Interactions with BFD
RIFT MAY incorporate BFD [RFC5881] to react quickly to link failures. RIFT MAY incorporate BFD [RFC5881] to react quickly to link failures.
In such case following procedures are introduced: In such case following procedures are introduced:
After RIFT three way hello adjacency convergence a BFD session MAY After RIFT three way hello adjacency convergence a BFD session MAY
be formed automatically between the RIFT endpoints without further be formed automatically between the RIFT endpoints without further
configuration using the exchanged discriminators. configuration using the exchanged discriminators. The capability
of the remote side to support BFD is carried on the LIEs.
In case established BFD session goes Down after it was Up, RIFT In case established BFD session goes Down after it was Up, RIFT
adjacency should be re-initialized started from Init. adjacency should be re-initialized started from Init.
In case of parallel links between nodes each link may run its own In case of parallel links between nodes each link may run its own
independent BFD session or they may share a session. independent BFD session or they may share a session.
In case RIFT changes link identifiers both the hello as well as In case RIFT changes link identifiers or BFD capability indication
the BFD sessions SHOULD be brought down and back up again. both the LIE as well as the BFD sessions SHOULD be brought down
and back up again.
Multiple RIFT instances MAY choose to share a single BFD session Multiple RIFT instances MAY choose to share a single BFD session
(in such case it is undefined what discriminators are used albeit (in such case it is undefined what discriminators are used albeit
RIFT CAN advertise the same link ID for the same interface in RIFT CAN advertise the same link ID for the same interface in
multiple instances and with that "share" the discriminators). multiple instances and with that "share" the discriminators).
BFD TTL follows [RFC5082]. BFD TTL follows [RFC5082].
5.3.6. Fabric Bandwidth Balancing 5.3.6. Fabric Bandwidth Balancing
A well understood problem in fabrics is that in case of link losses A well understood problem in fabrics is that in case of link losses
it would be ideal to rebalance how much traffic is offered to it would be ideal to rebalance how much traffic is offered to
switches in the next level based on the ingress and egress bandwidth switches in the next level based on the ingress and egress bandwidth
they have. Current attempts rely mostly on specialized traffic they have. Current attempts rely mostly on specialized traffic
engineering via controller or leafs being aware of complete topology engineering via controller or leafs being aware of complete topology
with according cost and complexity. with according cost and complexity.
RIFT can support a very light weight mechanism that can deal with the RIFT can support a very light weight mechanism that can deal with the
problem in an approximative way based on the fact that RIFT is loop- problem in an approximate way based on the fact that RIFT is loop-
free. free.
5.3.6.1. Northbound Direction 5.3.6.1. Northbound Direction
Every RIFT node SHOULD compute the amount of northbound bandwith Every RIFT node SHOULD compute the amount of northbound bandwith
available through neighbors at higher level and modify distance available through neighbors at higher level and modify distance
received on default route from this neighbor. Those different received on default route from this neighbor. Those different
distances SHOULD be used to support weighted ECMP forwarding towards distances SHOULD be used to support weighted ECMP forwarding towards
higher level when using default route. We call such a distance higher level when using default route. We call such a distance
Bandwidth Adjusted Distance or BAD. This is best illustrated by a Bandwidth Adjusted Distance or BAD. This is best illustrated by a
skipping to change at page 66, line 27 skipping to change at page 76, line 27
. || || || || . || || || ||
. || || || || . || || || ||
. || +------------+| || || . || +------------+| || ||
. || |+------------+ || || . || |+------------+ || ||
. |x || || || . |x || || ||
. +-+---+++ +--++-+++ . +-+---+++ +--++-+++
. | | | | . | | | |
. |Leaf111| |Leaf112| . |Leaf111| |Leaf112|
. +-------+ +-------+ . +-------+ +-------+
Figure 22: Balancing Bandwidth Figure 29: Balancing Bandwidth
All links from Leafs in Figure 22 are assumed to 10 MBit/s bandwidth All links from Leafs in Figure 29 are assumed to 10 MBit/s bandwidth
while the uplinks one level further up are assumed to be 100 MBit/s. while the uplinks one level further up are assumed to be 100 MBit/s.
Further, in Figure 22 we assume that Leaf111 lost one of the parallel Further, in Figure 29 we assume that Leaf111 lost one of the parallel
links to Spine 111 and with that wants to possibly push more traffic links to Spine 111 and with that wants to possibly push more traffic
onto Spine 112. Leaf 112 has equal bandwidth to Spine 111 and Spine onto Spine 112. Leaf 112 has equal bandwidth to Spine 111 and Spine
112 but Spine 111 lost one of its uplinks. 112 but Spine 111 lost one of its uplinks.
The local modification of the received default route distance from The local modification of the received default route distance from
upper level is achieved by running a relatively simple algorithm upper level is achieved by running a relatively simple algorithm
where the bandwidth is weighted exponentially while the distance on where the bandwidth is weighted exponentially while the distance on
the default route represents a multiplier for the bandwidth weight the default route represents a multiplier for the bandwidth weight
for easy operational adjustements. for easy operational adjustements.
skipping to change at page 67, line 36 skipping to change at page 77, line 36
| Leaf112 | Spine 112 | 220 | 8 | 1 | | Leaf112 | Spine 112 | 220 | 8 | 1 |
+---------+-----------+-------+-------+-----+ +---------+-----------+-------+-------+-----+
Table 5: BAD Computation Table 5: BAD Computation
All the multiplications and additions are saturating, i.e. when All the multiplications and additions are saturating, i.e. when
exceeding range of the bandwidth type are set to highest possible exceeding range of the bandwidth type are set to highest possible
value of the type. value of the type.
BAD is only computed for default routes. A node MAY compute and use BAD is only computed for default routes. A node MAY compute and use
BAD for any disaggregated prefixes or other RIFT routes. BAD for any disaggregated prefixes or other RIFT routes. A node MAY
use another algorithm than BAD to weight northbound traffic based on
bandwidth given that the algorithm is distributed and un-synchronized
and ultimately, its correct behavior does not depend on uniformity of
balancing algorithms used in the fabric. E.g. it is conceivable that
leafs could use real time link loads gathered by analytics to change
the amount of traffic assigned to each default route next hop.
Observe further that a change in available bandwidth will only affect Observe further that a change in available bandwidth will only affect
at maximum two levels down in the fabric, i.e. blast radius of at maximum two levels down in the fabric, i.e. blast radius of
bandwidth changes is contained no matter its height. bandwidth changes is contained no matter its height.
5.3.6.2. Southbound Direction 5.3.6.2. Southbound Direction
Due to its loop free properties a node could take during S-SPF into Due to its loop free properties a node CAN take during S-SPF into
account the available bandwidth on the nodes in lower levels and account the available bandwidth on the nodes in lower levels and
modify the amount of traffic offered to next level's "southbound" modify the amount of traffic offered to next level's "southbound"
nodes based as what it sees is the total achievable maximum flow nodes based as what it sees is the total achievable maximum flow
through those nodes. It is worth observing that such computations through those nodes. It is worth observing that such computations
will work better if standardized but does not have to be necessarily. may work better if standardized but does not have to be necessarily.
As long the packet keeps on heading south it will take one of the As long the packet keeps on heading south it will take one of the
available paths and arrive at the intended destination. available paths and arrive at the intended destination.
Future versions of this document will fill in more details.
5.3.7. Label Binding 5.3.7. Label Binding
A node MAY advertise on its TIEs a locally significant, downstream A node MAY advertise on its TIEs a locally significant, downstream
assigned label for the according interface. One use of such label is assigned label for the according interface. One use of such label is
a hop-by-hop encapsulation allowing to easily distinguish forwarding a hop-by-hop encapsulation allowing to easily distinguish forwarding
planes served by a multiplicity of RIFT instances. planes served by a multiplicity of RIFT instances.
5.3.8. Segment Routing Support with RIFT 5.3.8. Segment Routing Support with RIFT
Recently, alternative architecture to reuse labels as segment Recently, alternative architecture to reuse labels as segment
identifiers [I-D.ietf-spring-segment-routing] has gained traction and identifiers [RFC8402] has gained traction and may present use cases
may present use cases in IP fabric that would justify its deployment. in IP fabric that would justify its deployment. Such use cases will
Such use cases will either precondition an assignment of a label per either precondition an assignment of a label per node (or other
node (or other entities where the mechanisms are equivalent) or a entities where the mechanisms are equivalent) or a global assignment
global assignment and a knowledge of topology everywhere to compute and a knowledge of topology everywhere to compute segment stacks of
segment stacks of interest. We deal with the two issues separately. interest. We deal with the two issues separately.
5.3.8.1. Global Segment Identifiers Assignment 5.3.8.1. Global Segment Identifiers Assignment
Global segment identifiers are normally assumed to be provided by Global segment identifiers are normally assumed to be provided by
some kind of a centralized "controller" instance and distributed to some kind of a centralized "controller" instance and distributed to
other entities. This can be performed in RIFT by attaching a other entities. This can be performed in RIFT by attaching a
controller to the Top-of-Fabric nodes at the top of the fabric where controller to the Top-of-Fabric nodes at the top of the fabric where
the whole topology is always visible, assign such identifiers and the whole topology is always visible, assign such identifiers and
then distribute those via the KV mechanism towards all nodes so they then distribute those via the KV mechanism towards all nodes so they
can perform things like probing the fabric for failures using a stack can perform things like probing the fabric for failures using a stack
skipping to change at page 69, line 32 skipping to change at page 79, line 32
never form southbound adjacencies. never form southbound adjacencies.
This will allow the E-W leaf nodes to exchange traffic strictly for This will allow the E-W leaf nodes to exchange traffic strictly for
the prefixes advertised in each other's north prefix TIEs (since the the prefixes advertised in each other's north prefix TIEs (since the
southbound computation will find the reverse direction in the other southbound computation will find the reverse direction in the other
node's TIE and install its north prefixes). node's TIE and install its north prefixes).
5.3.10. Address Family and Multi Topology Considerations 5.3.10. Address Family and Multi Topology Considerations
Multi-Topology (MT)[RFC5120] and Multi-Instance (MI)[RFC6822] is used Multi-Topology (MT)[RFC5120] and Multi-Instance (MI)[RFC8202] is used
today in link-state routing protocols to support several domains on today in link-state routing protocols to support several domains on
the same physical topology. RIFT supports this capability by the same physical topology. RIFT supports this capability by
carrying transport ports in the LIE protocol exchanges. Multiplexing carrying transport ports in the LIE protocol exchanges. Multiplexing
of LIEs can be achieved by either choosing varying multicast of LIEs can be achieved by either choosing varying multicast
addresses or ports on the same address. addresses or ports on the same address.
BFD interactions in Section 5.3.5 are implementation dependent when BFD interactions in Section 5.3.5 are implementation dependent when
multiple RIFT instances run on the same link. multiple RIFT instances run on the same link.
5.3.11. Reachability of Internal Nodes in the Fabric 5.3.11. Reachability of Internal Nodes in the Fabric
RIFT does not precondition that its nodes have reachable addresses RIFT does not precondition that its nodes have reachable addresses
albeit for operational purposes this is clearly desirable. Under albeit for operational purposes this is clearly desirable. Under
normal operating conditions this can be easily achieved by e.g. normal operating conditions this can be easily achieved by e.g.
injecting the node's loopback address into North Prefix TIEs. injecting the node's loopback address into North and South Prefix
TIEs or other implementation specific mechanisms.
Things get more interesting in case a node looses all its northbound Things get more interesting in case a node looses all its northbound
adjacencies but is not at the top of the fabric. That is outside the adjacencies but is not at the top of the fabric. That is outside the
scope of this document and may be covered in a separate document scope of this document and may be covered in a separate document
about policy guided prefixes [PGP reference]. about policy guided prefixes [PGP reference].
5.3.12. One-Hop Healing of Levels with East-West Links 5.3.12. One-Hop Healing of Levels with East-West Links
Based on the rules defined in Section 5.2.4, Section 5.2.3.7 and Based on the rules defined in Section 5.2.4, Section 5.2.3.8 and
given presence of E-W links, RIFT can provide a one-hop protection of given presence of E-W links, RIFT can provide a one-hop protection of
nodes that lost all their northbound links or in other complex link nodes that lost all their northbound links or in other complex link
set failure scenarios except at Top-of-Fabric where the links are set failure scenarios except at Top-of-Fabric where the links are
used exclusively to flood topology information in multi-plane used exclusively to flood topology information in multi-plane
designs. Section 6.4 explains the resulting behavior based on one designs. Section 6.4 explains the resulting behavior based on one
such example. such example.
5.4. Security
5.4.1. Security Model
An inherent property of any security and ZTP architecture is the
resulting trade-off in regard to integrity verification of the
information distributed through the fabric vs. necessary provisioning
and auto-configuration. At a minimum, in all approaches, the
security of an established adjacency can be ensured. The stricter
the security model the more provisioning must take over the role of
ZTP.
The most security conscious operators will want to have full control
over which port on which router/switch is connected to the respective
port on the "other side", which we will call the "port-association
model" (PAM) achievable e.g. by pairwise-key PKI. In secure data
center locations, operators may want to control which router/switch
is connected to which other router/switch only or choose a "node-
association model" (NAM) which allows, for example, simplified port
sparing. In an even more relaxed environment, an operator may only
be concerned that the router/switches share credentials ensuring that
they belong to this particular data center network hence allowing the
flexible sparing of whole routers/switches. We will define that case
as the "fabric-association model" (FAM), equivalent to using a shared
secret for the whole fabric. Such flexibility may make sense for
leaf nodes such as servers where the addition and swapping of servers
is more frequent than the rest of the data center network.
Generally, leafs of the fabric tend to be less trusted than switches.
The different models could be mixed throughout the fabric if the
benefits outweigh the cost of increased complexity in provisioning.
In each of the above cases, some configuration mechanism is needed to
allow the operator to specify which connections are allowed, and some
mechanism is needed to:
a. specify the according level in the fabric,
b. discover and report missing connections,
c. discover and report unexpected connections, and prevent such
adjacencies from forming.
On the more relaxed configuration side of the spectrum, operators
might only configure the level of each switch, but don't explicitly
configure which connections are allowed. In this case, RIFT will
only allow adjacencies to come up between nodes are that in adjacent
levels. The operators with lowest security requirements may not use
any configuration to specify which connections are allowed. Such
fabrics could rely fully on ZTP for each router/switch to discover
its level and would only allow adjacencies between adjacent levels to
come up. Figure 30 illustrates the tradeoffs inherent in the
different security models.
Ultimately, some level of verification of the link quality may be
required before an adjacency is allowed to be used for forwarding.
For example, an implementation may require that a BFD session comes
up before advertising the adjacency.
For the above outlined cases, RIFT has two approaches to enforce that
a local port is connected to the correct port on the correct remote
router/switch. One approach is to piggy-back on RIFT's
authentication mechanism. Assuming the provisioning model (e.g. the
YANG model) is flexible enough, operators can choose to provision a
unique authentication key for:
a. each pair of ports in "port-association model" or
b. each pair of switches in "node-association model" or
c. each pair of levels or
d. the entire fabric in "fabric-association model".
The other approach is to rely on the system-id, port-id and level
fields in the LIE message to validate an adjacency against the
configured expected cabling topology, and optionally introduce some
new rules in the FSM to allow the adjacency to come up if the
expectations are met.
^ /\ |
/|\ / \ |
| / \ |
| / PAM \ |
Increasing / \ Increasing
Integrity +----------+ Flexibility
& / NAM \ &
Increasing +--------------+ Less
Provisioning / FAM \ Configuration
| +------------------+ |
| / Level Provisioning \ |
| +----------------------+ \|/
| / Zero Configuration \ v
+--------------------------+
Figure 30: Security Model
5.4.2. Security Mechanisms
RIFT Security goals are to ensure authentication, message integrity
and prevention of replay attacks. Low processing overhead and
efficiency messaging are also a goal. Message privacy achieved
through full encryption is a non goal.
The model in the previous section allows a range of security key
types that are analogous to the various security association models.
PAM and NAM allow security associations at the port or node level
using symmetric or asymmetric keys that are pre-installed. FAM
argues for security associations to be applied only at a group level
or to be refined once the topology has been established. RIFT does
not specify how security keys are installed or updated it specifies
how the key can be used to achieve goals.
The protocol has provisions for nonces to prevent replay attacks and
includes authentication mechanisms comparable to [RFC5709] and
[RFC7987].
5.4.3. Security Envelope
RIFT MUST be carried in a mandatory secure envelope illustrated in
Figure 31. Local configuration can allow to skip the checking of the
envelope's integrity.
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
UDP Header:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Source Port | RIFT destination port |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| UDP Length | UDP Checksum |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| RIFT MAGIC | Packet Number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Outer Security Envelope Header:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Reserved | RIFT Major | Outer Key ID | Fingerprint |
| | Version | | Length |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
~ Security Fingerprint covers all following content ~
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Nonce Local | Nonce Remote |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Remaining TIE Lifetime (all 1s in case of LIE) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
TIE Origin Security Envelope Header:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Inner Key ID | Fingerprint |
| | Length |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
~ Security Fingerprint covers all following content ~
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Serialized RIFT Model Object
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
~ Serialized RIFT Model Object ~
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 31: Security Envelope
RIFT MAGIC: 16 bits. Constant value of 0xA1F7 that allows to
classify RIFT packets independent of used UDP port.
Packet Number: 16 bits. An optional, per packet type monotonically
growing number rolling over using sequence number arithmetic
defined inAppendix A. A node SHOULD correctly set the number on
subsequent packets or otherwise MUST set the value to
`undefined_packet_number` as provided in the schema. This number
can be used to detect losses and misordering in flooding for
either operational purposes or in implementation to adjust
flooding behavior to current link or buffer quality. This number
MUST NOT be used to discard or validate the correctness of
packets.
RIFT Major Version: 8 bits. It allows to check whether protocol
versions are compatible, i.e. the serialized object can be decoded
at all. An implementation MUST drop packets with unexpected value
and MAY report a problem. Must be same as in encoded model
object, otherwise packet is dropped.
Inner Key ID: 8 bits to allow key rollovers. This implies key type
and used algorithm. Value 0 means that no valid fingerprint was
computed. This key ID scope is local to the nodes on both ends of
the adjacency.
Outer Key ID: 24 bits. This implies key type and used algorithm.
Value 0 means that no valid fingerprint was computed. This key ID
scope is global to the RIFT instance since it implies the
originator of the TIE so the contained object does not have to be
de-serialized to obtain it.
Length of Fingerprint: 8 bits. Length in 32-bit multiples of the
following fingerprint not including lifetime or nonces. It allows
to navigate the structure when an unknown key type is present. To
clarify a common cornercase a fingerprint with length of 0 bits is
presenting this field with value of 0.
Security Fingerprint: 32 bits * Length of Fingerprint. This is a
signature that is computed over all data following after it. If
the fingerprint is shorter than the signficant bits are left
aligned and remaining bits are set to 0. When using PKI the
Security fingerprint originating node uses its private key to
create the signature. The original packet can then be verified
provided the public key is shared and current.
Remaining TIE Lifetime: 32 bits. In case of anything but TIEs this
field MUST be set to all ones and Origin Security Envelope Header
MUST NOT be present in the packet. For TIEs this field represents
the remaining lifetime of the TIE and Origin Security Envelope
Header MUST be present in the packet. The value in the serialized
model object MUST be ignored.
Nonce Local: 16 bits. Local Nonce of the adjacency as advertised
in LIEs. In case of LIE packet this MUST correspond to the value
in the serialized object otherwise the packet MUST be discarded.
Nonce Remote: 16 bits. Remote Nonce of the adjacency as received
in LIEs. In case of LIE packet this MUST correspond to the value
in the serialized object otherwise the packet MUST be discarded.
TIE Origin Security Envelope Header: It MUST be present if Remaining
TIE Lifetime field is NOT all ones. It carries through the
originators key ID and according fingerprint of the object to
protect TIE from modification during flooding. This ensures
origin validation and integrity (but does not provide validation
of a chain of trust).
Observe that due to the schema migration rules per Appendix B the
contained model can be always decoded if the major version matches
and the envelope integrity has been validated. Consequently,
description of the TIE is available to flood it properly including
unknown TIE types.
5.4.4. Nonces
The protocol uses 16 bit Nonces to salt generated signatures as means
of replay attack prevention. Any implementation including RIFT
security MUST generate and wrap around local nonces properly. All
implementation MUST reflect the neighbor's nonces. An implementation
SHOULD increment a chosen nonce on every LIE FSM transition that ends
up in a different state from the previous and MUST increment its
nonce at least every 5 minutes (such considerations allow for
efficient implementations without opening a significant security
risk). When flooding TIEs, the implementation MUST use recent (i.e.
within allowed difference) nonces reflected in the LIE exchange. The
schema specifies maximum allowable nonce value difference on a packet
compared to reflected nonces in the LIEs. Any packet received with
nonces deviating more than the allowed delta MUST be discarded
without further computation of signatures to prevent computation load
attacks.
In case where a secure implementation does not receive signatures or
receives undefined nonces from neighbor indicating that it does not
support or verify signatures, it is a matter of local policy how such
packets are treated. Any secure implementation MUST discard packets
where its local nonce is not correctly mirrored but it may choose to
either refuse forming an adjacency with an implementation not
advertising signatures or valid nonces or simply keep on signing
local packets while accepting neighbor's packets without further
verification beside checking for proper nonce reflection.
5.4.5. Lifetime
Protecting lifetime on flooding can lead to excessive number of
security fingerprint computation and hence an application generating
such fingerprints on TIEs SHOULD round the value down to the next
`rounddown_lifetime_interval` defined in the schema when sending
TIEs.
5.4.6. Key Management
As outlined in the Security Model a private shared key or a public/
private key pair is used to Authenticate the adjacency. The actual
method of key distribution and key synchronization is assumed to be
out of band from RIFT's perspective. Both nodes in the adjacency
must share the same keys and configuration of key type and algorithm
for a key ID. Mismatched keys will obviously not inter-operate due
to unverifiable security envelope.
Key roll-over while the adjacency is active is allowed and the
technique is well known and described in e.g. [RFC6518]. Key
distribution procedures are out of scope for RIFT.
5.4.7. Security Association Changes
There in no mechanism to convert a security envelope for the same key
ID from one algorithm to another once the envelope is operational.
The recommended procedure to change to a new algorithm is to take the
adjacency down and make the changes and then bring the adjacency up.
If an implementation supports disabling the security envelope
requirements while sending a security envelope an implementation
could shut down the security envelope procedures while maintaining an
adjacency and make changes to the algorithms on both sides then re
enable the security envelope procedures but that introduces security
holes during the disabled period.
6. Examples 6. Examples
6.1. Normal Operation 6.1. Normal Operation
This section describes RIFT deployment in the example topology This section describes RIFT deployment in the example topology
without any node or link failures. We disregard flooding reduction without any node or link failures. We disregard flooding reduction
for simplicity's sake. for simplicity's sake.
As first step, the following bi-directional adjacencies will be As first step, the following bi-directional adjacencies will be
created (and any other links that do not fulfill LIE rules in created (and any other links that do not fulfill LIE rules in
skipping to change at page 71, line 50 skipping to change at page 88, line 31
. | | | X Failure . | | | X Failure
. | +-------------+ | X . | +-------------+ | X
. | | | | . | | | |
.+-+---+-+ +--+--+-+ .+-+---+-+ +--+--+-+
.| | | | .| | | |
.|Leaf111| |Leaf112| .|Leaf111| |Leaf112|
.+-------+ +-------+ .+-------+ +-------+
. + + . + +
. Prefix111 Prefix112 . Prefix111 Prefix112
Figure 23: Single Leaf link failure Figure 32: Single Leaf link failure
In case of a failing leaf link between spine 112 and leaf 112 the In case of a failing leaf link between spine 112 and leaf 112 the
link-state information will cause re-computation of the necessary SPF link-state information will cause re-computation of the necessary SPF
and the higher levels will stop forwarding towards prefix 112 through and the higher levels will stop forwarding towards prefix 112 through
spine 112. Only spines 111 and 112, as well as both spines will see spine 112. Only spines 111 and 112, as well as both spines will see
control traffic. Leaf 111 will receive a new S-TIE from spine 112 control traffic. Leaf 111 will receive a new S-TIE from spine 112
and reflect back to spine 111. Spine 111 will de-aggregate prefix and reflect back to spine 111. Spine 111 will de-aggregate prefix
111 and prefix 112 but we will not describe it further here since de- 111 and prefix 112 but we will not describe it further here since de-
aggregation is emphasized in the next example. It is worth observing aggregation is emphasized in the next example. It is worth observing
however in this example that if leaf 111 would keep on forwarding however in this example that if leaf 111 would keep on forwarding
skipping to change at page 73, line 37 skipping to change at page 89, line 40
. | +-------------+ | | | +--------------+ | | . | +-------------+ | | | +--------------+ | |
. | | | | | | | | . | | | | | | | |
.+-+---+-+ +--+--+-+ +-+---+-+ +---+-+-+ .+-+---+-+ +--+--+-+ +-+---+-+ +---+-+-+
.| | | | | | | | .| | | | | | | |
.|Leaf111| |Leaf112| |Leaf121| |Leaf122| .|Leaf111| |Leaf112| |Leaf121| |Leaf122|
.+-+-----+ ++------+ +-----+-+ +-+-----+ .+-+-----+ ++------+ +-----+-+ +-+-----+
. + + + + . + + + +
. Prefix111 Prefix112 Prefix121 Prefix122 . Prefix111 Prefix112 Prefix121 Prefix122
. 1.1/16 . 1.1/16
Figure 24: Fabric partition Figure 33: Fabric partition
Figure 24 shows the arguably most catastrophic but also the most Figure 33 shows the arguably a more catastrophic but also a more
interesting case. Spine 21 is completely severed from access to interesting case. Spine 21 is completely severed from access to
Prefix 121 (we use in the figure 1.1/16 as example) by double link Prefix 121 (we use in the figure 1.1/16 as example) by double link
failure. However unlikely, if left unresolved, forwarding from leaf failure. However unlikely, if left unresolved, forwarding from leaf
111 and leaf 112 to prefix 121 would suffer 50% black-holing based on 111 and leaf 112 to prefix 121 would suffer 50% black-holing based on
pure default route advertisements by Top-of-Fabric 21 and ToF 22. pure default route advertisements by Top-of-Fabric 21 and ToF 22.
The mechanism used to resolve this scenario is hinging on the The mechanism used to resolve this scenario is hinging on the
distribution of southbound representation by Top-of-Fabric 21 that is distribution of southbound representation by Top-of-Fabric 21 that is
reflected by spine 111 and spine 112 to ToF 22. Spine 22, having reflected by spine 111 and spine 112 to ToF 22. Spine 22, having
computed reachability to all prefixes in the network, advertises with computed reachability to all prefixes in the network, advertises with
skipping to change at page 75, line 25 skipping to change at page 91, line 25
. | | | | | | | | | . | | | | | | | | |
. | +----------------+ | +-----------------+ | . | +----------------+ | +-----------------+ |
. | | | | | | | | | . | | | | | | | | |
. | | +------------------------------------+ | | . | | +------------------------------------+ | |
. | | | | | | | | | . | | | | | | | | |
.++-+-+--+ | +---+---+ | +-+---+-++ .++-+-+--+ | +---+---+ | +-+---+-++
.| | +-+ +-+ | | .| | +-+ +-+ | |
.| L01 | | L02 | | L03 | Level 0 .| L01 | | L02 | | L03 | Level 0
.+-------+ +-------+ +--------+ .+-------+ +-------+ +--------+
Figure 25: North Partitioned Router Figure 34: North Partitioned Router
Figure 25 shows a part of a fabric where level 1 is horizontally Figure 34 shows a part of a fabric where level 1 is horizontally
connected and A01 lost its only northbound adjacency. Based on N-SPF connected and A01 lost its only northbound adjacency. Based on N-SPF
rules in Section 5.2.4.1 A01 will compute northbound reachability by rules in Section 5.2.4.1 A01 will compute northbound reachability by
using the link A01 to A02 (whereas A02 will NOT use this link during using the link A01 to A02 (whereas A02 will NOT use this link during
N-SPF). Hence A01 will still advertise the default towards level 0 N-SPF). Hence A01 will still advertise the default towards level 0
and route unidirectionally using the horizontal link. and route unidirectionally using the horizontal link.
As further consideration, the moment A02 looses link N2 the situation As further consideration, the moment A02 looses link N2 the situation
evolves again. A01 will have no more northbound reachability while evolves again. A01 will have no more northbound reachability while
still seeing A03 advertising northbound adjacencies in its south node still seeing A03 advertising northbound adjacencies in its south node
tie. With that it will stop advertising a default route due to tie. With that it will stop advertising a default route due to
Section 5.2.3.7. Section 5.2.3.8.
6.5. Multi-Plane Fabric and Negative Disaggregation 6.5. Multi-Plane Fabric and Negative Disaggregation
TODO TODO
7. Implementation and Operation: Further Details 7. Implementation and Operation: Further Details
7.1. Considerations for Leaf-Only Implementation 7.1. Considerations for Leaf-Only Implementation
Ideally RIFT can be stretched out to the loWest level in the IP RIFT can and is intended to be stretched to the lowest level in the
fabric to integrate ToRs or even servers. Since those entities would IP fabric to integrate ToRs or even servers. Since those entities
run as leafs only, it is worth to observe that a leaf only version is would run as leafs only, it is worth to observe that a leaf only
significantly simpler to implement and requires much less resources: version is significantly simpler to implement and requires much less
resources:
1. Under normal conditions, the leaf needs to support a multipath 1. Under normal conditions, the leaf needs to support a multipath
default route only. In worst partitioning case it has to be default route only. In most catastrophic partitioning case it
capable of accommodating all the leaf routes in its own PoD to has to be capable of accommodating all the leaf routes in its own
prevent black-holing. PoD to prevent black-holing.
2. Leaf nodes hold only their own N-TIEs and S-TIEs of Level 1 nodes 2. Leaf nodes hold only their own N-TIEs and S-TIEs of Level 1 nodes
they are connected to; so overall few in numbers. they are connected to; so overall few in numbers.
3. Leaf node does not have to support flooding reduction or any type 3. Leaf node does not have to support any type of de-aggregation
of de-aggregation computation or propagation. computation or propagation.
4. Unless optional leaf-2-leaf procedures are desired default route 4. Leaf nodes do not have to support overload bit normally.
5. Unless optional leaf-2-leaf procedures are desired default route
origination and S-TIE origination is unnecessary. origination and S-TIE origination is unnecessary.
7.2. Adaptations to Other Proposed Data Center Topologies 7.2. Considerations for Spine Implementation
In case of spines, i.e. nodes that will never act as Top of Fabric a
full implementation is not required, specifically the node does not
need to perform any computation of negative disaggregation except
respecting northbound disaggregation advertised from the north.
7.3. Adaptations to Other Proposed Data Center Topologies
. +-----+ +-----+ . +-----+ +-----+
. | | | | . | | | |
.+-+ S0 | | S1 | .+-+ S0 | | S1 |
.| ++---++ ++---++ .| ++---++ ++---++
.| | | | | .| | | | |
.| | +------------+ | .| | +------------+ |
.| | | +------------+ | .| | | +------------+ |
.| | | | | .| | | | |
.| ++-+--+ +--+-++ .| ++-+--+ +--+-++
.| | | | | .| | | | |
skipping to change at page 76, line 44 skipping to change at page 93, line 25
.| +-+--++ ++---++ .| +-+--++ ++---++
.| | | | | .| | | | |
.| | +------------+ | .| | +------------+ |
.| | +-----------+ | | .| | +-----------+ | |
.| | | | | .| | | | |
.| +-+-+-+ +--+-++ .| +-+-+-+ +--+-++
.+-+ | | | .+-+ | | |
. | L0 | | L1 | . | L0 | | L1 |
. +-----+ +-----+ . +-----+ +-----+
Figure 26: Level Shortcut Figure 35: Level Shortcut
Strictly speaking, RIFT is not limited to Clos variations only. The Strictly speaking, RIFT is not limited to Clos variations only. The
protocol preconditions only a sense of 'compass rose direction' protocol preconditions only a sense of 'compass rose direction'
achieved by configuration (or derivation) of levels and other achieved by configuration (or derivation) of levels and other
topologies are possible within this framework. So, conceptually, one topologies are possible within this framework. So, conceptually, one
could include leaf to leaf links and even shortcut between levels but could include leaf to leaf links and even shortcut between levels but
certain requirements in Section 4 will not be met anymore. As an certain requirements in Section 4 will not be met anymore. As an
example, shortcutting levels illustrated in Figure 26 will lead example, shortcutting levels illustrated in Figure 35 will lead
either to suboptimal routing when L0 sends traffic to L1 (since using either to suboptimal routing when L0 sends traffic to L1 (since using
S0's default route will lead to the traffic being sent back to A0 or S0's default route will lead to the traffic being sent back to A0 or
A1) or the leafs need each other's routes installed to understand A1) or the leafs need each other's routes installed to understand
that only A0 and A1 should be used to talk to each other. that only A0 and A1 should be used to talk to each other.
Whether such modifications of topology constraints make sense is Whether such modifications of topology constraints make sense is
dependent on many technology variables and the exhausting treatment dependent on many technology variables and the exhausting treatment
of the topic is definitely outside the scope of this document. of the topic is definitely outside the scope of this document.
7.3. Originating Non-Default Route Southbound 7.4. Originating Non-Default Route Southbound
Obviously, an implementation may choose to originate southbound Obviously, an implementation may choose to originate southbound
instead of a strict default route (as described in Section 5.2.3.7) a instead of a strict default route (as described in Section 5.2.3.8) a
shorter prefix P' but in such a scenario all addresses carried within shorter prefix P' but in such a scenario all addresses carried within
the RIFT domain must be contained within P'. the RIFT domain must be contained within P'.
8. Security Considerations 8. Security Considerations
8.1. General 8.1. General
The protocol has provisions for nonces and can include authentication One can consider attack vectors where a router may reboot many times
mechanisms in the future comparable to [RFC5709] and [RFC7987]. while changing its system ID and pollute the network with many stale
TIEs or TIEs are sent with very long lifetimes and not cleaned up
One can consider additionally attack vectors where a router may when the routes vanishes. Those attack vectors are not unique to
reboot many times while changing its system ID and pollute the RIFT. Given large memory footprints available today those attacks
network with many stale TIEs or TIEs are sent with very long should be relatively benign. Otherwise a node SHOULD implement a
lifetimes and not cleaned up when the routes vanishes. Those attack strategy of discarding contents of all TIEs that were not present in
vectors are not unique to RIFT. Given large memory footprints the SPF tree over a certain, configurable period of time. Since the
available today those attacks should be relatively benign. Otherwise protocol, like all modern link-state protocols, is self-stabilizing
a node SHOULD implement a strategy of discarding contents of all TIEs and will advertise the presence of such TIEs to its neighbors, they
that were not present in the SPF tree over a certain, configurable can be re-requested again if a computation finds that it sees an
period of time. Since the protocol, like all modern link-state adjacency formed towards the system ID of the discarded TIEs.
protocols, is self-stabilizing and will advertise the presence of
such TIEs to its neighbors, they can be re-requested again if a
computation finds that it sees an adjacency formed towards the system
ID of the discarded TIEs.
8.2. ZTP 8.2. ZTP
Section 5.2.7 presents many attack vectors in untrusted environments, Section 5.2.7 presents many attack vectors in untrusted environments,
starting with nodes that oscillate their level offers to the starting with nodes that oscillate their level offers to the
possiblity of a node offering a three way adjacency with the highest possiblity of a node offering a three way adjacency with the highest
possible level value with a very long holdtime trying to put itself possible level value with a very long holdtime trying to put itself
"on top of the lattice" and with that gaining access to the whole "on top of the lattice" and with that gaining access to the whole
southbound topology. Session authentication mechanisms are necessary southbound topology. Session authentication mechanisms are necessary
in environments where this is possible. in environments where this is possible and RIFT provides the
according security envelope to ensure this if desired.
8.3. Lifetime 8.3. Lifetime
The protocol uses in TIE flooding the traditional lifetime approach Traditional IGP protocols are vulnerable to lifetime modification and
that is vulnerable to sophisticated attack vectors under normal replay attacks that can be somewhat mitigated by using techniques
circumstances. However, on IP fabrics with some kind of, even like [RFC7987]. RIFT removes this attack vector by protecting the
coarse, clock synchronization, RIFT allows to recognize such attacks lifetime behind a signature computed over it and additional nonce
by including optional, protected information at origin. combination which makes even the replay attack window very small and
for practical purposes irrelevant since lifetime cannot be
artificially shortened by the attacker.
8.4. Packet Number
Optional packet number is carried in the security envelope without
any encryption protection and is hence vulnerable to replay and
modification attacks. Contrary to nonces this number must change on
every packet and would present a very high cryptographic load. And
the attack vector is relatively weak since changing the packet number
in flight will only affect operational validation tools and possibly
some performance optimizations on flodding. It is expected that an
implementation detecting too many fake losses or misorderings due to
the attack on the number would simply suppress its further
processing.
8.5. Outer Fingerprint Attacks
A node can try to inject LIE packets observing a conversation on the
wire by using the outer key ID albeit it cannot generate valid hashes
in case it changes the integrity of the message so the only possible
attack is DoS due to excessive LIE validation.
A node can try to replay previous LIEs with changed state that it
recorded but the attack is hard to replicate since the nonce
combination must match the ongoing exchange and is then limited to a
single flap only since both nodes will advance their nonces in case
the adjacency state changed. Even in the most unlikely case the
attack length is limited due to both sides periodically increasing
their nonces.
8.6. Inner Fingerprint DoS Attacks
A compromised node can attempt to generate "fake TIEs" using other
nodes' outer key identifiers. Albeit the ultimate validation of the
inner fingerprint will fail in such scenarios and not progress
further than immediately peering nodes, the resulting denial of
service attack seems unavoidable since the outer key id is only
protected by the, here assumed to be compromised, node.
9. IANA Considerations 9. IANA Considerations
This specification will request at an opportune time multiple This specification will request at an opportune time multiple
registry points to exchange protocol packets in a standardized way, registry points to exchange protocol packets in a standardized way,
amongst them multicast address assignments and standard port numbers. amongst them multicast address assignments and standard port numbers.
The schema itself defines many values and codepoints which can be The schema itself defines many values and codepoints which can be
considered registries themselves. considered registries themselves.
10. Acknowledgments 10. Acknowledgments
skipping to change at page 78, line 41 skipping to change at page 96, line 11
the light failed to shine. Kris Price was first to mention single the light failed to shine. Kris Price was first to mention single
router, single arm default considerations. Jeff Tantsura helped out router, single arm default considerations. Jeff Tantsura helped out
with some initial thoughts on BFD interactions while Jeff Haas with some initial thoughts on BFD interactions while Jeff Haas
corrected several misconceptions about BFD's finer points. Artur corrected several misconceptions about BFD's finer points. Artur
Makutunowicz pointed out many possible improvements and acted as Makutunowicz pointed out many possible improvements and acted as
sounding board in regard to modern protocol implementation techniques sounding board in regard to modern protocol implementation techniques
RIFT is exploring. Barak Gafni formalized first time clearly the RIFT is exploring. Barak Gafni formalized first time clearly the
problem of partitioned spine and fallen leafs on a (clean) napkin in problem of partitioned spine and fallen leafs on a (clean) napkin in
Singapore that led to the very important part of the specification Singapore that led to the very important part of the specification
centered around multiple Top-of-Fabric planes and negative centered around multiple Top-of-Fabric planes and negative
disaggregation. disaggregation. Igor Gashinsky and others shared many thoughts on
problems encountered in design and operation of large-scale data
center fabrics.
11. References 11. References
11.1. Normative References 11.1. Normative References
[I-D.ietf-6lo-rfc6775-update]
Thubert, P., Nordmark, E., Chakrabarti, S., and C.
Perkins, "Registration Extensions for 6LoWPAN Neighbor
Discovery", draft-ietf-6lo-rfc6775-update-21 (work in
progress), June 2018.
[ISO10589] [ISO10589]
ISO "International Organization for Standardization", ISO "International Organization for Standardization",
"Intermediate system to Intermediate system intra-domain "Intermediate system to Intermediate system intra-domain
routeing information exchange protocol for use in routeing information exchange protocol for use in
conjunction with the protocol for providing the conjunction with the protocol for providing the
connectionless-mode Network Service (ISO 8473), ISO/IEC connectionless-mode Network Service (ISO 8473), ISO/IEC
10589:2002, Second Edition.", Nov 2002. 10589:2002, Second Edition.", Nov 2002.
[RFC1982] Elz, R. and R. Bush, "Serial Number Arithmetic", RFC 1982,
DOI 10.17487/RFC1982, August 1996,
<https://www.rfc-editor.org/info/rfc1982>.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, Requirement Levels", BCP 14, RFC 2119,
DOI 10.17487/RFC2119, March 1997, DOI 10.17487/RFC2119, March 1997,
<https://www.rfc-editor.org/info/rfc2119>. <https://www.rfc-editor.org/info/rfc2119>.
[RFC2328] Moy, J., "OSPF Version 2", STD 54, RFC 2328, [RFC2328] Moy, J., "OSPF Version 2", STD 54, RFC 2328,
DOI 10.17487/RFC2328, April 1998, DOI 10.17487/RFC2328, April 1998,
<https://www.rfc-editor.org/info/rfc2328>. <https://www.rfc-editor.org/info/rfc2328>.
[RFC2365] Meyer, D., "Administratively Scoped IP Multicast", BCP 23, [RFC2365] Meyer, D., "Administratively Scoped IP Multicast", BCP 23,
RFC 2365, DOI 10.17487/RFC2365, July 1998, RFC 2365, DOI 10.17487/RFC2365, July 1998,
<https://www.rfc-editor.org/info/rfc2365>. <https://www.rfc-editor.org/info/rfc2365>.
[RFC3626] Clausen, T., Ed. and P. Jacquet, Ed., "Optimized Link
State Routing Protocol (OLSR)", RFC 3626,
DOI 10.17487/RFC3626, October 2003,
<https://www.rfc-editor.org/info/rfc3626>.
[RFC4271] Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A [RFC4271] Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A
Border Gateway Protocol 4 (BGP-4)", RFC 4271, Border Gateway Protocol 4 (BGP-4)", RFC 4271,
DOI 10.17487/RFC4271, January 2006, DOI 10.17487/RFC4271, January 2006,
<https://www.rfc-editor.org/info/rfc4271>. <https://www.rfc-editor.org/info/rfc4271>.
[RFC4291] Hinden, R. and S. Deering, "IP Version 6 Addressing [RFC4291] Hinden, R. and S. Deering, "IP Version 6 Addressing
Architecture", RFC 4291, DOI 10.17487/RFC4291, February Architecture", RFC 4291, DOI 10.17487/RFC4291, February
2006, <https://www.rfc-editor.org/info/rfc4291>. 2006, <https://www.rfc-editor.org/info/rfc4291>.
[RFC4655] Farrel, A., Vasseur, J., and J. Ash, "A Path Computation
Element (PCE)-Based Architecture", RFC 4655,
DOI 10.17487/RFC4655, August 2006,
<https://www.rfc-editor.org/info/rfc4655>.
[RFC5082] Gill, V., Heasley, J., Meyer, D., Savola, P., Ed., and C. [RFC5082] Gill, V., Heasley, J., Meyer, D., Savola, P., Ed., and C.
Pignataro, "The Generalized TTL Security Mechanism Pignataro, "The Generalized TTL Security Mechanism
(GTSM)", RFC 5082, DOI 10.17487/RFC5082, October 2007, (GTSM)", RFC 5082, DOI 10.17487/RFC5082, October 2007,
<https://www.rfc-editor.org/info/rfc5082>. <https://www.rfc-editor.org/info/rfc5082>.
[RFC5120] Przygienda, T., Shen, N., and N. Sheth, "M-ISIS: Multi [RFC5120] Przygienda, T., Shen, N., and N. Sheth, "M-ISIS: Multi
Topology (MT) Routing in Intermediate System to Topology (MT) Routing in Intermediate System to
Intermediate Systems (IS-ISs)", RFC 5120, Intermediate Systems (IS-ISs)", RFC 5120,
DOI 10.17487/RFC5120, February 2008, DOI 10.17487/RFC5120, February 2008,
<https://www.rfc-editor.org/info/rfc5120>. <https://www.rfc-editor.org/info/rfc5120>.
skipping to change at page 80, line 31 skipping to change at page 97, line 36
[RFC5881] Katz, D. and D. Ward, "Bidirectional Forwarding Detection [RFC5881] Katz, D. and D. Ward, "Bidirectional Forwarding Detection
(BFD) for IPv4 and IPv6 (Single Hop)", RFC 5881, (BFD) for IPv4 and IPv6 (Single Hop)", RFC 5881,
DOI 10.17487/RFC5881, June 2010, DOI 10.17487/RFC5881, June 2010,
<https://www.rfc-editor.org/info/rfc5881>. <https://www.rfc-editor.org/info/rfc5881>.
[RFC5905] Mills, D., Martin, J., Ed., Burbank, J., and W. Kasch, [RFC5905] Mills, D., Martin, J., Ed., Burbank, J., and W. Kasch,
"Network Time Protocol Version 4: Protocol and Algorithms "Network Time Protocol Version 4: Protocol and Algorithms
Specification", RFC 5905, DOI 10.17487/RFC5905, June 2010, Specification", RFC 5905, DOI 10.17487/RFC5905, June 2010,
<https://www.rfc-editor.org/info/rfc5905>. <https://www.rfc-editor.org/info/rfc5905>.
[RFC6234] Eastlake 3rd, D. and T. Hansen, "US Secure Hash Algorithms
(SHA and SHA-based HMAC and HKDF)", RFC 6234,
DOI 10.17487/RFC6234, May 2011,
<https://www.rfc-editor.org/info/rfc6234>.
[RFC6822] Previdi, S., Ed., Ginsberg, L., Shand, M., Roy, A., and D.
Ward, "IS-IS Multi-Instance", RFC 6822,
DOI 10.17487/RFC6822, December 2012,
<https://www.rfc-editor.org/info/rfc6822>.
[RFC7752] Gredler, H., Ed., Medved, J., Previdi, S., Farrel, A., and [RFC7752] Gredler, H., Ed., Medved, J., Previdi, S., Farrel, A., and
S. Ray, "North-Bound Distribution of Link-State and S. Ray, "North-Bound Distribution of Link-State and
Traffic Engineering (TE) Information Using BGP", RFC 7752, Traffic Engineering (TE) Information Using BGP", RFC 7752,
DOI 10.17487/RFC7752, March 2016, DOI 10.17487/RFC7752, March 2016,
<https://www.rfc-editor.org/info/rfc7752>. <https://www.rfc-editor.org/info/rfc7752>.
[RFC7855] Previdi, S., Ed., Filsfils, C., Ed., Decraene, B.,
Litkowski, S., Horneffer, M., and R. Shakir, "Source
Packet Routing in Networking (SPRING) Problem Statement
and Requirements", RFC 7855, DOI 10.17487/RFC7855, May
2016, <https://www.rfc-editor.org/info/rfc7855>.
[RFC7938] Lapukhov, P., Premji, A., and J. Mitchell, Ed., "Use of
BGP for Routing in Large-Scale Data Centers", RFC 7938,
DOI 10.17487/RFC7938, August 2016,
<https://www.rfc-editor.org/info/rfc7938>.
[RFC7987] Ginsberg, L., Wells, P., Decraene, B., Przygienda, T., and [RFC7987] Ginsberg, L., Wells, P., Decraene, B., Przygienda, T., and
H. Gredler, "IS-IS Minimum Remaining Lifetime", RFC 7987, H. Gredler, "IS-IS Minimum Remaining Lifetime", RFC 7987,
DOI 10.17487/RFC7987, October 2016, DOI 10.17487/RFC7987, October 2016,
<https://www.rfc-editor.org/info/rfc7987>. <https://www.rfc-editor.org/info/rfc7987>.
[RFC8200] Deering, S. and R. Hinden, "Internet Protocol, Version 6 [RFC8200] Deering, S. and R. Hinden, "Internet Protocol, Version 6
(IPv6) Specification", STD 86, RFC 8200, (IPv6) Specification", STD 86, RFC 8200,
DOI 10.17487/RFC8200, July 2017, DOI 10.17487/RFC8200, July 2017,
<https://www.rfc-editor.org/info/rfc8200>. <https://www.rfc-editor.org/info/rfc8200>.
[RFC8202] Ginsberg, L., Previdi, S., and W. Henderickx, "IS-IS
Multi-Instance", RFC 8202, DOI 10.17487/RFC8202, June
2017, <https://www.rfc-editor.org/info/rfc8202>.
[RFC8402] Filsfils, C., Ed., Previdi, S., Ed., Ginsberg, L.,
Decraene, B., Litkowski, S., and R. Shakir, "Segment
Routing Architecture", RFC 8402, DOI 10.17487/RFC8402,
July 2018, <https://www.rfc-editor.org/info/rfc8402>.
[RFC8505] Thubert, P., Ed., Nordmark, E., Chakrabarti, S., and C.
Perkins, "Registration Extensions for IPv6 over Low-Power
Wireless Personal Area Network (6LoWPAN) Neighbor
Discovery", RFC 8505, DOI 10.17487/RFC8505, November 2018,
<https://www.rfc-editor.org/info/rfc8505>.
11.2. Informative References 11.2. Informative References
[CLOS] Yuan, X., "On Nonblocking Folded-Clos Networks in Computer [CLOS] Yuan, X., "On Nonblocking Folded-Clos Networks in Computer
Communication Environments", IEEE International Parallel & Communication Environments", IEEE International Parallel &
Distributed Processing Symposium, 2011. Distributed Processing Symposium, 2011.
[DIJKSTRA] [DIJKSTRA]
Dijkstra, E., "A Note on Two Problems in Connexion with Dijkstra, E., "A Note on Two Problems in Connexion with
Graphs", Journal Numer. Math. , 1959. Graphs", Journal Numer. Math. , 1959.
skipping to change at page 81, line 48 skipping to change at page 98, line 48
Eppstein, D., "Finding the k-Shortest Paths", 1997. Eppstein, D., "Finding the k-Shortest Paths", 1997.
[EUI64] IEEE, "Guidelines for Use of Extended Unique Identifier [EUI64] IEEE, "Guidelines for Use of Extended Unique Identifier
(EUI), Organizationally Unique Identifier (OUI), and (EUI), Organizationally Unique Identifier (OUI), and
Company ID (CID)", IEEE EUI, Company ID (CID)", IEEE EUI,
<http://standards.ieee.org/develop/regauth/tut/eui.pdf>. <http://standards.ieee.org/develop/regauth/tut/eui.pdf>.
[FATTREE] Leiserson, C., "Fat-Trees: Universal Networks for [FATTREE] Leiserson, C., "Fat-Trees: Universal Networks for
Hardware-Efficient Supercomputing", 1985. Hardware-Efficient Supercomputing", 1985.
[I-D.ietf-spring-segment-routing]
Filsfils, C., Previdi, S., Ginsberg, L., Decraene, B.,
Litkowski, S., and R. Shakir, "Segment Routing
Architecture", draft-ietf-spring-segment-routing-15 (work
in progress), January 2018.
[IEEEstd1588] [IEEEstd1588]
IEEE, "IEEE Standard for a Precision Clock Synchronization IEEE, "IEEE Standard for a Precision Clock Synchronization
Protocol for Networked Measurement and Control Systems", Protocol for Networked Measurement and Control Systems",
IEEE Standard 1588, IEEE Standard 1588,
<https://ieeexplore.ieee.org/document/4579760/>. <https://ieeexplore.ieee.org/document/4579760/>.
[IEEEstd8021AS] [IEEEstd8021AS]
IEEE, "IEEE Standard for Local and Metropolitan Area IEEE, "IEEE Standard for Local and Metropolitan Area
Networks - Timing and Synchronization for Time-Sensitive Networks - Timing and Synchronization for Time-Sensitive
Applications in Bridged Local Area Networks", Applications in Bridged Local Area Networks",
skipping to change at page 82, line 39 skipping to change at page 99, line 33
[RFC0826] Plummer, D., "An Ethernet Address Resolution Protocol: Or [RFC0826] Plummer, D., "An Ethernet Address Resolution Protocol: Or
Converting Network Protocol Addresses to 48.bit Ethernet Converting Network Protocol Addresses to 48.bit Ethernet
Address for Transmission on Ethernet Hardware", STD 37, Address for Transmission on Ethernet Hardware", STD 37,
RFC 826, DOI 10.17487/RFC0826, November 1982, RFC 826, DOI 10.17487/RFC0826, November 1982,
<https://www.rfc-editor.org/info/rfc826>. <https://www.rfc-editor.org/info/rfc826>.
[RFC2131] Droms, R., "Dynamic Host Configuration Protocol", [RFC2131] Droms, R., "Dynamic Host Configuration Protocol",
RFC 2131, DOI 10.17487/RFC2131, March 1997, RFC 2131, DOI 10.17487/RFC2131, March 1997,
<https://www.rfc-editor.org/info/rfc2131>. <https://www.rfc-editor.org/info/rfc2131>.
[RFC3315] Droms, R., Ed., Bound, J., Volz, B., Lemon, T., Perkins, [RFC3626] Clausen, T., Ed. and P. Jacquet, Ed., "Optimized Link
C., and M. Carney, "Dynamic Host Configuration Protocol State Routing Protocol (OLSR)", RFC 3626,
for IPv6 (DHCPv6)", RFC 3315, DOI 10.17487/RFC3315, July DOI 10.17487/RFC3626, October 2003,
2003, <https://www.rfc-editor.org/info/rfc3315>. <https://www.rfc-editor.org/info/rfc3626>.
[RFC4655] Farrel, A., Vasseur, J., and J. Ash, "A Path Computation
Element (PCE)-Based Architecture", RFC 4655,
DOI 10.17487/RFC4655, August 2006,
<https://www.rfc-editor.org/info/rfc4655>.
[RFC4861] Narten, T., Nordmark, E., Simpson, W., and H. Soliman, [RFC4861] Narten, T., Nordmark, E., Simpson, W., and H. Soliman,
"Neighbor Discovery for IP version 6 (IPv6)", RFC 4861, "Neighbor Discovery for IP version 6 (IPv6)", RFC 4861,
DOI 10.17487/RFC4861, September 2007, DOI 10.17487/RFC4861, September 2007,
<https://www.rfc-editor.org/info/rfc4861>. <https://www.rfc-editor.org/info/rfc4861>.
[RFC4862] Thomson, S., Narten, T., and T. Jinmei, "IPv6 Stateless [RFC4862] Thomson, S., Narten, T., and T. Jinmei, "IPv6 Stateless
Address Autoconfiguration", RFC 4862, Address Autoconfiguration", RFC 4862,
DOI 10.17487/RFC4862, September 2007, DOI 10.17487/RFC4862, September 2007,
<https://www.rfc-editor.org/info/rfc4862>. <https://www.rfc-editor.org/info/rfc4862>.
[RFC6518] Lebovitz, G. and M. Bhatia, "Keying and Authentication for
Routing Protocols (KARP) Design Guidelines", RFC 6518,
DOI 10.17487/RFC6518, February 2012,
<https://www.rfc-editor.org/info/rfc6518>.
[RFC7855] Previdi, S., Ed., Filsfils, C., Ed., Decraene, B.,
Litkowski, S., Horneffer, M., and R. Shakir, "Source
Packet Routing in Networking (SPRING) Problem Statement
and Requirements", RFC 7855, DOI 10.17487/RFC7855, May
2016, <https://www.rfc-editor.org/info/rfc7855>.
[RFC7938] Lapukhov, P., Premji, A., and J. Mitchell, Ed., "Use of
BGP for Routing in Large-Scale Data Centers", RFC 7938,
DOI 10.17487/RFC7938, August 2016,
<https://www.rfc-editor.org/info/rfc7938>.
[RFC8415] Mrugalski, T., Siodelski, M., Volz, B., Yourtchenko, A.,
Richardson, M., Jiang, S., Lemon, T., and T. Winters,
"Dynamic Host Configuration Protocol for IPv6 (DHCPv6)",
RFC 8415, DOI 10.17487/RFC8415, November 2018,
<https://www.rfc-editor.org/info/rfc8415>.
[VAHDAT08] [VAHDAT08]
Al-Fares, M., Loukissas, A., and A. Vahdat, "A Scalable, Al-Fares, M., Loukissas, A., and A. Vahdat, "A Scalable,
Commodity Data Center Network Architecture", SIGCOMM , Commodity Data Center Network Architecture", SIGCOMM ,
2008. 2008.
Appendix A. Information Elements Schema [Wikipedia]
Wikipedia,
"https://en.wikipedia.org/wiki/Serial_number_arithmetic",
2016.
Appendix A. Sequence Number Binary Arithmetic
The only reasonably reference to a cleaner than [RFC1982] sequence
number solution is given in [Wikipedia]. It basically converts the
problem into two complement's arithmetic. Assuming a straight two
complement's substractions on the bit-width of the sequence number
the according >: and =: relations are defined as:
U_1, U_2 are 12-bits aligned unsigned version number
D_f is ( U_1 - U_2 ) interpreted as two complement signed 12-bits
D_b is ( U_2 - U_1 ) interpreted as two complement signed 12-bits
U_1 >: U_2 IIF D_f > 0 AND D_b < 0
U_1 =: U_2 IIF D_f = 0
The >: relationsship is symmetric but not transitive. Observe that
this leaves the case of the numbers having maximum two complement
distance, e.g. ( 0 and 0x800 ) undefined in our 12-bits case since
D_f and D_b are both -0x7ff.
A simple example of the relationship in case of 3-bit arithmetic
follows as table indicating D_f/D_b values and then the relationship
of U_1 to U_2:
U2 / U1 0 1 2 3 4 5 6 7
0 +/+ +/- +/- +/- -/- -/+ -/+ -/+
1 -/+ +/+ +/- +/- +/- -/- -/+ -/+
2 -/+ -/+ +/+ +/- +/- +/- -/- -/+
3 -/+ -/+ -/+ +/+ +/- +/- +/- -/-
4 -/- -/+ -/+ -/+ +/+ +/- +/- +/-
5 +/- -/- -/+ -/+ -/+ +/+ +/- +/-
6 +/- +/- -/- -/+ -/+ -/+ +/+ +/-
7 +/- +/- +/- -/- -/+ -/+ -/+ +/+
U2 / U1 0 1 2 3 4 5 6 7
0 = > > > ? < < <
1 < = > > > ? < <
2 < < = > > > ? <
3 < < < = > > > ?
4 ? < < < = > > >
5 > ? < < < = > >
6 > > ? < < < = >
7 > > > ? < < < =
Appendix B. Information Elements Schema
This section introduces the schema for information elements. This section introduces the schema for information elements.
On schema changes that On schema changes that
1. change field numbers or 1. change field numbers or
2. add new required fields or 2. add new *required* fields or
3. remove fields or 3. remove any fields or
4. change lists into sets, unions into structures or 4. change lists into sets, unions into structures or
5. change multiplicity of fields or 5. change multiplicity of fields or
6. changes name of any field or 6. changes name of any field or type or
7. change datatypes of any field or 7. change datatypes of any field or
8. adds, changes or removes a default value of any *existing* field
or
8. adds, changes or removes a default value of any field or 9. removes or changes any defined constant or constant value or
9. removes or changes any defined constant or constant value 10. changes any enumeration type except extending `common.TIEType`
(use of enumeration types is generally discouraged)
major version of the schema MUST increase. All other changes MUST major version of the schema MUST increase. All other changes MUST
increase minor version within the same major. increase minor version within the same major.
Observe however that introducing an optional field of a structure Observe however that introducing an optional field does not cause a
type without a default does not cause a major version increase even major version increase even if the fields inside the structure are
if the fields inside the structure are optional with defaults. optional with defaults.
Thrift serializer/deserializer MUST not discard optional, unknown When using security envelope all thrift encoded objects MUST NOT be
fields but preserve and serialize them again when re-flooding whereas de- and re-serialized again when flooding TIEs since differences in
missing optional fields MAY be replaced with according default values serializers may produce different security fingerprints.
if present.
When operating without a security envelope, thrift serializer/
deserializer MUST NOT discard optional, unknown fields but preserve
and serialize them again when re-flooding whereas missing optional
fields MAY be replaced with according default values if present.
All signed integer as forced by Thrift support must be cast for All signed integer as forced by Thrift support must be cast for
internal purposes to equivalent unsigned values without discarding internal purposes to equivalent unsigned values without discarding
the signedness bit. An implementation SHOULD try to avoid using the the signedness bit. An implementation SHOULD try to avoid using the
signedness bit when generating values. signedness bit when generating values.
The schema is normative. The schema is normative.
A.1. common.thrift B.1. common.thrift
/** /**
Thrift file with common definitions for RIFT Thrift file with common definitions for RIFT
*/ */
/** @note MUST be interpreted in implementation as unsigned 64 bits. /** @note MUST be interpreted in implementation as unsigned 64 bits.
* The implementation SHOULD NOT use the MSB. * The implementation SHOULD NOT use the MSB.
*/ */
typedef i64 SystemIDType typedef i64 SystemIDType
typedef i32 IPv4Address typedef i32 IPv4Address
/** this has to be of length long enough to accomodate prefix */ /** this has to be of length long enough to accomodate prefix */
typedef binary IPv6Address typedef binary IPv6Address
/** @note MUST be interpreted in implementation as unsigned 16 bits */ /** @note MUST be interpreted in implementation as unsigned */
typedef i16 UDPPortType typedef i16 UDPPortType
/** @note MUST be interpreted in implementation as unsigned 32 bits */ /** @note MUST be interpreted in implementation as unsigned */
typedef i32 TIENrType typedef i32 TIENrType
/** @note MUST be interpreted in implementation as unsigned 32 bits */ /** @note MUST be interpreted in implementation as unsigned */
typedef i32 MTUSizeType typedef i32 MTUSizeType
/** @note MUST be interpreted in implementation as unsigned 32 bits */ /** @note MUST be interpreted in implementation as unsigned rollling over number */
typedef i32 SeqNrType typedef i16 SeqNrType
/** @note MUST be interpreted in implementation as unsigned 32 bits */ /** @note MUST be interpreted in implementation as unsigned */
typedef i32 LifeTimeInSecType typedef i32 LifeTimeInSecType
/** @note MUST be interpreted in implementation as unsigned 16 bits */ /** @note MUST be interpreted in implementation as unsigned */
typedef i16 LevelType typedef i8 LevelType
/** @note MUST be interpreted in implementation as unsigned 32 bits */ /** optional, recommended monotonically increasing number _per packet type per adjacency_
that can be used to detect losses/misordering/restarts.
This will be moved into envelope in the future.
@note MUST be interpreted in implementation as unsigned rollling over number */
typedef i16 PacketNumberType
/** @note MUST be interpreted in implementation as unsigned */
typedef i32 PodType typedef i32 PodType
/** @note MUST be interpreted in implementation as unsigned 16 bits */ /** @note MUST be interpreted in implementation as unsigned. This is carried in the
typedef i16 VersionType security envelope and MUST fit into 8 bits. */
/** @note MUST be interpreted in implementation as unsigned 32 bits */ typedef i8 VersionType
/** @note MUST be interpreted in implementation as unsigned */
typedef i16 MinorVersionType
/** @note MUST be interpreted in implementation as unsigned */
typedef i32 MetricType typedef i32 MetricType
/** @note MUST be interpreted in implementation as unstructured 64 bits */ /** @note MUST be interpreted in implementation as unsigned and unstructured */
typedef i64 RouteTagType typedef i64 RouteTagType
/** @note MUST be interpreted in implementation as unstructured 32 bits label value */ /** @note MUST be interpreted in implementation as unstructured label value */
typedef i32 LabelType typedef i32 LabelType
/** @note MUST be interpreted in implementation as unsigned 32 bits */ /** @note MUST be interpreted in implementation as unsigned */
typedef i32 BandwithInMegaBitsType typedef i32 BandwithInMegaBitsType
typedef string KeyIDType typedef string KeyIDType
/** node local, unique identification for a link (interface/tunnel /** node local, unique identification for a link (interface/tunnel
* etc. Basically anything RIFT runs on). This is kept * etc. Basically anything RIFT runs on). This is kept
* at 32 bits so it aligns with BFD [RFC5880] discriminator size. * at 32 bits so it aligns with BFD [RFC5880] discriminator size.
*/ */
typedef i32 LinkIDType typedef i32 LinkIDType
typedef string KeyNameType typedef string KeyNameType
typedef i8 PrefixLenType typedef i8 PrefixLenType
/** timestamp in seconds since the epoch */ /** timestamp in seconds since the epoch */
typedef i64 TimestampInSecsType typedef i64 TimestampInSecsType
/** security nonce */ /** security nonce.
typedef i64 NonceType * @note MUST be interpreted in implementation as rolling over unsigned value */
typedef i16 NonceType
/** LIE FSM holdtime type */ /** LIE FSM holdtime type */
typedef i16 TimeIntervalInSecType typedef i16 TimeIntervalInSecType
/** Transaction ID type for prefix mobility as specified by RFC6550, value /** Transaction ID type for prefix mobility as specified by RFC6550, value
MUST be interpreted in implementation as unsigned */ MUST be interpreted in implementation as unsigned */
typedef i8 PrefixTransactionIDType typedef i8 PrefixTransactionIDType
/** timestamp per IEEE 802.1AS, values MUST be interpreted in implementation as unsigned */ /** timestamp per IEEE 802.1AS, values MUST be interpreted in implementation as unsigned */
struct IEEE802_1ASTimeStampType { struct IEEE802_1ASTimeStampType {
1: required i64 AS_sec; 1: required i64 AS_sec;
2: optional i32 AS_nsec; 2: optional i32 AS_nsec;
} }
skipping to change at page 85, line 27 skipping to change at page 104, line 20
/** Flags indicating nodes behavior in case of ZTP and support /** Flags indicating nodes behavior in case of ZTP and support
for special optimization procedures. It will force level to `leaf_level` or for special optimization procedures. It will force level to `leaf_level` or
`top-of-fabric` level accordingly and enable according procedures `top-of-fabric` level accordingly and enable according procedures
*/ */
enum HierarchyIndications { enum HierarchyIndications {
leaf_only = 0, leaf_only = 0,
leaf_only_and_leaf_2_leaf_procedures = 1, leaf_only_and_leaf_2_leaf_procedures = 1,
top_of_fabric = 2, top_of_fabric = 2,
} }
const PacketNumberType undefined_packet_number = 0
/** This MUST be used when node is configured as top of fabric in ZTP. /** This MUST be used when node is configured as top of fabric in ZTP.
This is kept reasonably low to alow for fast ZTP convergence on This is kept reasonably low to alow for fast ZTP convergence on
failures. */ failures. */
const LevelType top_of_fabric_level = 24 const LevelType top_of_fabric_level = 24
/** default bandwidth on a link */ /** default bandwidth on a link */
const BandwithInMegaBitsType default_bandwidth = 100 const BandwithInMegaBitsType default_bandwidth = 100
/** fixed leaf level when ZTP is not used */ /** fixed leaf level when ZTP is not used */
const LevelType leaf_level = 0 const LevelType leaf_level = 0
const LevelType default_level = leaf_level const LevelType default_level = leaf_level
const PodType default_pod = 0 const PodType default_pod = 0
skipping to change at page 86, line 11 skipping to change at page 105, line 5
/** default ZTP FSM holddown time */ /** default ZTP FSM holddown time */
const TimeIntervalInSecType default_ztp_holdtime = 1 const TimeIntervalInSecType default_ztp_holdtime = 1
/** by default LIE levels are ZTP offers */ /** by default LIE levels are ZTP offers */
const bool default_not_a_ztp_offer = false const bool default_not_a_ztp_offer = false
/** by default e'one is repeating flooding */ /** by default e'one is repeating flooding */
const bool default_you_are_flood_repeater = true const bool default_you_are_flood_repeater = true
/** 0 is illegal for SystemID */ /** 0 is illegal for SystemID */
const SystemIDType IllegalSystemID = 0 const SystemIDType IllegalSystemID = 0
/** empty set of nodes */ /** empty set of nodes */
const set<SystemIDType> empty_set_of_nodeids = {} const set<SystemIDType> empty_set_of_nodeids = {}
/** default lifetime is one week */ /** default lifetime of TIE is one week */
const LifeTimeInSecType default_lifetime = 604800 const LifeTimeInSecType default_lifetime = 604800
/** default lifetime when TIEs are purged is 5 minutes */
const LifeTimeInSecType purge_lifetime = 300
/** round down interval when TIEs are sent with security hashes
to prevent excessive computation. **/
const LifeTimeInSecType rounddown_lifetime_interval = 60
/** any `TieHeader` that has a smaller lifetime difference /** any `TieHeader` that has a smaller lifetime difference
than this constant is equal (if other fields equal) */ than this constant is equal (if other fields equal). This
const LifeTimeInSecType lifetime_diff2ignore = 300 constant MUST be larger than `purge_lifetime` to avoid
retransmissions */
const LifeTimeInSecType lifetime_diff2ignore = 400
/** default UDP port to run LIEs on */ /** default UDP port to run LIEs on */
const UDPPortType default_lie_udp_port = 911 const UDPPortType default_lie_udp_port = 911
/** default UDP port to receive TIEs on, that can be peer specific */ /** default UDP port to receive TIEs on, that can be peer specific */
const UDPPortType default_tie_udp_flood_port = 912 const UDPPortType default_tie_udp_flood_port = 912
/** default MTU link size to use */ /** default MTU link size to use */
const MTUSizeType default_mtu_size = 1400 const MTUSizeType default_mtu_size = 1400
/** default link being BFD capable */
const bool bfd_default = true
/** undefined nonce, equivalent to missing nonce */
const NonceType undefined_nonce = 0;
/** Maximum delta (negative or positive) that a mirrored nonce can
deviate from local value to be considered valid. If nonces are
changed every minute on both sides this opens statistically
a 5 minutes window of identical LIEs **/
const i16 maximum_valid_nonce_delta = 5;
/** indicates whether the direction is northbound/east-west /** indicates whether the direction is northbound/east-west
* or southbound */ * or southbound */
enum TieDirectionType { enum TieDirectionType {
Illegal = 0, Illegal = 0,
South = 1, South = 1,
North = 2, North = 2,
DirectionMaxValue = 3, DirectionMaxValue = 3,
} }
skipping to change at page 87, line 23 skipping to change at page 106, line 33
/** @note: Sequence of a prefix. Comparison function: /** @note: Sequence of a prefix. Comparison function:
if diff(timestamps) < 200msecs better transactionid wins if diff(timestamps) < 200msecs better transactionid wins
else better time wins else better time wins
*/ */
struct PrefixSequenceType { struct PrefixSequenceType {
1: required IEEE802_1ASTimeStampType timestamp; 1: required IEEE802_1ASTimeStampType timestamp;
2: optional PrefixTransactionIDType transactionid; 2: optional PrefixTransactionIDType transactionid;
} }
/** Type of TIE.
This enum indicates what TIE type the TIE is carrying.
In case the value is not known to the receiver,
re-flooded the same way as prefix TIEs. This allows for
future extensions of the protocol within the same schema major
with types opaque to some nodes unless the flooding scope is not
the same as prefix TIE, then a major version revision MUST
be performed.
*/
enum TIETypeType { enum TIETypeType {
Illegal = 0, Illegal = 0,
TIETypeMinValue = 1, TIETypeMinValue = 1,
/** first legal value */ /** first legal value */
NodeTIEType = 2, NodeTIEType = 2,
PrefixTIEType = 3, PrefixTIEType = 3,
PositiveDisaggregationPrefixTIEType = 4, PositiveDisaggregationPrefixTIEType = 4,
NegativeDisaggregationPrefixTIEType = 5, NegativeDisaggregationPrefixTIEType = 5,
PGPrefixTIEType = 6, PGPrefixTIEType = 6,
KeyValueTIEType = 7, KeyValueTIEType = 7,
ExternalPrefixTIEType = 8, ExternalPrefixTIEType = 8,
TIETypeMaxValue = 9, TIETypeMaxValue = 9,
} }
/** @note: route types which MUST be ordered on their preference /** @note: route types which MUST be ordered on their preference
* PGP prefixes are most preferred attracting * PGP prefixes are most preferred attracting
* traffic north (towards spine) and then south * traffic north (towards spine) and then south
* normal prefixes are attracting traffic south (towards leafs), * normal prefixes are attracting traffic south (towards leafs),
* i.e. prefix in NORTH PREFIX TIE is preferred over SOUTH PREFIX TIE * i.e. prefix in NORTH PREFIX TIE is preferred over SOUTH PREFIX TIE
* *
* @todo: external routes
* @note: The only purpose of those values is to introduce an * @note: The only purpose of those values is to introduce an
* ordering whereas an implementation can choose internally * ordering whereas an implementation can choose internally
* any other values as long the ordering is preserved * any other values as long the ordering is preserved
*/ */
enum RouteType { enum RouteType {
Illegal = 0, Illegal = 0,
RouteTypeMinValue = 1, RouteTypeMinValue = 1,
/** First legal value. */ /** First legal value. */
/** Discard routes are most prefered */ /** Discard routes are most prefered */
Discard = 2, Discard = 2,
/** Local prefixes are directly attached prefixes on the /** Local prefixes are directly attached prefixes on the
* system such as e.g. interface routes. * system such as e.g. interface routes.
*/ */
LocalPrefix = 3, LocalPrefix = 3,
/** advertised in S-TIEs */ /** advertised in S-TIEs */
SouthPGPPrefix = 4, SouthPGPPrefix = 4,
/** advertised in N-TIEs */ /** advertised in N-TIEs */
NorthPGPPrefix = 5, NorthPGPPrefix = 5,
/** advertised in N-TIEs */ /** advertised in N-TIEs */
NorthPrefix = 6, NorthPrefix = 6,
/** advertised in S-TIEs, either normal prefix or positive disaggregation */
SouthPrefix = 7,
/** externally imported north */ /** externally imported north */
NorthExternalPrefix = 8, NorthExternalPrefix = 7,
/** advertised in S-TIEs, either normal prefix or positive disaggregation */
SouthPrefix = 8,
/** externally imported south */ /** externally imported south */
SouthExternalPrefix = 9, SouthExternalPrefix = 9,
/** negative, transitive are least preferred of /** negative, transitive prefixes are least preferred of
local variety */ local variety */
NegativeSouthPrefix = 10, NegativeSouthPrefix = 10,
RouteTypeMaxValue = 11, RouteTypeMaxValue = 11,
} }
B.2. encoding.thrift
A.2. encoding.thrift
/** /**
Thrift file for packet encodings for RIFT Thrift file for packet encodings for RIFT
*/ */
include "common.thrift" include "common.thrift"
/** Changes
25.0: shorter level and major version types, clarified unsigned on nonces
25.1: clarifications on `common.TIEType` extensions for new TIEs
26.0: smaller packet sequence number and version number types
*/
/** represents protocol encoding schema major version */ /** represents protocol encoding schema major version */
const i32 protocol_major_version = 19 const common.VersionType protocol_major_version = 26
/** represents protocol encoding schema minor version */ /** represents protocol encoding schema minor version */
const i32 protocol_minor_version = 0 const common.MinorVersionType protocol_minor_version = 0
/** common RIFT packet header */ /** common RIFT packet header */
struct PacketHeader { struct PacketHeader {
1: required common.VersionType major_version = protocol_major_version; 1: required common.VersionType major_version = protocol_major_version;
2: required common.VersionType minor_version = protocol_minor_version; 2: required common.VersionType minor_version = protocol_minor_version;
/** this is the node sending the packet, in case of LIE/TIRE/TIDE /** this is the node sending the packet, in case of LIE/TIRE/TIDE
also the originator of it */ also the originator of it */
3: required common.SystemIDType sender; 3: required common.SystemIDType sender;
/** level of the node sending the packet, required on everything except /** level of the node sending the packet, required on everything except
* LIEs. Lack of presence on LIEs indicates UNDEFINED_LEVEL and is used * LIEs. Lack of presence on LIEs indicates UNDEFINED_LEVEL and is used
* in ZTP procedures. * in ZTP procedures.
*/ */
4: optional common.LevelType level; 4: optional common.LevelType level;
10: optional common.PacketNumberType packet_number;
} }
/** Community serves as community for PGP purposes */ /** Community serves as community for PGP purposes */
struct Community { struct Community {
1: required i32 top; 1: required i32 top;
2: required i32 bottom; 2: required i32 bottom;
} }
/** Neighbor structure */ /** Neighbor structure */
struct Neighbor { struct Neighbor {
1: required common.SystemIDType originator; 1: required common.SystemIDType originator;
2: required common.LinkIDType remote_id; 2: required common.LinkIDType remote_id;
} }
/** Capabilities the node supports. The schema may add to this
/** Capabilities the node supports */ field future capabilities to indicate whether it will support
interpretation of future schema extensions on the same major
revision. Such fields MUST be optional and have an implicit or
explicit false default value. If a future capability changes route
selection or generates blackholes if some nodes are not supporting
it then a major version increment is unavoidable.
*/
struct NodeCapabilities { struct NodeCapabilities {
/** can this node participate in flood reduction */ /** can this node participate in flood reduction */
1: optional bool flood_reduction = 1: optional bool flood_reduction =
common.flood_reduction_default; common.flood_reduction_default;
/** does this node restrict itself to be top-of-fabric or /** does this node restrict itself to be top-of-fabric or
leaf only (in ZTP) and does it support leaf-2-leaf procedures */ leaf only (in ZTP) and does it support leaf-2-leaf procedures */
2: optional common.HierarchyIndications hierarchy_indications; 2: optional common.HierarchyIndications hierarchy_indications;
} }
/* Link capabilities */
struct LinkCapabilities {
/* indicates that the link's `local ID` can be used as its BFD
discriminator and the link is supporting BFD */
1: optional bool bfd =
common.bfd_default;
}
/** RIFT LIE packet /** RIFT LIE packet
@note this node's level is already included on the packet header */ @note this node's level is already included on the packet header */
struct LIEPacket { struct LIEPacket {
/** optional node or adjacency name */ /** optional node or adjacency name */
1: optional string name; 1: optional string name;
/** local link ID */ /** local link ID */
2: required common.LinkIDType local_id; 2: required common.LinkIDType local_id;
/** UDP port to which we can receive flooded TIEs */ /** UDP port to which we can receive flooded TIEs */
3: required common.UDPPortType flood_port = 3: required common.UDPPortType flood_port =
common.default_tie_udp_flood_port; common.default_tie_udp_flood_port;
/** layer 3 MTU, used to discover to mismatch */ /** layer 3 MTU, used to discover to mismatch. */
4: optional common.MTUSizeType link_mtu_size = 4: optional common.MTUSizeType link_mtu_size =
common.default_mtu_size; common.default_mtu_size;
/** local link bandwidth on the interface */ /** local link bandwidth on the interface */
5: optional common.BandwithInMegaBitsType link_bandwidth = 5: optional common.BandwithInMegaBitsType link_bandwidth =
common.default_bandwidth; common.default_bandwidth;
/** this will reflect the neighbor once received to provid /** this will reflect the neighbor once received to provide
3-way connectivity */ 3-way connectivity */
6: optional Neighbor neighbor; 6: optional Neighbor neighbor;
7: optional common.PodType pod = common.default_pod; 7: optional common.PodType pod =
common.default_pod;
/** optional local nonce used for security computations */ /** optional local nonce used for security computations */
8: optional common.NonceType nonce; 8: optional common.NonceType nonce = common.undefined_nonce;
/** optional neighbor's reflected nonce for security purposes. Significant delta /** optional neighbor's reflected nonce for security purposes. */
in nonces seen compared to current local nonce can be used to prevent replays */ 9: optional common.NonceType last_neighbor_nonce = common.undefined_nonce;
9: optional common.NonceType last_neighbor_nonce;
/** optional node capabilities shown in the LIE. The capabilies /** optional node capabilities shown in the LIE. The capabilies
MUST match the capabilities shown in the Node TIEs, otherwise MUST match the capabilities shown in the Node TIEs, otherwise
the behavior is unspecified. A node detecting the mismatch the behavior is unspecified. A node detecting the mismatch
SHOULD generate according error. SHOULD generate according error */
*/ 10: optional NodeCapabilities node_capabilities;
10: optional NodeCapabilities capabilities; 11: optional LinkCapabilities link_capabilities;
/** required holdtime of the adjacency, i.e. how much time /** required holdtime of the adjacency, i.e. how much time
MUST expire without LIE for the adjacency to drop MUST expire without LIE for the adjacency to drop */
*/ 12: required common.TimeIntervalInSecType holdtime =
11: required common.TimeIntervalInSecType holdtime =
common.default_lie_holdtime; common.default_lie_holdtime;
/** optional downstream assigned locally significant label
value for the adjacency */
13: optional common.LabelType label;
/** indicates that the level on the LIE MUST NOT be used /** indicates that the level on the LIE MUST NOT be used
to derive a ZTP level by the receiving node. */ to derive a ZTP level by the receiving node */
12: optional bool not_a_ztp_offer = 21: optional bool not_a_ztp_offer =
common.default_not_a_ztp_offer; common.default_not_a_ztp_offer;
/** indicates to northbound neighbor that it should /** indicates to northbound neighbor that it should
be reflooding this node's N-TIEs to achieve flood reducuction and be reflooding this node's N-TIEs to achieve flood reduction and
balancing for northbound flooding. To be ignored if received from a balancing for northbound flooding. To be ignored if received from a
northbound adjacency. */ northbound adjacency */
13: optional bool you_are_flood_repeater = 22: optional bool you_are_flood_repeater =
common.default_you_are_flood_repeater; common.default_you_are_flood_repeater;
/** optional downstream assigned locally significant label /** can be optionally set to indicate to neighbor that packet losses are seen on
value for the adjacency. */ reception based on packet numbers or the rate is too high. The receiver SHOULD
14: optional common.LabelType label; temporarily slow down flooding rates.
*/
23: optional bool you_are_sending_too_quickly =
false;
} }
/** LinkID pair describes one of parallel links between two nodes */ /** LinkID pair describes one of parallel links between two nodes */
struct LinkIDPair { struct LinkIDPair {
/** node-wide unique value for the local link */ /** node-wide unique value for the local link */
1: required common.LinkIDType local_id; 1: required common.LinkIDType local_id;
/** received remote link ID for this link */ /** received remote link ID for this link */
2: required common.LinkIDType remote_id; 2: required common.LinkIDType remote_id;
/** more properties of the link can go in here */ /** more properties of the link can go in here */
} }
skipping to change at page 91, line 15 skipping to change at page 111, line 19
/** indicates originator of the TIE */ /** indicates originator of the TIE */
2: required common.SystemIDType originator; 2: required common.SystemIDType originator;
3: required common.TIETypeType tietype; 3: required common.TIETypeType tietype;
4: required common.TIENrType tie_nr; 4: required common.TIENrType tie_nr;
} }
/** Header of a TIE. /** Header of a TIE.
@note: TIEID space is a total order achieved by comparing the elements @note: TIEID space is a total order achieved by comparing the elements
in sequence defined and comparing each value as an in sequence defined and comparing each value as an
unsigned integer of according length. `origination_time` is unsigned integer of according length. `origination_time` and
disregarded for comparison purposes. `origination_lifetime` are disregarded for comparison purposes
and carried purely for debugging/security purposes if present.
*/ */
struct TIEHeader { struct TIEHeader {
2: required TIEID tieid; 2: required TIEID tieid;
3: required common.SeqNrType seq_nr; 3: required common.SeqNrType seq_nr;
/** remaining lifetime that expires down to 0 just like in ISIS. /** remaining lifetime that expires down to 0 just like in ISIS.
TIEs with lifetimes differing by less than `lifetime_diff2ignore` MUST TIEs with lifetimes differing by less than `lifetime_diff2ignore` MUST
be considered EQUAL. */ be considered EQUAL.
When using security envelope,
this is just a model placeholder for convienence
that is never being modified during flooding. The real remaining lifetime
is contained on the security envelope. */
4: required common.LifeTimeInSecType remaining_lifetime; 4: required common.LifeTimeInSecType remaining_lifetime;
/** optional absolute timestamp when the TIE /** optional absolute timestamp when the TIE
was generated. This can be used on fabrics with was generated. This can be used on fabrics with
synchronized clock to prevent lifetime modification attacks. */ synchronized clock to prevent lifetime modification attacks. */
10: optional common.IEEE802_1ASTimeStampType origination_time; 10: optional common.IEEE802_1ASTimeStampType origination_time;
/** optional original lifetime when the TIE /** optional original lifetime when the TIE
was generated. This can be used on fabrics with was generated. This can be used on fabrics with
synchronized clock to prevent lifetime modification attacks. */ synchronized clock to prevent lifetime modification attacks. */
12: optional common.LifeTimeInSecType origination_lifetime; 12: optional common.LifeTimeInSecType origination_lifetime;
} }
skipping to change at page 92, line 5 skipping to change at page 112, line 15
} }
/** A TIRE packet */ /** A TIRE packet */
struct TIREPacket { struct TIREPacket {
1: required set<TIEHeader> headers; 1: required set<TIEHeader> headers;
} }
/** Neighbor of a node */ /** Neighbor of a node */
struct NodeNeighborsTIEElement { struct NodeNeighborsTIEElement {
/** Level of neighbor */ /** Level of neighbor */
1: required common.LevelType level; 1: required common.LevelType level;
/** Cost to neighbor. /** Cost to neighbor.
@note: All parallel links to same node @note: All parallel links to same node
incur same cost, in case the neighbor has multiple incur same cost, in case the neighbor has multiple
parallel links at different cost, the largest distance parallel links at different cost, the largest distance
(highest numerical value) MUST be advertised (highest numerical value) MUST be advertised
@note: any neighbor with cost <= 0 MUST be ignored in computations */ @note: any neighbor with cost <= 0 MUST be ignored in computations */
3: optional common.MetricType cost = common.default_distance; 3: optional common.MetricType cost = common.default_distance;
/** can carry description of multiple parallel links in a TIE */ /** can carry description of multiple parallel links in a TIE */
4: optional set<LinkIDPair> link_ids; 4: optional set<LinkIDPair> link_ids;
skipping to change at page 92, line 44 skipping to change at page 113, line 6
* neighbors repeat with different values or * neighbors repeat with different values or
* visible in same level/having partition upper do not match * visible in same level/having partition upper do not match
the behavior is undefined and a warning SHOULD be generated. the behavior is undefined and a warning SHOULD be generated.
Neighbors can be distributed across multiple TIEs however if Neighbors can be distributed across multiple TIEs however if
the sets are disjoint. the sets are disjoint.
@note: observe that absence of fields implies defined defaults @note: observe that absence of fields implies defined defaults
*/ */
struct NodeTIEElement { struct NodeTIEElement {
1: required common.LevelType level; 1: required common.LevelType level;
/** if neighbor systemID repeats in other node TIEs of same node /** _All_ of the node's neighbors.
the behavior is undefined. Equivalent to |A_(n,s)(N) in spec. */ * If neighbor systemID repeats in other node TIEs of same node
the behavior is undefined. */
2: required map<common.SystemIDType, 2: required map<common.SystemIDType,
NodeNeighborsTIEElement> neighbors; NodeNeighborsTIEElement> neighbors;
3: optional NodeCapabilities capabilities; 3: optional NodeCapabilities capabilities;
4: optional NodeFlags flags; 4: optional NodeFlags flags;
/** optional node name for easier operations */ /** optional node name for easier operations */
5: optional string name; 5: optional string name;
} }
struct PrefixAttributes { struct PrefixAttributes {
2: required common.MetricType metric = common.default_distance; 2: required common.MetricType metric = common.default_distance;
/** generic unordered set of route tags, can be redistributed to other protocols or use /** generic unordered set of route tags, can be redistributed to other protocols or use
within the context of real time analytics */ within the context of real time analytics */
3: optional set<common.RouteTagType> tags; 3: optional set<common.RouteTagType> tags;
/** optional monotonic clock for mobile addresses */ /** optional monotonic clock for mobile addresses */
4: optional common.PrefixSequenceType monotonic_clock; 4: optional common.PrefixSequenceType monotonic_clock;
} }
skipping to change at page 93, line 31 skipping to change at page 113, line 41
1: required map<common.IPPrefixType, PrefixAttributes> prefixes; 1: required map<common.IPPrefixType, PrefixAttributes> prefixes;
} }
/** keys with their values */ /** keys with their values */
struct KeyValueTIEElement { struct KeyValueTIEElement {
/** if the same key repeats in multiple TIEs of same node /** if the same key repeats in multiple TIEs of same node
or with different values, behavior is unspecified */ or with different values, behavior is unspecified */
1: required map<common.KeyIDType,string> keyvalues; 1: required map<common.KeyIDType,string> keyvalues;
} }
/** single element in a TIE. enum common.TIETypeType /** single element in a TIE. enum `common.TIETypeType`
in TIEID indicates which elements MUST be present in TIEID indicates which elements MUST be present
in the TIEElement. In case of mismatch the unexpected in the TIEElement. In case of mismatch the unexpected
elements MUST be ignored. In case of lack of expected elements MUST be ignored. In case of lack of expected
element the TIE an error MUST be reported and the TIE element the TIE an error MUST be reported and the TIE
MUST be ignored. MUST be ignored.
This type can be extended with new optional elements
for new `common.TIETypeType` values without breaking
the major but if it is necessary to understand whether
all nodes support the new type a node capability must
be added as well.
*/ */
union TIEElement { union TIEElement {
/** in case of enum common.TIETypeType.NodeTIEType */ /** in case of enum common.TIETypeType.NodeTIEType */
1: optional NodeTIEElement node; 1: optional NodeTIEElement node;
/** in case of enum common.TIETypeType.PrefixTIEType */ /** in case of enum common.TIETypeType.PrefixTIEType */
2: optional PrefixTIEElement prefixes; 2: optional PrefixTIEElement prefixes;
/** positive prefixes (always southbound) /** positive prefixes (always southbound)
It MUST NOT be advertised within a North TIE. It MUST NOT be advertised within a North TIE.
*/ */
3: optional PrefixTIEElement positive_disaggregation_prefixes; 3: optional PrefixTIEElement positive_disaggregation_prefixes;
skipping to change at page 94, line 15 skipping to change at page 114, line 31
It MUST NOT be advertised within a North TIE. It MUST NOT be advertised within a North TIE.
*/ */
4: optional PrefixTIEElement negative_disaggregation_prefixes; 4: optional PrefixTIEElement negative_disaggregation_prefixes;
/** externally reimported prefixes */ /** externally reimported prefixes */
5: optional PrefixTIEElement external_prefixes; 5: optional PrefixTIEElement external_prefixes;
/** Key-Value store elements */ /** Key-Value store elements */
6: optional KeyValueTIEElement keyvalues; 6: optional KeyValueTIEElement keyvalues;
/** @todo: policy guided prefixes */ /** @todo: policy guided prefixes */
} }
/** @todo: flood header separately in UDP to allow changing lifetime and SHA without reserialization
*/
struct TIEPacket { struct TIEPacket {
1: required TIEHeader header; 1: required TIEHeader header;
2: required TIEElement element; 2: required TIEElement element;
} }
union PacketContent { union PacketContent {
1: optional LIEPacket lie; 1: optional LIEPacket lie;
2: optional TIDEPacket tide; 2: optional TIDEPacket tide;
3: optional TIREPacket tire; 3: optional TIREPacket tire;
4: optional TIEPacket tie; 4: optional TIEPacket tie;
} }
/** protocol packet structure */ /** protocol packet structure */
struct ProtocolPacket { struct ProtocolPacket {
1: required PacketHeader header; 1: required PacketHeader header;
2: required PacketContent content; 2: required PacketContent content;
} }
Appendix C. Finite State Machines and Precise Operational
Appendix B. Finite State Machines and Precise Operational
Specifications Specifications
Some FSM figures are provided as [DOT] description due to limitations Some FSM figures are provided as [DOT] description due to limitations
of ASCII art. of ASCII art.
On Entry action is performed every time and right before the On Entry action is performed every time and right before the
according state is entered, i.e. after any transitions from previous according state is entered, i.e. after any transitions from previous
state. state.
On Exit action is performed every time and immediately when a state On Exit action is performed every time and immediately when a state
skipping to change at page 95, line 15 skipping to change at page 115, line 32
The FSMs and procedures are NOT normative in the sense that an The FSMs and procedures are NOT normative in the sense that an
implementation MUST implement them literally (which would be implementation MUST implement them literally (which would be
overspecification) but an implementation MUST exhibit externally overspecification) but an implementation MUST exhibit externally
observable behavior that is identical to the execution of the observable behavior that is identical to the execution of the
specified FSMs. specified FSMs.
Where a FSM representation is inconvenient, i.e. the amount of Where a FSM representation is inconvenient, i.e. the amount of
procedures and kept state exceeds the amount of transitions, we defer procedures and kept state exceeds the amount of transitions, we defer
to a more procedural description on data structures. to a more procedural description on data structures.
B.1. LIE FSM C.1. LIE FSM
Initial state is `OneWay`. Initial state is `OneWay`.
Event `MultipleNeighbors` occurs normally when more than two nodes Event `MultipleNeighbors` occurs normally when more than two nodes
see each other on the same link or a remote node is quickly see each other on the same link or a remote node is quickly
reconfigured or rebooted without regressing to `OneWay` first. Each reconfigured or rebooted without regressing to `OneWay` first. Each
occurence of the event SHOULD generate a clear, according occurence of the event SHOULD generate a clear, according
notification to help operational deployments. notification to help operational deployments.
The machine sends LIEs on several transitions to accelerate adjacency The machine sends LIEs on several transitions to accelerate adjacency
skipping to change at page 101, line 19 skipping to change at page 121, line 33
5. CHECK_THREE_WAY: if current state is one-way do nothing else 5. CHECK_THREE_WAY: if current state is one-way do nothing else
1. if lie packet does not contain neighbor then if current state 1. if lie packet does not contain neighbor then if current state
is three-way then PUSH NeighborDroppedReflection else is three-way then PUSH NeighborDroppedReflection else
2. if packet reflects this system's ID and local port and state 2. if packet reflects this system's ID and local port and state
is three-way then PUSH event ValidReflection else PUSH event is three-way then PUSH event ValidReflection else PUSH event
MultipleNeighbors MultipleNeighbors
B.2. ZTP FSM C.2. ZTP FSM
Initial state is ComputeBestOffer. Initial state is ComputeBestOffer.
digraph G04743cd825cc40c5b93de0616ffb851b { digraph Gd436cc3ced8c471eb30bd4f3ac946261 {
N29e7db3976644f62b6f3b2801bccb854[label="Enter"] N06108ba9ac894d988b3e4e8ea5ace007
[style="dashed"][shape="plain"]; [label="Enter"]
N33df4993a1664be18a2196001c27a64c[label="HoldingDown"][shape="oval"]; [style="invis"]
N839f77189e324c82b21b8a709b4b021d[label="ComputeBestOffer"][shape="oval"]; [shape="plain"];
Nc97f2b02808d4751afcc630687bf7421[label="UpdatingClients"][shape="oval"]; Na47ff5eac9aa4b2eaf12839af68aab1f
N7ad21867360c44709be20a99f33dd1f7[label="Enter"] [label="MultipleNeighborsWait"]
[style="dashed"][shape="plain"]; [shape="oval"];
N33df4993a1664be18a2196001c27a64c -> N33df4993a1664be18a2196001c27a64c N57a829be68e2489d8dc6b84e10597d0b
[label="|ComputationDone|"][color="green"] [label="OneWay"]
[shape="oval"];
Na641d400819a468d987e31182cdb013e
[label="ThreeWay"]
[shape="oval"];
Necfbfc2d8e5b482682ee66e604450c7b
[label="Enter"]
[style="dashed"]
[shape="plain"];
N16db54bf2c5d48f093ad6c18e70081ee
[label="TwoWay"]
[shape="oval"];
N1b89016876b44cc1b9c1e4a735769560
[label="Exit"]
[style="invis"]
[shape="plain"];
N16db54bf2c5d48f093ad6c18e70081ee -> N57a829be68e2489d8dc6b84e10597d0b
[label="|NeighborChangedLevel|\n|NeighborChangedAddress|\n|UnacceptableHeader|\n|MTUMismatch|\n|PODMismatch|\n|HoldtimeExpired|"]
[color="black"]
[arrowhead="normal" dir="both" arrowtail="none"]; [arrowhead="normal" dir="both" arrowtail="none"];
N29e7db3976644f62b6f3b2801bccb854 -> Nc97f2b02808d4751afcc630687bf7421 N57a829be68e2489d8dc6b84e10597d0b -> N57a829be68e2489d8dc6b84e10597d0b
[label="|NeighborDroppedReflection|"]
[color="red"]
[arrowhead="normal" dir="both" arrowtail="none"];
N57a829be68e2489d8dc6b84e10597d0b -> Na47ff5eac9aa4b2eaf12839af68aab1f
[label="|MultipleNeighbors|"]
[color="black"]
[arrowhead="normal" dir="both" arrowtail="none"];
Necfbfc2d8e5b482682ee66e604450c7b -> N57a829be68e2489d8dc6b84e10597d0b
[label=""] [label=""]
[color="black"][arrowhead="normal" dir="both" arrowtail="none"]; [color="black"]
N839f77189e324c82b21b8a709b4b021d -> N839f77189e324c82b21b8a709b4b021d
[label="|NeighborOffer|\n|WithdrawNeighborOffer|"]
[color="blue"][arrowhead="normal" dir="both" arrowtail="none"];
N33df4993a1664be18a2196001c27a64c -> N839f77189e324c82b21b8a709b4b021d
[label="|ChangeLocalLeafIndications|\n|ChangeLocalConfiguredLevel|"]
[color="gold"]
[arrowhead="normal" dir="both" arrowtail="none"]; [arrowhead="normal" dir="both" arrowtail="none"];
N839f77189e324c82b21b8a709b4b021d -> N839f77189e324c82b21b8a709b4b021d N57a829be68e2489d8dc6b84e10597d0b -> N16db54bf2c5d48f093ad6c18e70081ee
[label="|BetterHAL|\n|BetterHAT|\n|LostHAT|"] [label="|NewNeighbor|"]
[color="red"][arrowhead="normal" dir="both" arrowtail="none"]; [color="black"]
N33df4993a1664be18a2196001c27a64c -> N33df4993a1664be18a2196001c27a64c
[label="|NeighborOffer|\n|WithdrawNeighborOffer|"][color="blue"]
[arrowhead="normal" dir="both" arrowtail="none"]; [arrowhead="normal" dir="both" arrowtail="none"];
Nc97f2b02808d4751afcc630687bf7421 -> N839f77189e324c82b21b8a709b4b021d Na641d400819a468d987e31182cdb013e -> Na47ff5eac9aa4b2eaf12839af68aab1f
[label="|BetterHAL|\n|BetterHAT|\n|LostHAT|"][color="red"] [label="|MultipleNeighbors|"]
[color="black"]
[arrowhead="normal" dir="both" arrowtail="none"]; [arrowhead="normal" dir="both" arrowtail="none"];
N33df4993a1664be18a2196001c27a64c -> N33df4993a1664be18a2196001c27a64c N16db54bf2c5d48f093ad6c18e70081ee -> N16db54bf2c5d48f093ad6c18e70081ee
[label="|ShortTic|"][color="black"][arrowhead="normal" dir="both" [label="|HALChanged|\n|HATChanged|\n|HALSChanged|\n|UpdateZTPOffer|"]
arrowtail="none"]; [color="blue"]
Nc97f2b02808d4751afcc630687bf7421 -> Nc97f2b02808d4751afcc630687bf7421
[label="|NeighborOffer|\n|WithdrawNeighborOffer|"][color="blue"]
[arrowhead="normal" dir="both" arrowtail="none"]; [arrowhead="normal" dir="both" arrowtail="none"];
N33df4993a1664be18a2196001c27a64c -> N33df4993a1664be18a2196001c27a64c Na641d400819a468d987e31182cdb013e -> N16db54bf2c5d48f093ad6c18e70081ee
[label="|BetterHAL|\n|BetterHAT|\n|LostHAL|\n|LostHAT|"][color="red"] [label="|NeighborDroppedReflection|"]
[color="red"]
[arrowhead="normal" dir="both" arrowtail="none"]; [arrowhead="normal" dir="both" arrowtail="none"];
N839f77189e324c82b21b8a709b4b021d -> N33df4993a1664be18a2196001c27a64c Na47ff5eac9aa4b2eaf12839af68aab1f -> Na47ff5eac9aa4b2eaf12839af68aab1f
[label="|LostHAL|"][color="red"][arrowhead="normal" dir="both" [label="|TimerTick|\n|MultipleNeighbors|"]
arrowtail="none"]; [color="black"]
N7ad21867360c44709be20a99f33dd1f7 -> N839f77189e324c82b21b8a709b4b021d
[label=""][color="black"][arrowhead="normal" dir="both" arrowtail="none"];
N839f77189e324c82b21b8a709b4b021d -> Nc97f2b02808d4751afcc630687bf7421
[label="|ComputationDone|"][color="green"][arrowhead="normal" dir="both"
arrowtail="none"];
N839f77189e324c82b21b8a709b4b021d -> N839f77189e324c82b21b8a709b4b021d
[label="|ChangeLocalLeafIndications|\n|ChangeLocalConfiguredLevel|"]
[color="gold"]
[arrowhead="normal" dir="both" arrowtail="none"]; [arrowhead="normal" dir="both" arrowtail="none"];
Nc97f2b02808d4751afcc630687bf7421 -> N33df4993a1664be18a2196001c27a64c N57a829be68e2489d8dc6b84e10597d0b -> N57a829be68e2489d8dc6b84e10597d0b
[label="|LostHAL|"] [label="|LevelChanged|\n|HALChanged|\n|HATChanged|\n|HALSChanged|\n|UpdateZTPOffer|"]
[color="red"][arrowhead="normal" dir="both" arrowtail="none"]; [color="blue"]
N33df4993a1664be18a2196001c27a64c -> N839f77189e324c82b21b8a709b4b021d [arrowhead="normal" dir="both" arrowtail="none"];
[label="|HoldDownExpired|"][color="green"][arrowhead="normal" dir="both" Na641d400819a468d987e31182cdb013e -> Na641d400819a468d987e31182cdb013e
arrowtail="none"]; [label="|HALChanged|\n|HATChanged|\n|HALSChanged|\n|UpdateZTPOffer|"]
Nc97f2b02808d4751afcc630687bf7421 -> N839f77189e324c82b21b8a709b4b021d [color="blue"]
[label="|ChangeLocalLeafIndications|\n|ChangeLocalConfiguredLevel|"] [arrowhead="normal" dir="both" arrowtail="none"];
[color="gold"] Na641d400819a468d987e31182cdb013e -> N57a829be68e2489d8dc6b84e10597d0b
[label="|NeighborChangedLevel|\n|NeighborChangedAddress|\n|UnacceptableHeader|\n|MTUMismatch|\n|PODMismatch|\n|HoldtimeExpired|"]
[color="black"]
[arrowhead="normal" dir="both" arrowtail="none"];
Na47ff5eac9aa4b2eaf12839af68aab1f -> Na47ff5eac9aa4b2eaf12839af68aab1f
[label="|HALChanged|\n|HATChanged|\n|HALSChanged|\n|UpdateZTPOffer|"]
[color="blue"]
[arrowhead="normal" dir="both" arrowtail="none"];
N16db54bf2c5d48f093ad6c18e70081ee -> N57a829be68e2489d8dc6b84e10597d0b
[label="|LevelChanged|"]
[color="blue"]
[arrowhead="normal" dir="both" arrowtail="none"];
Na641d400819a468d987e31182cdb013e -> N57a829be68e2489d8dc6b84e10597d0b
[label="|LevelChanged|"]
[color="blue"]
[arrowhead="normal" dir="both" arrowtail="none"];
N16db54bf2c5d48f093ad6c18e70081ee -> Na47ff5eac9aa4b2eaf12839af68aab1f
[label="|MultipleNeighbors|"]
[color="black"]
[arrowhead="normal" dir="both" arrowtail="none"];
Na47ff5eac9aa4b2eaf12839af68aab1f -> N57a829be68e2489d8dc6b84e10597d0b
[label="|MultipleNeighborsDone|"]
[color="black"]
[arrowhead="normal" dir="both" arrowtail="none"];
N16db54bf2c5d48f093ad6c18e70081ee -> Na641d400819a468d987e31182cdb013e
[label="|ValidReflection|"]
[color="red"]
[arrowhead="normal" dir="both" arrowtail="none"];
Na47ff5eac9aa4b2eaf12839af68aab1f -> N57a829be68e2489d8dc6b84e10597d0b
[label="|LevelChanged|"]
[color="blue"]
[arrowhead="normal" dir="both" arrowtail="none"];
Na641d400819a468d987e31182cdb013e -> Na641d400819a468d987e31182cdb013e
[label="|TimerTick|\n|LieRcvd|\n|SendLie|"]
[color="black"]
[arrowhead="normal" dir="both" arrowtail="none"];
N57a829be68e2489d8dc6b84e10597d0b -> N57a829be68e2489d8dc6b84e10597d0b
[label="|TimerTick|\n|LieRcvd|\n|NeighborChangedLevel|\n|NeighborChangedAddress|\n|NeighborAddressAdded|\n|UnacceptableHeader|\n|MTUMismatch|\n|PODMismatch|\n|HoldtimeExpired|\n|SendLie|"]
[color="black"]
[arrowhead="normal" dir="both" arrowtail="none"];
N57a829be68e2489d8dc6b84e10597d0b -> Na641d400819a468d987e31182cdb013e
[label="|ValidReflection|"]
[color="red"]
[arrowhead="normal" dir="both" arrowtail="none"];
N16db54bf2c5d48f093ad6c18e70081ee -> N16db54bf2c5d48f093ad6c18e70081ee
[label="|TimerTick|\n|LieRcvd|\n|SendLie|"]
[color="black"]
[arrowhead="normal" dir="both" arrowtail="none"];
Na641d400819a468d987e31182cdb013e -> Na641d400819a468d987e31182cdb013e
[label="|ValidReflection|"]
[color="red"]
[arrowhead="normal" dir="both" arrowtail="none"]; [arrowhead="normal" dir="both" arrowtail="none"];
} }
ZTP FSM DOT ZTP FSM DOT
Events Events
o ChangeLocalLeafIndications: node configured with new leaf flags o TimerTick: one second timer tic
o ChangeLocalConfiguredLevel: node locally configured with a defined o LevelChanged: node's level has been changed by ZTP or
level configuration
o NeighborOffer: a new neighbor offer with optional level and o HALChanged: best HAL computed by ZTP has changed
neighbor state
o WithdrawNeighborOffer: a neighbor's offer withdrawn o HATChanged: HAT computed by ZTP has changed
o BetterHAL: better HAL computed internally
o BetterHAT: better HAT computed internally o HALSChanged: set of HAL offering systems computed by ZTP has
changed
o LostHAL: lost last HAL in computation o LieRcvd: received LIE
o LostHAT: lost HAT in computation o NewNeighbor: new neighbor parsed
o ComputationDone: computation performed o ValidReflection: received own reflection from neighbor
o HoldDownExpired: holddown expired o NeighborDroppedReflection: lost previous own reflection from
neighbor
o NeighborChangedLevel: neighbor changed advertised level
o NeighborChangedAddress: neighbor changed IP address
o UnacceptableHeader: unacceptable header seen
o MTUMismatch: MTU mismatched
o PODMismatch: Unacceptable PoD seen
o HoldtimeExpired: adjacency hold down expired
o MultipleNeighbors: more than one neighbor seen on interface
o MultipleNeighborsDone: cooldown for multiple neighbors expired
o SendLie: send a LIE out
o UpdateZTPOffer: update this node's ZTP offer
Actions Actions
on LostHAT in ComputeBestOffer finishes in ComputeBestOffer: on MTUMismatch in OneWay finishes in OneWay: no action
LEVEL_COMPUTE
on LostHAT in HoldingDown finishes in HoldingDown: no action on HoldtimeExpired in OneWay finishes in OneWay: no action
on LostHAL in HoldingDown finishes in HoldingDown: on LevelChanged in ThreeWay finishes in OneWay: update level with
event value
on ChangeLocalLeafIndications in UpdatingClients finishes in on MultipleNeighbors in MultipleNeighborsWait finishes in
ComputeBestOffer: store leaf flags MultipleNeighborsWait: start multiple neighbors timer as 4 *
DEFAULT_LIE_HOLDTIME
on LostHAT in UpdatingClients finishes in ComputeBestOffer: no on HALChanged in MultipleNeighborsWait finishes in
MultipleNeighborsWait: store new HAL
on NeighborChangedAddress in ThreeWay finishes in OneWay: no
action action
on BetterHAT in HoldingDown finishes in HoldingDown: no action on ValidReflection in OneWay finishes in ThreeWay: no action
on NeighborOffer in ComputeBestOffer finishes in ComputeBestOffer: on MTUMismatch in TwoWay finishes in OneWay: no action
if no level offered REMOVE_OFFER else on TimerTick in MultipleNeighborsWait finishes in
MultipleNeighborsWait: decrement MultipleNeighbors timer, if
expired PUSH MultipleNeighborsDone
if level > leaf then UPDATE_OFFER else REMOVE_OFFER on MultipleNeighborsDone in MultipleNeighborsWait finishes in
OneWay: decrement MultipleNeighbors timer, if expired PUSH
MultipleNeighborsDone
on BetterHAT in UpdatingClients finishes in ComputeBestOffer: no on HATChanged in ThreeWay finishes in ThreeWay: store HAT
action
on ChangeLocalConfiguredLevel in HoldingDown finishes in on UpdateZTPOffer in TwoWay finishes in TwoWay: send offer to ZTP
ComputeBestOffer: store configured level FSM
on HALSChanged in TwoWay finishes in TwoWay: store HALS
on BetterHAL in ComputeBestOffer finishes in ComputeBestOffer: on PODMismatch in TwoWay finishes in OneWay: no action
LEVEL_COMPUTE
on HoldDownExpired in HoldingDown finishes in ComputeBestOffer: on LieRcvd in TwoWay finishes in TwoWay: PROCESS_LIE
PURGE_OFFERS
on ShortTic in HoldingDown finishes in HoldingDown: if holddown
timer expired PUSH_EVENT HoldDownExpired
on ComputationDone in ComputeBestOffer finishes in on PODMismatch in ThreeWay finishes in OneWay: no action
UpdatingClients: no action
on LostHAL in UpdatingClients finishes in HoldingDown: if any on TimerTick in TwoWay finishes in TwoWay: PUSH SendLie event, if
southbound adjacencies present update holddown timer to normal holdtime expired PUSH HoldtimeExpired event
duration else fire holddown timer immediately
on NeighborOffer in UpdatingClients finishes in UpdatingClients: on SendLie in TwoWay finishes in TwoWay: SEND_LIE
if no level offered REMOVE_OFFER else on SendLie in OneWay finishes in OneWay: SEND_LIE
if level > leaf then UPDATE_OFFER else REMOVE_OFFER on TimerTick in OneWay finishes in OneWay: PUSH SendLie event
on ChangeLocalConfiguredLevel in ComputeBestOffer finishes in on HALChanged in OneWay finishes in OneWay: store new HAL
ComputeBestOffer: store configured level and LEVEL_COMPUTE
on NeighborOffer in HoldingDown finishes in HoldingDown: on HALSChanged in ThreeWay finishes in ThreeWay: store HALS
if no level offered REMOVE_OFFER else on NeighborChangedLevel in TwoWay finishes in OneWay: no action
if level > leaf then UPDATE_OFFER else REMOVE_OFFER on PODMismatch in OneWay finishes in OneWay: no action
on LostHAL in ComputeBestOffer finishes in HoldingDown: if any on HoldtimeExpired in TwoWay finishes in OneWay: no action
southbound adjacencies present update holddown timer to normal
duration else fire holddown timer immediately
on BetterHAT in ComputeBestOffer finishes in ComputeBestOffer: on TimerTick in ThreeWay finishes in ThreeWay: PUSH SendLie event,
LEVEL_COMPUTE if holdtime expired PUSH HoldtimeExpired event
on WithdrawNeighborOffer in ComputeBestOffer finishes in on MultipleNeighbors in TwoWay finishes in MultipleNeighborsWait:
ComputeBestOffer: REMOVE_OFFER start multiple neighbors timer as 4 * DEFAULT_LIE_HOLDTIME
on ChangeLocalLeafIndications in ComputeBestOffer finishes in on UpdateZTPOffer in MultipleNeighborsWait finishes in
ComputeBestOffer: store leaf flags and LEVEL_COMPUTE MultipleNeighborsWait: send offer to ZTP FSM
on BetterHAL in HoldingDown finishes in HoldingDown: no action on LieRcvd in OneWay finishes in OneWay: PROCESS_LIE
on WithdrawNeighborOffer in HoldingDown finishes in HoldingDown: on LevelChanged in MultipleNeighborsWait finishes in OneWay:
REMOVE_OFFER update level with event value
on ChangeLocalLeafIndications in HoldingDown finishes in on UpdateZTPOffer in ThreeWay finishes in ThreeWay: send offer to
ComputeBestOffer: store leaf flags ZTP FSM
on ChangeLocalConfiguredLevel in UpdatingClients finishes in on HALChanged in TwoWay finishes in TwoWay: store new HAL
ComputeBestOffer: store level
on ComputationDone in HoldingDown finishes in HoldingDown:
on BetterHAL in UpdatingClients finishes in ComputeBestOffer: no on UnacceptableHeader in OneWay finishes in OneWay: no action
on HALSChanged in OneWay finishes in OneWay: store HALS
on HALSChanged in MultipleNeighborsWait finishes in
MultipleNeighborsWait: store HALS
on SendLie in ThreeWay finishes in ThreeWay: SEND_LIE
on MTUMismatch in ThreeWay finishes in OneWay: no action
on HATChanged in MultipleNeighborsWait finishes in
MultipleNeighborsWait: store HAT
on NeighborChangedAddress in OneWay finishes in OneWay: no action
on ValidReflection in TwoWay finishes in ThreeWay: no action
on MultipleNeighbors in OneWay finishes in MultipleNeighborsWait:
start multiple neighbors timer as 4 * DEFAULT_LIE_HOLDTIME
on NeighborChangedLevel in OneWay finishes in OneWay: no action
on HATChanged in OneWay finishes in OneWay: store HAT
on NeighborDroppedReflection in OneWay finishes in OneWay: no
action action
on WithdrawNeighborOffer in UpdatingClients finishes in on HALChanged in ThreeWay finishes in ThreeWay: store new HAL
UpdatingClients: REMOVE_OFFER
on Entry into UpdatingClients: update all LIE FSMs with on NeighborAddressAdded in OneWay finishes in OneWay: no action
computation results
on Entry into ComputeBestOffer: LEVEL_COMPUTE on NeighborChangedAddress in TwoWay finishes in OneWay: no action
on LieRcvd in ThreeWay finishes in ThreeWay: PROCESS_LIE
on UnacceptableHeader in TwoWay finishes in OneWay: no action
on LevelChanged in TwoWay finishes in OneWay: update level with
event value
on HATChanged in TwoWay finishes in TwoWay: store HAT
on UpdateZTPOffer in OneWay finishes in OneWay: send offer to ZTP
FSM
on ValidReflection in ThreeWay finishes in ThreeWay: no action
on UnacceptableHeader in ThreeWay finishes in OneWay: no action
on HoldtimeExpired in ThreeWay finishes in OneWay: no action
on NeighborChangedLevel in ThreeWay finishes in OneWay: no action
on LevelChanged in OneWay finishes in OneWay: update level with
event value, PUSH SendLie event
on NewNeighbor in OneWay finishes in TwoWay: PUSH SendLie event
on NeighborDroppedReflection in ThreeWay finishes in TwoWay: no
action
on MultipleNeighbors in ThreeWay finishes in
MultipleNeighborsWait: start multiple neighbors timer as 4 *
DEFAULT_LIE_HOLDTIME
on Entry into OneWay: CLEANUP
Following words are used for well known procedures: Following words are used for well known procedures:
1. PUSH Event: pushes an event to be executed by the FSM upon exit 1. PUSH Event: pushes an event to be executed by the FSM upon exit
of this action of this action
2. COMPARE_OFFERS: checks whether based on current offers and held 2. CLEANUP: neighbor MUST be reset to unknown
last results the events BetterHAL/LostHAL/BetterHAT/LostHAT are
necessary and returns them
3. UPDATE_OFFER: store current offer and COMPARE_OFFERS, PUSH 3. SEND_LIE: create a new LIE packet
according events
4. LEVEL_COMPUTE: compute best offered or configured level and HAL/ 1. reflecting the neighbor if known and valid and
HAT, if anything changed PUSH ComputationDone
5. REMOVE_OFFER: remove the according offer and COMPARE_OFFERS, PUSH 2. setting the necessary `not_a_ztp_offer` variable if level was
according events derived from last known neighbor on this interface and
6. PURGE_OFFERS: REMOVE_OFFER for all held offers, COMPARE OFFERS, 3. setting `you_are_not_flood_repeater` to computed value
PUSH according events
B.3. Flooding Procedures 4. PROCESS_LIE:
1. if lie has wrong major version OR our own system ID or
invalid system ID then CLEANUP else
2. if lie has non matching MTUs then CLEANUP, PUSH
UpdateZTPOffer, PUSH MTUMismatch else
3. if PoD rules do not allow adjacency forming then CLEANUP,
PUSH PODMismatch, PUSH MTUMismatch else
4. if lie has undefined level OR my level is undefined OR this
node is leaf and remote level lower than HAT OR (lie's level
is not leaf AND its difference is more than one from my
level) then CLEANUP, PUSH UpdateZTPOffer, PUSH
UnacceptableHeader else
5. PUSH UpdateZTPOffer, construct temporary new neighbor
structure with values from lie, if no current neighbor exists
then set neighbor to new neighbor, PUSH NewNeighbor event,
CHECK_THREE_WAY else
1. if current neighbor system ID differs from lie's system
ID then PUSH MultipleNeighbors else
2. if current neighbor stored level differs from lie's level
then PUSH NeighborChangedLevel else
3. if current neighbor stored IPv4/v6 address differs from
lie's address then PUSH NeighborChangedAddress else
4. if any of neighbor's flood address port, name, local
linkid changed then PUSH NeighborChangedMinorFields and
5. CHECK_THREE_WAY
5. CHECK_THREE_WAY: if current state is one-way do nothing else
1. if lie packet does not contain neighbor then if current state
is three-way then PUSH NeighborDroppedReflection else
2. if packet reflects this system's ID and local port and state
is three-way then PUSH event ValidReflection else PUSH event
MultipleNeighbors
C.3. Flooding Procedures
Flooding Procedures are described in terms of a flooding state of an Flooding Procedures are described in terms of a flooding state of an
adjacency and resulting operations on it driven by packet arrivals. adjacency and resulting operations on it driven by packet arrivals.
The FSM has basically a single state and is not well suited to The FSM has basically a single state and is not well suited to
represent the behavior. represent the behavior.
RIFT does not specify any kind of flood rate limiting since such RIFT does not specify any kind of flood rate limiting since such
specifications always assume particular points in available specifications always assume particular points in available
technology speeds and feeds and those points are shifting at faster technology speeds and feeds and those points are shifting at faster
and faster rate (speed of light holding for the moment). The and faster rate (speed of light holding for the moment). The encoded
implementation is well served to react accordingly to losses or packets provide hints to react accordingly to losses or overruns.
overruns which are easily detected by raising retransmission rates.
Flooding of all according topology exchange elements SHOULD be Flooding of all according topology exchange elements SHOULD be
performed at highest feasible rate whereas the rate of transmission performed at highest feasible rate whereas the rate of transmission
MUST be throttled by reacting to adequate features of the system such MUST be throttled by reacting to adequate features of the system such
as e.g. queue lengths. as e.g. queue lengths or congestion indications in the protocol
packets.
B.3.1. FloodState Structure per Adjacency C.3.1. FloodState Structure per Adjacency
The structure contains conceptually the following elements. The word The structure contains conceptually the following elements. The word
collection or queue indicates a set of elements that can be iterated: collection or queue indicates a set of elements that can be iterated:
TIES_TX: Collection containing all the TIEs to transmit on the TIES_TX: Collection containing all the TIEs to transmit on the
adjacency. adjacency.
TIES_ACK: Collection containing all the TIEs that have to be TIES_ACK: Collection containing all the TIEs that have to be
acknowledged on the adjacency. acknowledged on the adjacency.
TIES_REQ: Collection containing all the TIE headers that have to be TIES_REQ: Collection containing all the TIE headers that have to be
requested on the adjacency. requested on the adjacency.
TIES_RTX: Collection containing all TIEs that need retransmission TIES_RTX: Collection containing all TIEs that need retransmission
with the according time to retransmit. with the according time to retransmit.
LAST_RCVD_TIDE_END: last received TIE ID in last received TIDE.
Following words are used for well known procedures operating on this Following words are used for well known procedures operating on this
structure: structure:
TIE Describes either a full RIFT TIE or accordingly just the
`TIEHeader` or `TIEID`. The according meaning is unambiguously
contained in the context of the algorithm.
is_flood_reduced(TIE): returns whether a TIE can be flood reduced or is_flood_reduced(TIE): returns whether a TIE can be flood reduced or
not. not.
is_tide_entry_filtered(TIE): returns whether a header should be is_tide_entry_filtered(TIE): returns whether a header should be
propagated in TIDE according to flooding scopes. propagated in TIDE according to flooding scopes.
is_request_filtered(TIE): returns whether a TIE request should be is_request_filtered(TIE): returns whether a TIE request should be
propagated to neighbor or not according to flooding scopes. propagated to neighbor or not according to flooding scopes.
is_flood_filtered(TIE): returns whether a TIE requested be flooded is_flood_filtered(TIE): returns whether a TIE requested be flooded
skipping to change at page 107, line 35 skipping to change at page 131, line 35
The collection SHOULD be served with following priorities if the The collection SHOULD be served with following priorities if the
system cannot process all the collections in real time: system cannot process all the collections in real time:
Elements on TIES_ACK should be processed with highest priority Elements on TIES_ACK should be processed with highest priority
TIES_TX TIES_TX
TIES_REQ and TIES_RTX TIES_REQ and TIES_RTX
B.3.2. TIDEs C.3.2. TIDEs
`TIEID` and `TIEHeader` space forms a strict total order which `TIEID` and `TIEHeader` space forms a strict total order (modulo
uncomparable sequence numbers in the very unlikely event that can
occur if a TIE is "stuck" in a part of a network while the originator
reboots and reissues TIEs many times to the point its sequence# rolls
over and forms incomparable distance to the "stuck" copy) which
implies that a comparison relation is possible between two elements. implies that a comparison relation is possible between two elements.
With that it is implictly possible to compare TIEs, TIEHeaders and With that it is implictly possible to compare TIEs, TIEHeaders and
TIEIDs to each other whereas the shortest viable key is always TIEIDs to each other whereas the shortest viable key is always
implied. implied.
B.3.2.1. TIDE Generation When generating and sending TIDEs an implementation SHOULD ensure
that enough bandwidth is left to send elements of Floodstate
structure.
C.3.2.1. TIDE Generation
As given by timer constant, periodically generate TIDEs by: As given by timer constant, periodically generate TIDEs by:
NEXT_TIDE_ID: ID of next TIE to be sent in TIDE. NEXT_TIDE_ID: ID of next TIE to be sent in TIDE.
TIDE_START: Begin of TIDE packet range. TIDE_START: Begin of TIDE packet range.
a. NEXT_TIDE_ID = MIN_TIEID a. NEXT_TIDE_ID = MIN_TIEID
b. while NEXT_TIDE_ID not equal to MAX_TIEID do b. while NEXT_TIDE_ID not equal to MAX_TIEID do
1. TIDE_START = NEXT_TIDE_ID 1. TIDE_START = NEXT_TIDE_ID
2. HEADERS = At most TIRDEs_PER_PKT headers in TIEDB starting at 2. HEADERS = At most TIRDEs_PER_PKT headers in TIEDB starting at
NEXT_TIDE_ID or higher that SHOULD be filtered by NEXT_TIDE_ID or higher that SHOULD be filtered by
is_tide_entry_filtered is_tide_entry_filtered and have a lifetime left > 0
3. if HEADERS is empty then START = MIN_TIEID else START = first 3. if HEADERS is empty then START = MIN_TIEID else START = first
element in HEADERS element in HEADERS
4. if HEADERS' size less than TIRDEs_PER_PKT then END = 4. if HEADERS' size less than TIRDEs_PER_PKT then END =
MAX_TIEID else END = last element in HEADERS MAX_TIEID else END = last element in HEADERS
5. send sorted HEADERS as TIDE setting START and END as its 5. send sorted HEADERS as TIDE setting START and END as its
range range
6. NEXT_TIDE_ID = END 6. NEXT_TIDE_ID = END
The constant `TIRDEs_PER_PKT` SHOULD be generated and used by the The constant `TIRDEs_PER_PKT` SHOULD be generated and used by the
implementation to limit the amount of TIE headers per TIDE so the implementation to limit the amount of TIE headers per TIDE so the
sent TIDE PDU does not exceed interface MTU. sent TIDE PDU does not exceed interface MTU.
TIDE PDUs SHOULD be spaced on sending to prevent packet drops. TIDE PDUs SHOULD be spaced on sending to prevent packet drops.
B.3.2.2. TIDE Processing C.3.2.2. TIDE Processing
On reception of TIDEs the following processing is performed: On reception of TIDEs the following processing is performed:
TXKEYS: Collection of TIE Headers to be send after processing of TXKEYS: Collection of TIE Headers to be send after processing of
the packet the packet
REQKEYS: Collection of TIEIDs to be requested after processing of REQKEYS: Collection of TIEIDs to be requested after processing of
the packet the packet
CLEARKEYS: Collection of TIEIDs to be removed from flood state CLEARKEYS: Collection of TIEIDs to be removed from flood state
skipping to change at page 108, line 41 skipping to change at page 133, line 4
On reception of TIDEs the following processing is performed: On reception of TIDEs the following processing is performed:
TXKEYS: Collection of TIE Headers to be send after processing of TXKEYS: Collection of TIE Headers to be send after processing of
the packet the packet
REQKEYS: Collection of TIEIDs to be requested after processing of REQKEYS: Collection of TIEIDs to be requested after processing of
the packet the packet
CLEARKEYS: Collection of TIEIDs to be removed from flood state CLEARKEYS: Collection of TIEIDs to be removed from flood state
queues queues
LASTPROCESSED: Last processed TIEID in TIDE LASTPROCESSED: Last processed TIEID in TIDE
DBTIE: TIE in the LSDB if found DBTIE: TIE in the LSDB if found
a. if TIDE.start_range > LAST_RCVD_TIDE_END then add all HEADERs in a. LASTPROCESSED = TIDE.start_range
LSDB where (TIE >= LAST_RCVD_TIDE_END and TIE < TIDE.start_range)
to TXKEYS
b. LASTPROCESSED = TIDE.start_range b. for every HEADER in TIDE do
c. for every HEADER in TIDE do
1. DBTIE = find HEADER in current LSDB 1. DBTIE = find HEADER in current LSDB
2. if HEADER < LASTPROCESSED then report error and reset 2. if HEADER < LASTPROCESSED then report error and reset
adjacency and return adjacency and return
3. put all TIEs in LSDB where (TIE.HEADER > LASTPROCESSED and 3. put all TIEs in LSDB where (TIE.HEADER > LASTPROCESSED and
TIE.HEADER < HEADER) into TXKEYS TIE.HEADER < HEADER) into TXKEYS
4. LASTPROCESSED = HEADER 4. LASTPROCESSED = HEADER
5. if DBTIE not found then 5. if DBTIE not found then
1. if originator is this node then bump_own_tie else put I) if originator is this node then bump_own_tie
HEADER into REQKEYS
II) else put HEADER into REQKEYS
6. if DBTIE.HEADER < HEADER then 6. if DBTIE.HEADER < HEADER then
1. if originator is this node then bump_own_tie else put I) if originator is this node then bump_own_tie else
HEADER into REQKEYS
i. if this is a N-TIE header from a northbound
neighbor then override DBTIE in LSDB with HEADER
ii. else put HEADER into REQKEYS
7. if DBTIE.HEADER > HEADER then put DBTIE.HEADER into TXKEYS 7. if DBTIE.HEADER > HEADER then put DBTIE.HEADER into TXKEYS
8. if DBTIE.HEADER = HEADER then put DBTIE.HEADER into CLEARKEYS 8. if DBTIE.HEADER = HEADER then
d. put all TIEs in LSDB where (TIE.HEADER > LASTPROCESSED and I) if DBTIE has content already then put DBTIE.HEADER
TIE.HEADER <= TIDE.end_range) into TXKEYS into CLEARKEYS
e. LAST_RCVD_TIDE_END = (if TIDE.end_range == MAX_TIEID then II) else put HEADER into REQKEYS
MIN_TIEID else TIDE.end_range)
f. for all TIEs in TXKEYS try_to_transmit_tie(TIE) c. put all TIEs in LSDB where (TIE.HEADER > LASTPROCESSED and
TIE.HEADER <= TIDE.end_range) into TXKEYS
g. for all TIEs in REQKEYS request_tie(TIE) d. for all TIEs in TXKEYS try_to_transmit_tie(TIE)
h. for all TIEs in CLEARKEYS remove_from_all_queues(TIE) e. for all TIEs in REQKEYS request_tie(TIE)
f. for all TIEs in CLEARKEYS remove_from_all_queues(TIE)
B.3.3. TIREs C.3.3. TIREs
B.3.3.1. TIRE Generation C.3.3.1. TIRE Generation
There is not much to say here. Elements from both TIES_REQ and There is not much to say here. Elements from both TIES_REQ and
TIES_ACK MUST be collected and sent out as fast as feasible as TIREs. TIES_ACK MUST be collected and sent out as fast as feasible as TIREs.
B.3.3.2. TIRE Processing C.3.3.2. TIRE Processing
On reception of TIREs the following processing is performed: On reception of TIREs the following processing is performed:
TXKEYS: Collection of TIE Headers to be send after processing of TXKEYS: Collection of TIE Headers to be send after processing of
the packet the packet
REQKEYS: Collection of TIEIDs to be requested after processing of REQKEYS: Collection of TIEIDs to be requested after processing of
the packet the packet
ACKKEYS: Collection of TIEIDs that have been acked ACKKEYS: Collection of TIEIDs that have been acked
skipping to change at page 110, line 37 skipping to change at page 134, line 45
4. if DBTIE.HEADER > HEADER then put DBTIE.HEADER into TXKEYS 4. if DBTIE.HEADER > HEADER then put DBTIE.HEADER into TXKEYS
5. if DBTIE.HEADER = HEADER then put DBTIE.HEADER into ACKKEYS 5. if DBTIE.HEADER = HEADER then put DBTIE.HEADER into ACKKEYS
b. for all TIEs in TXKEYS try_to_transmit_tie(TIE) b. for all TIEs in TXKEYS try_to_transmit_tie(TIE)
c. for all TIEs in REQKEYS request_tie(TIE) c. for all TIEs in REQKEYS request_tie(TIE)
d. for all TIEs in ACKKEYS tie_been_acked(TIE) d. for all TIEs in ACKKEYS tie_been_acked(TIE)
B.3.4. TIEs Processing on Flood State Adjacency C.3.4. TIEs Processing on Flood State Adjacency
On reception of TIEs the following processing is performed: On reception of TIEs the following processing is performed:
ACKTIE: TIE to acknowledge ACKTIE: TIE to acknowledge
TXTIE: TIE to transmit TXTIE: TIE to transmit
DBTIE: TIE in the LSDB if found DBTIE: TIE in the LSDB if found
a. DBTIE = find TIE in current LSDB a. DBTIE = find TIE in current LSDB
b. if DBTIE not found then b. if DBTIE not found then
1. if originator is this node then bump_own_tie with a short 1. if originator is this node then bump_own_tie with a short
remaining lifetime remaining lifetime
2. else insert TIE into LSDB and ACKTIE = TIE 2. else insert TIE into LSDB and ACKTIE = TIE
skipping to change at page 111, line 9 skipping to change at page 135, line 17
b. if DBTIE not found then b. if DBTIE not found then
1. if originator is this node then bump_own_tie with a short 1. if originator is this node then bump_own_tie with a short
remaining lifetime remaining lifetime
2. else insert TIE into LSDB and ACKTIE = TIE 2. else insert TIE into LSDB and ACKTIE = TIE
else else
1. if DBTIE.HEADER = TIE.HEADER then ACKTIE = TIE 1. if DBTIE.HEADER = TIE.HEADER then
i. if DBTIE has content already then ACKTIE = TIE
ii. else process like the "DBTIE.HEADER < TIE.HEADER" case
2. if DBTIE.HEADER < TIE.HEADER then 2. if DBTIE.HEADER < TIE.HEADER then
i. if originator is this node then bump_own_tie i. if originator is this node then bump_own_tie
ii. else insert TIE into LSDB and ACKTIE = TIE ii. else insert TIE into LSDB and ACKTIE = TIE
3. if DBTIE.HEADER > TIE.HEADER then TXTIE = TIE 3. if DBTIE.HEADER > TIE.HEADER then
i. if DBTIE has content already then TXTIE = TIE
ii. else ACKTIE = DBTIE
c. if TXTIE is set then try_to_transmit_tie(TXTIE) c. if TXTIE is set then try_to_transmit_tie(TXTIE)
d. if ACKTIE is set then ack_tie(TIE) d. if ACKTIE is set then ack_tie(TIE)
B.3.5. TIEs Processing When LSDB Received Newer Version on Other C.3.5. TIEs Processing When LSDB Received Newer Version on Other
Adjacencies Adjacencies
The Link State Database can be considered to be a switchboard that The Link State Database can be considered to be a switchboard that
does not need any flooding procedures but can be given new versions does not need any flooding procedures but can be given new versions
of TIEs by a peer. Consecutively, a peer receives from the LSDB of TIEs by a peer. Consecutively, a peer receives from the LSDB
newer versions of TIEs received by other peeers and processes them newer versions of TIEs received by other peeers and processes them
(without any filtering) just like receving TIEs from its remote peer. (without any filtering) just like receving TIEs from its remote peer.
This publisher model can be implemented in many ways. This publisher model can be implemented in many ways.
Appendix C. Constants C.3.6. Sending TIEs
C.1. Configurable Protocol Constants On a periodic basis all TIEs with lifetime left > 0 MUST be sent out
on the adjacency, removed from TIES_TX list and requeued onto
TIES_RTX list.
Appendix D. Constants
D.1. Configurable Protocol Constants
This section gather constants that are provided in the schema files This section gather constants that are provided in the schema files
and the document. and the document.
+----------------+--------------+-----------------------------------+ +----------------+--------------+-----------------------------------+
| | Type | Value | | | Type | Value |
+----------------+--------------+-----------------------------------+ +----------------+--------------+-----------------------------------+
| LIE IPv4 | Default | 224.0.0.120 or all-rift-routers | | LIE IPv4 | Default | 224.0.0.120 or all-rift-routers |
| Multicast | Value, | to be assigned in IPv4 Multicast | | Multicast | Value, | to be assigned in IPv4 Multicast |
| Address | Configurable | Address Space Registry in Local | | Address | Configurable | Address Space Registry in Local |
skipping to change at page 112, line 34 skipping to change at page 137, line 34
| flag | | | | flag | | |
+----------------+--------------+-----------------------------------+ +----------------+--------------+-----------------------------------+
| Default LIE | Default | 3 seconds | | Default LIE | Default | 3 seconds |
| Holdtime | Value, | | | Holdtime | Value, | |
| | Configurable | | | | Configurable | |
+----------------+--------------+-----------------------------------+ +----------------+--------------+-----------------------------------+
| TIE | Default | 1 second | | TIE | Default | 1 second |
| Retransmission | Value | | | Retransmission | Value | |
| Interval | | | | Interval | | |
+----------------+--------------+-----------------------------------+ +----------------+--------------+-----------------------------------+
| TIDE | Default | 3 seconds | | TIDE | Default | 5 seconds |
| Generation | Value, | | | Generation | Value, | |
| Interval | Configurable | | | Interval | Configurable | |
+----------------+--------------+-----------------------------------+ +----------------+--------------+-----------------------------------+
| MIN_TIEID | Constant | TIE Key with minimal values: | | MIN_TIEID | Constant | TIE Key with minimal values: |
| signifies | | TIEID(originator=0, | | signifies | | TIEID(originator=0, |
| start of TIDEs | | tietype=TIETypeMinValue, | | start of TIDEs | | tietype=TIETypeMinValue, |
| | | tie_nr=0, direction=South) | | | | tie_nr=0, direction=South) |
+----------------+--------------+-----------------------------------+ +----------------+--------------+-----------------------------------+
| MAX_TIEID | Constant | TIE Key with maximal values: | | MAX_TIEID | Constant | TIE Key with maximal values: |
| signifies end | | TIEID(originator=MAX_UINT64, | | signifies end | | TIEID(originator=MAX_UINT64, |
| of TIDEs | | tietype=TIETypeMaxValue, | | of TIDEs | | tietype=TIETypeMaxValue, |
| | | tie_nr=MAX_UINT64, | | | | tie_nr=MAX_UINT64, |
| | | direction=North) | | | | direction=North) |
+----------------+--------------+-----------------------------------+ +----------------+--------------+-----------------------------------+
Table 6: all_constants Table 6: all_constants
Appendix D. TODO Appendix E. TODO
o explain what needs implemented for leaf/spine/superspine/ToF o explain what needs implemented for leaf/spine/superspine/ToF
version in detail version in detail
o section on E-W superspine/ToF flooding scope to connect partitions o section on E-W superspine/ToF flooding scope to connect partitions
o finish security envelope, move remaining lifetime out the TIE
packet so it can be modified independently of the SHA-1'd TIE
o move adjacency formation rules onto FSM text and remove 2.4.2 o move adjacency formation rules onto FSM text and remove 2.4.2
o add an intermediate state on multiple neighbors o Fill in example for multi-plane fabric and negative disaggregation
o write negative disaggregation example
o deal with the case of stale north TIEs stuck more than one level
up (propagate header description southbound)
Author's Address Author's Address
The RIFT Team The RIFT Team
"Heaven is under our feet as well as over our heads"
 End of changes. 326 change blocks. 
796 lines changed or deleted 1796 lines changed or added

This html diff was produced by rfcdiff 1.47. The latest version is available from http://tools.ietf.org/tools/rfcdiff/