< draft-ietf-rift-rift-06.txt   draft-ietf-rift-rift-07.txt >
RIFT Working Group The RIFT Authors RIFT Working Group A. Przygienda, Ed.
Internet-Draft June 23, 2019 Internet-Draft Juniper
Intended status: Standards Track Intended status: Standards Track A. Sharma
Expires: December 25, 2019 Expires: February 15, 2020 Comcast
P. Thubert
Cisco
Bruno. Rijsman
Individual
Dmitry. Afanasiev
Yandex
August 14, 2019
RIFT: Routing in Fat Trees RIFT: Routing in Fat Trees
draft-ietf-rift-rift-06 draft-ietf-rift-rift-07
Abstract Abstract
This document outlines a specialized, dynamic routing protocol for This document outlines a specialized, dynamic routing protocol for
Clos and fat-tree network topologies. The protocol (1) deals with Clos and fat-tree network topologies. The protocol (1) deals with
fully automated construction of fat-tree topologies based on fully automated construction of fat-tree topologies based on
detection of links, (2) minimizes the amount of routing state held at detection of links, (2) minimizes the amount of routing state held at
each level, (3) automatically prunes and load balances topology each level, (3) automatically prunes and load balances topology
flooding exchanges over a sufficient subset of links, (4) supports flooding exchanges over a sufficient subset of links, (4) supports
automatic disaggregation of prefixes on link and node failures to automatic disaggregation of prefixes on link and node failures to
skipping to change at page 1, line 42 skipping to change at page 2, line 4
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/. Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on February 15, 2020.
This Internet-Draft will expire on December 25, 2019.
Copyright Notice Copyright Notice
Copyright (c) 2019 IETF Trust and the persons identified as the Copyright (c) 2019 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(https://trustee.ietf.org/license-info) in effect on the date of (https://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License. described in the Simplified BSD License.
Table of Contents Table of Contents
1. Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1. Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 5 2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1. Requirements Language . . . . . . . . . . . . . . . . . . 7 2.1. Requirements Language . . . . . . . . . . . . . . . . . . 8
3. Reference Frame . . . . . . . . . . . . . . . . . . . . . . . 7 3. Reference Frame . . . . . . . . . . . . . . . . . . . . . . . 8
3.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 7 3.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 8
3.2. Topology . . . . . . . . . . . . . . . . . . . . . . . . 10 3.2. Topology . . . . . . . . . . . . . . . . . . . . . . . . 12
4. Requirement Considerations . . . . . . . . . . . . . . . . . 12 4. Requirement Considerations . . . . . . . . . . . . . . . . . 14
5. RIFT: Routing in Fat Trees . . . . . . . . . . . . . . . . . 15 5. RIFT: Routing in Fat Trees . . . . . . . . . . . . . . . . . 17
5.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . 16 5.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . 18
5.1.1. Properties . . . . . . . . . . . . . . . . . . . . . 16 5.1.1. Properties . . . . . . . . . . . . . . . . . . . . . 18
5.1.2. Generalized Topology View . . . . . . . . . . . . . . 16 5.1.2. Generalized Topology View . . . . . . . . . . . . . . 18
5.1.3. Fallen Leaf Problem . . . . . . . . . . . . . . . . . 26 5.1.3. Fallen Leaf Problem . . . . . . . . . . . . . . . . . 28
5.1.4. Discovering Fallen Leaves . . . . . . . . . . . . . . 28 5.1.4. Discovering Fallen Leaves . . . . . . . . . . . . . . 30
5.1.5. Addressing the Fallen Leaves Problem . . . . . . . . 29 5.1.5. Addressing the Fallen Leaves Problem . . . . . . . . 31
5.2. Specification . . . . . . . . . . . . . . . . . . . . . . 30 5.2. Specification . . . . . . . . . . . . . . . . . . . . . . 32
5.2.1. Transport . . . . . . . . . . . . . . . . . . . . . . 30 5.2.1. Transport . . . . . . . . . . . . . . . . . . . . . . 32
5.2.2. Link (Neighbor) Discovery (LIE Exchange) . . . . . . 31 5.2.2. Link (Neighbor) Discovery (LIE Exchange) . . . . . . 33
5.2.3. Topology Exchange (TIE Exchange) . . . . . . . . . . 33 5.2.3. Topology Exchange (TIE Exchange) . . . . . . . . . . 35
5.2.3.1. Topology Information Elements . . . . . . . . . . 33 5.2.3.1. Topology Information Elements . . . . . . . . . . 35
5.2.3.2. South- and Northbound Representation . . . . . . 33 5.2.3.2. South- and Northbound Representation . . . . . . 36
5.2.3.3. Flooding . . . . . . . . . . . . . . . . . . . . 36 5.2.3.3. Flooding . . . . . . . . . . . . . . . . . . . . 38
5.2.3.4. TIE Flooding Scopes . . . . . . . . . . . . . . . 36 5.2.3.4. TIE Flooding Scopes . . . . . . . . . . . . . . . 39
5.2.3.5. 'Flood Only Node TIEs' Bit . . . . . . . . . . . 39 5.2.3.5. 'Flood Only Node TIEs' Bit . . . . . . . . . . . 41
5.2.3.6. Initial and Periodic Database Synchronization . . 40 5.2.3.6. Initial and Periodic Database Synchronization . . 42
5.2.3.7. Purging and Roll-Overs . . . . . . . . . . . . . 40 5.2.3.7. Purging and Roll-Overs . . . . . . . . . . . . . 42
5.2.3.8. Southbound Default Route Origination . . . . . . 41 5.2.3.8. Southbound Default Route Origination . . . . . . 43
5.2.3.9. Northbound TIE Flooding Reduction . . . . . . . . 41 5.2.3.9. Northbound TIE Flooding Reduction . . . . . . . . 43
5.2.3.10. Special Considerations . . . . . . . . . . . . . 46 5.2.3.10. Special Considerations . . . . . . . . . . . . . 48
5.2.4. Reachability Computation . . . . . . . . . . . . . . 47 5.2.4. Reachability Computation . . . . . . . . . . . . . . 49
5.2.4.1. Northbound SPF . . . . . . . . . . . . . . . . . 47 5.2.4.1. Northbound SPF . . . . . . . . . . . . . . . . . 49
5.2.4.2. Southbound SPF . . . . . . . . . . . . . . . . . 48 5.2.4.2. Southbound SPF . . . . . . . . . . . . . . . . . 50
5.2.4.3. East-West Forwarding Within a Level . . . . . . . 48 5.2.4.3. East-West Forwarding Within a non-ToF Level . . . 50
5.2.5. Automatic Disaggregation on Link & Node Failures . . 48 5.2.4.4. East-West Links Within ToF Level . . . . . . . . 50
5.2.5.1. Positive, Non-transitive Disaggregation . . . . . 48 5.2.5. Automatic Disaggregation on Link & Node Failures . . 51
5.2.5.1. Positive, Non-transitive Disaggregation . . . . . 51
5.2.5.2. Negative, Transitive Disaggregation for Fallen 5.2.5.2. Negative, Transitive Disaggregation for Fallen
Leafs . . . . . . . . . . . . . . . . . . . . . . 52 Leafs . . . . . . . . . . . . . . . . . . . . . . 54
5.2.6. Attaching Prefixes . . . . . . . . . . . . . . . . . 56
5.2.6. Attaching Prefixes . . . . . . . . . . . . . . . . . 54 5.2.7. Optional Zero Touch Provisioning (ZTP) . . . . . . . 65
5.2.7. Optional Zero Touch Provisioning (ZTP) . . . . . . . 63 5.2.7.1. Terminology . . . . . . . . . . . . . . . . . . . 66
5.2.7.1. Terminology . . . . . . . . . . . . . . . . . . . 64 5.2.7.2. Automatic SystemID Selection . . . . . . . . . . 67
5.2.7.2. Automatic SystemID Selection . . . . . . . . . . 65 5.2.7.3. Generic Fabric Example . . . . . . . . . . . . . 68
5.2.7.3. Generic Fabric Example . . . . . . . . . . . . . 66 5.2.7.4. Level Determination Procedure . . . . . . . . . . 69
5.2.7.4. Level Determination Procedure . . . . . . . . . . 67 5.2.7.5. Resulting Topologies . . . . . . . . . . . . . . 70
5.2.7.5. Resulting Topologies . . . . . . . . . . . . . . 68 5.2.8. Stability Considerations . . . . . . . . . . . . . . 72
5.2.8. Stability Considerations . . . . . . . . . . . . . . 70 5.3. Further Mechanisms . . . . . . . . . . . . . . . . . . . 72
5.3. Further Mechanisms . . . . . . . . . . . . . . . . . . . 70 5.3.1. Overload Bit . . . . . . . . . . . . . . . . . . . . 72
5.3.1. Overload Bit . . . . . . . . . . . . . . . . . . . . 70 5.3.2. Optimized Route Computation on Leafs . . . . . . . . 72
5.3.2. Optimized Route Computation on Leafs . . . . . . . . 70 5.3.3. Mobility . . . . . . . . . . . . . . . . . . . . . . 73
5.3.3. Mobility . . . . . . . . . . . . . . . . . . . . . . 70 5.3.3.1. Clock Comparison . . . . . . . . . . . . . . . . 74
5.3.3.1. Clock Comparison . . . . . . . . . . . . . . . . 72
5.3.3.2. Interaction between Time Stamps and Sequence 5.3.3.2. Interaction between Time Stamps and Sequence
Counters . . . . . . . . . . . . . . . . . . . . 72 Counters . . . . . . . . . . . . . . . . . . . . 74
5.3.3.3. Anycast vs. Unicast . . . . . . . . . . . . . . . 73 5.3.3.3. Anycast vs. Unicast . . . . . . . . . . . . . . . 75
5.3.3.4. Overlays and Signaling . . . . . . . . . . . . . 73 5.3.3.4. Overlays and Signaling . . . . . . . . . . . . . 75
5.3.4. Key/Value Store . . . . . . . . . . . . . . . . . . . 74 5.3.4. Key/Value Store . . . . . . . . . . . . . . . . . . . 76
5.3.4.1. Southbound . . . . . . . . . . . . . . . . . . . 74 5.3.4.1. Southbound . . . . . . . . . . . . . . . . . . . 76
5.3.4.2. Northbound . . . . . . . . . . . . . . . . . . . 74 5.3.4.2. Northbound . . . . . . . . . . . . . . . . . . . 76
5.3.5. Interactions with BFD . . . . . . . . . . . . . . . . 74 5.3.5. Interactions with BFD . . . . . . . . . . . . . . . . 76
5.3.6. Fabric Bandwidth Balancing . . . . . . . . . . . . . 75 5.3.6. Fabric Bandwidth Balancing . . . . . . . . . . . . . 77
5.3.6.1. Northbound Direction . . . . . . . . . . . . . . 75 5.3.6.1. Northbound Direction . . . . . . . . . . . . . . 77
5.3.6.2. Southbound Direction . . . . . . . . . . . . . . 77 5.3.6.2. Southbound Direction . . . . . . . . . . . . . . 79
5.3.7. Label Binding . . . . . . . . . . . . . . . . . . . . 78 5.3.7. Label Binding . . . . . . . . . . . . . . . . . . . . 80
5.3.8. Segment Routing Support with RIFT . . . . . . . . . . 78 5.3.8. Segment Routing Support with RIFT . . . . . . . . . . 80
5.3.8.1. Global Segment Identifiers Assignment . . . . . . 78 5.3.8.1. Global Segment Identifiers Assignment . . . . . . 80
5.3.8.2. Distribution of Topology Information . . . . . . 78 5.3.8.2. Distribution of Topology Information . . . . . . 80
5.3.9. Leaf to Leaf Procedures . . . . . . . . . . . . . . . 79 5.3.9. Leaf to Leaf Procedures . . . . . . . . . . . . . . . 81
5.3.10. Address Family and Multi Topology Considerations . . 79 5.3.10. Address Family and Multi Topology Considerations . . 81
5.3.11. Reachability of Internal Nodes in the Fabric . . . . 79 5.3.11. Reachability of Internal Nodes in the Fabric . . . . 81
5.3.12. One-Hop Healing of Levels with East-West Links . . . 80 5.3.12. One-Hop Healing of Levels with East-West Links . . . 82
5.4. Security . . . . . . . . . . . . . . . . . . . . . . . . 80 5.4. Security . . . . . . . . . . . . . . . . . . . . . . . . 82
5.4.1. Security Model . . . . . . . . . . . . . . . . . . . 80 5.4.1. Security Model . . . . . . . . . . . . . . . . . . . 82
5.4.2. Security Mechanisms . . . . . . . . . . . . . . . . . 82 5.4.2. Security Mechanisms . . . . . . . . . . . . . . . . . 84
5.4.3. Security Envelope . . . . . . . . . . . . . . . . . . 82 5.4.3. Security Envelope . . . . . . . . . . . . . . . . . . 84
5.4.4. Weak Nonces . . . . . . . . . . . . . . . . . . . . . 85 5.4.4. Weak Nonces . . . . . . . . . . . . . . . . . . . . . 87
5.4.5. Lifetime . . . . . . . . . . . . . . . . . . . . . . 86 5.4.5. Lifetime . . . . . . . . . . . . . . . . . . . . . . 88
5.4.6. Key Management . . . . . . . . . . . . . . . . . . . 86 5.4.6. Key Management . . . . . . . . . . . . . . . . . . . 88
5.4.7. Security Association Changes . . . . . . . . . . . . 86 5.4.7. Security Association Changes . . . . . . . . . . . . 88
6. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.1. Normal Operation . . . . . . . . . . . . . . . . . . . . 87 6. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.2. Leaf Link Failure . . . . . . . . . . . . . . . . . . . . 88 6.1. Normal Operation . . . . . . . . . . . . . . . . . . . . 89
6.3. Partitioned Fabric . . . . . . . . . . . . . . . . . . . 89 6.2. Leaf Link Failure . . . . . . . . . . . . . . . . . . . . 90
6.3. Partitioned Fabric . . . . . . . . . . . . . . . . . . . 91
6.4. Northbound Partitioned Router and Optional East-West 6.4. Northbound Partitioned Router and Optional East-West
Links . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Links . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.5. Multi-Plane Fabric and Negative Disaggregation . . . . . 92 7. Implementation and Operation: Further Details . . . . . . . . 93
7. Implementation and Operation: Further Details . . . . . . . . 92 7.1. Considerations for Leaf-Only Implementation . . . . . . . 93
7.1. Considerations for Leaf-Only Implementation . . . . . . . 92 7.2. Considerations for Spine Implementation . . . . . . . . . 94
7.2. Considerations for Spine Implementation . . . . . . . . . 93 7.3. Adaptations to Other Proposed Data Center Topologies . . 94
7.3. Adaptations to Other Proposed Data Center Topologies . . 93 7.4. Originating Non-Default Route Southbound . . . . . . . . 95
7.4. Originating Non-Default Route Southbound . . . . . . . . 94
8. Security Considerations . . . . . . . . . . . . . . . . . . . 95 8. Security Considerations . . . . . . . . . . . . . . . . . . . 95
8.1. General . . . . . . . . . . . . . . . . . . . . . . . . . 95 8.1. General . . . . . . . . . . . . . . . . . . . . . . . . . 95
8.2. ZTP . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 8.2. ZTP . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
8.3. Lifetime . . . . . . . . . . . . . . . . . . . . . . . . 95 8.3. Lifetime . . . . . . . . . . . . . . . . . . . . . . . . 96
8.4. Packet Number . . . . . . . . . . . . . . . . . . . . . . 95 8.4. Packet Number . . . . . . . . . . . . . . . . . . . . . . 96
8.5. Outer Fingerprint Attacks . . . . . . . . . . . . . . . . 96 8.5. Outer Fingerprint Attacks . . . . . . . . . . . . . . . . 96
8.6. TIE Origin Fingerprint DoS Attacks . . . . . . . . . . . 96 8.6. TIE Origin Fingerprint DoS Attacks . . . . . . . . . . . 96
9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 96 8.7. Host Implementations . . . . . . . . . . . . . . . . . . 97
10. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 96 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 97
11. References . . . . . . . . . . . . . . . . . . . . . . . . . 97 9.1. Requested Multicast and Port Numbers . . . . . . . . . . 97
11.1. Normative References . . . . . . . . . . . . . . . . . . 97 9.2. Requested Registries with Suggested Values . . . . . . . 97
11.2. Informative References . . . . . . . . . . . . . . . . . 99 9.2.1. RIFT/common/AddressFamilyType . . . . . . . . . . . . 98
Appendix A. Sequence Number Binary Arithmetic . . . . . . . . . 102 9.2.1.1. Requested Entries . . . . . . . . . . . . . . . . 98
Appendix B. Information Elements Schema . . . . . . . . . . . . 103 9.2.2. RIFT/common/HierarchyIndications . . . . . . . . . . 98
B.1. common.thrift . . . . . . . . . . . . . . . . . . . . . . 103 9.2.2.1. Requested Entries . . . . . . . . . . . . . . . . 98
B.2. encoding.thrift . . . . . . . . . . . . . . . . . . . . . 109 9.2.3. RIFT/common/IEEE802_1ASTimeStampType . . . . . . . . 98
9.2.3.1. Requested Entries . . . . . . . . . . . . . . . . 98
9.2.4. RIFT/common/IPAddressType . . . . . . . . . . . . . . 98
9.2.4.1. Requested Entries . . . . . . . . . . . . . . . . 98
9.2.5. RIFT/common/IPPrefixType . . . . . . . . . . . . . . 99
9.2.5.1. Requested Entries . . . . . . . . . . . . . . . . 99
9.2.6. RIFT/common/IPv4PrefixType . . . . . . . . . . . . . 99
9.2.6.1. Requested Entries . . . . . . . . . . . . . . . . 99
9.2.7. RIFT/common/IPv6PrefixType . . . . . . . . . . . . . 99
9.2.7.1. Requested Entries . . . . . . . . . . . . . . . . 99
9.2.8. RIFT/common/PrefixSequenceType . . . . . . . . . . . 99
9.2.8.1. Requested Entries . . . . . . . . . . . . . . . . 99
9.2.9. RIFT/common/RouteType . . . . . . . . . . . . . . . . 100
9.2.9.1. Requested Entries . . . . . . . . . . . . . . . . 100
9.2.10. RIFT/common/TIETypeType . . . . . . . . . . . . . . . 100
9.2.10.1. Requested Entries . . . . . . . . . . . . . . . 100
9.2.11. RIFT/common/TieDirectionType . . . . . . . . . . . . 101
9.2.11.1. Requested Entries . . . . . . . . . . . . . . . 101
9.2.12. RIFT/encoding/Community . . . . . . . . . . . . . . . 101
9.2.12.1. Requested Entries . . . . . . . . . . . . . . . 101
9.2.13. RIFT/encoding/KeyValueTIEElement . . . . . . . . . . 101
9.2.13.1. Requested Entries . . . . . . . . . . . . . . . 101
9.2.14. RIFT/encoding/LIEPacket . . . . . . . . . . . . . . . 102
9.2.14.1. Requested Entries . . . . . . . . . . . . . . . 102
9.2.15. RIFT/encoding/LinkCapabilities . . . . . . . . . . . 103
9.2.15.1. Requested Entries . . . . . . . . . . . . . . . 103
9.2.16. RIFT/encoding/LinkIDPair . . . . . . . . . . . . . . 103
9.2.16.1. Requested Entries . . . . . . . . . . . . . . . 103
9.2.17. RIFT/encoding/Neighbor . . . . . . . . . . . . . . . 104
9.2.17.1. Requested Entries . . . . . . . . . . . . . . . 104
9.2.18. RIFT/encoding/NodeCapabilities . . . . . . . . . . . 104
9.2.18.1. Requested Entries . . . . . . . . . . . . . . . 104
9.2.19. RIFT/encoding/NodeFlags . . . . . . . . . . . . . . . 105
9.2.19.1. Requested Entries . . . . . . . . . . . . . . . 105
9.2.20. RIFT/encoding/NodeNeighborsTIEElement . . . . . . . . 105
9.2.20.1. Requested Entries . . . . . . . . . . . . . . . 105
9.2.21. RIFT/encoding/NodeTIEElement . . . . . . . . . . . . 105
9.2.21.1. Requested Entries . . . . . . . . . . . . . . . 105
9.2.22. RIFT/encoding/PacketContent . . . . . . . . . . . . . 106
9.2.22.1. Requested Entries . . . . . . . . . . . . . . . 106
9.2.23. RIFT/encoding/PacketHeader . . . . . . . . . . . . . 106
9.2.23.1. Requested Entries . . . . . . . . . . . . . . . 106
9.2.24. RIFT/encoding/PrefixAttributes . . . . . . . . . . . 107
9.2.24.1. Requested Entries . . . . . . . . . . . . . . . 107
9.2.25. RIFT/encoding/PrefixTIEElement . . . . . . . . . . . 107
9.2.25.1. Requested Entries . . . . . . . . . . . . . . . 107
9.2.26. RIFT/encoding/ProtocolPacket . . . . . . . . . . . . 107
9.2.26.1. Requested Entries . . . . . . . . . . . . . . . 107
9.2.27. RIFT/encoding/TIDEPacket . . . . . . . . . . . . . . 108
9.2.27.1. Requested Entries . . . . . . . . . . . . . . . 108
9.2.28. RIFT/encoding/TIEElement . . . . . . . . . . . . . . 108
9.2.28.1. Requested Entries . . . . . . . . . . . . . . . 108
9.2.29. RIFT/encoding/TIEHeader . . . . . . . . . . . . . . . 109
9.2.29.1. Requested Entries . . . . . . . . . . . . . . . 109
9.2.30. RIFT/encoding/TIEHeaderWithLifeTime . . . . . . . . . 110
9.2.30.1. Requested Entries . . . . . . . . . . . . . . . 110
9.2.31. RIFT/encoding/TIEID . . . . . . . . . . . . . . . . . 110
9.2.31.1. Requested Entries . . . . . . . . . . . . . . . 110
9.2.32. RIFT/encoding/TIEPacket . . . . . . . . . . . . . . . 111
9.2.32.1. Requested Entries . . . . . . . . . . . . . . . 111
9.2.33. RIFT/encoding/TIREPacket . . . . . . . . . . . . . . 111
9.2.33.1. Requested Entries . . . . . . . . . . . . . . . 111
10. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 111
11. References . . . . . . . . . . . . . . . . . . . . . . . . . 112
11.1. Normative References . . . . . . . . . . . . . . . . . . 112
11.2. Informative References . . . . . . . . . . . . . . . . . 114
Appendix A. Sequence Number Binary Arithmetic . . . . . . . . . 116
Appendix B. Information Elements Schema . . . . . . . . . . . . 117
B.1. common.thrift . . . . . . . . . . . . . . . . . . . . . . 118
B.2. encoding.thrift . . . . . . . . . . . . . . . . . . . . . 124
Appendix C. Finite State Machines and Precise Operational Appendix C. Finite State Machines and Precise Operational
Specifications . . . . . . . . . . . . . . . . . . . 117 Specifications . . . . . . . . . . . . . . . . . . . 132
C.1. LIE FSM . . . . . . . . . . . . . . . . . . . . . . . . . 117 C.1. LIE FSM . . . . . . . . . . . . . . . . . . . . . . . . . 132
C.2. ZTP FSM . . . . . . . . . . . . . . . . . . . . . . . . . 123 C.2. ZTP FSM . . . . . . . . . . . . . . . . . . . . . . . . . 139
C.3. Flooding Procedures . . . . . . . . . . . . . . . . . . . 131 C.3. Flooding Procedures . . . . . . . . . . . . . . . . . . . 147
C.3.1. FloodState Structure per Adjacency . . . . . . . . . 132 C.3.1. FloodState Structure per Adjacency . . . . . . . . . 147
C.3.2. TIDEs . . . . . . . . . . . . . . . . . . . . . . . . 134 C.3.2. TIDEs . . . . . . . . . . . . . . . . . . . . . . . . 149
C.3.2.1. TIDE Generation . . . . . . . . . . . . . . . . . 134 C.3.2.1. TIDE Generation . . . . . . . . . . . . . . . . . 149
C.3.2.2. TIDE Processing . . . . . . . . . . . . . . . . . 135 C.3.2.2. TIDE Processing . . . . . . . . . . . . . . . . . 150
C.3.3. TIREs . . . . . . . . . . . . . . . . . . . . . . . . 136 C.3.3. TIREs . . . . . . . . . . . . . . . . . . . . . . . . 151
C.3.3.1. TIRE Generation . . . . . . . . . . . . . . . . . 136 C.3.3.1. TIRE Generation . . . . . . . . . . . . . . . . . 151
C.3.3.2. TIRE Processing . . . . . . . . . . . . . . . . . 136 C.3.3.2. TIRE Processing . . . . . . . . . . . . . . . . . 151
C.3.4. TIEs Processing on Flood State Adjacency . . . . . . 137 C.3.4. TIEs Processing on Flood State Adjacency . . . . . . 152
C.3.5. TIEs Processing When LSDB Received Newer Version on C.3.5. TIEs Processing When LSDB Received Newer Version on
Other Adjacencies . . . . . . . . . . . . . . . . . . 138 Other Adjacencies . . . . . . . . . . . . . . . . . . 153
C.3.6. Sending TIEs . . . . . . . . . . . . . . . . . . . . 138 C.3.6. Sending TIEs . . . . . . . . . . . . . . . . . . . . 153
Appendix D. Constants . . . . . . . . . . . . . . . . . . . . . 138 Appendix D. Constants . . . . . . . . . . . . . . . . . . . . . 153
D.1. Configurable Protocol Constants . . . . . . . . . . . . . 138 D.1. Configurable Protocol Constants . . . . . . . . . . . . . 153
Appendix E. TODO . . . . . . . . . . . . . . . . . . . . . . . . 140 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 155
Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 140
1. Authors 1. Authors
This work is a product of a growing list of individuals. This work is a product of a list of individuals which are all to be
considered major contributors independent of the fact whether their
name made it to the limited boilerplate author's list or not.
Tony Przygienda, Ed | Alankar Sharma | Pascal Thubert Tony Przygienda, Ed. | Alankar Sharma | Pascal Thubert
Juniper Networks | Comcast | Cisco Juniper Networks | Comcast | Cisco
Bruno Rijsman | Ilya Vershkov | Dmitry Afanasiev Bruno Rijsman | Ilya Vershkov | Dmitry Afanasiev
Individual | Mellanox | Yandex Individual | Mellanox | Yandex
Don Fedyk | Alia Atlas | John Drake Don Fedyk | Alia Atlas | John Drake
Individual | Individual | Juniper Individual | Individual | Juniper
Table 1: RIFT Authors Table 1: RIFT Authors
2. Introduction 2. Introduction
Clos [CLOS] and Fat-Tree [FATTREE] topologies have gained prominence Clos [CLOS] and Fat-Tree [FATTREE] topologies have gained prominence
in today's networking, primarily as result of the paradigm shift in today's networking, primarily as result of the paradigm shift
towards a centralized data-center based architecture that is poised towards a centralized data-center based architecture that is poised
to deliver a majority of computation and storage services in the to deliver a majority of computation and storage services in the
future. Today's current routing protocols were geared towards a future. Today's current routing protocols were geared towards a
skipping to change at page 5, line 38 skipping to change at page 7, line 15
but rather because the perceived capability to easily modify BGP and but rather because the perceived capability to easily modify BGP and
the immanent difficulties with link-state [DIJKSTRA] based protocols the immanent difficulties with link-state [DIJKSTRA] based protocols
to optimize topology exchange and converge quickly in large scale to optimize topology exchange and converge quickly in large scale
densely meshed topologies. The incumbent protocols precondition densely meshed topologies. The incumbent protocols precondition
normally extensive configuration or provisioning during bring up and normally extensive configuration or provisioning during bring up and
re-dimensioning which is only viable for a set of organizations with re-dimensioning which is only viable for a set of organizations with
according networking operation skills and budgets. For the majority according networking operation skills and budgets. For the majority
of data center consumers a preferable protocol would be one that of data center consumers a preferable protocol would be one that
auto-configures itself and deals with failures and misconfigurations auto-configures itself and deals with failures and misconfigurations
with a minimum of human intervention only. Such a solution would with a minimum of human intervention only. Such a solution would
allow local IP fabric bandwidth to be consumed in a standardized allow local IP fabric bandwidth to be consumed in a 'standard
component fashion, i.e. provision it much faster and operate it at component' fashion, i.e. provision it much faster and operate it at
much lower costs, much like compute or storage is consumed today. much lower costs, much like compute or storage is consumed today.
In looking at the problem through the lens of data center In looking at the problem through the lens of data center
requirements, an optimal approach does not seem however to be a requirements, an optimal approach does not seem however to be a
simple modification of either a link-state (distributed computation) simple modification of either a link-state (distributed computation)
or distance-vector (diffused computation) approach but rather a or distance-vector (diffused computation) approach but rather a
mixture of both, colloquially best described as "link-state towards mixture of both, colloquially best described as "link-state towards
the spine" and "distance vector towards the leafs". In other words, the spine" and "distance vector towards the leafs". In other words,
"bottom" levels are flooding their link-state information in the "bottom" levels are flooding their link-state information in the
"northern" direction while each node generates under normal "northern" direction while each node generates under normal
skipping to change at page 6, line 20 skipping to change at page 7, line 46
simplified view of the resulting information and routes on a RIFT simplified view of the resulting information and routes on a RIFT
fabric. The top of the fabric is holding in its link-state database fabric. The top of the fabric is holding in its link-state database
the nodes below it and the routes to them. In the second row of the the nodes below it and the routes to them. In the second row of the
database table we indicate that partial information of other nodes in database table we indicate that partial information of other nodes in
the same level is available as well. The details of how this is the same level is available as well. The details of how this is
achieved will be postponed for the moment. When we look at the achieved will be postponed for the moment. When we look at the
"bottom" of the fabric, the leafs, we see that the topology is "bottom" of the fabric, the leafs, we see that the topology is
basically empty and they only hold a load balanced default route to basically empty and they only hold a load balanced default route to
the next level. the next level.
The balance of this document details the resulting protocol and fills The balance of this document details the requirements of a dedicated
in the missing details. fabric routing protocol, fills in the specification details and
ultimately includes resulting security considerations.
. [A,B,C,D] . [A,B,C,D]
. [E] . [E]
. +-----+ +-----+ . +-----+ +-----+
. | E | | F | A/32 @ [C,D] . | E | | F | A/32 @ [C,D]
. +-+-+-+ +-+-+-+ B/32 @ [C,D] . +-+-+-+ +-+-+-+ B/32 @ [C,D]
. | | | | C/32 @ C . | | | | C/32 @ C
. | | +-----+ | D/32 @ D . | | +-----+ | D/32 @ D
. | | | | . | | | |
. | +------+ | . | +------+ |
skipping to change at page 7, line 22 skipping to change at page 8, line 46
3.1. Terminology 3.1. Terminology
This section presents the terminology used in this document. It is This section presents the terminology used in this document. It is
assumed that the reader is thoroughly familiar with the terms and assumed that the reader is thoroughly familiar with the terms and
concepts used in OSPF [RFC2328] and IS-IS [ISO10589-Second-Edition], concepts used in OSPF [RFC2328] and IS-IS [ISO10589-Second-Edition],
[ISO10589] as well as the according graph theoretical concepts of [ISO10589] as well as the according graph theoretical concepts of
shortest path first (SPF) [DIJKSTRA] computation and directed acyclic shortest path first (SPF) [DIJKSTRA] computation and directed acyclic
graphs (DAG). graphs (DAG).
Crossbar: Physical arrangement of ports in a switching matrix
without implying any further scheduling or buffering disciplines.
Clos/Fat Tree: This document uses the terms Clos and Fat Tree
interchangeably whereas it always refers to a folded spine-and-
leaf topology with possibly multiple PoDs and one or multiple ToF
planes. Several modifications such as leaf-2-leaf shortcuts and
multiple level shortcuts are possible and described further in the
document.
Folded Spine-and-Leaf: In case Clos fabric input and output stages
are analogous, the fabric can be "folded" to build a "superspine"
or top which we will call Top of Fabric (ToF) in this document.
Level: Clos and Fat Tree networks are topologically partially Level: Clos and Fat Tree networks are topologically partially
ordered graphs and 'level' denotes the set of nodes at the same ordered graphs and 'level' denotes the set of nodes at the same
height in such a network, where the bottom level (leaf) is the height in such a network, where the bottom level (leaf) is the
level with lowest value. A node has links to nodes one level down level with lowest value. A node has links to nodes one level down
and/or one level up. Under some circumstances, a node may have and/or one level up. Under some circumstances, a node may have
links to nodes at the same level. As footnote: Clos terminology links to nodes at the same level. As footnote: Clos terminology
uses often the concept of "stage" but due to the folded nature of uses often the concept of "stage" but due to the folded nature of
the Fat Tree we do not use it to prevent misunderstandings. the Fat Tree we do not use it to prevent misunderstandings.
Superspine/Aggregation or Spine/Edge Levels: Traditional names in Superspine/Aggregation or Spine/Edge Levels: Traditional names in
skipping to change at page 8, line 45 skipping to change at page 10, line 34
Southbound Link: A link to a node one level down or in other words, Southbound Link: A link to a node one level down or in other words,
one level further south. one level further south.
East-West Link: A link between two nodes at the same level. East- East-West Link: A link between two nodes at the same level. East-
West links are normally not part of Clos or "fat-tree" topologies. West links are normally not part of Clos or "fat-tree" topologies.
Leaf shortcuts (L2L): East-West links at leaf level will need to be Leaf shortcuts (L2L): East-West links at leaf level will need to be
differentiated from East-West links at other levels. differentiated from East-West links at other levels.
Routing on the host (RotH): Modern data center architecture variant
where servers/leafs are multi-homed and consecutively participate
in routing.
Southbound representation: Subset of topology information sent Southbound representation: Subset of topology information sent
towards a lower level. towards a lower level.
South Reflection: Often abbreviated just as "reflection" it defines South Reflection: Often abbreviated just as "reflection" it defines
a mechanism where South Node TIEs are "reflected" back up north to a mechanism where South Node TIEs are "reflected" back up north to
allow nodes in same level without E-W links to "see" each other. allow nodes in same level without E-W links to "see" each other.
TIE: This is an acronym for a "Topology Information Element". TIEs TIE: This is an acronym for a "Topology Information Element". TIEs
are exchanged between RIFT nodes to describe parts of a network are exchanged between RIFT nodes to describe parts of a network
such as links and address prefixes. A TIE can be thought of as such as links and address prefixes, in a fashion similar to ISIS
largely equivalent to ISIS LSPs or OSPF LSA. We will talk about LSPs or OSPF LSAs. We will talk about N-TIEs when talking about
N-TIEs when talking about TIEs in the northbound representation TIEs in the northbound representation and S-TIEs for the
and S-TIEs for the southbound equivalent. southbound equivalent.
Node TIE: This is an acronym for a "Node Topology Information Node TIE: This stands as acronym for a "Node Topology Information
Element", largely equivalent to OSPF Router LSA, i.e. it contains Element" that contains all adjacencies the node discovered and
all adjacencies the node discovered and information about node information about node itself.
itself.
Prefix TIE: This is an acronym for a "Prefix Topology Information Prefix TIE: This is an acronym for a "Prefix Topology Information
Element" and it contains all prefixes directly attached to this Element" and it contains all prefixes directly attached to this
node in case of a N-TIE and in case of S-TIE the necessary default node in case of a N-TIE and in case of S-TIE the necessary default
the node passes southbound. the node passes southbound.
Key Value TIE: A S-TIE that is carrying a set of key value pairs Key Value TIE: A S-TIE that is carrying a set of key value pairs
[DYNAMO]. It can be used to distribute information in the [DYNAMO]. It can be used to distribute information in the
southbound direction within the protocol. southbound direction within the protocol.
skipping to change at page 14, line 31 skipping to change at page 16, line 31
aggregating the prefixes to prevent black-holing in case of aggregating the prefixes to prevent black-holing in case of
failures. The de-aggregation should support maximum failures. The de-aggregation should support maximum
possible ECMP/N-ECMP remaining after failure. possible ECMP/N-ECMP remaining after failure.
REQ12: Reducing the scope of communication needed throughout the REQ12: Reducing the scope of communication needed throughout the
network on link and state failure, as well as reducing network on link and state failure, as well as reducing
advertisements of repeating or idiomatic information in advertisements of repeating or idiomatic information in
stable state is highly desirable since it leads to better stable state is highly desirable since it leads to better
stability and faster convergence behavior. stability and faster convergence behavior.
REQ13: Once a packet traverses a link in a "southbound" direction, REQ13: Under normal, fully converged condition, once a packet is
it must not take any further "northbound" steps along its forwarded along a link in a "southbound" direction, it must
path to delivery to its destination under normal, i.e. not take any further "northbound" links (Valley Free
fully converged, conditions. Taking a path through the Routing). Taking a path through the spine in cases where a
spine in cases where a shorter path is available is highly shorter path is available is highly undesirable (Bow Tying).
undesirable.
REQ14: Parallel links between same set of nodes must be REQ14: Parallel links between same set of nodes must be
distinguishable for SPF, failure and traffic engineering distinguishable for SPF, failure and traffic engineering
purposes. purposes.
REQ15: The protocol must not rely on interfaces having discernible REQ15: The protocol must support interfaces sharing the same
unique addresses, i.e. it must operate in presence of address. Specifically, it must operate in presence of
unnumbered links (even parallel ones) or links of a single unnumbered links (even parallel ones) and/or links of a
node having same addresses. single node being configured with same addresses.
REQ16: It would be desirable to achieve fast re-balancing of flows REQ16: It would be desirable to achieve fast re-balancing of flows
when links, especially towards the spines are lost or when links, especially towards the spines are lost or
provisioned without regressing to per flow traffic provisioned without regressing to per flow traffic
engineering which introduces significant amount of engineering which introduces significant amount of
complexity while possibly not being reactive enough to complexity while possibly not being reactive enough to
account for short-lived flows. account for short-lived flows.
REQ17: The control plane should be able to unambiguously determine REQ17: The control plane should be able to unambiguously determine
the current point of attachment (which port on which leaf the current point of attachment (which port on which leaf
node) of a prefix, even in a context of fast mobility, e.g., node) of a prefix, even in a context of fast mobility, e.g.,
when the prefix is a host address on a wireless node that 1) when the prefix is a host address on a wireless node that 1)
may associate to any of multiple access points (APs) that may associate to any of multiple access points (APs) that
are attached to different ports on a same leaf node or to are attached to different ports on a same leaf node or to
different leaf nodes, and 2) may move and reassociate different leaf nodes, and 2) may move and reassociate
several times to a different access point within a sub- several times to a different access point within a sub-
second period. second period.
REQ18: The protocol should provide security mechanisms that allow REQ18: The protocol must provide security mechanisms that allow the
to restrict nodes, especially leafs without proper operator to restrict nodes, especially leaf nodes without
credentials from forming three-way adjacencies. proper credentials, from forming a three-way adjacency and
participating in routing.
Following list represents possible requirements and requirements
under discussion:
PEND1: Supporting anything but point-to-point links is a non- Following list represents non-requirements:
requirement. Questions remain: for connecting to the
leaves, is there a case where multipoint is desirable? One
could still model it as point-to-point links; it seems there
is no need for anything more than a NBMA-type construct.
PEND2: What is the maximum scale of number leaf prefixes we need to PEND1: Supporting anything but point-to-point links is not
carry. 500'000 seems plenty even if we deploy RIFT down to necessary.
servers as leafs.
Finally, following are the non-requirements: Finally, following are the non-requirements:
NONREQ1: Broadcast media support is unnecessary. However, NONREQ1: Broadcast media support is unnecessary. However,
miscabling leading to multiple nodes on a broadcast miscabling leading to multiple nodes on a broadcast
segment must be operationally easily recognizable and segment must be operationally easily recognizable and
detectable while not taxing the protocol excessively. detectable while not taxing the protocol excessively.
NONREQ2: Purging link state elements is unnecessary given its NONREQ2: Purging link state elements is unnecessary given its
fragility and complexity and today's large memory size on fragility and complexity and today's large memory size on
skipping to change at page 16, line 4 skipping to change at page 17, line 46
NONREQ3: Special support for layer 3 multi-hop adjacencies is not NONREQ3: Special support for layer 3 multi-hop adjacencies is not
part of the protocol specification. Such support can be part of the protocol specification. Such support can be
easily provided by using tunneling technologies the same easily provided by using tunneling technologies the same
way IGPs today are solving the problem. way IGPs today are solving the problem.
5. RIFT: Routing in Fat Trees 5. RIFT: Routing in Fat Trees
Derived from the above requirements we present a detailed outline of Derived from the above requirements we present a detailed outline of
a protocol optimized for Routing in Fat Trees (RIFT) that in most a protocol optimized for Routing in Fat Trees (RIFT) that in most
abstract terms has many properties of a modified link-state protocol abstract terms has many properties of a modified link-state protocol
[RFC2328][ISO10589-Second-Edition] when "pointing north" and distance
[RFC2328][ISO10589-Second-Edition] when "pointing north" and path-
vector [RFC4271] protocol when "pointing south". While this is an vector [RFC4271] protocol when "pointing south". While this is an
unusual combination, it does quite naturally exhibit the desirable unusual combination, it does quite naturally exhibit the desirable
properties we seek. properties we seek.
5.1. Overview 5.1. Overview
5.1.1. Properties 5.1.1. Properties
The most singular property of RIFT is that it floods flat link-state The most singular property of RIFT is that it floods flat link-state
information northbound only so that each level obtains the full information northbound only so that each level obtains the full
skipping to change at page 16, line 31 skipping to change at page 18, line 26
propagates one hop south and is 're-advertised' by nodes at next propagates one hop south and is 're-advertised' by nodes at next
lower level, normally just the default route. However, RIFT uses lower level, normally just the default route. However, RIFT uses
flooding in the southern direction as well to avoid the necessity to flooding in the southern direction as well to avoid the necessity to
build an update per adjacency. We omit describing the East-West build an update per adjacency. We omit describing the East-West
direction out for the moment. direction out for the moment.
Those information flow constraints create not only an anisotropic Those information flow constraints create not only an anisotropic
protocol (i.e. the information is not distributed "evenly" or protocol (i.e. the information is not distributed "evenly" or
"clumped" but summarized along the N-S gradient) but also a "smooth" "clumped" but summarized along the N-S gradient) but also a "smooth"
information propagation where nodes do not receive the same information propagation where nodes do not receive the same
information from multiple fronts which would force them to perform a information from multiple directions at the same time. Normally,
diffused computation to tie-break the same reachability information accepting the same reachability on any link without understanding its
arriving on arbitrary links and ultimately force hop-by-hop topological significance forces tie-breaking on some kind of distance
forwarding on shortest-paths only. The application of those metric and ultimately leads in hop-by-hop forwarding substrates to
principle lead to RIFT having moreover the highly desirable utilization of variants of shortest paths only. RIFT under normal
properties of being loop-free and guaranteeing valley-free forwarding conditions does not need to reconcile same reachability information
behavior. from multiple directions and its computation principles (south
forwarding direction is always prefered) leads to valley-free
forwarding behavior. And since valley free routing is loop-free it
can use all feasible paths, another highly desirable property if
available bandwidth should be utilized to the maximum extent
possible.
To account for the "northern" and the "southern" information split To account for the "northern" and the "southern" information split
the link state database is partitioned into "north representation" the link state database is accordingly partitioned into "north
and "south representation" TIEs, whereas in simplest terms the N-TIEs representation" and "south representation" TIEs. In simplest terms
contain a link state topology description of lower levels and and the N-TIEs contain a link state topology description of lower levels
S-TIEs carry simply default routes. This oversimplified view will be and and S-TIEs carry simply default routes of the level above. This
refined gradually in following sections while introducing protocol oversimplified view will be refined gradually in following sections
procedures aimed to fulfill the described requirements. while introducing protocol procedures aimed to fulfill the described
requirements.
5.1.2. Generalized Topology View 5.1.2. Generalized Topology View
This section will dwell on the topologies addresses by RIFT including This section will shed some light on the topologies addresses by RIFT
multi plane fabrics and their related implications. Readers that are including multi plane fabrics and their related implications.
only interested in single plane designs, i.e. all top-of-fabric nodes Readers that are only interested in single plane designs, i.e. all
being topologically equal and initially connected to all the switches top-of-fabric nodes being topologically equal and initially connected
at the level below them can skip this section and resulting to all the switches at the level below them can skip this section and
Section 5.2.5.2 as well. resulting Section 5.2.5.2 as well.
Given the difficulty of visualizing multi plane design which are It is quite difficult to visualize multi plane design which are
effectively multi-dimensional switching matrices we will introduce a effectively multi-dimensional switching matrices. To cope with that,
methodology allowing us to visualize the connectivity in a two- we will introduce a methodology allowing us to depict the
dimensional document and leverage the fact that we are dealing connectivity in a two-dimensional plane. Further, we will leverage
basically with crossbar fabrics stacked on top of each other where the fact that we are dealing basically with crossbar fabrics stacked
ports also align "on top of each other" in a regular fashion. on top of each other where ports align "on top of each other" in a
regular fashion.
As a word of caution to the reader at this point it should be
observed that the language used to describe Clos variations,
especially in multi-plane designs varies widely between sources.
This description follows the introduced Section 3.1 and it is
paramount to have it present to follow the rest of this section
correctly.
The typical topology for which RIFT is defined is built of a number P The typical topology for which RIFT is defined is built of a number P
of PoDs, connected together by a number S of spine nodes. A PoD node of PoDs, connected together by a number S of ToF nodes. A PoD node
has a number of ports called Radix, with half of them (K=Radix/2) has a number of ports called Radix, with half of them (K=Radix/2)
used to connect host devices from the south, and half to connect to used to connect host devices from the south, and half to connect to
interleaved PoD Top-Level switches to the north. Ratio K can be interleaved PoD Top-Level switches to the north. Ratio K can be
chosen differently without loss of generality when port speeds differ chosen differently without loss of generality when port speeds differ
or fabric is oversubscribed but K=R/2 allows for more readable or fabric is oversubscribed but K=R/2 allows for more readable
representation whereby there are as many ports facing north as south representation whereby there are as many ports facing north as south
on any intermediate node. We represent a node hence in a schematic on any intermediate node. We represent a node hence in a schematic
fashion with ports "sticking out" to its north and south rather than fashion with ports "sticking out" to its north and south rather than
by the usual real-world front faceplate designs of the day. by the usual real-world front faceplate designs of the day.
skipping to change at page 28, line 8 skipping to change at page 30, line 8
problem by distributing host routes will be able to converge only problem by distributing host routes will be able to converge only
using paths through leafs, i.e. the flooding of information on using paths through leafs, i.e. the flooding of information on
Leaf122 will go up to Top-of-Fabric A and then "loopback" over Leaf122 will go up to Top-of-Fabric A and then "loopback" over
other leafs to ToF B leading in extreme cases to traffic for other leafs to ToF B leading in extreme cases to traffic for
Leaf122 when presented to plane B taking an "inverted fabric" path Leaf122 when presented to plane B taking an "inverted fabric" path
where leafs start to serve as TOFs. where leafs start to serve as TOFs.
5.1.4. Discovering Fallen Leaves 5.1.4. Discovering Fallen Leaves
As we illustrate later and without further proof here, to deal with As we illustrate later and without further proof here, to deal with
fallen leafs in multi-plane designs RIFT requires all the ToF nodes fallen leafs in multi-plane designs when aggregation is used RIFT
to share the same topology database. This happens naturally in requires all the ToF nodes to share the same topology database. This
single plane design but needs additional considerations in multi- happens naturally in single plane design but needs additional
plane fabrics. To satisfy this RIFT in multi-plane designs relies at considerations in multi-plane fabrics. To satisfy this RIFT in
the ToF Level on ring interconnection of switches in multiple planes. multi-plane designs relies at the ToF Level on ring interconnection
Other solutions are possible but they either need more cabling or end of switches in multiple planes. Other solutions are possible but
up having much longer flooding path and/or single points of failure. they either need more cabling or end up having much longer flooding
path and/or single points of failure.
In more detail, by reserving two ports on each Top-of-Fabric node it In more detail, by reserving two ports on each Top-of-Fabric node it
is possible to connect them together in an interplane bi-directional is possible to connect them together in an interplane bi-directional
ring as illustrated in Figure 13 (where we show a bi-directional ring ring as illustrated in Figure 13 (where we show a bi-directional ring
connecting switches across planes). The rings will exchange full connecting switches across planes). The rings will exchange full
topology information between planes and with that allow consequently topology information between planes and with that allow consequently
by the means of transitive, negative disaggregation described in by the means of transitive, negative disaggregation described in
Section 5.2.5.2 to efficiently fix any possible fallen leaf scenario. Section 5.2.5.2 to efficiently fix any possible fallen leaf scenario.
Somewhat as a side-effect, the exchange of information fulfills the Somewhat as a side-effect, the exchange of information fulfills the
requirement to present full view of the fabric topology at the Top- requirement to present full view of the fabric topology at the Top-
skipping to change at page 30, line 44 skipping to change at page 32, line 44
packet typically occurs at the leaf level and the disaggregation must packet typically occurs at the leaf level and the disaggregation must
be transitive and reach all the leaves. In that case, the negative be transitive and reach all the leaves. In that case, the negative
disaggregation is necessary. The details on the RIFT approach to disaggregation is necessary. The details on the RIFT approach to
deal with fallen leafs in an optimal way is specified in deal with fallen leafs in an optimal way is specified in
Section 5.2.5.2. Section 5.2.5.2.
5.2. Specification 5.2. Specification
5.2.1. Transport 5.2.1. Transport
All packet formats are defined in Thrift models in Appendix B. All packet formats are defined in Thrift [thrift] models in
Appendix B.
The serialized model is carried in an envelope within a UDP frame The serialized model is carried in an envelope within a UDP frame
that provides security and allows validation/modification of several that provides security and allows validation/modification of several
important fields without de-serialization for performance and important fields without de-serialization for performance and
security reasons. security reasons.
5.2.2. Link (Neighbor) Discovery (LIE Exchange) 5.2.2. Link (Neighbor) Discovery (LIE Exchange)
LIE exchange happens over well-known administratively locally scoped LIE exchange happens over well-known administratively locally scoped
and configured or otherwise well-known IPv4 multicast address and configured or otherwise well-known IPv4 multicast address
[RFC2365] or link-local multicast scope [RFC4291] for IPv6 [RFC8200] [RFC2365] and/or link-local multicast scope [RFC4291] for IPv6
using a configured or otherwise a well-known destination UDP port [RFC8200] using a configured or otherwise a well-known destination
defined in Appendix D.1. LIEs SHOULD be sent with a TTL of 1 to UDP port defined in Appendix D.1. LIEs SHOULD be sent with a TTL of
prevent RIFT information reaching beyond a single L3 next-hop in the 1 to prevent RIFT information reaching beyond a single L3 next-hop in
topology. LIEs SHOULD be sent with network control precedence. the topology. LIEs SHOULD be sent with network control precedence.
Originating port of the LIE has no further significance other than Originating port of the LIE has no further significance other than
identifying the origination point. LIEs are exchanged over all links identifying the origination point. LIEs are exchanged over all links
running RIFT. An implementation MAY listen and send LIEs on IPv4 running RIFT.
and/or IPv6 multicast addresses. LIEs on same link are considered
part of the same negotiation independent on the address family they
arrive on. Observe further that the LIE source address may not
identify the peer uniquely in unnumbered or link-local address cases
so the response transmission MUST occur over the same interface the
LIEs have been received on. A node CAN use any of the adjacency's
source addresses it saw in LIEs on the specific interface during
adjacency formation to send TIEs. That implies that an
implementation MUST be ready to accept TIEs on all addresses it used
as source of LIE frames.
Observe further that the protocol does NOT support selective An implementation MAY listen and send LIEs on IPv4 and/or IPv6
disabling of address families or any local address changes in three multicast addresses. A node MUST NOT originate LIEs on an address
way state, i.e. if a link has entered three way IPv4 and/or IPv6 with family if it does not process received LIEs on that family. LIEs on
a neighbor on an adjacency and it wants to stop supporting one of the same link are considered part of the same negotiation independent on
families or change any of its local addresses, it has to tear down the address family they arrive on. Observe further that the LIE
and rebuild the adjacency. It also has to remove any information it source address may not identify the peer uniquely in unnumbered or
stored about adjacency's' LIE source addresses seen. link-local address cases so the response transmission MUST occur over
the same interface the LIEs have been received on. A node CAN use
any of the adjacency's source addresses it saw in LIEs on the
specific interface during adjacency formation to send TIEs. That
implies that an implementation MUST be ready to accept TIEs on all
addresses it used as source of LIE frames.
All RIFT routers MUST support IPv4 forwarding and MAY support IPv6 A three way adjacency over any address family implies support for
forwarding. A three way adjacency over IPv6 addresses implies IPv4 forwarding if the `v4_forwarding_capable` flag is set to true
support for IPv4 forwarding. and a node can use [RFC5549] type of forwarding in such a situation.
It is expected that the whole fabric supports the same type of
forwarding of address families on all the links. Operation of a
fabric where only some of the links are supporting forwarding on an
address family and others do not is outside the scope of this
specification.
Observe further that the protocol does NOT support selective
disabling of address families, disabling v4 forwarding capability or
any local address changes in three way state, i.e. if a link has
entered three way IPv4 and/or IPv6 with a neighbor on an adjacency
and it wants to stop supporting one of the families or change any of
its local addresses or stop v4 forwarding, it has to tear down and
rebuild the adjacency. It also has to remove any information it
stored about the adjacency such as LIE source addresses seen.
Unless Section 5.2.7 is used, each node is provisioned with the level Unless Section 5.2.7 is used, each node is provisioned with the level
at which it is operating and its PoD (or otherwise a default level at which it is operating and its PoD (or otherwise a default level
and "undefined" PoD are assumed; meaning that leafs do not need to be and "undefined" PoD are assumed; meaning that leafs do not need to be
configured at all if initial configuration values are all left at 0). configured at all if initial configuration values are all left at 0).
Nodes in the spine are configured with "any" PoD which has the same Nodes in the spine are configured with "any" PoD which has the same
value "undefined" PoD hence we will talk about "undefined/any" PoD. value "undefined" PoD hence we will talk about "undefined/any" PoD.
This information is propagated in the LIEs exchanged. This information is propagated in the LIEs exchanged.
Further definitions of leaf flags are found in Section 5.2.7 given Further definitions of leaf flags are found in Section 5.2.7 given
skipping to change at page 36, line 19 skipping to change at page 38, line 31
Figure 14: example TIES generated in a 2 level spine-and-leaf Figure 14: example TIES generated in a 2 level spine-and-leaf
topology topology
5.2.3.3. Flooding 5.2.3.3. Flooding
The mechanism used to distribute TIEs is the well-known (albeit The mechanism used to distribute TIEs is the well-known (albeit
modified in several respects to address fat tree requirements) modified in several respects to address fat tree requirements)
flooding mechanism used by today's link-state protocols. Although flooding mechanism used by today's link-state protocols. Although
flooding is initially more demanding to implement it avoids many flooding is initially more demanding to implement it avoids many
problems with update style used in diffused computation such as path problems with update style used in diffused computation such as
vector protocols. Since flooding tends to present an unscalable distance vector protocols. Since flooding tends to present an
burden in large, densely meshed topologies (fat trees being unscalable burden in large, densely meshed topologies (fat trees
unfortunately such a topology) we provide as solution a close to being unfortunately such a topology) we provide as solution a close
optimal global flood reduction and load balancing optimization in to optimal global flood reduction and load balancing optimization in
Section 5.2.3.9. Section 5.2.3.9.
As described before, TIEs themselves are transported over UDP with As described before, TIEs themselves are transported over UDP with
the ports indicated in the LIE exchanges and using the destination the ports indicated in the LIE exchanges and using the destination
address on which the LIE adjacency has been formed. For unnumbered address on which the LIE adjacency has been formed. For unnumbered
IPv4 interfaces same considerations apply as in equivalent OSPF case. IPv4 interfaces same considerations apply as in equivalent OSPF case.
On reception of a TIE with an undefined level value in the packet On reception of a TIE with an undefined level value in the packet
header the node SHOULD issue a warning and indiscriminately discard header the node SHOULD issue a warning and indiscriminately discard
the packet. the packet.
skipping to change at page 37, line 20 skipping to change at page 39, line 34
disconnected node in a given level to receive the S-TIEs of other disconnected node in a given level to receive the S-TIEs of other
nodes at its level, every *NODE* S-TIE is "reflected" northbound to nodes at its level, every *NODE* S-TIE is "reflected" northbound to
level from which it was received. It should be noted that East-West level from which it was received. It should be noted that East-West
links are included in South TIE flooding (except at ToF level); those links are included in South TIE flooding (except at ToF level); those
TIEs need to be flooded to satisfy algorithms in Section 5.2.4. In TIEs need to be flooded to satisfy algorithms in Section 5.2.4. In
that way nodes at same level can learn about each other without a that way nodes at same level can learn about each other without a
lower level, e.g. in case of leaf level. The precise flooding scopes lower level, e.g. in case of leaf level. The precise flooding scopes
are given in Table 3. Those rules govern as well what SHOULD be are given in Table 3. Those rules govern as well what SHOULD be
included in TIDEs on the adjacency. Again, East-West flooding scopes included in TIDEs on the adjacency. Again, East-West flooding scopes
are identical to South flooding scopes except in case of ToF East- are identical to South flooding scopes except in case of ToF East-
West links (rings). West links (rings) which are basically performing northbound
flooding.
Node S-TIE "south reflection" allows to support positive Node S-TIE "south reflection" allows to support positive
disaggregation on failures describes in Section 5.2.5 and flooding disaggregation on failures describes in Section 5.2.5 and flooding
reduction in Section 5.2.3.9. reduction in Section 5.2.3.9.
+-----------+---------------------+---------------+-----------------+ +-----------+---------------------+---------------+-----------------+
| Type / | South | North | East-West | | Type / | South | North | East-West |
| Direction | | | | | Direction | | | |
+-----------+---------------------+---------------+-----------------+ +-----------+---------------------+---------------+-----------------+
| node | flood if level of | flood if | flood only if | | node | flood if level of | flood if | flood only if |
skipping to change at page 47, line 17 skipping to change at page 49, line 17
A node has three sources of relevant information. A node knows the A node has three sources of relevant information. A node knows the
full topology south from the received N-TIEs. A node has the set of full topology south from the received N-TIEs. A node has the set of
prefixes with associated distances and bandwidths from received prefixes with associated distances and bandwidths from received
S-TIEs. S-TIEs.
To compute reachability, a node runs conceptually a northbound and a To compute reachability, a node runs conceptually a northbound and a
southbound SPF. We call that N-SPF and S-SPF. southbound SPF. We call that N-SPF and S-SPF.
Since neither computation can "loop", it is possible to compute non- Since neither computation can "loop", it is possible to compute non-
equal-cost or even k-shortest paths [EPPSTEIN] and "saturate" the equal-cost or even k-shortest paths [EPPSTEIN] and "saturate" the
fabric to the extent desired. fabric to the extent desired but we use simple, familiar SPF
algorithms and concepts here due to their prevalence in today's
routing.
5.2.4.1. Northbound SPF 5.2.4.1. Northbound SPF
N-SPF uses northbound and East-West adjacencies in the computing N-SPF uses northbound and East-West adjacencies in the computing
node's node N-TIEs (since if the node is a leaf it may not have node's node N-TIEs (since if the node is a leaf it may not have
generated a node S-TIE) when starting Dijkstra. Observe that N-SPF generated a node S-TIE) when starting Dijkstra. Observe that N-SPF
is really just a one hop variety since Node S-TIEs are not re-flooded is really just a one hop variety since Node S-TIEs are not re-flooded
southbound beyond a single level (or East-West) and with that the southbound beyond a single level (or East-West) and with that the
computation cannot progress beyond adjacent nodes. computation cannot progress beyond adjacent nodes.
skipping to change at page 48, line 8 skipping to change at page 50, line 8
multi-plane designs. multi-plane designs.
Other south prefixes found when crossing E-W link MAY be used IIF Other south prefixes found when crossing E-W link MAY be used IIF
1. no north neighbors are advertising same or supersuming non- 1. no north neighbors are advertising same or supersuming non-
default prefix AND default prefix AND
2. the node does not originate a non-default supersuming prefix 2. the node does not originate a non-default supersuming prefix
itself. itself.
i.e. the E-W link can be used as the gateway of last resort for a i.e. the E-W link can be used as a gateway of last resort for a
specific prefix only. Using south prefixes across E-W link can be specific prefix only. Using south prefixes across E-W link can be
beneficial e.g. on automatic de-aggregation in pathological fabric beneficial e.g. on automatic de-aggregation in pathological fabric
partitioning scenarios. partitioning scenarios.
A detailed example can be found in Section 6.4. A detailed example can be found in Section 6.4.
5.2.4.2. Southbound SPF 5.2.4.2. Southbound SPF
S-SPF uses only the southbound adjacencies in the node S-TIEs, i.e. S-SPF uses only the southbound adjacencies in the node S-TIEs, i.e.
progresses towards nodes at lower levels. Observe that E-W progresses towards nodes at lower levels. Observe that E-W
adjacencies are NEVER used in the computation. This enforces the adjacencies are NEVER used in the computation. This enforces the
requirement that a packet traversing in a southbound direction must requirement that a packet traversing in a southbound direction must
never change its direction. never change its direction.
S-SPF uses northbound adjacencies in node N-TIEs to verify backlink S-SPF uses northbound adjacencies in node N-TIEs to verify backlink
connectivity. connectivity.
5.2.4.3. East-West Forwarding Within a Level 5.2.4.3. East-West Forwarding Within a non-ToF Level
Ultimately, it should be observed that in presence of a "ring" of E-W Ultimately, it should be observed that in presence of a "ring" of E-W
links in a level neither SPF will provide a "ring protection" scheme links in any level (except ToF level) neither SPF will provide a
since such a computation would have to deal necessarily with breaking "ring protection" scheme since such a computation would have to deal
of "loops" in generic Dijkstra sense; an application for which RIFT necessarily with breaking of "loops" in generic Dijkstra sense; an
is not intended. It is outside the scope of this document how an application for which RIFT is not intended. It is outside the scope
underlay can be used to provide a full-mesh connectivity between of this document how an underlay can be used to provide a full-mesh
nodes in the same level that would allow for N-SPF to provide connectivity between nodes in the same level that would allow for
protection for a single node loosing all its northbound adjacencies N-SPF to provide protection for a single node loosing all its
(as long as any of the other nodes in the level are northbound northbound adjacencies (as long as any of the other nodes in the
connected). level are northbound connected).
Using south prefixes over horizontal links is optional and can Using south prefixes over horizontal links is optional and can
protect against pathological fabric partitioning cases that leave protect against pathological fabric partitioning cases that leave
only paths to destinations that would necessitate multiple changes of only paths to destinations that would necessitate multiple changes of
forwarding direction between north and south. forwarding direction between north and south.
5.2.4.4. East-West Links Within ToF Level
E-W ToF links behave in terms of flooding scopes defined in
Section 5.2.3.4 like northbound links. Even though a ToF node could
be tempted to use those links during southbound SPF this MUST NOT be
attempted since it may lead in, e.g. anycast cases to routing loops.
An implemention could try to resolve the looping problem by following
on the ring strictly tie-broken shortest-paths only but the details
are outside this specification. And even then, the problem of proper
capacity provisioning of such links when they become traffic-bearing
in case of failures is vexing.
5.2.5. Automatic Disaggregation on Link & Node Failures 5.2.5. Automatic Disaggregation on Link & Node Failures
5.2.5.1. Positive, Non-transitive Disaggregation 5.2.5.1. Positive, Non-transitive Disaggregation
Under normal circumstances, node's S-TIEs contain just the Under normal circumstances, node's S-TIEs contain just the
adjacencies and a default route. However, if a node detects that its adjacencies and a default route. However, if a node detects that its
default IP prefix covers one or more prefixes that are reachable default IP prefix covers one or more prefixes that are reachable
through it but not through one or more other nodes at the same level, through it but not through one or more other nodes at the same level,
then it MUST explicitly advertise those prefixes in an S-TIE. then it MUST explicitly advertise those prefixes in an S-TIE.
Otherwise, some percentage of the northbound traffic for those Otherwise, some percentage of the northbound traffic for those
prefixes would be sent to nodes without according reachability, prefixes would be sent to nodes without according reachability,
causing it to be black-holed. Even when not black-holing, the causing it to be black-holed. Even when not black-holing, the
resulting forwarding could 'backhaul' packets through the higher resulting forwarding could 'backhaul' packets through the higher
level spines, clearly an undesirable condition affecting the blocking level spines, clearly an undesirable condition affecting the blocking
probabilities of the fabric. probabilities of the fabric.
We refer to the process of advertising additional prefixes southbound We refer to the process of advertising additional prefixes southbound
as 'positive de-aggregation' or 'positive dis-aggregation'. as 'positive de-aggregation' or 'positive dis-aggregation'. Such
dis-aggregation is non-transitive, i.e. its' effects are always
contained to a single level of the fabric only. Naturally, multiple
node or link failures can lead to several independent instances of
positive dis-aggregation necessary to prevent looping or bow-tying
the fabric.
A node determines the set of prefixes needing de-aggregation using A node determines the set of prefixes needing de-aggregation using
the following steps: the following steps:
1. A DAG computation in the southern direction is performed first, 1. A DAG computation in the southern direction is performed first,
i.e. the N-TIEs are used to find all of prefixes it can reach and i.e. the N-TIEs are used to find all of prefixes it can reach and
the set of next-hops in the lower level for each of them. Such a the set of next-hops in the lower level for each of them. Such a
computation can be easily performed on a fat tree by e.g. setting computation can be easily performed on a fat tree by e.g. setting
all link costs in the southern direction to 1 and all northern all link costs in the southern direction to 1 and all northern
directions to infinity. We term set of those prefixes |R, and directions to infinity. We term set of those prefixes |R, and
skipping to change at page 54, line 29 skipping to change at page 56, line 29
northbound neighbors provided a negative prefix advertisement. This northbound neighbors provided a negative prefix advertisement. This
is the trigger to advertise this negative prefix transitively south is the trigger to advertise this negative prefix transitively south
and normally caused by the node being in a plane where the prefix and normally caused by the node being in a plane where the prefix
belongs to a fabric leaf that has "fallen" in this plane. Obviously, belongs to a fabric leaf that has "fallen" in this plane. Obviously,
when one of the northbound switches withdraws its negative when one of the northbound switches withdraws its negative
advertisement, the node has to withdraw its transitively provided advertisement, the node has to withdraw its transitively provided
negative prefix as well. negative prefix as well.
5.2.6. Attaching Prefixes 5.2.6. Attaching Prefixes
After the SPF is run, it is necessary to attach according prefixes. After SPF is run, it is necessary to attach the resulting
For S-SPF, prefixes from an N-TIE are attached to the originating reachability information in form of prefixes. For S-SPF, prefixes
node with that node's next-hop set and a distance equal to the from an N-TIE are attached to the originating node with that node's
prefix's cost plus the node's minimized path distance. The RIFT next-hop set and a distance equal to the prefix's cost plus the
route database, a set of (prefix, type=spf, path_distance, next-hop node's minimized path distance. The RIFT route database, a set of
set), accumulates these results. Obviously, the prefix retains its (prefix, prefix-type, attributes, path_distance, next-hop set),
type which is used to tie-break between the same prefix advertised accumulates these results.
with different types.
In case of N-SPF prefixes from each S-TIE need to also be added to In case of N-SPF prefixes from each S-TIE need to also be added to
the RIFT route database. The N-SPF is really just a stub so the the RIFT route database. The N-SPF is really just a stub so the
computing node needs simply to determine, for each prefix in an S-TIE computing node needs simply to determine, for each prefix in an S-TIE
that originated from adjacent node, what next-hops to use to reach that originated from adjacent node, what next-hops to use to reach
that node. Since there may be parallel links, the next-hops to use that node. Since there may be parallel links, the next-hops to use
can be a set; presence of the computing node in the associated Node can be a set; presence of the computing node in the associated Node
S-TIE is sufficient to verify that at least one link has S-TIE is sufficient to verify that at least one link has
bidirectional connectivity. The set of minimum cost next-hops from bidirectional connectivity. The set of minimum cost next-hops from
the computing node X to the originating adjacent node is determined. the computing node X to the originating adjacent node is determined.
Each prefix has its cost adjusted before being added into the RIFT Each prefix has its cost adjusted before being added into the RIFT
route database. The cost of the prefix is set to the cost received route database. The cost of the prefix is set to the cost received
plus the cost of the minimum distance next-hop to that neighbor while plus the cost of the minimum distance next-hop to that neighbor while
taking into account its attributes such as mobility per Section 5.3.3 taking into account its attributes such as mobility per Section 5.3.3
necessary. Then each prefix can be added into the RIFT route necessary. Then each prefix can be added into the RIFT route
database with the next_hop_set; ties are broken based upon type first database with the next_hop_set; ties are broken based upon type first
and then distance and further attributes. RIFT route preferences are and then distance and further attributes and only the best
normalized by the according thrift model type. combination is used for forwarding. RIFT route preferences are
normalized by the according Thrift [thrift] model type.
An exemplary implementation for node X follows: An example implementation for node X follows:
for each S-TIE for each S-TIE
if S-TIE.level > X.level if S-TIE.level > X.level
next_hop_set = set of minimum cost links to the S-TIE.originator next_hop_set = set of minimum cost links to the S-TIE.originator
next_hop_cost = minimum cost link to S-TIE.originator next_hop_cost = minimum cost link to S-TIE.originator
end if end if
for each prefix P in the S-TIE for each prefix P in the S-TIE
P.cost = P.cost + next_hop_cost P.cost = P.cost + next_hop_cost
if P not in route_database: if P not in route_database:
add (P, type=DistVector, P.cost, next_hop_set) to route_database add (P, P.cost, P.type, P.attributes, next_hop_set) to route_database
end if end if
if (P in route_database): if (P in route_database):
if route_database[P].cost > P.cost or route_database[P].type > P.type: if route_database[P].cost > P.cost or route_database[P].type > P.type:
update route_database[P] with (P, DistVector, P.cost, P.type, next_hop_set) update route_database[P] with (P, P.type, P.cost, P.attributes, next_hop_set)
else if route_database[P].cost == P.cost and route_database[P].type == P.type: else if route_database[P].cost == P.cost and route_database[P].type == P.type:
update route_database[P] with (P, DistVector, P.cost, P.type, update route_database[P] with (P, P.type, P.cost, P.attributes,
merge(next_hop_set, route_database[P].next_hop_set)) merge(next_hop_set, route_database[P].next_hop_set))
else else
// Not preferred route so ignore // Not preferred route so ignore
end if end if
end if end if
end for end for
end for end for
Figure 17: Adding Routes from S-TIE Positive and Negative Prefixes Figure 17: Adding Routes from S-TIE Positive and Negative Prefixes
skipping to change at page 56, line 15 skipping to change at page 58, line 15
will recurse in the case of nested negative prefix aggregations. will recurse in the case of nested negative prefix aggregations.
The rule of inheritance must also be maintained when a new prefix of The rule of inheritance must also be maintained when a new prefix of
intermediate length is inserted, or when the immediately aggregating intermediate length is inserted, or when the immediately aggregating
prefix is deleted from the routing table, making an even shorter prefix is deleted from the routing table, making an even shorter
aggregating prefix the one from which the negative routes now inherit aggregating prefix the one from which the negative routes now inherit
their adjacencies. As the aggregating prefix changes, all the their adjacencies. As the aggregating prefix changes, all the
negative routes must be recomputed, and then again the process may negative routes must be recomputed, and then again the process may
recurse in case of nested negative prefix aggregations. recurse in case of nested negative prefix aggregations.
Observe that despite seeming quite computationally expensive the Although these operations can be computationally expensive, the
operations are only necessary if the only available advertisements overall load on devices in the network is low because these
for a prefix are negative since tie-breaking always prefers positive computations are not run very often, as positive route advertisements
information for the prefix which stops any kind of recursion since are always preferred over negative ones. This prevents recursion in
positive information never inherits next hops. most cases because positive reachability information never inherits
next hops.
To make the negative disaggregation less abstract and provide an To make the negative disaggregation less abstract and provide an
example let us consider a ToP node T1 with 4 ToF parents S1..S4 as example let us consider a ToP node T1 with 4 ToF parents S1..S4 as
represented in Figure 18: represented in Figure 18:
+----+ +----+ +----+ +----+ N +----+ +----+ +----+ +----+ N
| S1 | | S1 | | S1 | | S1 | ^ | S1 | | S1 | | S1 | | S1 | ^
+----+ +----+ +----+ +----+ W< + >E +----+ +----+ +----+ +----+ W< + >E
| | | | v | | | | v
|+--------+ | | S |+--------+ | | S
skipping to change at page 58, line 27 skipping to change at page 60, line 27
| +--------+ | +--------+
+---> | Via S3 | +---> | Via S3 |
| +---------+ | +---------+
| |
| +--------+ | +--------+
+---> | Via S4 | +---> | Via S4 |
+--------+ +--------+
Figure 20: Abstract RIB after negative 2001:db8::/32 from S1 Figure 20: Abstract RIB after negative 2001:db8::/32 from S1
Negative 2001:db8::/32 entry inherits from ::/0, so the positive more The negative 2001:db8::/32 prefix entry inherits from ::/0, so the
specific routes are the complements to S1 in the set of next-hops for positive more specific routes are the complements to S1 in the set of
the default route. That entry is composed of S2, S3, and S4, or, in next-hops for the default route. That entry is composed of S2, S3,
other words, it uses all entries of the default route with a "hole and S4, or, in other words, it uses all entries the the default route
punched" for S1 into them. These are the next hops that are still with a "hole punched" for S1 into them. These are the next hops that
available to reach 2001:db8::/32, now that S1 advertised that it will are still available to reach 2001:db8::/32, now that S1 advertised
not forward 2001:db8::/32 anymore. Ultimately, those resulting next- that it will not forward 2001:db8::/32 anymore. Ultimately, those
hops are installed in FIB for the more specific route to resulting next-hops are installed in FIB for the more specific route
2001:db8::/32 as illustrated below: to 2001:db8::/32 as illustrated below:
+---------+ +---------------+ +---------+ +---------------+
| Default | | 2001:db8::/32 | | Default | | 2001:db8::/32 |
+---------+ +---------------+ +---------+ +---------------+
| | | |
| +--------+ | | +--------+ |
+---> | Via S1 | | +---> | Via S1 | |
| +--------+ | | +--------+ |
| | | |
| +--------+ | +--------+ | +--------+ | +--------+
skipping to change at page 62, line 34 skipping to change at page 64, line 34
Figure 24: Abstract FIB after loss of S3 Figure 24: Abstract FIB after loss of S3
Say that at that time, S4 would also disaggregate prefix Say that at that time, S4 would also disaggregate prefix
2001:db8:1::/48. This would mean that the FIB entry for 2001:db8:1::/48. This would mean that the FIB entry for
2001:db8:1::/48 becomes a discard route, and that would be the signal 2001:db8:1::/48 becomes a discard route, and that would be the signal
for T1 to disaggregate prefix 2001:db8:1::/48 negatively in a for T1 to disaggregate prefix 2001:db8:1::/48 negatively in a
transitive fashion with its own children. transitive fashion with its own children.
Finally, let us look at the case where S3 becomes available again as Finally, let us look at the case where S3 becomes available again as
default gateway, and a negative advertisement is received from S4 a default gateway, and a negative advertisement is received from S4
about prefix 2001:db8:2::/48 as opposed to 2001:db8:1::/48. Again, a about prefix 2001:db8:2::/48 as opposed to 2001:db8:1::/48. Again, a
negative route is stored in the RIB, and the more specific route to negative route is stored in the RIB, and the more specific route to
the complementing ToF nodes are installed in FIB. Since the complementing ToF nodes are installed in FIB. Since
2001:db8:2::/48 inherits from 2001:db8::/32, the positive FIB routes 2001:db8:2::/48 inherits from 2001:db8::/32, the positive FIB routes
are chosen by removing S4 from S2, S3, S4. The abstract FIB in T1 are chosen by removing S4 from S2, S3, S4. The abstract FIB in T1
now shows as illustrated in Figure 25: now shows as illustrated in Figure 25:
+-----------------+ +-----------------+
| 2001:db8:2::/48 | | 2001:db8:2::/48 |
+-----------------+ +-----------------+
skipping to change at page 63, line 33 skipping to change at page 65, line 33
| +--------+ | +--------+ | +--------+ | +--------+ | +--------+ | +--------+
| | | | | |
| +--------+ | +--------+ | +--------+ | +--------+ | +--------+ | +--------+
+---> | Via S4 | +---> | Via S4 | +---> | Via S4 | +---> | Via S4 | +---> | Via S4 | +---> | Via S4 |
+--------+ +--------+ +--------+ +--------+ +--------+ +--------+
Figure 25: Abstract FIB after negative 2001:db8:2::/48 from S4 Figure 25: Abstract FIB after negative 2001:db8:2::/48 from S4
5.2.7. Optional Zero Touch Provisioning (ZTP) 5.2.7. Optional Zero Touch Provisioning (ZTP)
Each RIFT node can optionally operate in zero touch provisioning Each RIFT node can operate in zero touch provisioning (ZTP) mode,
(ZTP) mode, i.e. it has no configuration (unless it is a Top-of- i.e. it has no configuration (unless it is a Top-of-Fabric at the top
Fabric at the top of the topology or the must operate in the topology of the topology or the must operate in the topology as leaf and/or
as leaf and/or support leaf-2-leaf procedures) and it will fully support leaf-2-leaf procedures) and it will fully configure itself
configure itself after being attached to the topology. Configured after being attached to the topology. Configured nodes and nodes
nodes and nodes operating in ZTP can be mixed and will form a valid operating in ZTP can be mixed and will form a valid topology if
topology if achievable. achievable.
The derivation of the level of each node happens based on offers The derivation of the level of each node happens based on offers
received from its neighbors whereas each node (with possibly received from its neighbors whereas each node (with possibly
exceptions of configured leafs) tries to attach at the highest exceptions of configured leafs) tries to attach at the highest
possible point in the fabric. This guarantees that even if the possible point in the fabric. This guarantees that even if the
diffusion front reaches a node from "below" faster than from "above", diffusion front reaches a node from "below" faster than from "above",
it will greedily abandon already negotiated level derived from nodes it will greedily abandon already negotiated level derived from nodes
topologically below it and properly peers with nodes above. topologically below it and properly peers with nodes above.
The fabric is very conciously numbered from the top to allow for PoDs The fabric is very conciously numbered from the top to allow for PoDs
skipping to change at page 65, line 37 skipping to change at page 67, line 37
all VOLs received. all VOLs received.
Highest Available Level Systems (HALS): Set of nodes offering HAL Highest Available Level Systems (HALS): Set of nodes offering HAL
VOLs. VOLs.
Highest Adjacency Three Way (HAT): Highest neigbhor level of all the Highest Adjacency Three Way (HAT): Highest neigbhor level of all the
formed three way adjacencies for the node. formed three way adjacencies for the node.
5.2.7.2. Automatic SystemID Selection 5.2.7.2. Automatic SystemID Selection
RIFT identifies each node via a SystemID which is a 64 bits wide RIFT nodes require a 64 bit SystemID which SHOULD be derived as
integer. It is relatively simple to derive a, for all practical EUI-64 MA-L derive according to [EUI64]. The organizationally
purposes collision free, value for each node on startup. For that goverened portion of this ID (24 bits) can be used to generate
purpose, a node MUST use as system ID EUI-64 MA-L format [EUI64] multiple IDs if required to indicate more than one RIFT instance."
where the organizationally governed 24 bits can be used to generate
system IDs for multiple RIFT instances running on the system.
As matter of operational concern, the router MUST ensure that such As matter of operational concern, the router MUST ensure that such
identifier is not changing very frequently (or at least not without identifier is not changing very frequently (or at least not without
sending all its TIEs with fairly short lifetimes) since otherwise the sending all its TIEs with fairly short lifetimes) since otherwise the
network may be left with large amounts of stale TIEs in other nodes network may be left with large amounts of stale TIEs in other nodes
(though this is not necessarily a serious problem if the procedures (though this is not necessarily a serious problem if the procedures
described in Section 8 are implemented). described in Section 8 are implemented).
5.2.7.3. Generic Fabric Example 5.2.7.3. Generic Fabric Example
skipping to change at page 70, line 19 skipping to change at page 72, line 19
all nodes with highest level either leaving or entering the domain all nodes with highest level either leaving or entering the domain
(with some finer distinctions not explained further). It is (with some finer distinctions not explained further). It is
therefore recommended that each node is multi-homed towards nodes therefore recommended that each node is multi-homed towards nodes
with respective HAL offerings. Fortuntately, this is the natural with respective HAL offerings. Fortuntately, this is the natural
state of things for the topology variants considered in RIFT. state of things for the topology variants considered in RIFT.
5.3. Further Mechanisms 5.3. Further Mechanisms
5.3.1. Overload Bit 5.3.1. Overload Bit
Overload Bit MUST be respected in all according reachability The overload Bit MUST be respected in all according reachability
computations. A node with overload bit set SHOULD NOT advertise any computations. A node with overload bit set SHOULD NOT advertise any
reachability prefixes southbound except locally hosted ones. A node reachability prefixes southbound except locally hosted ones. A node
in overload SHOULD advertise all its locally hosted prefixes north in overload SHOULD advertise all its locally hosted prefixes north
and southbound. and southbound.
The leaf node SHOULD set the 'overload' bit on its node TIEs, since The leaf node SHOULD set the 'overload' bit on its node TIEs, since
if the spine nodes were to forward traffic not meant for the local if the spine nodes were to forward traffic not meant for the local
node, the leaf node does not have the topology information to prevent node, the leaf node does not have the topology information to prevent
a routing/forwarding loop. a routing/forwarding loop.
5.3.2. Optimized Route Computation on Leafs 5.3.2. Optimized Route Computation on Leafs
Since the leafs do see only "one hop away" they do not need to run a Since the leafs do see only "one hop away" they do not need to run a
full SPF but can simply gather prefix candidates from their neighbors "proper" SPF. Instead, they can gather the available prefix
and build the according routing table. candidates from their neighbors and build the routing table
accordingly.
A leaf will have no N-TIEs except its own and optionally from its A leaf will have no N-TIEs except its own and optionally from its
East-West neighbors. A leaf will have S-TIEs from its neighbors. East-West neighbors. A leaf will have S-TIEs from its neighbors.
Instead of creating a network graph from its N-TIEs and neighbor's Instead of creating a network graph from its N-TIEs and neighbor's
S-TIEs and then running an SPF, a leaf node can simply compute the S-TIEs and then running an SPF, a leaf node can simply compute the
minimum cost and next_hop_set to each leaf neighbor by examining its minimum cost and next_hop_set to each leaf neighbor by examining its
local adjacencies, determining bi-directionality from the associated local adjacencies, determining bi-directionality from the associated
N-TIE, and specifying the neighbor's next_hop_set set and cost from N-TIE, and specifying the neighbor's next_hop_set set and cost from
the minimum cost local adjacency to that neighbor. the minimum cost local adjacency to that neighbor.
skipping to change at page 71, line 10 skipping to change at page 73, line 15
5.3.3. Mobility 5.3.3. Mobility
It is a requirement for RIFT to maintain at the control plane a real It is a requirement for RIFT to maintain at the control plane a real
time status of which prefix is attached to which port of which leaf, time status of which prefix is attached to which port of which leaf,
even in a context of mobility where the point of attachement may even in a context of mobility where the point of attachement may
change several times in a subsecond period of time. change several times in a subsecond period of time.
There are two classical approaches to maintain such knowledge in an There are two classical approaches to maintain such knowledge in an
unambiguous fashion: unambiguous fashion:
time stamp: With this method, the infrastructure memorizes the time stamp: With this method, the infrastructure records the precise
precise time at which the movement is observed. One key advantage time at which the movement is observed. One key advantage of this
of this technique is that it has no dependency on the mobile technique is that it has no dependency on the mobile device. One
device. One drawback is that the infrastructure must be precisely drawback is that the infrastructure must be precisely synchronized
synchronized to be able to compare time stamps as observed by the to be able to compare time stamps as observed by the various
various points of attachment, e.g., using the variation of the points of attachment, e.g., using the variation of the Precision
Precision Time Protocol (PTP) IEEE Std. 1588 [IEEEstd1588], Time Protocol (PTP) IEEE Std. 1588 [IEEEstd1588], [IEEEstd8021AS]
[IEEEstd8021AS] designed for bridged LANs IEEE Std. 802.1AS designed for bridged LANs IEEE Std. 802.1AS [IEEEstd8021AS]. Both
[IEEEstd8021AS]. Both the precision of the synchronisation the precision of the synchronisation protocol and the resolution
protocol and the resolution of the time stamp must beat the of the time stamp must beat the highest possible roaming time on
highest possible roaming time on the fabric. Another drawback is the fabric. Another drawback is that the presence of the mobile
that the presence of the mobile device may be observed only device may be observed only asynchronously, e.g., after it starts
asynchronously, e.g., after it starts using an IP protocol such as using an IP protocol such as ARP [RFC0826], IPv6 Neighbor
ARP [RFC0826], IPv6 Neighbor Discovery [RFC4861][RFC4862], or DHCP Discovery [RFC4861][RFC4862], or DHCP [RFC2131][RFC8415].
[RFC2131][RFC8415].
sequence counter: With this method, a mobile node notifies its point sequence counter: With this method, a mobile node notifies its point
of attachment on arrival with a sequence counter that is of attachment on arrival with a sequence counter that is
incremented upon each movement. On the positive side, this method incremented upon each movement. On the positive side, this method
does not have a dependency on a precise sense of time, since the does not have a dependency on a precise sense of time, since the
sequence of movements is kept in order by the device. The sequence of movements is kept in order by the device. The
disadvantage of this approach is the lack of support for protocols disadvantage of this approach is the lack of support for protocols
that may be used by the mobile node to register its presence to that may be used by the mobile node to register its presence to
the leaf node with the capability to provide a sequence counter. the leaf node with the capability to provide a sequence counter.
Well-known issues with wrapping sequence counters must be Well-known issues with wrapping sequence counters must be
skipping to change at page 71, line 46 skipping to change at page 73, line 50
in both wrapping rules and comparison rules. A particular in both wrapping rules and comparison rules. A particular
knowledge of the source of the sequence counter is required to knowledge of the source of the sequence counter is required to
operate it, and the comparison between sequence counters from operate it, and the comparison between sequence counters from
heterogeneous sources can be hard to impossible. heterogeneous sources can be hard to impossible.
RIFT supports a hybrid approach contained in an optional RIFT supports a hybrid approach contained in an optional
`PrefixSequenceType` prefix attribute that we call a `monotonic `PrefixSequenceType` prefix attribute that we call a `monotonic
clock` consisting of a timestamp and optional sequence number. In clock` consisting of a timestamp and optional sequence number. In
case of presence of the attribute: case of presence of the attribute:
o The leaf node MUST advertise a time stamp of the latest sighting o The leaf node MAY advertise a time stamp of the latest sighting of
of a prefix, e.g., by snooping IP protocols or the node using the a prefix, e.g., by snooping IP protocols or the node using the
time at which it advertised the prefix. RIFT transports the time time at which it advertised the prefix. RIFT transports the time
stamp within the desired prefix N-TIEs as 802.1AS timestamp. stamp within the desired prefix N-TIEs as 802.1AS timestamp.
o RIFT may interoperate with the "update to 6LoWPAN Neighbor o RIFT may interoperate with the "update to 6LoWPAN Neighbor
Discovery" [RFC8505], which provides a method for registering a Discovery" [RFC8505], which provides a method for registering a
prefix with a sequence counter called a Transaction ID (TID). prefix with a sequence counter called a Transaction ID (TID).
RIFT transports in such case the TID in its native form. RIFT transports in such case the TID in its native form.
o RIFT also defines an abstract negative clock (ANSC) that compares o RIFT also defines an abstract negative clock (ANSC) that compares
as less than any other clock. By default, the lack of a as less than any other clock. By default, the lack of a
skipping to change at page 73, line 37 skipping to change at page 75, line 39
away. away.
Observe further that without support for [RFC8505] movements on the Observe further that without support for [RFC8505] movements on the
fabric within intervals smaller than 100msec will be seen as anycast. fabric within intervals smaller than 100msec will be seen as anycast.
5.3.3.4. Overlays and Signaling 5.3.3.4. Overlays and Signaling
RIFT is agnostic whether any overlay technology like [MIP, LISP, RIFT is agnostic whether any overlay technology like [MIP, LISP,
VxLAN, NVO3] and the associated signaling is deployed over it. But VxLAN, NVO3] and the associated signaling is deployed over it. But
it is expected that leaf nodes, and possibly Top-of-Fabric nodes can it is expected that leaf nodes, and possibly Top-of-Fabric nodes can
perform the according encapsulation. perform the correct encapsulation.
In the context of mobility, overlays provide a classical solution to In the context of mobility, overlays provide a classical solution to
avoid injecting mobile prefixes in the fabric and improve the avoid injecting mobile prefixes in the fabric and improve the
scalability of the solution. It makes sense on a data center that scalability of the solution. It makes sense on a data center that
already uses overlays to consider their applicability to the mobility already uses overlays to consider their applicability to the mobility
solution; as an example, a mobility protocol such as LISP may inform solution; as an example, a mobility protocol such as LISP may inform
the ingress leaf of the location of the egress leaf in real time. the ingress leaf of the location of the egress leaf in real time.
Another possibility is to consider that mobility as an underlay Another possibility is to consider that mobility as an underlay
service and support it in RIFT to an extent. The load on the fabric service and support it in RIFT to an extent. The load on the fabric
skipping to change at page 77, line 31 skipping to change at page 79, line 31
+---------+-----------+-------+-------+-----+ +---------+-----------+-------+-------+-----+
| Leaf111 | Spine 112 | 220 | 8 | 1 | | Leaf111 | Spine 112 | 220 | 8 | 1 |
+---------+-----------+-------+-------+-----+ +---------+-----------+-------+-------+-----+
| Leaf112 | Spine 111 | 120 | 7 | 2 | | Leaf112 | Spine 111 | 120 | 7 | 2 |
+---------+-----------+-------+-------+-----+ +---------+-----------+-------+-------+-----+
| Leaf112 | Spine 112 | 220 | 8 | 1 | | Leaf112 | Spine 112 | 220 | 8 | 1 |
+---------+-----------+-------+-------+-----+ +---------+-----------+-------+-------+-----+
Table 5: BAD Computation Table 5: BAD Computation
All the multiplications and additions are saturating, i.e. when If a calculation produces a result exceeding the range of the type,
exceeding range of the bandwidth type are set to highest possible e.g. bandwidth, the result is set to the highest possible value for
value of the type. that type.
BAD is only computed for default routes. A node MAY compute and use BAD is only computed for default routes. A node MAY compute and use
BAD for any disaggregated prefixes or other RIFT routes. A node MAY BAD for any disaggregated prefixes or other RIFT routes. A node MAY
use another algorithm than BAD to weight northbound traffic based on use another algorithm than BAD to weight northbound traffic based on
bandwidth given that the algorithm is distributed and un-synchronized bandwidth given that the algorithm is distributed and un-synchronized
and ultimately, its correct behavior does not depend on uniformity of and ultimately, its correct behavior does not depend on uniformity of
balancing algorithms used in the fabric. E.g. it is conceivable that balancing algorithms used in the fabric. E.g. it is conceivable that
leafs could use real time link loads gathered by analytics to change leafs could use real time link loads gathered by analytics to change
the amount of traffic assigned to each default route next hop. the amount of traffic assigned to each default route next hop.
skipping to change at page 86, line 8 skipping to change at page 88, line 8
MUST use recent (i.e. within allowed difference) nonces reflected in MUST use recent (i.e. within allowed difference) nonces reflected in
the LIE exchange. The schema specifies maximum allowable nonce value the LIE exchange. The schema specifies maximum allowable nonce value
difference on a packet compared to reflected nonces in the LIEs. Any difference on a packet compared to reflected nonces in the LIEs. Any
packet received with nonces deviating more than the allowed delta packet received with nonces deviating more than the allowed delta
MUST be discarded without further computation of signatures to MUST be discarded without further computation of signatures to
prevent computation load attacks. prevent computation load attacks.
In case where a secure implementation does not receive signatures or In case where a secure implementation does not receive signatures or
receives undefined nonces from neighbor indicating that it does not receives undefined nonces from neighbor indicating that it does not
support or verify signatures, it is a matter of local policy how such support or verify signatures, it is a matter of local policy how such
packets are treated. Any secure implementation MUST discard packets packets are treated. Any secure implementation may choose to either
where its local nonce is not correctly mirrored but it may choose to refuse forming an adjacency with an implementation not advertising
either refuse forming an adjacency with an implementation not signatures or valid nonces or simply keep on signing local packets
advertising signatures or valid nonces or simply keep on signing while accepting neighbor's packets without further security
local packets while accepting neighbor's packets without further verification.
verification beside checking for proper nonce reflection.
As a necessary exception, an implementation MUST advertise As a necessary exception, an implementation MUST advertise
`undefined_nonce` for remote nonce value when the FSM is not in 2-way `undefined_nonce` for remote nonce value when the FSM is not in 2-way
or 3-way state and accept an `undefined_nonce` for its local nonce or 3-way state and accept an `undefined_nonce` for its local nonce
value on packets in any other state than 3-way. value on packets in any other state than 3-way.
As optional optimization, an implemenation MAY send one LIE with As optional optimization, an implemenation MAY send one LIE with
previously negotiated neighbor's nonce to try to speed up a previously negotiated neighbor's nonce to try to speed up a
neighbor's transition from 3-way to 1-way and MUST revert to sending neighbor's transition from 3-way to 1-way and MUST revert to sending
`undefined_nonce` after that. `undefined_nonce` after that.
skipping to change at page 92, line 40 skipping to change at page 93, line 43
using the link A01 to A02 (whereas A02 will NOT use this link during using the link A01 to A02 (whereas A02 will NOT use this link during
N-SPF). Hence A01 will still advertise the default towards level 0 N-SPF). Hence A01 will still advertise the default towards level 0
and route unidirectionally using the horizontal link. and route unidirectionally using the horizontal link.
As further consideration, the moment A02 looses link N2 the situation As further consideration, the moment A02 looses link N2 the situation
evolves again. A01 will have no more northbound reachability while evolves again. A01 will have no more northbound reachability while
still seeing A03 advertising northbound adjacencies in its south node still seeing A03 advertising northbound adjacencies in its south node
tie. With that it will stop advertising a default route due to tie. With that it will stop advertising a default route due to
Section 5.2.3.8. Section 5.2.3.8.
6.5. Multi-Plane Fabric and Negative Disaggregation
TODO
7. Implementation and Operation: Further Details 7. Implementation and Operation: Further Details
7.1. Considerations for Leaf-Only Implementation 7.1. Considerations for Leaf-Only Implementation
RIFT can and is intended to be stretched to the lowest level in the RIFT can and is intended to be stretched to the lowest level in the
IP fabric to integrate ToRs or even servers. Since those entities IP fabric to integrate ToRs or even servers. Since those entities
would run as leafs only, it is worth to observe that a leaf only would run as leafs only, it is worth to observe that a leaf only
version is significantly simpler to implement and requires much less version is significantly simpler to implement and requires much less
resources: resources:
skipping to change at page 96, line 32 skipping to change at page 97, line 7
8.6. TIE Origin Fingerprint DoS Attacks 8.6. TIE Origin Fingerprint DoS Attacks
A compromised node can attempt to generate "fake TIEs" using other A compromised node can attempt to generate "fake TIEs" using other
nodes' TIE origin key identifiers. Albeit the ultimate validation of nodes' TIE origin key identifiers. Albeit the ultimate validation of
the origin fingerprint will fail in such scenarios and not progress the origin fingerprint will fail in such scenarios and not progress
further than immediately peering nodes, the resulting denial of further than immediately peering nodes, the resulting denial of
service attack seems unavoidable since the TIE origin key id is only service attack seems unavoidable since the TIE origin key id is only
protected by the, here assumed to be compromised, node. protected by the, here assumed to be compromised, node.
8.7. Host Implementations
It can be reasonably expected that with the proliferation of RotH
servers, rather than dedicated networking devices, will constitute
significant amount of RIFT devices. Given their normally far wider
software envelope and access granted to them, such servers are also
far more likely to be compromised and present an attack vector on the
protocol. Hijacking of prefixes to attract traffic is a trust
problem and cannot be addressed within the protocol if the trust
model is breached, i.e. the server presents valid credentials to form
an adjacency and issue TIEs. However, in a move devious way, the
servers can present DoS (or even DDos) vectors of issuing too many
LIE packets, flood large amount of N-TIEs and similar anomalies. A
prudent implementation hosting leafs should implement thresholds and
raise warnings when leaf is advertising number of TIEs in excess of
those.
9. IANA Considerations 9. IANA Considerations
This specification will request at an opportune time multiple This specification requests multicast address assignments and
registry points to exchange protocol packets in a standardized way, standard port numbers. Additionally registries for the schema are
amongst them multicast address assignments and standard port numbers. requested and suggested values provided that reflect the numbers
The schema itself defines many values and codepoints which can be allocated in the given schema.
considered registries themselves.
9.1. Requested Multicast and Port Numbers
This document requests allocation in the 'IPv4 Multicast Address
Space' registry the suggested value of 224.0.0.120 as
'ALL_V4_RIFT_ROUTERS' and in the 'IPv6 Multicast Address Space'
registry the suggested value of FF02::A1F7 as 'ALL_V6_RIFT_ROUTERS'.
This document requests allocation in the 'Service Name and Transport
Protocol Port Number Registry' the allocation of a suggested value of
914 on udp for 'RIFT_LIES_PORT' and suggested value of 915 for
'RIFT_TIES_PORT'.
9.2. Requested Registries with Suggested Values
This section requests registries that help govern the schema via
usual IANA registry procedures. Allocation of new values is always
performed via `Expert Review` action. IANA is requested to store the
schema version introducing the allocated value as well as,
optionally, its description when present. All values not suggested
as to be considered `Unassigned`. The range of every registry is a
16-bit integer.
9.2.1. RIFT/common/AddressFamilyType
address family
9.2.1.1. Requested Entries
Name Value Schema Version Description
Illegal 0 1.0
AddressFamilyMinValue 1 1.0
IPv4 2 1.0
IPv6 3 1.0
AddressFamilyMaxValue 4 1.0
9.2.2. RIFT/common/HierarchyIndications
flags indicating nodes behavior in case of ZTP
9.2.2.1. Requested Entries
Name Value Schema Version Description
leaf_only 0 1.0
leaf_only_and_leaf_2_leaf_procedures 1 1.0
top_of_fabric 2 1.0
9.2.3. RIFT/common/IEEE802_1ASTimeStampType
timestamp per IEEE 802.1AS, values MUST be interpreted in
implementation as unsigned
9.2.3.1. Requested Entries
Name Value Schema Version Description
AS_sec 1 1.0
AS_nsec 2 1.0
9.2.4. RIFT/common/IPAddressType
IP address type
9.2.4.1. Requested Entries
Name Value Schema Version Description
ipv4address 1 1.0
ipv6address 2 1.0
9.2.5. RIFT/common/IPPrefixType
prefix representing reachablity.
@note: for interface addresses the protocol can propagate the address
part beyond the subnet mask and on reachability computation that has
to be normalized. The non-significant bits can be used for
operational purposes.
9.2.5.1. Requested Entries
Name Value Schema Version Description
ipv4prefix 1 1.0
ipv6prefix 2 1.0
9.2.6. RIFT/common/IPv4PrefixType
IP v4 prefix type
9.2.6.1. Requested Entries
Name Value Schema Version Description
address 1 1.0
prefixlen 2 1.0
9.2.7. RIFT/common/IPv6PrefixType
IP v6 prefix type
9.2.7.1. Requested Entries
Name Value Schema Version Description
address 1 1.0
prefixlen 2 1.0
9.2.8. RIFT/common/PrefixSequenceType
sequence of a prefix when it moves
9.2.8.1. Requested Entries
Name Value Schema Description
Version
timestamp 1 1.0
transactionid 2 1.0 transaction ID set by client in e.g.
in 6LoWPAN
9.2.9. RIFT/common/RouteType
RIFT route types.
@note: route types which MUST be ordered on their preference PGP
prefixes are most preferred attracting traffic north (towards spine)
and then south normal prefixes are attracting traffic south (towards
leafs), i.e. prefix in NORTH PREFIX TIE is preferred over SOUTH
PREFIX TIE.
@note: The only purpose of those values is to introduce an ordering
whereas an implementation can choose internally any other values as
long the ordering is preserved
9.2.9.1. Requested Entries
Name Value Schema Version Description
Illegal 0 1.0
RouteTypeMinValue 1 1.0
Discard 2 1.0
LocalPrefix 3 1.0
SouthPGPPrefix 4 1.0
NorthPGPPrefix 5 1.0
NorthPrefix 6 1.0
NorthExternalPrefix 7 1.0
SouthPrefix 8 1.0
SouthExternalPrefix 9 1.0
NegativeSouthPrefix 10 1.0
RouteTypeMaxValue 11 1.0
9.2.10. RIFT/common/TIETypeType
type of TIE.
This enum indicates what TIE type the TIE is carrying. In case the
value is not known to the receiver, re-flooded the same way as prefix
TIEs. This allows for future extensions of the protocol within the
same schema major with types opaque to some nodes unless the flooding
scope is not the same as prefix TIE, then a major version revision
MUST be performed.
9.2.10.1. Requested Entries
Name Value Schema Version Description
Illegal 0 1.0
TIETypeMinValue 1 1.0
NodeTIEType 2 1.0
PrefixTIEType 3 1.0
PositiveDisaggregationPrefixTIEType 4 1.0
NegativeDisaggregationPrefixTIEType 5 1.0
PGPrefixTIEType 6 1.0
KeyValueTIEType 7 1.0
ExternalPrefixTIEType 8 1.0
TIETypeMaxValue 9 1.0
9.2.11. RIFT/common/TieDirectionType
direction of tie
9.2.11.1. Requested Entries
Name Value Schema Version Description
Illegal 0 1.0
South 1 1.0
North 2 1.0
DirectionMaxValue 3 1.0
9.2.12. RIFT/encoding/Community
community
9.2.12.1. Requested Entries
Name Value Schema Version Description
top 1 1.0
bottom 2 1.0
9.2.13. RIFT/encoding/KeyValueTIEElement
Generic key value pairs
9.2.13.1. Requested Entries
Name Value Schema Description
Version
keyvalues 1 1.0 if the same key repeats in multiple TIEs of
same node or with different values, behavior
is unspecified
9.2.14. RIFT/encoding/LIEPacket
RIFT LIE packet
@note this node's level is already included on the packet header
9.2.14.1. Requested Entries
Name Value Schema Description
Version
name 1 1.0 node or adjacency name
local_id 2 1.0 local link ID
flood_port 3 1.0 UDP port to which we can
receive flooded TIEs
link_mtu_size 4 1.0 layer 3 MTU, used to
discover to mismatch.
link_bandwidth 5 1.0 local link bandwidth on the
interface
neighbor 6 1.0 reflects the neighbor once
received to provide 3-way
connectivity
pod 7 1.0 node's PoD
node_capabilities 10 1.0 node capabilities shown in
the LIE. The capabilies
MUST match the capabilities
shown in the Node TIEs,
otherwise the behavior is
unspecified. A node
detecting the mismatch
SHOULD generate according
error
link_capabilities 11 1.0 capabilities of this link
holdtime 12 1.0 required holdtime of the
adjacency, i.e. how much
time MUST expire without
LIE for the adjacency to
drop
label 13 1.0 unsolicited, downstream
assigned locally
significant label value for
the adjacency
not_a_ztp_offer 21 1.0 indicates that the level on
the LIE MUST NOT be used to
derive a ZTP level by the
receiving node
you_are_flood_repeater 22 1.0 indicates to northbound
neighbor that it should be
reflooding this node's
N-TIEs to achieve flood
reduction and balancing for
northbound flooding. To be
ignored if received from a
northbound adjacency
you_are_sending_too_quickly 23 1.0 can be optionally set to
indicate to neighbor that
packet losses are seen on
reception based on packet
numbers or the rate is too
high. The receiver SHOULD
temporarily slow down
flooding rates
instance_name 24 1.0 instance name in case
multiple RIFT instances
running on same interface
9.2.15. RIFT/encoding/LinkCapabilities
link capabilities
9.2.15.1. Requested Entries
Name Value Schema Description
Version
bfd 1 1.0 indicates that the link's `local
ID` can be used as its BFD
discriminator and the link is
supporting BFD
v4_forwarding_capable 2 1.0 indicates whether the interface
will support v4 forwarding. This
MUST be set to true when LIEs
from a v4 address are sent and
MAY be set to true in LIEs on v6
address. If v4 and v6 LIEs
indicate contradicting
information the behavior is
unspecified.
9.2.16. RIFT/encoding/LinkIDPair
LinkID pair describes one of parallel links between two nodes
9.2.16.1. Requested Entries
Name Value Schema Description
Version
local_id 1 1.0 node-wide unique value for
the local link
remote_id 2 1.0 received remote link ID for
this link
platform_interface_index 10 1.0 describes the local
interface index of the link
platform_interface_name 11 1.0 describes the local
interface name
trusted_outer_security_key 12 1.0 indication whether the link
is secured, i.e. protected
by outer key, absence of
this element means no
indication, undefined outer
key means not secured
9.2.17. RIFT/encoding/Neighbor
neighbor structure
9.2.17.1. Requested Entries
Name Value Schema Version Description
originator 1 1.0 system ID of the originator
remote_id 2 1.0 ID of remote side of the link
9.2.18. RIFT/encoding/NodeCapabilities
capabilities the node supports. The schema may add to this field
future capabilities to indicate whether it will support
interpretation of future schema extensions on the same major
revision. Such fields MUST be optional and have an implicit or
explicit false default value. If a future capability changes route
selection or generates blackholes if some nodes are not supporting it
then a major version increment is unavoidable.
9.2.18.1. Requested Entries
Name Value Schema Description
Version
flood_reduction 1 1.0 can this node participate in
flood reduction
hierarchy_indications 2 1.0 does this node restrict itself to
be top-of-fabric or leaf only (in
ZTP) and does it support
leaf-2-leaf procedures
9.2.19. RIFT/encoding/NodeFlags
Flags the node sets
9.2.19.1. Requested Entries
Name Value Schema Description
Version
overload 1 1.0 indicates that node is in overload, do not
transit traffic through it
9.2.20. RIFT/encoding/NodeNeighborsTIEElement
neighbor of a node
9.2.20.1. Requested Entries
Name Value Schema Description
Version
level 1 1.0 level of neighbor
cost 3 1.0
link_ids 4 1.0 can carry description of multiple parallel
links in a TIE
bandwidth 5 1.0 total bandwith to neighbor, this will be
normally sum of the bandwidths of all the
parallel links.
9.2.21. RIFT/encoding/NodeTIEElement
Description of a node.
It may occur multiple times in different TIEs but if either *
capabilities values do not match or * flags values do not match or *
neighbors repeat with different values
the behavior is undefined and a warning SHOULD be generated.
Neighbors can be distributed across multiple TIEs however if the sets
are disjoint. Miscablings SHOULD be repeated in every node TIE,
otherwise the behavior is undefined.
@note: observe that absence of fields implies defined defaults
9.2.21.1. Requested Entries
Name Value Schema Description
Version
level 1 1.0 level of the node
neighbors 2 1.0 node's neighbors. If neighbor systemID
repeats in other node TIEs of same node
the behavior is undefined
capabilities 3 1.0 capabilities of the node
flags 4 1.0 flags of the node
name 5 1.0 optional node name for easier
operations
pod 6 1.0 PoD to which the node belongs
miscabled_links 10 1.0 if any local links are miscabled, the
indication is flooded
9.2.22. RIFT/encoding/PacketContent
content of a RIFT packet
9.2.22.1. Requested Entries
Name Value Schema Version Description
lie 1 1.0
tide 2 1.0
tire 3 1.0
tie 4 1.0
9.2.23. RIFT/encoding/PacketHeader
common RIFT packet header
9.2.23.1. Requested Entries
Name Value Schema Description
Version
major_version 1 1.0 major version type of protocol
minor_version 2 1.0 minor version type of protocol
sender 3 1.0 node sending the packet, in case of
LIE/TIRE/TIDE also the originator of it
level 4 1.0 level of the node sending the packet,
required on everything except LIEs. Lack
of presence on LIEs indicates
UNDEFINED_LEVEL and is used in ZTP
procedures.
9.2.24. RIFT/encoding/PrefixAttributes
9.2.24.1. Requested Entries
Name Value Schema Description
Version
metric 2 1.0 distance of the prefix
tags 3 1.0 generic unordered set of route tags,
can be redistributed to other
protocols or use within the context
of real time analytics
monotonic_clock 4 1.0 monotonic clock for mobile addresses
loopback 6 1.0 indicates if the interface is a node
loopback
directly_attached 7 1.0 indicates that the prefix is directly
attached, i.e. should be routed to
even if the node is in overload. *
from_link 10 1.0 in case of locally originated
prefixes, i.e. interface addresses
this can describe which link the
address belongs to.
9.2.25. RIFT/encoding/PrefixTIEElement
TIE carrying prefixes
9.2.25.1. Requested Entries
Name Value Schema Description
Version
prefixes 1 1.0 prefixes with the associated attributes. if
the same prefix repeats in multiple TIEs of
same node behavior is unspecified
9.2.26. RIFT/encoding/ProtocolPacket
RIFT packet structure
9.2.26.1. Requested Entries
Name Value Schema Version Description
header 1 1.0
content 2 1.0
9.2.27. RIFT/encoding/TIDEPacket
TIDE with sorted TIE headers, if headers are unsorted, behavior is
undefined
9.2.27.1. Requested Entries
Name Value Schema Version Description
start_range 1 1.0 first TIE header in the tide packet
end_range 2 1.0 last TIE header in the tide packet
headers 3 1.0 _sorted_ list of headers
9.2.28. RIFT/encoding/TIEElement
single element in a TIE. enum `common.TIETypeType` in TIEID indicates
which elements MUST be present in the TIEElement. In case of
mismatch the unexpected elements MUST be ignored. In case of lack of
expected element the TIE an error MUST be reported and the TIE MUST
be ignored.
This type can be extended with new optional elements for new
`common.TIETypeType` values without breaking the major but if it is
necessary to understand whether all nodes support the new type a node
capability must be added as well.
9.2.28.1. Requested Entries
Name Valu Schema Description
e Version
node 1 1.0 used in case of enum common.
TIETypeType.NodeTIEType
prefixes 2 1.0 used in case of enum common.
TIETypeType.PrefixTIEType
positive_disaggregation_pre 3 1.0 positive prefixes (always
fixes southbound) It MUST NOT be
advertised within a North
TIE and ignored otherwise
negative_disaggregation_pre 4 1.0 transitive, negative
fixes prefixes (always southbound)
which MUST be aggregated and
propagated according to the
specification southwards
towards lower levels to heal
pathological upper level
partitioning, otherwise
blackholes may occur in
multiplane fabrics. It MUST
NOT be advertised within a
North TIE.
external_prefixes 5 1.0 externally reimported
prefixes
keyvalues 6 1.0 Key-Value store elements
9.2.29. RIFT/encoding/TIEHeader
Header of a TIE.
@note: TIEID space is a total order achieved by comparing the
elements in sequence defined and comparing each value as an unsigned
integer of according length.
@note: After sequence number the lifetime received on the envelope
must be used for comparison before further fields.
@note: `origination_time` and `origination_lifetime` are disregarded
for comparison purposes and carried purely for debugging/security
purposes if present.
9.2.29.1. Requested Entries
Name Value Schema Description
Version
tieid 2 1.0 ID of the tie
seq_nr 3 1.0 sequence number of the tie
origination_time 10 1.0 absolute timestamp when the TIE
was generated. This can be used on
fabrics with synchronized clock to
prevent lifetime modification
attacks.
origination_lifetime 12 1.0 original lifetime when the TIE was
generated. This can be used on
fabrics with synchronized clock to
prevent lifetime modification
attacks.
9.2.30. RIFT/encoding/TIEHeaderWithLifeTime
Header of a TIE as described in TIRE/TIDE.
9.2.30.1. Requested Entries
Name Value Schema Description
Version
header 1 1.0
remaining_lifetime 2 1.0 remaining lifetime that expires down
to 0 just like in ISIS. TIEs with
lifetimes differing by less than
`lifetime_diff2ignore` MUST be
considered EQUAL.
9.2.31. RIFT/encoding/TIEID
ID of a TIE
@note: TIEID space is a total order achieved by comparing the
elements in sequence defined and comparing each value as an unsigned
integer of according length.
9.2.31.1. Requested Entries
Name Value Schema Version Description
direction 1 1.0 direction of TIE
originator 2 1.0 indicates originator of the TIE
tietype 3 1.0 type of the tie
tie_nr 4 1.0 number of the tie
9.2.32. RIFT/encoding/TIEPacket
TIE packet
9.2.32.1. Requested Entries
Name Value Schema Version Description
header 1 1.0
element 2 1.0
9.2.33. RIFT/encoding/TIREPacket
TIRE packet
9.2.33.1. Requested Entries
Name Value Schema Version Description
headers 1 1.0
10. Acknowledgments 10. Acknowledgments
A new routing protocol in its complexity is not a product of a parent A new routing protocol in its complexity is not a product of a parent
but of a village as the author list shows already. However, many but of a village as the author list shows already. However, many
more people provided input, fine-combed the specification based on more people provided input, fine-combed the specification based on
their experience in design or implementation. This section will make their experience in design or implementation. This section will make
an inadequate attempt in recording their contribution. an inadequate attempt in recording their contribution.
Many thanks to Naiming Shen for some of the early discussions around Many thanks to Naiming Shen for some of the early discussions around
skipping to change at page 98, line 30 skipping to change at page 113, line 16
Topology (MT) Routing in Intermediate System to Topology (MT) Routing in Intermediate System to
Intermediate Systems (IS-ISs)", RFC 5120, Intermediate Systems (IS-ISs)", RFC 5120,
DOI 10.17487/RFC5120, February 2008, DOI 10.17487/RFC5120, February 2008,
<https://www.rfc-editor.org/info/rfc5120>. <https://www.rfc-editor.org/info/rfc5120>.
[RFC5303] Katz, D., Saluja, R., and D. Eastlake 3rd, "Three-Way [RFC5303] Katz, D., Saluja, R., and D. Eastlake 3rd, "Three-Way
Handshake for IS-IS Point-to-Point Adjacencies", RFC 5303, Handshake for IS-IS Point-to-Point Adjacencies", RFC 5303,
DOI 10.17487/RFC5303, October 2008, DOI 10.17487/RFC5303, October 2008,
<https://www.rfc-editor.org/info/rfc5303>. <https://www.rfc-editor.org/info/rfc5303>.
[RFC5549] Le Faucheur, F. and E. Rosen, "Advertising IPv4 Network
Layer Reachability Information with an IPv6 Next Hop",
RFC 5549, DOI 10.17487/RFC5549, May 2009,
<https://www.rfc-editor.org/info/rfc5549>.
[RFC5709] Bhatia, M., Manral, V., Fanto, M., White, R., Barnes, M., [RFC5709] Bhatia, M., Manral, V., Fanto, M., White, R., Barnes, M.,
Li, T., and R. Atkinson, "OSPFv2 HMAC-SHA Cryptographic Li, T., and R. Atkinson, "OSPFv2 HMAC-SHA Cryptographic
Authentication", RFC 5709, DOI 10.17487/RFC5709, October Authentication", RFC 5709, DOI 10.17487/RFC5709, October
2009, <https://www.rfc-editor.org/info/rfc5709>. 2009, <https://www.rfc-editor.org/info/rfc5709>.
[RFC5881] Katz, D. and D. Ward, "Bidirectional Forwarding Detection [RFC5881] Katz, D. and D. Ward, "Bidirectional Forwarding Detection
(BFD) for IPv4 and IPv6 (Single Hop)", RFC 5881, (BFD) for IPv4 and IPv6 (Single Hop)", RFC 5881,
DOI 10.17487/RFC5881, June 2010, DOI 10.17487/RFC5881, June 2010,
<https://www.rfc-editor.org/info/rfc5881>. <https://www.rfc-editor.org/info/rfc5881>.
skipping to change at page 99, line 30 skipping to change at page 114, line 20
Decraene, B., Litkowski, S., and R. Shakir, "Segment Decraene, B., Litkowski, S., and R. Shakir, "Segment
Routing Architecture", RFC 8402, DOI 10.17487/RFC8402, Routing Architecture", RFC 8402, DOI 10.17487/RFC8402,
July 2018, <https://www.rfc-editor.org/info/rfc8402>. July 2018, <https://www.rfc-editor.org/info/rfc8402>.
[RFC8505] Thubert, P., Ed., Nordmark, E., Chakrabarti, S., and C. [RFC8505] Thubert, P., Ed., Nordmark, E., Chakrabarti, S., and C.
Perkins, "Registration Extensions for IPv6 over Low-Power Perkins, "Registration Extensions for IPv6 over Low-Power
Wireless Personal Area Network (6LoWPAN) Neighbor Wireless Personal Area Network (6LoWPAN) Neighbor
Discovery", RFC 8505, DOI 10.17487/RFC8505, November 2018, Discovery", RFC 8505, DOI 10.17487/RFC8505, November 2018,
<https://www.rfc-editor.org/info/rfc8505>. <https://www.rfc-editor.org/info/rfc8505>.
[thrift] Apache Software Foundation, "Thrift Interface Description
Language", <https://thrift.apache.org/docs/idl>.
11.2. Informative References 11.2. Informative References
[CLOS] Yuan, X., "On Nonblocking Folded-Clos Networks in Computer [CLOS] Yuan, X., "On Nonblocking Folded-Clos Networks in Computer
Communication Environments", IEEE International Parallel & Communication Environments", IEEE International Parallel &
Distributed Processing Symposium, 2011. Distributed Processing Symposium, 2011.
[DIJKSTRA] [DIJKSTRA]
Dijkstra, E., "A Note on Two Problems in Connexion with Dijkstra, E., "A Note on Two Problems in Connexion with
Graphs", Journal Numer. Math. , 1959. Graphs", Journal Numer. Math. , 1959.
skipping to change at page 103, line 7 skipping to change at page 117, line 44
1 < = > > > ? < < 1 < = > > > ? < <
2 < < = > > > ? < 2 < < = > > > ? <
3 < < < = > > > ? 3 < < < = > > > ?
4 ? < < < = > > > 4 ? < < < = > > >
5 > ? < < < = > > 5 > ? < < < = > >
6 > > ? < < < = > 6 > > ? < < < = >
7 > > > ? < < < = 7 > > > ? < < < =
Appendix B. Information Elements Schema Appendix B. Information Elements Schema
This section introduces the schema for information elements. This section introduces the schema for information elements. The IDL
is Thrift [thrift].
On schema changes that On schema changes that
1. change field numbers or 1. change field numbers or
2. add new *required* fields or 2. add new *required* fields or
3. remove any fields or 3. remove any fields or
4. change lists into sets, unions into structures or 4. change lists into sets, unions into structures or
5. change multiplicity of fields or 5. change multiplicity of fields or
6. changes name of any field or type or 6. changes name of any field or type or
7. change datatypes of any field or 7. change datatypes of any field or
skipping to change at page 103, line 40 skipping to change at page 118, line 29
10. changes any enumeration type except extending `common.TIEType` 10. changes any enumeration type except extending `common.TIEType`
(use of enumeration types is generally discouraged) (use of enumeration types is generally discouraged)
major version of the schema MUST increase. All other changes MUST major version of the schema MUST increase. All other changes MUST
increase minor version within the same major. increase minor version within the same major.
Observe however that introducing an optional field does not cause a Observe however that introducing an optional field does not cause a
major version increase even if the fields inside the structure are major version increase even if the fields inside the structure are
optional with defaults. optional with defaults.
All signed integer as forced by Thrift support must be cast for All signed integer as forced by Thrift [thrift] support must be cast
internal purposes to equivalent unsigned values without discarding for internal purposes to equivalent unsigned values without
the signedness bit. An implementation SHOULD try to avoid using the discarding the signedness bit. An implementation SHOULD try to avoid
signedness bit when generating values. using the signedness bit when generating values.
The schema is normative. The schema is normative.
B.1. common.thrift B.1. common.thrift
/** /**
Thrift file with common definitions for RIFT Thrift file with common definitions for RIFT
*/ */
/** @note MUST be interpreted in implementation as unsigned 64 bits. /** @note MUST be interpreted in implementation as unsigned 64 bits.
* The implementation SHOULD NOT use the MSB. * The implementation SHOULD NOT use the MSB.
*/ */
typedef i64 SystemIDType typedef i64 SystemIDType
typedef i32 IPv4Address typedef i32 IPv4Address
/** this has to be of length long enough to accomodate prefix */ /** this has to be of length long enough to accomodate prefix */
typedef binary IPv6Address typedef binary IPv6Address
/** @note MUST be interpreted in implementation as unsigned */ /** @note MUST be interpreted in implementation as unsigned */
typedef i16 UDPPortType typedef i16 UDPPortType
/** @note MUST be interpreted in implementation as unsigned */ /** @note MUST be interpreted in implementation as unsigned */
skipping to change at page 105, line 25 skipping to change at page 120, line 17
struct IEEE802_1ASTimeStampType { struct IEEE802_1ASTimeStampType {
1: required i64 AS_sec; 1: required i64 AS_sec;
2: optional i32 AS_nsec; 2: optional i32 AS_nsec;
} }
/** generic counter type */ /** generic counter type */
typedef i64 CounterType typedef i64 CounterType
/** Platform Interface Index type, i.e. index of interface on hardware, can be used e.g. with /** Platform Interface Index type, i.e. index of interface on hardware, can be used e.g. with
RFC5837 */ RFC5837 */
typedef i32 PlatformInterfaceIndex typedef i32 PlatformInterfaceIndex
/** Flags indicating nodes behavior in case of ZTP and support /** flags indicating nodes behavior in case of ZTP
for special optimization procedures. It will force level to `leaf_level` or
`top-of-fabric` level accordingly and enable according procedures
*/ */
enum HierarchyIndications { enum HierarchyIndications {
/** forces level to `leaf_level` and enables according procedures */
leaf_only = 0, leaf_only = 0,
/** forces level to `leaf_level` and enables according procedures */
leaf_only_and_leaf_2_leaf_procedures = 1, leaf_only_and_leaf_2_leaf_procedures = 1,
/** forces level to `top_of_fabric` and enables according procedures */
top_of_fabric = 2, top_of_fabric = 2,
} }
const PacketNumberType undefined_packet_number = 0 const PacketNumberType undefined_packet_number = 0
/** This MUST be used when node is configured as top of fabric in ZTP. /** This MUST be used when node is configured as top of fabric in ZTP.
This is kept reasonably low to alow for fast ZTP convergence on This is kept reasonably low to alow for fast ZTP convergence on
failures. */ failures. */
const LevelType top_of_fabric_level = 24 const LevelType top_of_fabric_level = 24
/** default bandwidth on a link */ /** default bandwidth on a link */
const BandwithInMegaBitsType default_bandwidth = 100 const BandwithInMegaBitsType default_bandwidth = 100
skipping to change at page 106, line 34 skipping to change at page 121, line 26
/** round down interval when TIEs are sent with security hashes /** round down interval when TIEs are sent with security hashes
to prevent excessive computation. **/ to prevent excessive computation. **/
const LifeTimeInSecType rounddown_lifetime_interval = 60 const LifeTimeInSecType rounddown_lifetime_interval = 60
/** any `TieHeader` that has a smaller lifetime difference /** any `TieHeader` that has a smaller lifetime difference
than this constant is equal (if other fields equal). This than this constant is equal (if other fields equal). This
constant MUST be larger than `purge_lifetime` to avoid constant MUST be larger than `purge_lifetime` to avoid
retransmissions */ retransmissions */
const LifeTimeInSecType lifetime_diff2ignore = 400 const LifeTimeInSecType lifetime_diff2ignore = 400
/** default UDP port to run LIEs on */ /** default UDP port to run LIEs on */
const UDPPortType default_lie_udp_port = 911 const UDPPortType default_lie_udp_port = 914
/** default UDP port to receive TIEs on, that can be peer specific */ /** default UDP port to receive TIEs on, that can be peer specific */
const UDPPortType default_tie_udp_flood_port = 912 const UDPPortType default_tie_udp_flood_port = 915
/** default MTU link size to use */ /** default MTU link size to use */
const MTUSizeType default_mtu_size = 1400 const MTUSizeType default_mtu_size = 1400
/** default link being BFD capable */ /** default link being BFD capable */
const bool bfd_default = true const bool bfd_default = true
/** undefined nonce, equivalent to missing nonce */ /** undefined nonce, equivalent to missing nonce */
const NonceType undefined_nonce = 0; const NonceType undefined_nonce = 0;
/** outer security key id */ /** outer security key id, MUST be interpreted as in implementation as unsigned */
typedef i8 OuterSecurityKeyID typedef i8 OuterSecurityKeyID
/** outer security key id */ /** security key id, MUST be interpreted as in implementation as unsigned */
typedef i32 InnerSecurityKeyID
/** security key id */
typedef i32 TIESecurityKeyID typedef i32 TIESecurityKeyID
/** undefined key */ /** undefined key */
const TIESecurityKeyID undefined_securitykey_id = 0; const TIESecurityKeyID undefined_securitykey_id = 0;
/** Maximum delta (negative or positive) that a mirrored nonce can /** Maximum delta (negative or positive) that a mirrored nonce can
deviate from local value to be considered valid. If nonces are deviate from local value to be considered valid. If nonces are
changed every minute on both sides this opens statistically changed every minute on both sides this opens statistically
a `maximum_valid_nonce_delta` minutes window of identical LIEs, a `maximum_valid_nonce_delta` minutes window of identical LIEs,
TIE, TI(x)E replays. TIE, TI(x)E replays.
The interval cannot be too small since LIE FSM may change The interval cannot be too small since LIE FSM may change
states fairly quickly during ZTP without sending LIEs*/ states fairly quickly during ZTP without sending LIEs*/
const i16 maximum_valid_nonce_delta = 5; const i16 maximum_valid_nonce_delta = 5;
/** direction of tie */
/** indicates whether the direction is northbound/east-west
* or southbound */
enum TieDirectionType { enum TieDirectionType {
Illegal = 0, Illegal = 0,
South = 1, South = 1,
North = 2, North = 2,
DirectionMaxValue = 3, DirectionMaxValue = 3,
} }
/** address family */
enum AddressFamilyType { enum AddressFamilyType {
Illegal = 0, Illegal = 0,
AddressFamilyMinValue = 1, AddressFamilyMinValue = 1,
IPv4 = 2, IPv4 = 2,
IPv6 = 3, IPv6 = 3,
AddressFamilyMaxValue = 4, AddressFamilyMaxValue = 4,
} }
/** IP v4 prefix type */
struct IPv4PrefixType { struct IPv4PrefixType {
1: required IPv4Address address; 1: required IPv4Address address;
2: required PrefixLenType prefixlen; 2: required PrefixLenType prefixlen;
} }
/** IP v6 prefix type */
struct IPv6PrefixType { struct IPv6PrefixType {
1: required IPv6Address address; 1: required IPv6Address address;
2: required PrefixLenType prefixlen; 2: required PrefixLenType prefixlen;
} }
/** IP address type */
union IPAddressType { union IPAddressType {
1: optional IPv4Address ipv4address; 1: optional IPv4Address ipv4address;
2: optional IPv6Address ipv6address; 2: optional IPv6Address ipv6address;
} }
/** Prefix representing reachablity. Observe that for interface /** prefix representing reachablity.
addresses the protocol can propagate the address part beyond
the subnet mask and on reachability computation that has to @note: for interface
be normalized. The non-significant bits can be used for operational addresses the protocol can propagate the address part beyond
purposes. the subnet mask and on reachability computation that has to
be normalized. The non-significant bits can be used for operational
purposes.
*/ */
union IPPrefixType { union IPPrefixType {
1: optional IPv4PrefixType ipv4prefix; 1: optional IPv4PrefixType ipv4prefix;
2: optional IPv6PrefixType ipv6prefix; 2: optional IPv6PrefixType ipv6prefix;
} }
/** sequence of a prefix when it moves
/** @note: Sequence of a prefix. Comparison function:
if diff(timestamps) < 200msecs better transactionid wins
else better time wins
*/ */
struct PrefixSequenceType { struct PrefixSequenceType {
1: required IEEE802_1ASTimeStampType timestamp; 1: required IEEE802_1ASTimeStampType timestamp;
/** transaction ID set by client in e.g. in 6LoWPAN */
2: optional PrefixTransactionIDType transactionid; 2: optional PrefixTransactionIDType transactionid;
} }
/** Type of TIE. /** type of TIE.
This enum indicates what TIE type the TIE is carrying. This enum indicates what TIE type the TIE is carrying.
In case the value is not known to the receiver, In case the value is not known to the receiver,
re-flooded the same way as prefix TIEs. This allows for re-flooded the same way as prefix TIEs. This allows for
future extensions of the protocol within the same schema major future extensions of the protocol within the same schema major
with types opaque to some nodes unless the flooding scope is not with types opaque to some nodes unless the flooding scope is not
the same as prefix TIE, then a major version revision MUST the same as prefix TIE, then a major version revision MUST
be performed. be performed.
*/ */
enum TIETypeType { enum TIETypeType {
skipping to change at page 108, line 42 skipping to change at page 123, line 36
NodeTIEType = 2, NodeTIEType = 2,
PrefixTIEType = 3, PrefixTIEType = 3,
PositiveDisaggregationPrefixTIEType = 4, PositiveDisaggregationPrefixTIEType = 4,
NegativeDisaggregationPrefixTIEType = 5, NegativeDisaggregationPrefixTIEType = 5,
PGPrefixTIEType = 6, PGPrefixTIEType = 6,
KeyValueTIEType = 7, KeyValueTIEType = 7,
ExternalPrefixTIEType = 8, ExternalPrefixTIEType = 8,
TIETypeMaxValue = 9, TIETypeMaxValue = 9,
} }
/** @note: route types which MUST be ordered on their preference /** RIFT route types.
PGP prefixes are most preferred attracting
traffic north (towards spine) and then south @note: route types which MUST be ordered on their preference
normal prefixes are attracting traffic south (towards leafs), PGP prefixes are most preferred attracting
i.e. prefix in NORTH PREFIX TIE is preferred over SOUTH PREFIX TIE. traffic north (towards spine) and then south
normal prefixes are attracting traffic south (towards leafs),
i.e. prefix in NORTH PREFIX TIE is preferred over SOUTH PREFIX TIE.
@note: The only purpose of those values is to introduce an @note: The only purpose of those values is to introduce an
ordering whereas an implementation can choose internally ordering whereas an implementation can choose internally
any other values as long the ordering is preserved any other values as long the ordering is preserved
*/ */
enum RouteType { enum RouteType {
Illegal = 0, Illegal = 0,
RouteTypeMinValue = 1, RouteTypeMinValue = 1,
/** First legal value. */ /** first legal value. */
/** Discard routes are most prefered */ /** discard routes are most prefered */
Discard = 2, Discard = 2,
/** Local prefixes are directly attached prefixes on the /** local prefixes are directly attached prefixes on the
* system such as e.g. interface routes. * system such as e.g. interface routes.
*/ */
LocalPrefix = 3, LocalPrefix = 3,
/** advertised in S-TIEs */ /** advertised in S-TIEs */
SouthPGPPrefix = 4, SouthPGPPrefix = 4,
/** advertised in N-TIEs */ /** advertised in N-TIEs */
NorthPGPPrefix = 5, NorthPGPPrefix = 5,
/** advertised in N-TIEs */ /** advertised in N-TIEs */
NorthPrefix = 6, NorthPrefix = 6,
/** externally imported north */ /** externally imported north */
NorthExternalPrefix = 7, NorthExternalPrefix = 7,
/** advertised in S-TIEs, either normal prefix or positive disaggregation */ /** advertised in S-TIEs, either normal prefix or positive disaggregation */
SouthPrefix = 8, SouthPrefix = 8,
/** externally imported south */ /** externally imported south */
SouthExternalPrefix = 9, SouthExternalPrefix = 9,
/** negative, transitive prefixes are least preferred of /** negative, transitive prefixes are least preferred */
local variety */
NegativeSouthPrefix = 10, NegativeSouthPrefix = 10,
RouteTypeMaxValue = 11, RouteTypeMaxValue = 11,
} }
B.2. encoding.thrift B.2. encoding.thrift
/** /**
Thrift file for packet encodings for RIFT Thrift file for packet encodings for RIFT
*/ */
include "common.thrift" /** Represents protocol encoding schema major version */
const common.VersionType protocol_major_version = 1
/** /** Represents protocol encoding schema minor version */
Thrift file for packet encodings for RIFT
Copyright (c) Juniper Networks, Inc., 2016-
All rights reserved.
*/
include "common.thrift"
namespace rs models
namespace py encoding
/** represents protocol encoding schema major version */
const common.VersionType protocol_major_version = 2
/** represents protocol encoding schema minor version */
const common.MinorVersionType protocol_minor_version = 0 const common.MinorVersionType protocol_minor_version = 0
/** common RIFT packet header */ /** common RIFT packet header */
struct PacketHeader { struct PacketHeader {
/** major version type of protocol */
1: required common.VersionType major_version = protocol_major_version; 1: required common.VersionType major_version = protocol_major_version;
/** minor version type of protocol */
2: required common.VersionType minor_version = protocol_minor_version; 2: required common.VersionType minor_version = protocol_minor_version;
/** this is the node sending the packet, in case of LIE/TIRE/TIDE /** node sending the packet, in case of LIE/TIRE/TIDE
also the originator of it */ * also the originator of it */
3: required common.SystemIDType sender; 3: required common.SystemIDType sender;
/** level of the node sending the packet, required on everything except /** level of the node sending the packet, required on everything except
* LIEs. Lack of presence on LIEs indicates UNDEFINED_LEVEL and is used * LIEs. Lack of presence on LIEs indicates UNDEFINED_LEVEL and is used
* in ZTP procedures. * in ZTP procedures.
*/ */
4: optional common.LevelType level; 4: optional common.LevelType level;
} }
/** Community serves as community for PGP purposes */ /** community */
struct Community { struct Community {
1: required i32 top; 1: required i32 top;
2: required i32 bottom; 2: required i32 bottom;
} }
/** Neighbor structure */ /** neighbor structure */
struct Neighbor { struct Neighbor {
/** system ID of the originator */
1: required common.SystemIDType originator; 1: required common.SystemIDType originator;
/** ID of remote side of the link */
2: required common.LinkIDType remote_id; 2: required common.LinkIDType remote_id;
} }
/** Capabilities the node supports. The schema may add to this /** capabilities the node supports. The schema may add to this
field future capabilities to indicate whether it will support field future capabilities to indicate whether it will support
interpretation of future schema extensions on the same major interpretation of future schema extensions on the same major
revision. Such fields MUST be optional and have an implicit or revision. Such fields MUST be optional and have an implicit or
explicit false default value. If a future capability changes route explicit false default value. If a future capability changes route
selection or generates blackholes if some nodes are not supporting selection or generates blackholes if some nodes are not supporting
it then a major version increment is unavoidable. it then a major version increment is unavoidable.
*/ */
struct NodeCapabilities { struct NodeCapabilities {
/** can this node participate in flood reduction */ /** can this node participate in flood reduction */
1: optional bool flood_reduction = 1: optional bool flood_reduction =
skipping to change at page 111, line 4 skipping to change at page 125, line 38
explicit false default value. If a future capability changes route explicit false default value. If a future capability changes route
selection or generates blackholes if some nodes are not supporting selection or generates blackholes if some nodes are not supporting
it then a major version increment is unavoidable. it then a major version increment is unavoidable.
*/ */
struct NodeCapabilities { struct NodeCapabilities {
/** can this node participate in flood reduction */ /** can this node participate in flood reduction */
1: optional bool flood_reduction = 1: optional bool flood_reduction =
common.flood_reduction_default; common.flood_reduction_default;
/** does this node restrict itself to be top-of-fabric or /** does this node restrict itself to be top-of-fabric or
leaf only (in ZTP) and does it support leaf-2-leaf procedures */ leaf only (in ZTP) and does it support leaf-2-leaf procedures */
2: optional common.HierarchyIndications hierarchy_indications; 2: optional common.HierarchyIndications hierarchy_indications;
} }
/* Link capabilities */ /** link capabilities */
struct LinkCapabilities { struct LinkCapabilities {
/* indicates that the link's `local ID` can be used as its BFD /** indicates that the link's `local ID` can be used as its BFD
discriminator and the link is supporting BFD */ * discriminator and the link is supporting BFD */
1: optional bool bfd = 1: optional bool bfd =
common.bfd_default; common.bfd_default;
/** indicates whether the interface will support v4 forwarding. This MUST
* be set to true when LIEs from a v4 address are sent and MAY be set
* to true in LIEs on v6 address. If v4 and v6 LIEs indicate contradicting
* information the behavior is unspecified. */
2: optional bool v4_forwarding_capable =
true;
} }
/** RIFT LIE packet /** RIFT LIE packet
@note this node's level is already included on the packet header */ @note this node's level is already included on the packet header */
struct LIEPacket { struct LIEPacket {
/** optional node or adjacency name */ /** node or adjacency name */
1: optional string name; 1: optional string name;
/** local link ID */ /** local link ID */
2: required common.LinkIDType local_id; 2: required common.LinkIDType local_id;
/** UDP port to which we can receive flooded TIEs */ /** UDP port to which we can receive flooded TIEs */
3: required common.UDPPortType flood_port = 3: required common.UDPPortType flood_port =
common.default_tie_udp_flood_port; common.default_tie_udp_flood_port;
/** layer 3 MTU, used to discover to mismatch. */ /** layer 3 MTU, used to discover to mismatch. */
4: optional common.MTUSizeType link_mtu_size = 4: optional common.MTUSizeType link_mtu_size =
common.default_mtu_size; common.default_mtu_size;
/** local link bandwidth on the interface */ /** local link bandwidth on the interface */
5: optional common.BandwithInMegaBitsType link_bandwidth = 5: optional common.BandwithInMegaBitsType link_bandwidth =
common.default_bandwidth; common.default_bandwidth;
/** this will reflect the neighbor once received to provide /** reflects the neighbor once received to provide
3-way connectivity */ 3-way connectivity */
6: optional Neighbor neighbor; 6: optional Neighbor neighbor;
/** node's PoD */
7: optional common.PodType pod = 7: optional common.PodType pod =
common.default_pod; common.default_pod;
/** optional node capabilities shown in the LIE. The capabilies /** node capabilities shown in the LIE. The capabilies
MUST match the capabilities shown in the Node TIEs, otherwise MUST match the capabilities shown in the Node TIEs, otherwise
the behavior is unspecified. A node detecting the mismatch the behavior is unspecified. A node detecting the mismatch
SHOULD generate according error */ SHOULD generate according error */
10: optional NodeCapabilities node_capabilities; 10: optional NodeCapabilities node_capabilities;
/** capabilities of this link */
11: optional LinkCapabilities link_capabilities; 11: optional LinkCapabilities link_capabilities;
/** required holdtime of the adjacency, i.e. how much time /** required holdtime of the adjacency, i.e. how much time
MUST expire without LIE for the adjacency to drop */ MUST expire without LIE for the adjacency to drop */
12: required common.TimeIntervalInSecType holdtime = 12: required common.TimeIntervalInSecType holdtime =
common.default_lie_holdtime; common.default_lie_holdtime;
/** optional downstream assigned locally significant label /** unsolicited, downstream assigned locally significant label
value for the adjacency */ value for the adjacency */
13: optional common.LabelType label; 13: optional common.LabelType label;
/** indicates that the level on the LIE MUST NOT be used /** indicates that the level on the LIE MUST NOT be used
to derive a ZTP level by the receiving node */ to derive a ZTP level by the receiving node */
21: optional bool not_a_ztp_offer = 21: optional bool not_a_ztp_offer =
common.default_not_a_ztp_offer; common.default_not_a_ztp_offer;
/** indicates to northbound neighbor that it should /** indicates to northbound neighbor that it should
be reflooding this node's N-TIEs to achieve flood reduction and be reflooding this node's N-TIEs to achieve flood reduction and
balancing for northbound flooding. To be ignored if received from a balancing for northbound flooding. To be ignored if received from a
northbound adjacency */ northbound adjacency */
22: optional bool you_are_flood_repeater = 22: optional bool you_are_flood_repeater =
common.default_you_are_flood_repeater; common.default_you_are_flood_repeater;
/** can be optionally set to indicate to neighbor that packet losses are seen on /** can be optionally set to indicate to neighbor that packet losses are seen on
reception based on packet numbers or the rate is too high. The receiver SHOULD reception based on packet numbers or the rate is too high. The receiver SHOULD
temporarily slow down flooding rates. temporarily slow down flooding rates
*/ */
23: optional bool you_are_sending_too_quickly = 23: optional bool you_are_sending_too_quickly =
false; false;
/** instance name in case multiple RIFT instances running on same interface */
24: optional string instance_name;
} }
/** LinkID pair describes one of parallel links between two nodes */ /** LinkID pair describes one of parallel links between two nodes */
struct LinkIDPair { struct LinkIDPair {
/** node-wide unique value for the local link */ /** node-wide unique value for the local link */
1: required common.LinkIDType local_id; 1: required common.LinkIDType local_id;
/** received remote link ID for this link */ /** received remote link ID for this link */
2: required common.LinkIDType remote_id; 2: required common.LinkIDType remote_id;
/** optionally describes the local interface index of the link */ /** describes the local interface index of the link */
10: optional common.PlatformInterfaceIndex platform_interface_index; 10: optional common.PlatformInterfaceIndex platform_interface_index;
/** optionally describes the local interface name */ /** describes the local interface name */
11: optional string platform_interface_name; 11: optional string platform_interface_name;
/** optional indication whether the link is secured, i.e. protected by outer key, absence /** indication whether the link is secured, i.e. protected by outer key, absence
of this element means no indication, undefined outer key means not secured */ of this element means no indication, undefined outer key means not secured */
12: optional common.OuterSecurityKeyID trusted_outer_security_key; 12: optional common.OuterSecurityKeyID trusted_outer_security_key;
/** more properties of the link can go in here */
} }
/** ID of a TIE /** ID of a TIE
@note: TIEID space is a total order achieved by comparing the elements @note: TIEID space is a total order achieved by comparing the elements
in sequence defined and comparing each value as an in sequence defined and comparing each value as an
unsigned integer of according length. unsigned integer of according length.
*/ */
struct TIEID { struct TIEID {
/** indicates direction of the TIE */ /** direction of TIE */
1: required common.TieDirectionType direction; 1: required common.TieDirectionType direction;
/** indicates originator of the TIE */ /** indicates originator of the TIE */
2: required common.SystemIDType originator; 2: required common.SystemIDType originator;
/** type of the tie */
3: required common.TIETypeType tietype; 3: required common.TIETypeType tietype;
/** number of the tie */
4: required common.TIENrType tie_nr; 4: required common.TIENrType tie_nr;
} }
/** Header of a TIE. /** Header of a TIE.
@note: TIEID space is a total order achieved by comparing the elements @note: TIEID space is a total order achieved by comparing the elements
in sequence defined and comparing each value as an in sequence defined and comparing each value as an
unsigned integer of according length. unsigned integer of according length.
After sequence number the lifetime received on the envelope @note: After sequence number the lifetime received on the envelope
must be used for comparison before further fields. must be used for comparison before further fields.
`origination_time` and `origination_lifetime` are disregarded @note: `origination_time` and `origination_lifetime` are disregarded
for comparison purposes and carried purely for debugging/security for comparison purposes and carried purely for debugging/security
purposes if present. purposes if present.
*/ */
struct TIEHeader { struct TIEHeader {
/** ID of the tie */
2: required TIEID tieid; 2: required TIEID tieid;
/** sequence number of the tie */
3: required common.SeqNrType seq_nr; 3: required common.SeqNrType seq_nr;
/** optional absolute timestamp when the TIE /** absolute timestamp when the TIE
was generated. This can be used on fabrics with was generated. This can be used on fabrics with
synchronized clock to prevent lifetime modification attacks. */ synchronized clock to prevent lifetime modification attacks. */
10: optional common.IEEE802_1ASTimeStampType origination_time; 10: optional common.IEEE802_1ASTimeStampType origination_time;
/** optional original lifetime when the TIE /** original lifetime when the TIE
was generated. This can be used on fabrics with was generated. This can be used on fabrics with
synchronized clock to prevent lifetime modification attacks. */ synchronized clock to prevent lifetime modification attacks. */
12: optional common.LifeTimeInSecType origination_lifetime; 12: optional common.LifeTimeInSecType origination_lifetime;
} }
/** Header of a TIE as described in TIRE/TIDE. /** Header of a TIE as described in TIRE/TIDE.
*/ */
struct TIEHeaderWithLifeTime { struct TIEHeaderWithLifeTime {
1: required TIEHeader header; 1: required TIEHeader header;
/** remaining lifetime that expires down to 0 just like in ISIS. /** remaining lifetime that expires down to 0 just like in ISIS.
TIEs with lifetimes differing by less than `lifetime_diff2ignore` MUST TIEs with lifetimes differing by less than `lifetime_diff2ignore` MUST
be considered EQUAL. */ be considered EQUAL. */
2: required common.LifeTimeInSecType remaining_lifetime; 2: required common.LifeTimeInSecType remaining_lifetime;
} }
/** A TIDE with sorted TIE headers, if headers unsorted, behavior is undefined */ /** TIDE with sorted TIE headers, if headers are unsorted, behavior is undefined */
struct TIDEPacket { struct TIDEPacket {
/** all 00s marks starts */ /** first TIE header in the tide packet */
1: required TIEID start_range; 1: required TIEID start_range;
/** all FFs mark end */ /** last TIE header in the tide packet */
2: required TIEID end_range; 2: required TIEID end_range;
/** _sorted_ list of headers */ /** _sorted_ list of headers */
3: required list<TIEHeaderWithLifeTime> headers; 3: required list<TIEHeaderWithLifeTime> headers;
} }
/** A TIRE packet */ /** TIRE packet */
struct TIREPacket { struct TIREPacket {
1: required set<TIEHeaderWithLifeTime> headers; 1: required set<TIEHeaderWithLifeTime> headers;
} }
/** Neighbor of a node */ /** neighbor of a node */
struct NodeNeighborsTIEElement { struct NodeNeighborsTIEElement {
/** Level of neighbor */ /** level of neighbor */
1: required common.LevelType level; 1: required common.LevelType level;
/** Cost to neighbor. /** Cost to neighbor.
@note: All parallel links to same node @note: All parallel links to same node
incur same cost, in case the neighbor has multiple incur same cost, in case the neighbor has multiple
parallel links at different cost, the largest distance parallel links at different cost, the largest distance
(highest numerical value) MUST be advertised (highest numerical value) MUST be advertised
@note: any neighbor with cost <= 0 MUST be ignored in computations */ @note: any neighbor with cost <= 0 MUST be ignored in computations */
3: optional common.MetricType cost = common.default_distance; 3: optional common.MetricType cost = common.default_distance;
/** can carry description of multiple parallel links in a TIE */ /** can carry description of multiple parallel links in a TIE */
4: optional set<LinkIDPair> link_ids; 4: optional set<LinkIDPair> link_ids;
/** total bandwith to neighbor, this will be normally sum of the /** total bandwith to neighbor, this will be normally sum of the
bandwidths of all the parallel links. */ bandwidths of all the parallel links. */
5: optional common.BandwithInMegaBitsType bandwidth = 5: optional common.BandwithInMegaBitsType bandwidth =
common.default_bandwidth; common.default_bandwidth;
} }
/** Flags the node sets */ /** Flags the node sets */
struct NodeFlags { struct NodeFlags {
/** node is in overload, do not transit traffic through it */ /** indicates that node is in overload, do not transit traffic through it */
1: optional bool overload = common.overload_default; 1: optional bool overload = common.overload_default;
} }
/** Description of a node. /** Description of a node.
It may occur multiple times in different TIEs but if either It may occur multiple times in different TIEs but if either
* capabilities values do not match or * capabilities values do not match or
* flags values do not match or * flags values do not match or
* neighbors repeat with different values * neighbors repeat with different values
the behavior is undefined and a warning SHOULD be generated. the behavior is undefined and a warning SHOULD be generated.
Neighbors can be distributed across multiple TIEs however if Neighbors can be distributed across multiple TIEs however if
the sets are disjoint. Miscablings SHOULD be repeated in every the sets are disjoint. Miscablings SHOULD be repeated in every
node TIE, otherwise the behavior is undefined. node TIE, otherwise the behavior is undefined.
@note: observe that absence of fields implies defined defaults @note: observe that absence of fields implies defined defaults
*/ */
struct NodeTIEElement { struct NodeTIEElement {
/** level of the node */
1: required common.LevelType level; 1: required common.LevelType level;
/** If neighbor systemID repeats in other node TIEs of same node /** node's neighbors. If neighbor systemID repeats in other node TIEs of
the behavior is undefined. */ same node the behavior is undefined */
2: required map<common.SystemIDType, 2: required map<common.SystemIDType,
NodeNeighborsTIEElement> neighbors; NodeNeighborsTIEElement> neighbors;
/** capabilities of the node */
3: optional NodeCapabilities capabilities; 3: optional NodeCapabilities capabilities;
/** flags of the node */
4: optional NodeFlags flags; 4: optional NodeFlags flags;
/** optional node name for easier operations */ /** optional node name for easier operations */
5: optional string name; 5: optional string name;
/** PoD to which the node belongs */ /** PoD to which the node belongs */
6: optional common.PodType pod; 6: optional common.PodType pod;
/** if any local links are miscabled, the indication is flooded. */ /** if any local links are miscabled, the indication is flooded */
10: optional set<common.LinkIDType> miscabled_links; 10: optional set<common.LinkIDType> miscabled_links;
} }
struct PrefixAttributes { struct PrefixAttributes {
/** distance of the prefix */
2: required common.MetricType metric = common.default_distance; 2: required common.MetricType metric = common.default_distance;
/** generic unordered set of route tags, can be redistributed to other protocols or use /** generic unordered set of route tags, can be redistributed to other protocols or use
within the context of real time analytics */ within the context of real time analytics */
3: optional set<common.RouteTagType> tags; 3: optional set<common.RouteTagType> tags;
/** optional monotonic clock for mobile addresses */ /** monotonic clock for mobile addresses */
4: optional common.PrefixSequenceType monotonic_clock; 4: optional common.PrefixSequenceType monotonic_clock;
/** optionally indicates the interface is a node loopback */ /** indicates if the interface is a node loopback */
6: optional bool loopback = false; 6: optional bool loopback = false;
/** indicates that the prefix is directly attached, i.e. should be routed to even if /** indicates that the prefix is directly attached, i.e. should be routed to even if
the node is in overload. **/ the node is in overload. **/
7: optional bool directly_attached = true; 7: optional bool directly_attached = true;
/** in case of locally originated prefixes, i.e. interface addresses this can /** in case of locally originated prefixes, i.e. interface addresses this can
describe which link the address belongs to. */ describe which link the address belongs to. */
10: optional common.LinkIDType from_link; 10: optional common.LinkIDType from_link;
} }
/** multiple prefixes */ /** TIE carrying prefixes */
struct PrefixTIEElement { struct PrefixTIEElement {
/** prefixes with the associated attributes. /** prefixes with the associated attributes.
if the same prefix repeats in multiple TIEs of same node if the same prefix repeats in multiple TIEs of same node
behavior is unspecified */ behavior is unspecified */
1: required map<common.IPPrefixType, PrefixAttributes> prefixes; 1: required map<common.IPPrefixType, PrefixAttributes> prefixes;
} }
/** keys with their values */ /** Generic key value pairs */
struct KeyValueTIEElement { struct KeyValueTIEElement {
/** if the same key repeats in multiple TIEs of same node /** if the same key repeats in multiple TIEs of same node
or with different values, behavior is unspecified */ or with different values, behavior is unspecified */
1: required map<common.KeyIDType,string> keyvalues; 1: required map<common.KeyIDType,string> keyvalues;
} }
/** single element in a TIE. enum `common.TIETypeType` /** single element in a TIE. enum `common.TIETypeType`
in TIEID indicates which elements MUST be present in TIEID indicates which elements MUST be present
in the TIEElement. In case of mismatch the unexpected in the TIEElement. In case of mismatch the unexpected
elements MUST be ignored. In case of lack of expected elements MUST be ignored. In case of lack of expected
element the TIE an error MUST be reported and the TIE element the TIE an error MUST be reported and the TIE
MUST be ignored. MUST be ignored.
This type can be extended with new optional elements This type can be extended with new optional elements
for new `common.TIETypeType` values without breaking for new `common.TIETypeType` values without breaking
the major but if it is necessary to understand whether the major but if it is necessary to understand whether
skipping to change at page 116, line 18 skipping to change at page 131, line 22
element the TIE an error MUST be reported and the TIE element the TIE an error MUST be reported and the TIE
MUST be ignored. MUST be ignored.
This type can be extended with new optional elements This type can be extended with new optional elements
for new `common.TIETypeType` values without breaking for new `common.TIETypeType` values without breaking
the major but if it is necessary to understand whether the major but if it is necessary to understand whether
all nodes support the new type a node capability must all nodes support the new type a node capability must
be added as well. be added as well.
*/ */
union TIEElement { union TIEElement {
/** in case of enum common.TIETypeType.NodeTIEType */ /** used in case of enum common.TIETypeType.NodeTIEType */
1: optional NodeTIEElement node; 1: optional NodeTIEElement node;
/** in case of enum common.TIETypeType.PrefixTIEType */ /** used in case of enum common.TIETypeType.PrefixTIEType */
2: optional PrefixTIEElement prefixes; 2: optional PrefixTIEElement prefixes;
/** positive prefixes (always southbound) /** positive prefixes (always southbound)
It MUST NOT be advertised within a North TIE. It MUST NOT be advertised within a North TIE and ignored otherwise
*/ */
3: optional PrefixTIEElement positive_disaggregation_prefixes; 3: optional PrefixTIEElement positive_disaggregation_prefixes;
/** transitive, negative prefixes (always southbound) which /** transitive, negative prefixes (always southbound) which
MUST be aggregated and propagated MUST be aggregated and propagated
according to the specification according to the specification
southwards towards lower levels to heal southwards towards lower levels to heal
pathological upper level partitioning, otherwise pathological upper level partitioning, otherwise
blackholes may occur in multiplane fabrics. blackholes may occur in multiplane fabrics.
It MUST NOT be advertised within a North TIE. It MUST NOT be advertised within a North TIE.
*/ */
4: optional PrefixTIEElement negative_disaggregation_prefixes; 4: optional PrefixTIEElement negative_disaggregation_prefixes;
/** externally reimported prefixes */ /** externally reimported prefixes */
5: optional PrefixTIEElement external_prefixes; 5: optional PrefixTIEElement external_prefixes;
/** Key-Value store elements */ /** Key-Value store elements */
6: optional KeyValueTIEElement keyvalues; 6: optional KeyValueTIEElement keyvalues;
/** @todo: policy guided prefixes */
} }
/** TIE packet */
struct TIEPacket { struct TIEPacket {
1: required TIEHeader header; 1: required TIEHeader header;
2: required TIEElement element; 2: required TIEElement element;
} }
/** content of a RIFT packet */
union PacketContent { union PacketContent {
1: optional LIEPacket lie; 1: optional LIEPacket lie;
2: optional TIDEPacket tide; 2: optional TIDEPacket tide;
3: optional TIREPacket tire; 3: optional TIREPacket tire;
4: optional TIEPacket tie; 4: optional TIEPacket tie;
} }
/** protocol packet structure */ /** RIFT packet structure */
struct ProtocolPacket { struct ProtocolPacket {
1: required PacketHeader header; 1: required PacketHeader header;
2: required PacketContent content; 2: required PacketContent content;
} }
Appendix C. Finite State Machines and Precise Operational Appendix C. Finite State Machines and Precise Operational
Specifications Specifications
Some FSM figures are provided as [DOT] description due to limitations Some FSM figures are provided as [DOT] description due to limitations
of ASCII art. of ASCII art.
skipping to change at page 138, line 35 skipping to change at page 153, line 35
C.3.6. Sending TIEs C.3.6. Sending TIEs
On a periodic basis all TIEs with lifetime left > 0 MUST be sent out On a periodic basis all TIEs with lifetime left > 0 MUST be sent out
on the adjacency, removed from TIES_TX list and requeued onto on the adjacency, removed from TIES_TX list and requeued onto
TIES_RTX list. TIES_RTX list.
Appendix D. Constants Appendix D. Constants
D.1. Configurable Protocol Constants D.1. Configurable Protocol Constants
This section gather constants that are provided in the schema files This section gathers constants that are provided in the schema files
and the document. and in the document.
+----------------+--------------+-----------------------------------+ +----------------+--------------+-----------------------------------+
| | Type | Value | | | Type | Value |
+----------------+--------------+-----------------------------------+ +----------------+--------------+-----------------------------------+
| LIE IPv4 | Default | 224.0.0.120 or all-rift-routers | | LIE IPv4 | Default | 224.0.0.120 or all-rift-routers |
| Multicast | Value, | to be assigned in IPv4 Multicast | | Multicast | Value, | to be assigned in IPv4 Multicast |
| Address | Configurable | Address Space Registry in Local | | Address | Configurable | Address Space Registry in Local |
| | | Network Control Block | | | | Network Control Block |
+----------------+--------------+-----------------------------------+ +----------------+--------------+-----------------------------------+
| LIE IPv6 | Default | FF02::A1F7 or all-rift-routers to | | LIE IPv6 | Default | FF02::A1F7 or all-rift-routers to |
| Multicast | Value, | be assigned in IPv6 Multicast | | Multicast | Value, | be assigned in IPv6 Multicast |
| Address | Configurable | Address Assignments | | Address | Configurable | Address Assignments |
+----------------+--------------+-----------------------------------+ +----------------+--------------+-----------------------------------+
| LIE | Default | 911 | | LIE | Default | 914 |
| Destination | Value, | | | Destination | Value, | |
| Port | Configurable | | | Port | Configurable | |
+----------------+--------------+-----------------------------------+ +----------------+--------------+-----------------------------------+
| Level value | Constant | 24 | | Level value | Constant | 24 |
| for | | | | for | | |
| TOP_OF_FABRIC | | | | TOP_OF_FABRIC | | |
| flag | | | | flag | | |
+----------------+--------------+-----------------------------------+ +----------------+--------------+-----------------------------------+
| Default LIE | Default | 3 seconds | | Default LIE | Default | 3 seconds |
| Holdtime | Value, | | | Holdtime | Value, | |
skipping to change at page 140, line 5 skipping to change at page 155, line 5
+----------------+--------------+-----------------------------------+ +----------------+--------------+-----------------------------------+
| MAX_TIEID | Constant | TIE Key with maximal values: | | MAX_TIEID | Constant | TIE Key with maximal values: |
| signifies end | | TIEID(originator=MAX_UINT64, | | signifies end | | TIEID(originator=MAX_UINT64, |
| of TIDEs | | tietype=TIETypeMaxValue, | | of TIDEs | | tietype=TIETypeMaxValue, |
| | | tie_nr=MAX_UINT64, | | | | tie_nr=MAX_UINT64, |
| | | direction=North) | | | | direction=North) |
+----------------+--------------+-----------------------------------+ +----------------+--------------+-----------------------------------+
Table 6: all_constants Table 6: all_constants
Appendix E. TODO Authors' Addresses
o section on E-W superspine/ToF flooding scope to connect partitions Tony Przygienda (editor)
Juniper
1137 Innovation Way
Sunnyvale, CA
USA
Author's Address Email: prz@juniper.net
The RIFT Team Alankar Sharma
Comcast
1800 Bishops Gate Blvd
Mount Laurel, NJ 08054
US
Email: Alankar_Sharma@comcast.com
Pascal Thubert
Cisco Systems, Inc
Building D
45 Allee des Ormes - BP1200
MOUGINS - Sophia Antipolis 06254
FRANCE
Phone: +33 497 23 26 34
Email: pthubert@cisco.com
Bruno Rijsman
Individual
Email: fl0w@yandex-team.ru
Dmitry Afanasiev
Yandex
Email: fl0w@yandex-team.ru
 End of changes. 166 change blocks. 
434 lines changed or deleted 1205 lines changed or added

This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/