--- 1/draft-ietf-opsawg-ntf-06.txt 2021-02-19 16:13:08.924794692 -0800 +++ 2/draft-ietf-opsawg-ntf-07.txt 2021-02-19 16:13:08.972795257 -0800 @@ -1,40 +1,40 @@ OPSAWG H. Song Internet-Draft Futurewei Intended status: Informational F. Qin -Expires: July 25, 2021 China Mobile +Expires: August 23, 2021 China Mobile P. Martinez-Julia NICT L. Ciavaglia Nokia A. Wang China Telecom - January 21, 2021 + February 19, 2021 Network Telemetry Framework - draft-ietf-opsawg-ntf-06 + draft-ietf-opsawg-ntf-07 Abstract Network telemetry is a technology for gaining network insight and facilitating efficient and automated network management. It encompasses various techniques for remote data generation, collection, correlation, and consumption. This document describes an architectural framework for network telemetry, motivated by challenges that are encountered as part of the operation of networks and by the requirements that ensue. Network telemetry, as necessitated by best industry practices, covers technologies and protocols that extend beyond conventional network Operations, Administration, and Management (OAM). The presented network - telemetry framework promises better flexibility, scalability, - accuracy, coverage, and performance. In addition, it facilitates the + telemetry framework promises flexibility, scalability, accuracy, + coverage, and performance. In addition, it facilitates the implementation of automated control loops to address both today's and tomorrow's network operational needs. This document clarifies the terminologies and classifies the modules and components of a network telemetry system from several different perspectives. The framework and taxonomy help to set a common ground for the collection of related work and provide guidance for related technique and standard developments. Status of This Memo @@ -44,78 +44,78 @@ Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at https://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." - This Internet-Draft will expire on July 25, 2021. + This Internet-Draft will expire on August 23, 2021. Copyright Notice Copyright (c) 2021 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 - 2. Background . . . . . . . . . . . . . . . . . . . . . . . . . 4 - 2.1. Telemetry Data Coverage . . . . . . . . . . . . . . . . . 5 - 2.2. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . 5 - 2.3. Challenges . . . . . . . . . . . . . . . . . . . . . . . 7 - 2.4. Glossary . . . . . . . . . . . . . . . . . . . . . . . . 8 - 2.5. Network Telemetry . . . . . . . . . . . . . . . . . . . . 10 - 3. The Necessity of a Network Telemetry Framework . . . . . . . 12 - 4. Network Telemetry Framework . . . . . . . . . . . . . . . . . 13 - 4.1. Top Level Modules . . . . . . . . . . . . . . . . . . . . 13 - 4.1.1. Management Plane Telemetry . . . . . . . . . . . . . 17 - 4.1.2. Control Plane Telemetry . . . . . . . . . . . . . . . 17 - 4.1.3. Forwarding Plane Telemetry . . . . . . . . . . . . . 18 - 4.1.4. External Data Telemetry . . . . . . . . . . . . . . . 20 - 4.2. Second Level Function Components . . . . . . . . . . . . 20 - 4.3. Data Acquiring Mechanism and Type Abstraction . . . . . . 22 - 4.4. Existing Works Mapped in the Framework . . . . . . . . . 24 - 5. Evolution of Network Telemetry . . . . . . . . . . . . . . . 26 - 6. Security Considerations . . . . . . . . . . . . . . . . . . . 26 - 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 27 - 8. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 28 - 9. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 28 - 10. Informative References . . . . . . . . . . . . . . . . . . . 28 + 2. Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . 4 + 3. Background . . . . . . . . . . . . . . . . . . . . . . . . . 6 + 3.1. Telemetry Data Coverage . . . . . . . . . . . . . . . . . 7 + 3.2. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . 7 + 3.3. Challenges . . . . . . . . . . . . . . . . . . . . . . . 9 + 3.4. Network Telemetry . . . . . . . . . . . . . . . . . . . . 10 + 4. The Necessity of a Network Telemetry Framework . . . . . . . 12 + 5. Network Telemetry Framework . . . . . . . . . . . . . . . . . 13 + 5.1. Top Level Modules . . . . . . . . . . . . . . . . . . . . 14 + 5.1.1. Management Plane Telemetry . . . . . . . . . . . . . 17 + 5.1.2. Control Plane Telemetry . . . . . . . . . . . . . . . 17 + 5.1.3. Forwarding Plane Telemetry . . . . . . . . . . . . . 18 + 5.1.4. External Data Telemetry . . . . . . . . . . . . . . . 20 + 5.2. Second Level Function Components . . . . . . . . . . . . 21 + 5.3. Data Acquisition Mechanism and Type Abstraction . . . . . 22 + 5.4. Mapping Existing Mechanisms into the Framework . . . . . 24 + 6. Evolution of Network Telemetry Applications . . . . . . . . . 25 + 7. Security Considerations . . . . . . . . . . . . . . . . . . . 26 + 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 27 + 9. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 27 + 10. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 28 + 11. Informative References . . . . . . . . . . . . . . . . . . . 28 Appendix A. A Survey on Existing Network Telemetry Techniques . 32 A.1. Management Plane Telemetry . . . . . . . . . . . . . . . 32 A.1.1. Push Extensions for NETCONF . . . . . . . . . . . . . 32 - A.1.2. gRPC Network Management Interface . . . . . . . . . . 33 + A.1.2. gRPC Network Management Interface . . . . . . . . . . 32 A.2. Control Plane Telemetry . . . . . . . . . . . . . . . . . 33 A.2.1. BGP Monitoring Protocol . . . . . . . . . . . . . . . 33 - A.3. Data Plane Telemetry . . . . . . . . . . . . . . . . . . 34 - A.3.1. The Alternate Marking technology . . . . . . . . . . 34 - A.3.2. Dynamic Network Probe . . . . . . . . . . . . . . . . 35 + A.3. Data Plane Telemetry . . . . . . . . . . . . . . . . . . 33 + A.3.1. The Alternate Marking (AM) technology . . . . . . . . 33 + A.3.2. Dynamic Network Probe . . . . . . . . . . . . . . . . 34 A.3.3. IP Flow Information Export (IPFIX) protocol . . . . . 35 A.3.4. In-Situ OAM . . . . . . . . . . . . . . . . . . . . . 35 - A.3.5. Postcard Based Telemetry . . . . . . . . . . . . . . 36 - A.4. External Data and Event Telemetry . . . . . . . . . . . . 36 + A.3.5. Postcard Based Telemetry . . . . . . . . . . . . . . 35 + A.4. External Data and Event Telemetry . . . . . . . . . . . . 35 A.4.1. Sources of External Events . . . . . . . . . . . . . 36 A.4.2. Connectors and Interfaces . . . . . . . . . . . . . . 37 - Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 38 + Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 37 1. Introduction Network visibility is the ability of management tools to see the state and behavior of a network, which is essential for successful network operation. Network Telemetry revolves around network data that can help provide insights about the current state of the network, including network devices, forwarding, control, and management planes, and that can be generated and obtained through a variety of techniques, including but not limited to network @@ -158,21 +158,102 @@ maintaining, and understanding a network telemetry system. At last, we outline the evolution stages of the network telemetry system and discuss the potential security concerns. The purpose of the framework and taxonomy is to set a common ground for the collection of related work and provide guidance for future technique and standard developments. To the best of our knowledge, this document is the first such effort for network telemetry in industry standards organizations. -2. Background +2. Glossary + + Before further discussion, we list some key terminology and acronyms + used in this documents. We make an intended differentiation between + the terms of network telemetry and OAM. However, it should be + understood that there is not a hard-line distinction between the two + concepts. Rather, network telemetry is considered as the extension + of OAM. It covers all the existing OAM protocols but puts more + emphasis on the newer and emerging techniques and protocols + concerning all aspects of network data from acquisition to + consumption. + + AI: Artificial Intelligence. In network domain, AI refers to the + machine-learning based technologies for automated network + operation and other tasks. + + AM: Alternate Marking, a flow performance measurement method, + specified in [RFC8321]. + + BMP: BGP Monitoring Protocol, specified in [RFC7854]. + + DNP: Dynamic Network Probe, referring to programmable in-network + sensors for network monitoring and measurement. + + DPI: Deep Packet Inspection, referring to the techniques that + examines packet beyond packet L3/L4 headers. + + gNMI: gRPC Network Management Interface, a network management + protocol from OpenConfig Operator Working Group, mainly + contributed by Google. See [gnmi] for details. + + gRPC: gRPC Remote Procedure Call, a open source high performance RPC + framework that gNMI is based on. See [grpc] for details. + + IPFIX: IP Flow Information Export Protocol, specified in [RFC7011]. + + IOAM: In-situ OAM, a dataplane on-path telemetry technique. + + NETCONF: Network Configuration Protocol, specified in [RFC6241]. + + NetFlow: A Cisco protocol for flow record collecting, described in + [RFC3594]. + + Network Telemetry: The process and instrumentation for acquiring and + utilizing network data remotely for network monitoring and + operation. A general term for a large set of network visibility + techniques and protocols, concerning aspects like data generation, + collection, correlation, and consumption. Network telemetry + addresses the current network operation issues and enables smooth + evolution toward future intent-driven autonomous networks. + + NMS: Network Management System, referring to applications that allow + network administrators manage a network. + + OAM: Operations, Administration, and Maintenance. A group of + network management functions that provide network fault + indication, fault localization, performance information, and data + and diagnosis functions. Most conventional network monitoring + techniques and protocols belong to network OAM. + + PBT: Postcard-Based Telemetry, a dataplane on-path telemetry + technique. + + SMIv2 Structure of Management Information Version 2, specified in + [RFC2578]. + + SNMP: Simple Network Management Protocol. Version 1 and 2 are + specified in [RFC1157] and [RFC3416], respectively. + + YANG: The abbreviation of "Yet Another Next Generation". YANG is a + data modeling language for the definition of data sent over + network management protocols such as the NETCONF and RESTCONF. + YANG is defined in [RFC6020]. + + YANG ECA A YANG model for Event-Condition-Action policies, defined + in [I-D.wwx-netmod-event-yang]. + + YANG PUSH: A method to subscribe pushed data from remote YANG + datastore on network devices. Details are specified in [RFC8641] + and [RFC8639]. + +3. Background The term "big data" is used to describe the extremely large volume of data sets that can be analyzed computationally to reveal patterns, trends, and associations. Networks are undoubtedly a source of big data because of their scale and the volume of network traffic they forward. It is easy to see that network operations can benefit from network big data. Today one can access advanced big data analytics capability through a plethora of commercial and open source platforms (e.g., Apache @@ -185,22 +266,23 @@ tools can use the network data to detect and react on network faults, anomalies, and policy violations, as well as predicting future events. In turn, the network policy updates for planning, intrusion prevention, optimization, and self-healing may be applied. It is conceivable that an autonomic network [RFC7575] is the logical next step for network evolution following Software Defined Network (SDN), aiming to reduce (or even eliminate) human labor, make more efficient use of network resources, and provide better services more aligned with customer requirements. Intent-based Networking (IBN) - [I-D.irtf-nmrg-ibn-concepts-definitions] provides the necessary - mechanisms. Although it takes time to reach the ultimate goal, the + [I-D.irtf-nmrg-ibn-concepts-definitions] requires network visibility + and telemetry data in order to ensure that the network is behaving as + intended. Although it takes time to reach the ultimate goal, the journey has started nevertheless. However, while the data processing capability is improved and applications are hungry for more data, the networks lag behind in extracting and translating network data into useful and actionable information in efficient ways. The system bottleneck is shifting from data consumption to data supply. Both the number of network nodes and the traffic bandwidth keep increasing at a fast pace. The network configuration and policy change at smaller time slots than before. More subtle events and fine-grained data through all network @@ -211,101 +293,105 @@ any potential gaps. In the remainder of this section, first we clarify the scope of network data (i.e., telemetry data) concerned in the context. Then, we discuss several key use cases for today's and future network operations. Next, we show why the current network OAM techniques and protocols are insufficient for these use cases. The discussion underlines the need of new methods, techniques, and protocols which we assign under the umbrella term - Network Telemetry. -2.1. Telemetry Data Coverage +3.1. Telemetry Data Coverage Any information that can be extracted from networks (including data plane, control plane, and management plane) and used to gain visibility or as basis for actions is considered telemetry data. It includes statistics, event records and logs, snapshots of state, configuration data, etc. It also covers the outputs of any active and passive measurements [RFC7799]. Specially, raw data can be processed in-network before being sent to a data consumer. Such processed data is also considered telemetry data. A classification - of telemetry data is provided in Section 4. + of telemetry data is provided in Section 5. -2.2. Use Cases +3.2. Use Cases The following set of use cases is essential for network operations. While the list is by no means exhaustive, it is enough to highlight the requirements for data velocity, variety, volume, and veracity in networks. - Security: Network intrusion detection and prevention systems need to - monitor network traffic and activities and act upon anomalies. + o Security: Network intrusion detection and prevention systems need + to monitor network traffic and activities and act upon anomalies. Given increasingly sophisticated attack vector coupled with increasingly severe consequences of security breaches, new tools and techniques need to be developed, relying on wider and deeper - visibility in networks. + visibility into networks. - Policy and Intent Compliance: Network policies are the rules that + o Policy and Intent Compliance: Network policies are the rules that constraint the services for network access, provide service differentiation, or enforce specific treatment on the traffic. For example, a service function chain is a policy that requires the selected flows to pass through a set of ordered network functions. Intent, as defined in + [I-D.irtf-nmrg-ibn-concepts-definitions], is a set of operational goal that a network should meet and outcomes that a network is supposed to deliver, defined in a declarative manner without specifying how to achieve or implement them. An intent requires a complex translation and mapping process before being applied on networks. While a policy or an intent is enforced, the compliance - needs to be verified and monitored continuously, and any violation - needs to be reported immediately. + needs to be verified and monitored continuously, relying on + visibility that is provided through network telemetry data, and + any violation needs to be reported immediately. - SLA Compliance: A Service-Level Agreement (SLA) defines the level of - service a user expects from a network operator, which include the - metrics for the service measurement and remedy/penalty procedures - when the service level misses the agreement. Users need to check - if they get the service as promised and network operators need to - evaluate how they can deliver the services that can meet the SLA - based on realtime network measurement. + o SLA Compliance: A Service-Level Agreement (SLA) defines the level + of service a user expects from a network operator, which include + the metrics for the service measurement and remedy/penalty + procedures when the service level misses the agreement. Users + need to check if they get the service as promised and network + operators need to evaluate how they can deliver the services that + can meet the SLA based on realtime network telemetry data, + including data from network measurements. - Root Cause Analysis: Any network failure can be the effect of a + o Root Cause Analysis: Any network failure can be the effect of a sequence of chained events. Troubleshooting and recovery require quick identification of the root cause of any observable issues. However, the root cause is not always straightforward to identify, especially when the failure is sporadic and the number of event messages, both related and unrelated to the same cause, is overwhelming. While machine learning technologies can be used for root cause analysis, it up to the network to sense and provide the - relevant data. + relevant data to feed into machine learning applications. - Network Optimization: This covers all short-term and long-term + o Network Optimization: This covers all short-term and long-term network optimization techniques, including load balancing, Traffic Engineering (TE), and network planning. Network operators are motivated to optimize their network utilization and differentiate services for better Return On Investment (ROI) or lower Capital Expenditures (CAPEX). The first step is to know the real-time network conditions before applying policies for traffic manipulation. In some cases, micro-bursts need to be detected in a very short time-frame so that fine-grained traffic control can be applied to avoid network congestion. Long-term planning of network capacity and topology requires analysis of real-world network telemetry data that is obtained over long periods of time. - Event Tracking and Prediction: The visibility of traffic path and - performance is critical for services and applications that rely on - healthy network operation. Numerous related network events are of - interest to network operators. For example, Network operators - want to learn where and why packets are dropped for an application - flow. They also want to be warned of issues in advance so - proactive actions can be taken to avoid catastrophic consequences. + o Event Tracking and Prediction: The visibility into traffic path + and performance is critical for services and applications that + rely on healthy network operation. Numerous related network + events are of interest to network operators. For example, Network + operators want to learn where and why packets are dropped for an + application flow. They also want to be warned of issues in + advance so proactive actions can be taken to avoid catastrophic + consequences. -2.3. Challenges +3.3. Challenges For a long time, network operators have relied upon SNMP [RFC3416], Command-Line Interface (CLI), or Syslog to monitor the network. Some other OAM techniques as described in [RFC7276] are also used to facilitate network troubleshooting. These conventional techniques are not sufficient to support the above use cases for the following reasons: o Most use cases need to continuously monitor the network and dynamically refine the data collection in real-time. The poll- @@ -352,163 +438,82 @@ or lead to inaccurate results; on the other hand, the conventional active measurement techniques can interfere with the user traffic and their results are indirect. Techniques that can collect direct and on-demand data from user traffic are more favorable. These challenges were addressed by newer standards and techniques (e.g., IPFIX/Netflow, PSAMP, IOAM, and YANG-Push) and more are emerging. These standards and techniques need to be recognized and accommodated in a new framework. -2.4. Glossary - - Before further discussion, we list some key terminology and acronyms - used in this documents. We make an intended differentiation between - the terms of network telemetry and OAM. However, it should be - understood that there is not a hard-line distinction between the two - concepts. Rather, network telemetry is considered as the extension - of OAM. It covers all the existing OAM protocols but puts more - emphasis on the newer and emerging techniques and protocols - concerning all aspects of network data from acquisition to - consumption. - - AI: Artificial Intelligence. In network domain, AI refers to the - machine-learning based technologies for automated network - operation and other tasks. - - AM: Alternate Marking, a flow performance measurement method, - specified in [RFC8321]. - - BMP: BGP Monitoring Protocol, specified in [RFC7854]. - - DNP: Dynamic Network Probe, referring to programmable in-network - sensors for network monitoring and measurement. - - DPI: Deep Packet Inspection, referring to the techniques that - examines packet beyond packet L3/L4 headers. - - gNMI: gRPC Network Management Interface, a network management - protocol from OpenConfig Operator Working Group, mainly - contributed by Google. See [gnmi] for details. - - gRPC: gRPC Remote Procedure Call, a open source high performance RPC - framework that gNMI is based on. See [grpc] for details. - - IPFIX: IP Flow Information Export Protocol, specified in [RFC7011]. - - IOAM: In-situ OAM, a dataplane on-path telemetry technique. - - NETCONF: Network Configuration Protocol, specified in [RFC6241]. - - NetFlow: A Cisco protocol for flow record collecting, described in - [RFC3594]. - - Network Telemetry: The process and instrumentation for acquiring and - utilizing network data remotely for network monitoring and - operation. A general term for a large set of network visibility - techniques and protocols, concerning aspects like data generation, - collection, correlation, and consumption. Network telemetry - addresses the current network operation issues and enables smooth - evolution toward future intent-driven autonomous networks. - - NMS: Network Management System, referring to applications that allow - network administrators manage a network. - - OAM: Operations, Administration, and Maintenance. A group of - network management functions that provide network fault - indication, fault localization, performance information, and data - and diagnosis functions. Most conventional network monitoring - techniques and protocols belong to network OAM. - - PBT: Postcard-Based Telemetry, a dataplane on-path telemetry - technique. - - SMIv2 Structure of Management Information Version 2, specified in - [RFC2578]. - - SNMP: Simple Network Management Protocol. Version 1 and 2 are - specified in [RFC1157] and [RFC3416], respectively. - - YANG: The abbreviation of "Yet Another Next Generation". YANG is a - data modeling language for the definition of data sent over - network management protocols such as the NETCONF and RESTCONF. - YANG is defined in [RFC6020]. - - YANG ECA A YANG model for Event-Condition-Action policies, defined - in [I-D.wwx-netmod-event-yang]. - - YANG FSM: A YANG model that describes events, operations, and finite - state machine of YANG-defined network elements. - - YANG PUSH: A method to subscribe pushed data from remote YANG - datastore on network devices. Details are specified in [RFC8641] - and [RFC8639]. - -2.5. Network Telemetry +3.4. Network Telemetry Network telemetry has emerged as a mainstream technical term to refer to the network data collection and consumption techniques. Several network telemetry techniques and protocols (e.g., IPFIX [RFC7011] and - gPRC [grpc]) have been widely deployed. Network telemetry allows + gRPC [grpc]) have been widely deployed. Network telemetry allows separate entities to acquire data from network devices so that data can be visualized and analyzed to support network monitoring and operation. Network telemetry covers the conventional network OAM and has a wider scope. It is expected that network telemetry can provide the necessary network insight for autonomous networks and address the shortcomings of conventional OAM techniques. - Network telemetry usually assumes machines as data consumer rather + Network telemetry usually assumes machines as data consumers rather than human operators. Hence, the network telemetry can directly trigger the automated network operation, while in contrast some conventional OAM tools are designed and used to help human operators to monitor and diagnose the networks and guide manual network operations. Such a proposition leads to very different techniques. Although new network telemetry techniques are emerging and subject to continuous evolution, several characteristics of network telemetry have been well accepted. Note that network telemetry is intended to be an umbrella term covering a wide spectrum of techniques, so the following characteristics are not expected to be held by every specific technique. o Push and Streaming: Instead of polling data from network devices, telemetry collectors subscribe to streaming data pushed from data sources in network devices. o Volume and Velocity: The telemetry data is intended to be consumed by machines rather than by human being. Therefore, the data - volume is huge and the processing is often in realtime. + volume can be huge and the processing is optimized for the needs + of automation in realtime. o Normalization and Unification: Telemetry aims to address the overall network automation needs. Efforts are made to normalize the data representation and unify the protocols, so to simplify - data analysis and tying it all in with automation solutions + data analysis and provide integrated analysis across heterogeneous + devices and data sources across a network. o Model-based: The telemetry data is modeled in advance which allows applications to configure and consume data with ease. o Data Fusion: The data for a single application can come from multiple data sources (e.g., cross-domain, cross-device, and cross-layer) and needs to be correlated to take effect. o Dynamic and Interactive: Since the network telemetry means to be used in a closed control loop for network automation, it needs to run continuously and adapt to the dynamic and interactive queries from the network operation controller. In addition, an ideal network telemetry solution may also have the following features or properties: - o In-Network Customization: The data can be customized in network at - run-time to cater to the specific need of applications. This - needs the support of a programmable data plane which allows probes - with custom functions to be deployed at flexible locations. + o In-Network Customization: The data that is generated can be + customized in network at run-time to cater to the specific need of + applications. This needs the support of a programmable data plane + which allows probes with custom functions to be deployed at + flexible locations. o In-Network Data Aggregation and Correlation: Network devices and aggregation points can work out which events and what data needs to be stored, reported, or discarded thus reducing the load on the central collection and processing points while still ensuring that the right information is ready to be processed in a timely way. o In-Network Processing: Sometimes it is not necessary or feasible to gather all information to a central point to be processed and acted upon. It is possible for the data processing to be done in @@ -517,67 +522,68 @@ o Direct Data Plane Export: The data originated from the data plane forwarding chips can be directly exported to the data consumer for efficiency, especially when the data bandwidth is large and the real-time processing is required. o In-band Data Collection: In addition to the passive and active data collection approaches, the new hybrid approach allows to directly collect data for any target flow on its entire forwarding path [I-D.song-opsawg-ifit-framework]. - It is worth noting that, a network telemetry system should not be - intrusive to normal network operations, by avoiding the pitfall of - the "observer effect". That is, it should not change the network + It is worth noting that a network telemetry system should not be + intrusive to normal network operations by avoiding the pitfall of the + "observer effect". That is, it should not change the network behavior and affect the forwarding performance. Otherwise, the whole - purpose of network telemetry is defied. + purpose of network telemetry is compromised. Although in many cases a system for network telemetry involves a remote data collecting and consuming entity, it is important to understand that there are no inherent assumptions about how a system should be architected. Telemetry data producers and consumers can work in distributed or peer-to-peer fashions rather than assuming a centralized data consuming entity. In such cases, a network node can be the direct consumer of telemetry data from other nodes. -3. The Necessity of a Network Telemetry Framework +4. The Necessity of a Network Telemetry Framework Network data analytics and machine-learning technologies are applied for network operation automation, relying on abundant and coherent data from networks. Data acquisition that is limited to a single source and static in nature will in many cases not be sufficient to meet an application's telemetry data needs. As a result, multiple data sources, involving a variety of techniques and standards, will need to be integrated. It is desirable to have a framework that classifies and organizes different telemetry data source and types, defines different components of a network telemetry system and their interactions, and helps coordinate and integrate multiple telemetry approaches across layers. This allows flexible combinations of data for different applications, while normalizing and simplifying interfaces. In detail, such a framework would benefit application development for the following reasons: o Future networks, autonomous or otherwise, depend on holistic and comprehensive network visibility. All the use cases and applications are better to be supported uniformly and coherently - under a single intelligent agent. Therefore, the protocols and - mechanisms should be consolidated into a minimum yet comprehensive - set. A telemetry framework can help to normalize the technique - developments. + under a single intelligent agent using an integrated, converged + mechanism and common telemetry data representations wherever + feasible. Therefore, the protocols and mechanisms should be + consolidated into a minimum yet comprehensive set. A telemetry + framework can help to normalize the technique developments. o Network visibility presents multiple viewpoints. For example, the device viewpoint takes the network infrastructure as the monitoring object from which the network topology and device status can be acquired; the traffic viewpoint takes the flows or packets as the monitoring object from which the traffic quality and path can be acquired. An application may need to switch its viewpoint during operation. It may also need to correlate a - service and its impact on network experience to acquire the + service and its impact on user experience to acquire the comprehensive information. o Applications require network telemetry to be elastic in order to make efficient use of network resources and reduce the impact of processing related to network telemetry on network performance. For example, routine network monitoring should cover the entire network with a low data sampling rate. Only when issues arise or critical trends emerge should telemetry data source be modified and telemetry data rates boosted as needed. @@ -587,38 +593,38 @@ A telemetry framework collects together all of the telemetry-related works from different sources and working groups within IETF. This makes it possible to assemble a comprehensive network telemetry system and to avoid repetitious or redundant work. The framework should cover the concepts and components from the standardization perspective. This document describes the modules which make up a network telemetry framework and decomposes the telemetry system into a set of distinct components that existing and future work can easily map to. -4. Network Telemetry Framework +5. Network Telemetry Framework The top level network telemetry framework partitions the network telemetry into four modules based on the telemetry data object source and represents their relationship. At the next level, the framework decomposes each module into separate components. Each of the modules follows the same underlying structure, with one component dedicated to the configuration of data subscriptions and data sources, a second component dedicated to encoding and exporting data, and a third component instrumenting the generation of telemetry related to the underlying resources. Throughout the framework, the same set of abstract data acquiring mechanisms and data types are applied. The two-level architecture with the uniform data abstraction helps accurately pinpoint a protocol or technique to its position in a network telemetry system or disaggregate a network telemetry system into manageable parts. -4.1. Top Level Modules +5.1. Top Level Modules Telemetry can be applied on the forwarding plane, the control plane, and the management plane in a network, as well as other sources out of the network, as shown in Figure 1. Therefore, we categorize the network telemetry into four distinct modules with each having its own interface to Network Operation Applications. +------------------------------+ | | | Network Operation |<-------+ @@ -641,56 +647,55 @@ | Telemetry | | | | | +---------------+--------------+ Figure 1: Modules in Layer Category of NTF The rationale of this partition lies in the different telemetry data objects which result in different data source and export locations. Such differences have profound implications on in-network data programming and processing capability, data encoding and transport - protocol, and data bandwidth and latency. + protocol, and required data bandwidth and latency. We summarize the major differences of the four modules in the - following table. They are compared from six aspects: + following table. They are compared from six angles: o Data Object o Data Export Location o Data Model - o Data Encoding o Telemetry Protocol o Transport Method - Data object is the target and source of each module. Because the - data source varies, the data export location varies. For example, - the forwarding plane data are mainly from the fast path(e.g., - forwarding chips) while the control plane data are mainly from the - slow path (e.g., main control CPU). For convenience and efficiency, - it is preferred to export the data from locations near the source. - Because each data export location has different capability, the - proper data model, encoding, and transport method cannot be kept the - same. For example, the forwarding chip has high throughput but - limited capacity for processing complex data and maintaining states, - while the main control CPU is capable of complex data and state - processing, but has limited bandwidth for high throughput data. As a - result, the suitable telemetry protocol for each module can be - different. Some representative techniques are shown in the - corresponding table blocks to highlight the technical diversity of - these modules. Note that the selected techniques just reflect the - de-facto state of the art and are not exhaustive. The key point is - that one cannot expect to use a universal protocol to cover all the - network telemetry requirements. + Data Object is the target and source of each module. Because the + data source varies, the location where data is mostly conveniently + exported also varies. For example, forwarding plane data mainly + originates from the fast path(e.g., forwarding chips) while control + plane data mainly originates from the slow path (e.g., main control + CPU). For convenience and efficiency, it is preferred to export the + data from locations near the source. Because each location that can + export data has different capability, the proper data model, + encoding, and transport method cannot be kept the same. For example, + the forwarding chip has high throughput but limited capacity for + processing complex data and maintaining states, while the main + control CPU is capable of complex data and state processing, but has + limited bandwidth for high throughput data. As a result, the + suitable telemetry protocol for each module can be different. Some + representative techniques are shown in the corresponding table blocks + to highlight the technical diversity of these modules. Note that the + selected techniques just reflect the de-facto state of the art and + are not exhaustive. The key point is that one cannot expect to use a + universal protocol to cover all the network telemetry requirements. +---------+--------------+--------------+--------------+-----------+ | Module | Control | Management | Forwarding | External | | | Plane | Plane | Plane | Data | +---------+--------------+--------------+--------------+-----------+ |Object | control | config. & | flow & packet| terminal, | | | protocol & | operation | QoS, traffic | social & | | | signaling, | state, MIB | stat., buffer| environ- | | | RIB, ACL | | & queue stat.| mental | +---------+--------------+--------------+--------------+-----------+ @@ -700,114 +705,119 @@ | | or fwding | | control CPU | | | | chip | | unlikely | | +---------+--------------+--------------+--------------+-----------+ |Data | YANG, | MIB, syslog, | template, | YANG | |Model | custom | YANG, | YANG, | | | | | custom | custom | | +---------+--------------+--------------+--------------+-----------+ |Data | GPB, JSON, | GPB, JSON, | plain | GPB, JSON | |Encoding | XML, plain | XML | | XML, plain| +---------+--------------+--------------+--------------+-----------+ - |Protocol | gRPC,NETCONF,| gPRC,NETCONF,| IPFIX, mirror| gRPC | + |Protocol | gRPC,NETCONF,| gRPC,NETCONF,| IPFIX, mirror| gRPC | | | IPFIX,mirror | | | | +---------+--------------+--------------+--------------+-----------+ |Transport| HTTP, TCP, | HTTP, TCP | UDP | HTTP,TCP | | | UDP | | | UDP | +---------+--------------+--------------+--------------+-----------+ Figure 2: Comparison of the Data Object Modules - Note that the interaction with the network operation applications can - be indirect. Some in-device data transfer is possible. For example, - in the management plane telemetry, the management plane may need to - acquire data from the data plane. Some of the operational states can - only be derived from the data plane such as the interface status and - statistics. For another example, the control plane telemetry may - need to access the Forwarding Information Base (FIB) in data plane. + Note that the interaction with the applications that consume network + telemetry data can be indirect. Some in-device data transfer is + possible. For example, in the management plane telemetry, the + management plane may need to acquire data from the data plane. Some + of the operational states can only be derived from data plane data + sources such as the interface status and statistics. For another + example, obtaining control plane telemetry data may require the + ability access the Forwarding Information Base (FIB) of the data + plane. On the other hand, an application may involve more than one plane and interact with multiple planes simultaneously. For example, an SLA compliance application may require both the data plane telemetry and the control plane telemetry. The requirements and challenges for each module are summarized as - follows. + follows (note that the requirements may pertain across all telemetry + modules; however, we emphasize those that are most pronounced for a + particular plane). -4.1.1. Management Plane Telemetry +5.1.1. Management Plane Telemetry The management plane of network elements interacts with the Network Management System (NMS), and provides information such as performance data, network logging data, network warning and defects data, and network statistics and state data. The management plane includes many protocols, including some that are considered "legacy", such as SNMP and syslog. Regardless the protocol, management plane telemetry must address the following requirements: - Convenient Data Subscription: An application should have the freedom - to choose the data export means such as the data types and the - export frequency. + o Convenient Data Subscription: An application should have the + freedom to choose the data export means such as the data types and + the export frequency. - Structured Data: For automatic network operation, machines will + o Structured Data: For automatic network operation, machines will replace human for network data comprehension. The schema languages such as YANG can efficiently describe structured data and normalize data encoding and transformation. - High Speed Data Transport: In order to keep up with the velocity of - information, a server needs to be able to send large amounts of + o High Speed Data Transport: In order to keep up with the velocity + of information, a server needs to be able to send large amounts of data at high frequency. Compact encoding formats are needed to compress the data and improve the data transport efficiency. The subscription mode, by replacing the query mode, reduces the interactions between clients and servers and helps to improve the server's efficiency. -4.1.2. Control Plane Telemetry +5.1.2. Control Plane Telemetry The control plane telemetry refers to the health condition monitoring of different network control protocols covering Layer 2 to Layer 7. Keeping track of the running status of these protocols is beneficial for detecting, localizing, and even predicting various network issues, as well as network optimization, in real-time and in fine - granularity. + granularity. Some particular challenges and issues faced by the + control plane telemetry are as follows: - One of the most challenging problems for the control plane telemetry - is how to correlate the End-to-End (E2E) Key Performance Indicators - (KPI) to a specific layer's KPIs. For example, an IPTV user may - describe his User Experience (UE) by the video fluency and - definition. Then in case of an unusually poor UE KPI or a service - disconnection, it is non-trivial to delimit and pinpoint the issue in - the responsible protocol layer (e.g., the Transport Layer or the - Network Layer), the responsible protocol (e.g., ISIS or BGP at the - Network Layer), and finally the responsible device(s) with specific - reasons. + o One challenging problem for the control plane telemetry is how to + correlate the End-to-End (E2E) Key Performance Indicators (KPI) to + a specific layer's KPIs. For example, an IPTV user may describe + his User Experience (UE) by the video fluency and definition. + Then in case of an unusually poor UE KPI or a service + disconnection, it is non-trivial to delimit and pinpoint the issue + in the responsible protocol layer (e.g., the Transport Layer or + the Network Layer), the responsible protocol (e.g., ISIS or BGP at + the Network Layer), and finally the responsible device(s) with + specific reasons. - Traditional OAM-based approaches for control plane KPI measurement - include PING (L3), Tracert (L3), Y.1731 (L2), and so on. One common - issue behind these methods is that they only measure the KPIs instead - of reflecting the actual running status of these protocols, making - them less effective or efficient for control plane troubleshooting - and network optimization. + o Traditional OAM-based approaches for control plane KPI measurement + include PING (L3), Tracert (L3), Y.1731 (L2), and so on. One + common issue behind these methods is that they only measure the + KPIs instead of reflecting the actual running status of these + protocols, making them less effective or efficient for control + plane troubleshooting and network optimization. - An example of the control plane telemetry is the BGP monitoring - protocol (BMP), it is currently used to monitoring the BGP routes and - enables rich applications, such as BGP peer analysis, AS analysis, - prefix analysis, security analysis, and so on. However, the - monitoring of other layers, protocols and the cross-layer, cross- - protocol KPI correlations are still in their infancy (e.g., the IGP - monitoring is missing), which require further research. + o An example of the control plane telemetry is the BGP monitoring + protocol (BMP), it is currently used to monitoring the BGP routes + and enables rich applications, such as BGP peer analysis, AS + analysis, prefix analysis, security analysis, and so on. However, + the monitoring of other layers, protocols and the cross-layer, + cross-protocol KPI correlations are still in their infancy (e.g., + the IGP monitoring is missing), which require further research. -4.1.3. Forwarding Plane Telemetry +5.1.3. Forwarding Plane Telemetry An effective forwarding plane telemetry system relies on the data that the network device can expose. The quality, quantity, and timeliness of data must meet some stringent requirements. This raises some challenges to the network data plane devices where the - first hand data originate. + first hand data originates. o A data plane device's main function is user traffic processing and forwarding. While supporting network visibility is important, the telemetry is just an auxiliary function, and it should not impede normal traffic processing and forwarding (i.e., the performance is not lowered and the behavior is not altered due to the telemetry functions). o Network operation applications require end-to-end visibility across various sources, which can result in a huge volume of data. @@ -831,62 +841,62 @@ work even though some devices are unaware of the system. This challenge is highly relevant to the standards and legacy networks. Although not specific to the forwarding plane, these challenges are more difficult to the forwarding plane because of the limited resource and flexibility. The data plane programmability is essential to support network telemetry. Newer data plane forwarding chips are equipped with advanced telemetry features and provide flexibility to support customized telemetry functions. -4.1.3.1. Technique Taxonomy - - There can be multiple possible dimensions to classify the forwarding - plane telemetry techniques. + Technique Taxonomy: concerning about how one instruments the + telemetry, there can be multiple possible dimensions to classify the + forwarding plane telemetry techniques. - Active, Passive, and Hybrid: Active and passive methods (as well as + o Active, Passive, and Hybrid: This dimension concerns about the + end-to-end measurement. Active and passive methods (as well as the hybrid types) are well documented in [RFC7799]. Passive methods include TCPDUMP, IPFIX [RFC7011], sflow, and traffic mirroring. These methods usually have low data coverage. The bandwidth cost is very high in order to improve the data coverage. On the other hand, active methods include Ping, OWAMP [RFC4656], TWAMP [RFC5357], and Cisco's SLA Protocol [RFC6812]. These methods are intrusive and only provide indirect network measurement results. Hybrid methods, including in-situ OAM - [I-D.ietf-ippm-ioam-data], IPFPM [RFC8321], and Multipoint - Alternate Marking [I-D.fioccola-ippm-multipoint-alt-mark], provide - a well-balanced and more flexible approach. However, these - methods are also more complex to implement. + [I-D.ietf-ippm-ioam-data], Alternate-Marking (AM) [RFC8321], and + Multipoint Alternate Marking + [I-D.fioccola-ippm-multipoint-alt-mark], provide a well-balanced + and more flexible approach. However, these methods are also more + complex to implement. - In-Band and Out-of-Band: The telemetry data, before being exported + o In-Band and Out-of-Band: The telemetry data, before being exported to some collector, can be carried in user packets. Such methods are considered in-band (e.g., in-situ OAM [I-D.ietf-ippm-ioam-data]). If the telemetry data is directly exported to some collector without modifying the user packets, such methods are considered out-of-band (e.g., postcard-based INT). It is possible to have hybrid methods. For example, only the telemetry instruction or partial data is carried by user - packets (e.g., IPFPM [RFC8321]). + packets (e.g., AM [RFC8321]). - E2E and In-Network: Some E2E methods start from and end at the + o E2E and In-Network: Some E2E methods start from and end at the network end hosts (e.g., Ping). The other methods work in networks and are transparent to end hosts. However, if needed, in-network methods can be easily extended into end hosts. - Information Type: Depending on the telemetry objective, the methods + o Data Subject: Depending on the telemetry objective, the methods can be flow-based (e.g., in-situ OAM [I-D.ietf-ippm-ioam-data]), path-based (e.g., Traceroute), and node-based (e.g., IPFIX - [RFC7011]). The various data objects can be packet, flow record, measurement, states, and signal. -4.1.4. External Data Telemetry +5.1.4. External Data Telemetry Events that occur outside the boundaries of the network system are another important source of network telemetry. Correlating both internal telemetry data and external events with the requirements of network systems, as presented in [I-D.pedro-nmrg-anticipated-adaptation], provides a strategic and functional advantage to management operations. As with other sources of telemetry information, the data and events must meet strict requirements, especially in terms of timeliness, @@ -912,61 +922,61 @@ current and future devices and applications. Therefore, it must be easily mapped to current information models, such as in terms of YANG. Organizing together both internal and external telemetry information will be key for the general exploitation of the management possibilities of current and future network systems, as reflected in the incorporation of cognitive capabilities to new hardware and software (virtual) elements. -4.2. Second Level Function Components +5.2. Second Level Function Components Reflecting the best current practice, the telemetry module at each plane is further partitioned into five distinct components: - Data Query, Analysis, and Storage: This component works at the + o Data Query, Analysis, and Storage: This component works at the application layer. It is a part of the network management system at the receiver side. On the one hand, it is responsible for issuing data requirements. The data of interest can be modeled data through configuration or custom data through programming. The data requirements can be queries for one-shot data or subscriptions for events or streaming data. On the other hand, it receives, stores, and processes the returned data from network devices. Data analysis can be interactive to initiate further data queries. This component can reside in either network devices or remote controllers. It can be centralized and distributed, and involve one or more instances. - Data Configuration and Subscription: This component deploys data + o Data Configuration and Subscription: This component deploys data queries on devices. It determines the protocol and channel for applications to acquire desired data. This component is also responsible for configuring the desired data that might not be directly available form data sources. The subscription data can be described by models, templates, or programs. - Data Encoding and Export: This component determines how telemetry - data are delivered to the data analysis and storage component. - The data encoding and the transport protocol may vary due to the - data exporting location. + o Data Encoding and Export: This component determines how telemetry + data is delivered to the data analysis and storage component. The + data encoding and the transport protocol may vary due to the data + exporting location. - Data Generation and Processing: The requested data needs to be + o Data Generation and Processing: The requested data needs to be captured, processed, and formatted in network devices from raw data sources. This may involve in-network computing and processing on either the fast path or the slow path in network devices. - Data Object and Source: This component determines the monitoring + o Data Object and Source: This component determines the monitoring object and original data source. The data source usually just provides raw data which needs further processing. A data source - can be considered a probe. A probe can be statically installed or - dynamically installed. + can be considered a probe. Some data sources can be dynamically + installed, while others will be more static. +----------------------------------------+ +----------------------------------------+ | | | | | Data Query, Analysis, & Storage | | | | + +-------+++ -----------------------------+ ||| ^^^ ||| ||| ||V ||| @@ -983,59 +993,62 @@ | & Processing | | | | | | | +----------------------------------------| | | | | | | | Data Object and Source | |-+ | |-+ +----------------------------------------+ Figure 3: Components in the Network Telemetry Framework -4.3. Data Acquiring Mechanism and Type Abstraction +5.3. Data Acquisition Mechanism and Type Abstraction Broadly speaking, network data can be acquired through subscription - (push) and query (poll). Subscription is a contract between + (push) and query (poll). A subscription is a contract between publisher and subscriber. After initial setup, the subscribed data is automatically delivered to registered subscribers until the subscription expires. Subscription can be partitioned into two sub modes: the Publish-Subscription (Pub-Sub) mode and the Subscription- Publish (Sub-Pub) mode. In the Pub-Sub mode, a publisher publishes pre-defined data and any qualified subscribers can subscribe the data as-is. In the Sub-Pub mode, a subscriber initiates a data request and sends it to a publisher; the publisher will deliver the requested - data when available. + data when available. While for both modes, the subscribed data is + pushed to the subscriber, the Sub-Pub mode allows subscribers to + customize their subscriptions. In contrast, query is used when a querier expects immediate and one- off feedback from network devices. The queried data may be directly extracted from some specific data source, or synthesized and processed from raw data. Query suits for interactive network telemetry applications. - There are four types of data from network devices: + There are four types of data from network devices that a telemetry + data consumer can subscribe or query: - Simple Data: The data that are steadily available from some data + o Simple Data: The data that are steadily available from some data store or static probes in network devices. such data can be specified by YANG model. - Complex Data: The data need to be synthesized or processed in + o Complex Data: The data need to be synthesized or processed in network from raw data from one or more network devices. The data processing function can be statically or dynamically loaded into network devices. - Event-triggered Data: The data are conditionally acquired based on + o Event-triggered Data: The data are conditionally acquired based on the occurrence of some events. It can be actively pushed through subscription or passively polled through query. There are many ways to model events, including using Finite State Machine (FSM) or Event Condition Action (ECA) [I-D.wwx-netmod-event-yang]. - Streaming Data: The data are continuously generated. It can be time - series or the dump of databases. The streaming data reflect + o Streaming Data: The data are continuously generated. It can be + time series or the dump of databases. The streaming data reflect realtime network states and metrics and require large bandwidth and processing power. The streaming data are always actively pushed to the subscribers. The above data types are not mutually exclusive. Rather, they often overlap. For example, event-triggered data can be simple or complex, and streaming data can be simple, complex, or triggered by events. The relationships of these data types are illustrated in Figure 4. +--------------+ @@ -1055,77 +1068,76 @@ Figure 4: Data Type Relationship Subscription usually deals with event-triggered data and streaming data, and query usually deals with simple data and complex data. But the other ways are also possible. The conventional OAM techniques are mostly about querying simple data. While these techniques are still useful, more advanced network telemetry techniques are designed mainly for event-triggered or streaming data subscription, and complex data query. -4.4. Existing Works Mapped in the Framework +5.4. Mapping Existing Mechanisms into the Framework - The following two tables provide a non-exhaustive list of existing - works (mainly published in IETF and with the emphasis on the latest - new technologies) and shows their positions in the framework. More - details can be found in Appendix A. + The following two tables show how the existing mechanisms (mainly + published in IETF and with the emphasis on the latest new + technologies) are positioned in the framework. Given the vast body + of existing work, we cannot provide an exhaustive list, so the + mechanisms in the tables should be considered as just examples. + Also, some comprehensive protocols and techniques may cover multiple + aspects or modules of the framework, so a name in a block only + emphasizes one particular characteristic of it. More details about + some listed mechanisms can be found in Appendix A. - The first table is based on the data acquiring mechanisms and data + The first table is based on the data acquisition mechanisms and data types. - +-----------------+---------------+----------------+ + +----------------------+-----------+--------------+ | | Query | Subscription | - | | | | - +-----------------+---------------+----------------+ - | Simple Data | SNMP, NETCONF,| SNMP, NETCONF | - | | YANG, BMP, | YANG, gRPC | - | | SMIv2, gRPC | | - +-----------------+---------------+----------------+ - | Complex Data | DNP, YANG FSM | DNP, YANG PUSH | - | | gRPC, NETCONF | gPRC, NETCONF | - +-----------------+---------------+----------------+ - | Event-triggered | DNP, NETCONF, | gRPC, NETCONF, | - | Data | YANG FSM | YANG PUSH, DNP | - | | | YANG FSM | - +-----------------+---------------+----------------+ - | Streaming Data | | gRPC, NETCONF, | - | | N/A | IOAM, PBT, DNP | - | | | IPFIX, IPFPM | - +-----------------+---------------+----------------+ + +----------------------+-----------+--------------+ + | Simple Data | SNMP | YANG | + +----------------------+-----------+--------------+ + | Complex Data | DNP | YANG PUSH | + +----------------------+-----------+--------------+ + | Event-triggered Data | DNP | YANG PUSH | + +----------------------+-----------+--------------+ + | Streaming Data | N/A | gRPC | + +----------------------+-----------+--------------+ Figure 5: Existing Work Mapping I The second table is based on the telemetry modules and components. +-------------+-----------------+---------------+--------------+ | | Management | Control | Forwarding | | | Plane | Plane | Plane | +-------------+-----------------+---------------+--------------+ | data config.| gRPC, NETCONF, | NETCONF/YANG | NETCONF/YANG,| | & subscribe | SMIv2,YANG PUSH | YANG PUSH | YANG PUSH | +-------------+-----------------+---------------+--------------+ | data gen. & | DNP, | DNP, | IOAM, PSAMP | - | process | YANG | YANG | PBT, IPFPM, | + | process | YANG | YANG | PBT, AM, | | | | | DNP | +-------------+-----------------+---------------+--------------+ | data | gRPC, NETCONF | BMP, NETCONF | IPFIX | | export | YANG PUSH | | | +-------------+-----------------+---------------+--------------+ Figure 6: Existing Work Mapping II -5. Evolution of Network Telemetry +6. Evolution of Network Telemetry Applications Network telemetry is a fast evolving technical area. As the network - moves towards the automated operation, network telemetry undergoes - several stages of evolution. Each stage is built upon the techniques - enabled by previous stages. + moves towards the automated operation, network telemetry applications + undergo several stages of evolution which add new layer of + requirements to the underlying network telemetry techniques. Each + stage is built upon the techniques adopted by the previous stages + plus some new requirements. Stage 0 - Static Telemetry: The telemetry data source and type are determined at design time. The network operator can only configure how to use it with limited flexibility. Stage 1 - Dynamic Telemetry: The custom telemetry data can be dynamically programmed or configured at runtime without interrupting the network operation, allowing a tradeoff among resource, performance, flexibility, and coverage. DNP is an effort towards this direction. @@ -1136,28 +1148,28 @@ Compared with Stage 1, the changes are frequent based on the real- time feedback. At this stage, some tasks can be automated, but human operators still need to sit in the middle to make decisions. Stage 3 - Closed-loop Telemetry: The telemetry is free from the interference of human operators, except for generating the reports. The intelligent network operation engine automatically issues the telemetry data requests, analyzes the data, and updates the network operations in closed control loops. - The most of the existing technologies belong to stage 0 and stage 1. - Individual stage 2 and stage 3 applications are also possible now. - However, the future autonomic networks may need a comprehensive - operation management system which relies on stage 2 and stage 3 - telemetry to cover all the network operation tasks. A well-defined - network telemetry framework is the first step towards this direction. + Existing technologies are ready for stage 0 and stage 1. Individual + stage 2 and stage 3 applications are also possible now. However, the + future autonomic networks may need a comprehensive operation + management system which works at stage 2 and stage 3 to cover all the + network operation tasks. A well-defined network telemetry framework + is the first step towards this direction. -6. Security Considerations +7. Security Considerations The complexity of network telemetry raises significant security implications. For example, telemetry data can be manipulated to exhaust various network resources at each plane as well as the data consumer; falsified or tampered data can mislead the decision making and paralyze networks; wrong configuration and programming for telemetry is equally harmful. Given that this document has proposed a framework for network telemetry and the telemetry mechanisms discussed are more extensive @@ -1195,25 +1207,25 @@ Based Access Control and Event-Condition-Action policies. Also, potential conflicts between network telemetry mechanisms must be detected accurately and resolved quickly to avoid unnecessary network telemetry traffic propagation escalating into an unintended or intended denial of service attack. Further study of the security issues will be required, and it is expected that the secuirty mechanisms and protocols are developed and deployed along with a network telemetry system. -7. IANA Considerations +8. IANA Considerations This document includes no request to IANA. -8. Contributors +9. Contributors The other contributors of this document are listed as follows. o Tianran Zhou o Zhenbin Li o Zhenqiang Li o Daniel King @@ -1210,33 +1222,32 @@ The other contributors of this document are listed as follows. o Tianran Zhou o Zhenbin Li o Zhenqiang Li o Daniel King - o Adrian Farrel o Alexander Clemm -9. Acknowledgments +10. Acknowledgments We would like to thank Greg Mirsky, Randy Presuhn, Joe Clarke, Victor Liu, James Guichard, Uri Blumenthal, Giuseppe Fioccola, Yunan Gu, Parviz Yegani, Young Lee, Qin Wu, and many others who have provided helpful comments and suggestions to improve this document. -10. Informative References +11. Informative References [gnmi] "gNMI - gRPC Network Management Interface", . [grpc] "gPPC, A high performance, open-source universal RPC framework", . [I-D.fioccola-ippm-multipoint-alt-mark] Fioccola, G., Cociglio, M., Sapio, A., and R. Sisto, @@ -1246,22 +1257,22 @@ [I-D.ietf-grow-bmp-adj-rib-out] Evens, T., Bayraktar, S., Lucente, P., Mi, K., and S. Zhuang, "Support for Adj-RIB-Out in BGP Monitoring Protocol (BMP)", draft-ietf-grow-bmp-adj-rib-out-07 (work in progress), August 2019. [I-D.ietf-grow-bmp-local-rib] Evens, T., Bayraktar, S., Bhardwaj, M., and P. Lucente, "Support for Local RIB in BGP Monitoring Protocol (BMP)", - draft-ietf-grow-bmp-local-rib-08 (work in progress), - November 2020. + draft-ietf-grow-bmp-local-rib-09 (work in progress), + January 2021. [I-D.ietf-ippm-ioam-data] Brockners, F., Bhandari, S., and T. Mizrahi, "Data Fields for In-situ OAM", draft-ietf-ippm-ioam-data-11 (work in progress), November 2020. [I-D.ietf-netconf-distributed-notif] Zhou, T., Zheng, G., Voit, E., Graf, T., and P. Francois, "Subscription to Distributed Notifications", draft-ietf- netconf-distributed-notif-01 (work in progress), November @@ -1479,21 +1489,21 @@ [I-D.ietf-grow-bmp-local-rib] are encapsulated in the BMP Route Monitoring Message and the BMP Route Mirroring Message, in the form of both initial table dump and real-time route update. In addition, BGP statistics are reported through the BMP Stats Report Message, which could be either timer triggered or event-driven. More BMP extensions can be explored to enrich the applications of BGP monitoring. A.3. Data Plane Telemetry -A.3.1. The Alternate Marking technology +A.3.1. The Alternate Marking (AM) technology The Alternate Marking method is efficient to perform packet loss, delay, and jitter measurements both in an IP and Overlay Networks, as presented in [RFC8321] and [I-D.fioccola-ippm-multipoint-alt-mark]. This technique can be applied to point-to-point and multipoint-to- multipoint flows. Alternate Marking creates batches of packets by alternating the value of 1 bit (or a label) of the packet header. These batches of packets are unambiguously recognized over the network and the comparison of packet counters for each batch allows @@ -1524,21 +1534,21 @@ calibrate how deep can be obtained monitoring data from the network by configuring measurement points roughly or meticulously. Using Alternate Marking, it is possible to monitor a Multipoint Network without examining in depth by using the Network Clustering (subnetworks that are portions of the entire network that preserve the same property of the entire network, called clusters). So in case there is packet loss or the delay is too high the filtering criteria could be specified more in order to perform a detailed analysis by using a different combination of clusters up to a per- - flow measurement as described in IPFPM [RFC8321]. + flow measurement as described in Alternate-Marking (AM) [RFC8321]. In summary, an application can configure end-to-end network monitoring. If the network does not experiment issues, this approximate monitoring is good enough and is very cheap in terms of network resources. However, in case of problems, the application becomes aware of the issues from this approximate monitoring and, in order to localize the portion of the network that has issues, configures the measurement points more exhaustively. So a new detailed monitoring is performed. After the detection and resolution of the problem the initial approximate monitoring can be used again.