--- 1/draft-ietf-opsawg-ntf-12.txt 2021-12-03 12:13:10.805853365 -0800 +++ 2/draft-ietf-opsawg-ntf-13.txt 2021-12-03 12:13:10.853853994 -0800 @@ -1,25 +1,25 @@ OPSAWG H. Song Internet-Draft Futurewei Intended status: Informational F. Qin -Expires: 4 June 2022 China Mobile +Expires: 6 June 2022 China Mobile P. Martinez-Julia NICT L. Ciavaglia Rakuten Mobile A. Wang China Telecom - 1 December 2021 + 3 December 2021 Network Telemetry Framework - draft-ietf-opsawg-ntf-12 + draft-ietf-opsawg-ntf-13 Abstract Network telemetry is a technology for gaining network insight and facilitating efficient and automated network management. It encompasses various techniques for remote data generation, collection, correlation, and consumption. This document describes an architectural framework for network telemetry, motivated by challenges that are encountered as part of the operation of networks and by the requirements that ensue. This document clarifies the @@ -37,21 +37,21 @@ Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at https://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." - This Internet-Draft will expire on 4 June 2022. + This Internet-Draft will expire on 6 June 2022. Copyright Notice Copyright (c) 2021 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/ license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights @@ -61,21 +61,21 @@ provided without warranty as described in the Revised BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1. Applicability Statement . . . . . . . . . . . . . . . . . 4 1.2. Glossary . . . . . . . . . . . . . . . . . . . . . . . . 4 2. Background . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1. Telemetry Data Coverage . . . . . . . . . . . . . . . . . 7 2.2. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . 8 - 2.3. Challenges . . . . . . . . . . . . . . . . . . . . . . . 10 + 2.3. Challenges . . . . . . . . . . . . . . . . . . . . . . . 9 2.4. Network Telemetry . . . . . . . . . . . . . . . . . . . . 11 2.5. The Necessity of a Network Telemetry Framework . . . . . 13 3. Network Telemetry Framework . . . . . . . . . . . . . . . . . 14 3.1. Top Level Modules . . . . . . . . . . . . . . . . . . . . 15 3.1.1. Management Plane Telemetry . . . . . . . . . . . . . 18 3.1.2. Control Plane Telemetry . . . . . . . . . . . . . . . 18 3.1.3. Forwarding Plane Telemetry . . . . . . . . . . . . . 19 3.1.4. External Data Telemetry . . . . . . . . . . . . . . . 21 3.2. Second Level Function Components . . . . . . . . . . . . 22 3.3. Data Acquisition Mechanism and Type Abstraction . . . . . 24 @@ -140,21 +140,21 @@ different category of telemetry data and corresponding procedures. All the modules are internally structured in the same way, including components that allow the operator to configure data sources in regard to what data to generate and how to make that available to client applications, components that instrument the underlying data sources, and components that perform the actual rendering, encoding, and exporting of the generated data. We show how the network telemetry framework can benefit the current and future network operations. Based on the distinction of modules and function components, we can map the existing and emerging techniques and - protocols into the framework. The framework can also simplify the + protocols into the framework. The framework can also simplify designing, maintaining, and understanding a network telemetry system. In addition, we outline the evolution stages of the network telemetry system and discuss the potential security concerns. The purpose of the framework and taxonomy is to set a common ground for the collection of related work and provide guidance for future technique and standard developments. To the best of our knowledge, this document is the first such effort for network telemetry in industry standards organizations. This document does not define specific technologies. @@ -175,22 +175,22 @@ Before further discussion, we list some key terminology and acronyms used in this document. We make an intended differentiation between the terms of network telemetry and OAM. However, it should be understood that there is not a hard-line distinction between the two concepts. Rather, network telemetry is considered as an extension of OAM. It covers all the existing OAM protocols but puts more emphasis on the newer and emerging techniques and protocols concerning all aspects of network data from acquisition to consumption. - AI: Artificial Intelligence. In network domain, AI refers to the - machine-learning based technologies for automated network + AI: Artificial Intelligence. In the network domain, AI refers to + the machine-learning based technologies for automated network operation and other tasks. AM: Alternate Marking, a flow performance measurement method, specified in [RFC8321]. BMP: BGP Monitoring Protocol, specified in [RFC7854]. DPI: Deep Packet Inspection, referring to the techniques that examines packet beyond packet L3/L4 headers. @@ -243,21 +243,21 @@ [I-D.ietf-ippm-ioam-direct-export]. RESTCONF: An HTTP-based protocol that provides a programmatic interface for accessing data defined in YANG, using the datastore concepts defined in NETCONF, as specified in [RFC8040]. SMIv2: Structure of Management Information Version 2, defining MIB objects, specified in [RFC2578]. SNMP: Simple Network Management Protocol. Version 1, 2, and 3 are - specified in [RFC1157], [RFC3416], and [RFC3414], respectively. + specified in [RFC1157], [RFC3416], and [RFC3411], respectively. XML: Extensible Markup Language is a markup language for data encoding that is both human-readable and machine-readable, specified by W3C [xml]. YANG: YANG is a data modeling language for the definition of data sent over network management protocols such as the NETCONF and RESTCONF. YANG is defined in [RFC6020] and [RFC7950]. YANG ECA: A YANG model for Event-Condition-Action policies, defined @@ -285,25 +285,27 @@ technologies, network big data analytics gives network operators an opportunity to gain network insights and move towards network autonomy. Some operators start to explore the application of Artificial Intelligence (AI) to make sense of network data. Software tools can use the network data to detect and react on network faults, anomalies, and policy violations, as well as predicting future events. In turn, the network policy updates for planning, intrusion prevention, optimization, and self-healing may be applied. It is conceivable that an autonomic network [RFC7575] is the logical - next step for network evolution following Software Defined Network + next step for network evolution following Software Defined Networking (SDN), aiming to reduce (or even eliminate) human labor, make more efficient use of network resources, and provide better services more - aligned with customer requirements. The related technique of - Intent-based Networking (IBN) + aligned with customer requirements. The IETF ANIMA working group is + dedicated to developing and maintaining protocols and procedures for + automated network management and control of professionally-managed + networks. The related technique of Intent-based Networking (IBN) [I-D.irtf-nmrg-ibn-concepts-definitions] requires network visibility and telemetry data in order to ensure that the network is behaving as intended. However, while the data processing capability is improved and applications require more data to function better, the networks lag behind in extracting and translating network data into useful and actionable information in efficient ways. The system bottleneck is shifting from data consumption to data supply. Both the number of network nodes and the traffic bandwidth keep increasing at a fast @@ -313,21 +315,21 @@ a nutshell, it is a challenge to get enough high-quality data out of the network in a manner that is efficient, timely, and flexible. Therefore, we need to survey the existing technologies and protocols and identify any potential gaps. In the remainder of this section, first we clarify the scope of network data (i.e., telemetry data) relevant in this document. Then, we discuss several key use cases for today's and future network operations. Next, we show why the current network OAM techniques and protocols are insufficient for these use cases. The discussion - underlines the need of new methods, techniques, and protocols, as + underlines the need for new methods, techniques, and protocols, as well as the extensions of existing ones, which we assign under the umbrella term - Network Telemetry. 2.1. Telemetry Data Coverage Any information that can be extracted from networks (including data plane, control plane, and management plane) and used to gain visibility or as basis for actions is considered telemetry data. It includes statistics, event records and logs, snapshots of state, configuration data, etc. It also covers the outputs of any active @@ -344,53 +346,54 @@ 2.2. Use Cases The following set of use cases is essential for network operations. While the list is by no means exhaustive, it is enough to highlight the requirements for data velocity, variety, volume, and veracity, the attributes of big data, in networks. * Security: Network intrusion detection and prevention systems need to monitor network traffic and activities and act upon anomalies. - Given increasingly sophisticated attack vector coupled with + Given increasingly sophisticated attack vectors coupled with increasingly severe consequences of security breaches, new tools and techniques need to be developed, relying on wider and deeper - visibility into networks. The ultimate goal is to achieve the - security with no, or only minimal, human intervention. + visibility into networks. The ultimate goal is to achieve + security with no, or only minimal, human intervention, and without + disrupting legitimate traffic flows. * Policy and Intent Compliance: Network policies are the rules that constrain the services for network access, provide service differentiation, or enforce specific treatment on the traffic. For example, a service function chain is a policy that requires the selected flows to pass through a set of ordered network functions. Intent, as defined in [I-D.irtf-nmrg-ibn-concepts-definitions], is a set of operational goals that a network should meet and outcomes that a network is supposed to deliver, defined in a declarative manner without specifying how to achieve or implement them. An intent requires a complex translation and mapping process before being applied on networks. While a policy or intent is enforced, the compliance needs to be verified and monitored continuously by relying on visibility that is provided through network telemetry data. Any - violation must be notified immediately, potentially resulting in + violation must be reported immediately, potentially resulting in updates to how the policy or intent is applied in the network to ensure that it remains in force, or otherwise alerting the network administrator to the policy or intent violation. * SLA Compliance: A Service-Level Agreement (SLA) is a service contract between a service provider and a client, which include the metrics for the service measurement and remedy/penalty procedures when the service level misses the agreement. Users need to check if they get the service as promised and network - operators need to evaluate how they can deliver the services that - can meet the SLA based on realtime network telemetry data, - including data from network measurements. + operators need to evaluate how they can deliver services that can + meet the SLA based on realtime network telemetry data, including + data from network measurements. * Root Cause Analysis: Many network failure can be the effect of a sequence of chained events. Troubleshooting and recovery require quick identification of the root cause of any observable issues. However, the root cause is not always straightforward to identify, especially when the failure is sporadic and the number of event messages, both related and unrelated to the same cause, is overwhelming. While technologies such as machine learning can be used for root cause analysis, it is up to the network to sense and provide the relevant diagnostic data which are either actively fed @@ -428,28 +431,29 @@ techniques are not sufficient to support the above use cases for the following reasons: * Most use cases need to continuously monitor the network and dynamically refine the data collection in real-time. Poll-based low-frequency data collection is ill-suited for these applications. Subscription-based streaming data directly pushed from the data source (e.g., the forwarding chip) is preferred to provide sufficient data quantity and precision at scale. - * Comprehensive data is needed from packet processing engines to - traffic manager, from line cards to main control board, from user - flows to control protocol packets, from device configurations to - operations, and from physical layer to application layer. - Conventional OAM only covers a narrow range of data (e.g., SNMP - only handles data from the Management Information Base (MIB)). - Classical network devices cannot provide all the necessary probes. - More open and programmable network devices are therefore needed. + * Comprehensive data is needed, ranging from packet processing + engines to traffic manager, from line cards to main control board, + from user flows to control protocol packets, from device + configurations to operations, and from physical layer to + application layer. Conventional OAM only covers a narrow range of + data (e.g., SNMP only handles data from the Management Information + Base (MIB)). Classical network devices cannot provide all the + necessary probes. More open and programmable network devices are + therefore needed. * Many application scenarios need to correlate network-wide data from multiple sources (i.e., from distributed network devices, different components of a network device, or different network planes). A piecemeal solution is often lacking the capability to consolidate the data from multiple sources. The composition of a complete solution, as partly proposed by Autonomic Resource Control Architecture(ARCA) [I-D.pedro-nmrg-anticipated-adaptation], will be empowered and guided by a comprehensive framework. @@ -461,21 +465,21 @@ * Although some conventional OAM techniques support data push (e.g., SNMP Trap [RFC2981][RFC3877], Syslog, and sFlow [RFC3176]), the pushed data are limited to only predefined management plane warnings (e.g., SNMP Trap) or sampled user packets (e.g., sFlow). Network operators require the data with arbitrary source, granularity, and precision which are beyond the capability of the existing techniques. * The conventional passive measurement techniques can either consume - excessive network resources and render excessive redundant data, + excessive network resources and produce excessive redundant data, or lead to inaccurate results; on the other hand, the conventional active measurement techniques can interfere with the user traffic and their results are indirect. Techniques that can collect direct and on-demand data from user traffic are more favorable. These challenges were addressed by newer standards and techniques (e.g., IPFIX/Netflow, Packet Sampling (PSAMP), IOAM, and YANG-Push) and more are emerging. These standards and techniques need to be recognized and accommodated in a new framework. @@ -510,21 +514,21 @@ telemetry collectors subscribe to streaming data pushed from data sources in network devices. * Volume and Velocity: The telemetry data is intended to be consumed by machines rather than by human being. Therefore, the data volume can be huge and the processing is optimized for the needs of automation in realtime. * Normalization and Unification: Telemetry aims to address the overall network automation needs. Efforts are made to normalize - the data representation and unify the protocols, so to simplify + the data representation and unify the protocols, so as to simplify data analysis and provide integrated analysis across heterogeneous devices and data sources across a network. * Model-based: The telemetry data is modeled in advance which allows applications to configure and consume data with ease. * Data Fusion: The data for a single application can come from multiple data sources (e.g., cross-domain, cross-device, and cross-layer) based on common naming/ID and needs to be correlated to take effect. @@ -619,21 +623,21 @@ and path can be acquired. An application may need to switch its viewpoint during operation. It may also need to correlate a service and its impact on user experience to acquire the comprehensive information. * Applications require network telemetry to be elastic in order to make efficient use of network resources and reduce the impact of processing related to network telemetry on network performance. For example, routine network monitoring should cover the entire network with a low data sampling rate. Only when issues arise or - critical trends emerge should telemetry data source be modified + critical trends emerge should telemetry data sources be modified and telemetry data rates boosted as needed. * Efficient data aggregation is critical for applications to reduce the overall quantity of data and improve the accuracy of analysis. A telemetry framework collects together all the telemetry-related works from different sources and working groups within IETF. This makes it possible to assemble a comprehensive network telemetry system and to avoid repetitious or redundant work. The framework should cover the concepts and components from the standardization @@ -805,23 +809,23 @@ follows (note that the requirements may pertain across all telemetry modules; however, we emphasize those that are most pronounced for a particular plane). 3.1.1. Management Plane Telemetry The management plane of network elements interacts with the Network Management System (NMS), and provides information such as performance data, network logging data, network warning and defects data, and network statistics and state data. The management plane includes - many protocols, including some that are considered "legacy", such as - SNMP and syslog. Regardless the protocol, management plane telemetry - must address the following requirements: + many protocols, including the classical SNMP and syslog. Regardless + the protocol, management plane telemetry must address the following + requirements: * Convenient Data Subscription: An application should have the freedom to choose which data is exported (see section 4.3) and the means and frequency of how that data is exported (e.g., on-change or periodic subscription). * Structured Data: For automatic network operation, machines will replace human for network data comprehension. Data modeling languages, such as YANG, can efficiently describe structured data and normalize data encoding and transformation. @@ -863,23 +867,23 @@ * Conventional OAM-based approaches for control plane KPI measurement include Ping (L3), Traceroute (L3), Y.1731 [y1731] (L2), and so on. One common issue behind these methods is that they only measure the KPIs instead of reflecting the actual running status of these protocols, making them less effective or efficient for control plane troubleshooting and network optimization. * An example of the control plane telemetry is the BGP monitoring - protocol (BMP), it is currently used for monitoring the BGP routes - and enables rich applications, such as BGP peer analysis, AS - analysis, prefix analysis, and security analysis. However, the + protocol (BMP). It is currently used for monitoring the BGP + routes and enables rich applications, such as BGP peer analysis, + AS analysis, prefix analysis, and security analysis. However, the monitoring of other layers, protocols and the cross-layer, cross- protocol KPI correlations are still in their infancy (e.g., IGP monitoring is not as extensive as BMP), which require further research. * The requirement and solutions for network congestion avoidance are also applicable to the control plane telemetry. 3.1.3. Forwarding Plane Telemetry @@ -1027,21 +1031,21 @@ data. On the other hand, it receives, stores, and processes the returned data from network devices. Data analysis can be interactive to initiate further data queries. This component can reside in either network devices or remote controllers. It can be centralized and distributed, and involve one or more instances. * Data Configuration and Subscription: This component manages data queries on devices. It determines the protocol and channel for applications to acquire desired data. This component is also responsible for configuring the desired data that might not be - directly available form data sources. The subscription data can + directly available from data sources. The subscription data can be described by models, templates, or programs. * Data Encoding and Export: This component determines how telemetry data is delivered to the data analysis and storage component with access control. The data encoding and the transport protocol may vary due to the data export location. * Data Generation and Processing: The requested data needs to be captured, filtered, processed, and formatted in network devices from raw data sources. This may involve in-network computing and @@ -1259,31 +1263,31 @@ transporting, and analyzing a wide variety of data sources in support of network applications. The protocols, data formats, and configurations chosen to implement this framework will dictate the specific security considerations. These considerations may include: * Telemetry framework trust and policy model; * Role management and access control for enabling and disabling telemetry capabilities; - * Protocol transport used telemetry data and inherent security - capabilities; + * Protocol transport used for telemetry data and its inherent + security capabilities; * Telemetry data stores, storage encryption, methods of access, and retention practices; * Tracking telemetry events and any abnormalities that might identify malicious attacks using telemetry interfaces. - * Authentication and signing of telemetry data to make data more - trustworthy. + * Authentication and integrity protection of telemetry data to make + data more trustworthy. * Segregating the telemetry data traffic from the data traffic carried over the network (e.g., historically management access and management data may be carried via an independent management network). Some security considerations highlighted above may be minimized or negated with policy management of network telemetry. In a network telemetry deployment it would be advantageous to separate telemetry capabilities into different classes of policies, i.e., Role Based @@ -1306,22 +1310,23 @@ The other contributors of this document are Tianran Zhou, Zhenbin Li, Zhenqiang Li, Daniel King, Adrian Farrel, and Alexander Clemm 8. Acknowledgments We would like to thank Rob Wilton, Greg Mirsky, Randy Presuhn, Joe Clarke, Victor Liu, James Guichard, Uri Blumenthal, Giuseppe Fioccola, Yunan Gu, Parviz Yegani, Young Lee, Qin Wu, Gyan Mishra, Ben Schwartz, Alexey Melnikov, Michael Scharf, Dhruv Dhody, Martin Duke, Roman Danyliw, Warren Kumari, Sheng Jiang, Lars Eggert, Eric - Vyncke, Jean-Michel Combes, and many others who have provided helpful - comments and suggestions to improve this document. + Vyncke, Jean-Michel Combes, Erik Kline, Benjamin Kaduk, and many + others who have provided helpful comments and suggestions to improve + this document. 9. Informative References [gnmi] "gNMI - gRPC Network Management Interface", . [gpb] "Google Protocol Buffers", . @@ -1424,25 +1429,25 @@ [RFC2981] Kavasseri, R., Ed., "Event MIB", RFC 2981, DOI 10.17487/RFC2981, October 2000, . [RFC3176] Phaal, P., Panchen, S., and N. McKee, "InMon Corporation's sFlow: A Method for Monitoring Traffic in Switched and Routed Networks", RFC 3176, DOI 10.17487/RFC3176, September 2001, . - [RFC3414] Blumenthal, U. and B. Wijnen, "User-based Security Model - (USM) for version 3 of the Simple Network Management - Protocol (SNMPv3)", STD 62, RFC 3414, - DOI 10.17487/RFC3414, December 2002, - . + [RFC3411] Harrington, D., Presuhn, R., and B. Wijnen, "An + Architecture for Describing Simple Network Management + Protocol (SNMP) Management Frameworks", STD 62, RFC 3411, + DOI 10.17487/RFC3411, December 2002, + . [RFC3416] Presuhn, R., Ed., "Version 2 of the Protocol Operations for the Simple Network Management Protocol (SNMP)", STD 62, RFC 3416, DOI 10.17487/RFC3416, December 2002, . [RFC3877] Chisholm, S. and D. Romascanu, "Alarm Management Information Base (MIB)", RFC 3877, DOI 10.17487/RFC3877, September 2004, . @@ -1761,21 +1766,21 @@ [I-D.ietf-ippm-ioam-direct-export] and IOAM Marking [I-D.song-ippm-postcard-based-telemetry], is a complementary technique to the passport-based IOAM. PBT directly exports data at each node through an independent packet. At the cost of higher bandwidth overhead and the need for data correlation, PBT shows several unique advantages. It can also help to identify packet drop location in case a packet is dropped on its forwarding path. A.3.6. Existing OAM for Specific Data Planes - Various data planes raises unique OAM requirements. IETF has + Various data planes raise unique OAM requirements. IETF has published OAM technique and framework documents (e.g., [RFC8924] and [RFC5085]) targeting different data planes such as Multi-Protocol Label Switching (MPLS), L2 Virtual Private Network (L2-VPN), Network Virtualization Overlays (NVO3), Virtual Extensible LAN (VXLAN), Bit Indexed Explicit Replication (BIER), Service Function Chaining (SFC), Segment Routing (SR), and Deterministic Networking (DETNET). The aforementioned data plane telemetry techniques can be used to enhance the OAM capability on such data planes. A.4. External Data and Event Telemetry