draft-ietf-opsawg-ntf-01.txt   draft-ietf-opsawg-ntf-02.txt 
OPSAWG H. Song, Ed. OPSAWG H. Song, Ed.
Internet-Draft Futurewei Internet-Draft Futurewei
Intended status: Informational F. Qin Intended status: Informational F. Qin
Expires: December 13, 2019 China Mobile Expires: April 10, 2020 China Mobile
P. Martinez-Julia P. Martinez-Julia
NICT NICT
L. Ciavaglia L. Ciavaglia
Nokia Nokia
A. Wang A. Wang
China Telecom China Telecom
June 11, 2019 October 8, 2019
Network Telemetry Framework Network Telemetry Framework
draft-ietf-opsawg-ntf-01 draft-ietf-opsawg-ntf-02
Abstract Abstract
This document provides an architectural framework for network Network telemetry is the technology for gaining network insight and
telemetry to address the current and future network operation facilitating efficient and automated network management. It engages
challenges and requirements. As evidenced by some key various techniques for remote data collection, correlation, and
characteristics and industry practices, network telemetry covers consumption. This document provides an architectural framework for
technologies and protocols beyond the conventional network network telemetry, motivated by the network operation challenges and
Operations, Administration, and Management (OAM), so it promises requirements. As evidenced by some key characteristics and industry
better flexibility, scalability, accuracy, coverage, and performance practices, network telemetry covers technologies and protocols beyond
and allows automated control loops to suit both today's and the conventional network Operations, Administration, and Management
tomorrow's network operation requirements. This document clarifies (OAM). It promises better flexibility, scalability, accuracy,
the terminologies and classifies the modules and components of a coverage, and performance and allows automated control loops to suit
network telemetry system. The framework and taxonomy help to set a both today's and tomorrow's network operation. This document
common ground for the collection of related work and provide guidance clarifies the terminologies and classifies the modules and components
for future technique and standard developments. of a network telemetry system from several different perspectives.
To the best of our knowledge, this document is the first such effort
for network telemetry in industry standards organizations. The
framework and taxonomy help to set a common ground for the collection
of related work and provide guidance for future technique and
standard developments.
Status of This Memo Status of This Memo
This Internet-Draft is submitted in full conformance with the This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79. provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/. Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on December 13, 2019. This Internet-Draft will expire on April 10, 2020.
Copyright Notice Copyright Notice
Copyright (c) 2019 IETF Trust and the persons identified as the Copyright (c) 2019 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(https://trustee.ietf.org/license-info) in effect on the date of (https://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License. described in the Simplified BSD License.
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3
2. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 4 2. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2. Challenges . . . . . . . . . . . . . . . . . . . . . . . 6 2.2. Challenges . . . . . . . . . . . . . . . . . . . . . . . 6
2.3. Glossary . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3. Glossary . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4. Network Telemetry . . . . . . . . . . . . . . . . . . . . 8 2.4. Network Telemetry . . . . . . . . . . . . . . . . . . . . 8
3. The Necessity of a Network Telemetry Framework . . . . . . . 9 3. The Necessity of a Network Telemetry Framework . . . . . . . 10
4. Network Telemetry Framework . . . . . . . . . . . . . . . . . 11 4. Network Telemetry Framework . . . . . . . . . . . . . . . . . 11
4.1. Data Acquiring Mechanisms . . . . . . . . . . . . . . . . 11 4.1. Data Acquiring Mechanisms and Data Types . . . . . . . . 12
4.2. Data Objects . . . . . . . . . . . . . . . . . . . . . . 12 4.2. Data Object Modules . . . . . . . . . . . . . . . . . . . 13
4.3. Function Components . . . . . . . . . . . . . . . . . . . 14 4.2.1. Requirements and Challenges for each Module . . . . . 15
4.4. Existing Works Mapped in the Framework . . . . . . . . . 16 4.3. Function Components . . . . . . . . . . . . . . . . . . . 19
5. Evolution of Network Telemetry . . . . . . . . . . . . . . . 17 4.4. Existing Works Mapped in the Framework . . . . . . . . . 21
6. Security Considerations . . . . . . . . . . . . . . . . . . . 18 5. Evolution of Network Telemetry . . . . . . . . . . . . . . . 22
7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 19 6. Security Considerations . . . . . . . . . . . . . . . . . . . 23
8. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 19 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 24
9. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 19 8. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 24
10. References . . . . . . . . . . . . . . . . . . . . . . . . . 19 9. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 24
10.1. Normative References . . . . . . . . . . . . . . . . . . 19 10. Informative References . . . . . . . . . . . . . . . . . . . 24
10.2. Informative References . . . . . . . . . . . . . . . . . 20 Appendix A. A Survey on Existing Network Telemetry Techniques . 28
Appendix A. A Survey on Existing Network Telemetry Techniques . 23 A.1. Management Plane Telemetry . . . . . . . . . . . . . . . 28
A.1. Management Plane Telemetry . . . . . . . . . . . . . . . 23 A.1.1. Push Extensions for NETCONF . . . . . . . . . . . . . 28
A.1.1. Requirements and Challenges . . . . . . . . . . . . . 23 A.1.2. gRPC Network Management Interface . . . . . . . . . . 28
A.1.2. Push Extensions for NETCONF . . . . . . . . . . . . . 24 A.2. Control Plane Telemetry . . . . . . . . . . . . . . . . . 29
A.1.3. gRPC Network Management Interface . . . . . . . . . . 24 A.2.1. BGP Monitoring Protocol . . . . . . . . . . . . . . . 29
A.2. Control Plane Telemetry . . . . . . . . . . . . . . . . . 24 A.3. Data Plane Telemetry . . . . . . . . . . . . . . . . . . 29
A.2.1. Requirements and Challenges . . . . . . . . . . . . . 24 A.3.1. The IPFPM technology . . . . . . . . . . . . . . . . 29
A.2.2. BGP Monitoring Protocol . . . . . . . . . . . . . . . 25 A.3.2. Dynamic Network Probe . . . . . . . . . . . . . . . . 30
A.3. Data Plane Telemetry . . . . . . . . . . . . . . . . . . 25 A.3.3. IP Flow Information Export (IPFIX) protocol . . . . . 31
A.3.1. Requirements and Challenges . . . . . . . . . . . . . 26 A.3.4. In-Situ OAM . . . . . . . . . . . . . . . . . . . . . 31
A.3.2. Technique Taxonomy . . . . . . . . . . . . . . . . . 26 A.3.5. Postcard Based Telemetry . . . . . . . . . . . . . . 31
A.3.3. The IPFPM technology . . . . . . . . . . . . . . . . 27 A.4. External Data and Event Telemetry . . . . . . . . . . . . 31
A.3.4. Dynamic Network Probe . . . . . . . . . . . . . . . . 29 A.4.1. Sources of External Events . . . . . . . . . . . . . 32
A.3.5. IP Flow Information Export (IPFIX) protocol . . . . . 29 A.4.2. Connectors and Interfaces . . . . . . . . . . . . . . 33
A.3.6. In-Situ OAM . . . . . . . . . . . . . . . . . . . . . 29 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 33
A.3.7. Postcard Based Telemetry . . . . . . . . . . . . . . 30
A.4. External Data and Event Telemetry . . . . . . . . . . . . 30
A.4.1. Requirements and Challenges . . . . . . . . . . . . . 30
A.4.2. Sources of External Events . . . . . . . . . . . . . 31
A.4.3. Connectors and Interfaces . . . . . . . . . . . . . . 32
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 32
1. Introduction 1. Introduction
Network visibility is essential for network operation. Network Network visibility is the ability of management tools to see the
telemetry has been considered as an ideal means to gain sufficient state and behavior of a network. It is essential for successful
network visibility with better flexibility, scalability, accuracy, network operation. Network telemetry is the process of measuring,
coverage, and performance than conventional OAM technologies. correlating, recording, and distributing information about the
However, network telemetry is a vague term. The scope and coverage behavior of a network. Network telemetry has been considered as an
of it cause confusion and misunderstandings. It is beneficial to ideal means to gain sufficient network visibility with better
have an unambiguous concept and a clear architectural framework for flexibility, scalability, accuracy, coverage, and performance than
network telemetry, so we can better align the related technology and some conventional network Operations, Administration, and Management
standard work. (OAM) techniques.
First, we show some key characteristics of network telemetry which However, so far the term of network telemetry lacks a solid and
set a clear distinction from the conventional network OAM and show unambiguous definition. The scope and coverage of it cause confusion
that some conventional OAM technologies can be considered a subset of and misunderstandings. It is beneficial to clarify the concept and
the network telemetry technologies. We then provide an architectural provide a clear architectural framework for network telemetry, so we
framework for network telemetry to meet the current and future can articulate the technical field, and better align the related
network operation requirements. Following the framework, we classify techniques and standard works.
the components of a network telemetry system so we can easily map the
existing and emerging techniques and protocols into the framework. To fulfill such an undertaking, we first discuss some key
At last, we outline a roadmap for the evolution of the network characteristics of network telemetry which set a clear distinction
telemetry system. from the conventional network OAM and show that some conventional OAM
technologies can be considered a subset of the network telemetry
technologies. We then provide an architectural framework from three
different perspectives for network telemetry. We show how network
telemetry can meet the current and future network operation
requirements, and the challenges each telemetry module is facing.
Based on the distinction of modules and function components, we can
easily map the existing and emerging techniques and protocols into
the framework. At last, we outline a road-map for the evolution of
the network telemetry system and discuss the potential security
concerns for network telemetry.
The purpose of the framework and taxonomy is to set a common ground The purpose of the framework and taxonomy is to set a common ground
for the collection of related work and provide guidance for future for the collection of related work and provide guidance for future
technique and standard developments. technique and standard developments. To the best of our knowledge,
this document is the first such effort for network telemetry in
1.1. Requirements Language industry standards organizations.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described in BCP
14 [RFC2119][RFC8174] when, and only when, they appear in all
capitals, as shown here.
2. Motivation 2. Motivation
Thanks to the advance of the computing and storage technologies, The term of Big data is used to describe the extremely large volume
today's big data analytics gives network operators an unprecedented of data sets that can be analyzed computationally to reveal patterns,
opportunity to gain network insights and move towards network trends, and associations. Network is undoubtedly a source of big
autonomy. Some operators start to explore the application of data because of its scale and all the traffic goes through it. It is
easy to see that network OAM can benefit from network big data.
Today one can easily access advanced big data analytics capability
through a plethora of commercial and open source platforms (e.g.,
Apache Hadoop), tools (e.g., Apache Spark), and techniques (e.g.,
machine learning). Thanks to the advance of computing and storage
technologies, network big data analytics gives network operators an
unprecedented opportunity to gain network insights and move towards
network autonomy. Some operators start to explore the application of
Artificial Intelligence (AI) to make sense of network data. Software Artificial Intelligence (AI) to make sense of network data. Software
tools can use the network data to detect and react on network faults, tools can use the network data to detect and react on network faults,
anomalies, and policy violations, as well as predicting future anomalies, and policy violations, as well as predicting future
events. In turn, the network policy updates for planning, intrusion events. In turn, the network policy updates for planning, intrusion
prevention, optimization, and self-healing may be applied. prevention, optimization, and self-healing may be applied.
It is conceivable that an intent-driven autonomic network [RFC7575] It is conceivable that an intent-driven autonomic network [RFC7575]
is the logical next step for network evolution following Software is the logical next step for network evolution following Software
Defined Network (SDN), aiming to reduce (or even eliminate) human Defined Network (SDN), aiming to reduce (or even eliminate) human
labor, make the most efficient usage of network resources, and labor, make the most efficient usage of network resources, and
skipping to change at page 5, line 35 skipping to change at page 5, line 48
issues. However, the root cause is not always straightforward to issues. However, the root cause is not always straightforward to
identify, especially when the failure is sporadic and the related identify, especially when the failure is sporadic and the related
and unrelated events are overwhelming. While machine learning and unrelated events are overwhelming. While machine learning
technologies can be used for root cause analysis, it up to the technologies can be used for root cause analysis, it up to the
network to sense and provide all the relevant data. network to sense and provide all the relevant data.
Network Optimization: This covers all short-term and long-term Network Optimization: This covers all short-term and long-term
network optimization techniques, including load balancing, Traffic network optimization techniques, including load balancing, Traffic
Engineering (TE), and network planning. Network operators are Engineering (TE), and network planning. Network operators are
motivated to optimize their network utilization and differentiate motivated to optimize their network utilization and differentiate
services for better ROI or lower CAPEX. The first step is to know services for better Return On Investment (ROI) or lower Capital
the real-time network conditions before applying policies for Expenditures (CAPEX). The first step is to know the real-time
traffic manipulation. In some cases, micro-bursts need to be network conditions before applying policies for traffic
detected in a very short time-frame so that fine-grained traffic manipulation. In some cases, micro-bursts need to be detected in
control can be applied to avoid network congestion. The long-term a very short time-frame so that fine-grained traffic control can
network capacity planning and topology augmentation also rely on be applied to avoid network congestion. The long-term network
the accumulated data of the network operations. capacity planning and topology augmentation also rely on the
accumulated data of the network operations.
Event Tracking and Prediction: The visibility of user traffic path Event Tracking and Prediction: The visibility of user traffic path
and performance is critical for healthy network operation. and performance is critical for healthy network operation.
Numerous related network events are of interest to network Numerous related network events are of interest to network
operators. For example, Network operators always want to learn operators. For example, Network operators always want to learn
where and why packets are dropped for an application flow. They where and why packets are dropped for an application flow. They
also want to be warned of issues in advance so proactive actions also want to be warned of issues in advance so proactive actions
can be taken to avoid catastrophic consequences. can be taken to avoid catastrophic consequences.
2.2. Challenges 2.2. Challenges
skipping to change at page 6, line 42 skipping to change at page 7, line 5
o Many application scenarios need to correlate network-wide data o Many application scenarios need to correlate network-wide data
from multiple sources (i.e., from distributed network devices, from multiple sources (i.e., from distributed network devices,
different components of a network device, or different network different components of a network device, or different network
planes). A piecemeal solution is often lacking the capability to planes). A piecemeal solution is often lacking the capability to
consolidate the data from multiple sources. The composition of a consolidate the data from multiple sources. The composition of a
complete solution, as partly proposed by Autonomic Resource complete solution, as partly proposed by Autonomic Resource
Control Architecture(ARCA) Control Architecture(ARCA)
[I-D.pedro-nmrg-anticipated-adaptation], will be empowered and [I-D.pedro-nmrg-anticipated-adaptation], will be empowered and
guided by a comprehensive framework. guided by a comprehensive framework.
o Some of the conventional OAM techniques (e.g., CLI and Syslog) are o Some of the conventional OAM techniques (e.g., CLI and Syslog)
lack of formal data model. The unstructured data hinder the tool lack a formal data model. The unstructured data hinder the tool
automation and application extensibility. Standardized data automation and application extensibility. Standardized data
models are essential to support the programmable networks. models are essential to support the programmable networks.
o Although some conventional OAM techniques support data push (e.g., o Although some conventional OAM techniques support data push (e.g.,
SNMP Trap [RFC2981][RFC3877], Syslog, and sFlow), the pushed data SNMP Trap [RFC2981][RFC3877], Syslog, and sFlow), the pushed data
are limited to only predefined management plane warnings (e.g., are limited to only predefined management plane warnings (e.g.,
SNMP Trap) or sampled user packets (e.g., sFlow). We require the SNMP Trap) or sampled user packets (e.g., sFlow). We require the
data with arbitrary source, granularity, and precision which are data with arbitrary source, granularity, and precision which are
beyond the capability of the existing techniques. beyond the capability of the existing techniques.
skipping to change at page 7, line 18 skipping to change at page 7, line 30
techniques can interfere with the user traffic and their results techniques can interfere with the user traffic and their results
are indirect. We need techniques that can collect direct and on- are indirect. We need techniques that can collect direct and on-
demand data from user traffic. demand data from user traffic.
2.3. Glossary 2.3. Glossary
Before further discussion, we list some key terminology and acronyms Before further discussion, we list some key terminology and acronyms
used in this documents. We make an intended distinction between used in this documents. We make an intended distinction between
network telemetry and network OAM. network telemetry and network OAM.
AI: Artificial Intelligence. Use machine-learning based AI: Artificial Intelligence. In network domain, AI refers to the
technologies to automate network operation. machine-learning based technologies for automated network
operation and other tasks.
BMP: BGP Monitoring Protocol BMP: BGP Monitoring Protocol, specified in [RFC7854].
DNP: Dynamic Network Probe DNP: Dynamic Network Probe, referring to programmable in-network
sensors for network monitoring and measurement.
DPI: Deep Packet Inspection DPI: Deep Packet Inspection, referring to the techniques that
examines packet beyond packet L3/L4 headers.
gNMI: gRPC Network Management Interface gNMI: gRPC Network Management Interface, a network management
protocol from OpenConfig Operator Working Group, mainly
contributed by Google. See [gnmi] for details.
gRPC: gRPC Remote Procedure Call gRPC: gRPC Remote Procedure Call, a open source high performance RPC
framework that gNMI is based on. See [grpc] for details.
IPFIX: IP Flow Information Export Protocol IPFIX: IP Flow Information Export Protocol, specified in [RFC7011].
IPFPM: IP Flow Performance Measurement IPFPM: IP Flow Performance Measurement method, specified in
[RFC8321].
IOAM: In-situ OAM IOAM: In-situ OAM, a dataplane on-path telemetry technique.
NETCONF: Network Configuration Protocol NETCONF: Network Configuration Protocol, specified in [RFC6241].
Network Telemetry: Acquiring network data remotely for network Network Telemetry: Acquiring and processing network data remotely
monitoring and operation. A general term for a large set of for network monitoring and operation. A general term for a large
network visibility techniques and protocols, with the set of network visibility techniques and protocols, with the
characteristics defined in this document. Network telemetry characteristics defined in this document. Network telemetry
addresses the current network operation issues and enables smooth addresses the current network operation issues and enables smooth
evolution toward intent-driven autonomous networks. evolution toward intent-driven autonomous networks.
NMS: Network Management System NMS: Network Management System, referring to applications that allow
network administrators manage a network's software and hardware
components. It usually records data from a network's remote
points to carry out central reporting to a system administrator.
OAM: Operations, Administration, and Maintenance. A group of OAM: Operations, Administration, and Maintenance. A group of
network management functions that provide network fault network management functions that provide network fault
indication, fault localization, performance information, and data indication, fault localization, performance information, and data
and diagnosis functions. Most conventional network monitoring and diagnosis functions. Most conventional network monitoring
techniques and protocols belong to network OAM. techniques and protocols belong to network OAM.
PBT: Postcard-Based Telemetry PBT: Postcard-Based Telemetry, a dataplane on-path telemetry
technique.
SNMP: Simple Network Management Protocol SNMP: Simple Network Management Protocol. Version 1 and 2 are
specified in [RFC1157] and [RFC3416], respectively.
YANG: A data modeling language for NETCONF YANG: The abbreviation of "Yet Another Next Generation". YANG is a
data modeling language for the definition of data sent over
network management protocols such as the NETCONF and RESTCONF.
YANG is defined in [RFC6020].
YANG FSM: A YANG model to define device side finite state machine YANG FSM: A YANG model that describes events, operations, and finite
state machine of YANG-defined network elements.
YANG PUSH: A method to subscribe pushed data from remote YANG YANG PUSH: A method to subscribe pushed data from remote YANG
datastore datastore on network devices.
2.4. Network Telemetry 2.4. Network Telemetry
Network telemetry has emerged as a mainstream technical term to refer Network telemetry has emerged as a mainstream technical term to refer
to the newer data collection and consumption techniques, to the newer data collection and consumption techniques,
distinguishing itself from the convention techniques for network OAM. distinguishing itself from the convention techniques for network OAM.
The representative techniques and protocols include IPFIX [RFC7011] The representative techniques and protocols include IPFIX [RFC7011]
and gPRC [I-D.kumar-rtgwg-grpc-protocol]. Network telemetry allows and gPRC [grpc]. Network telemetry allows separate entities to
separate entities to acquire data from network devices so that data acquire data from network devices so that data can be visualized and
can be visualized and analyzed to support network monitoring and analyzed to support network monitoring and operation. Network
operation. Network telemetry overlaps with the conventional network telemetry overlaps with the conventional network OAM and has a wider
OAM and has a wider scope than it. It is expected that network scope than it. It is expected that network telemetry can provide the
telemetry can provide the necessary network insight for autonomous necessary network insight for autonomous networks and address the
networks and address the shortcomings of conventional OAM techniques. shortcomings of conventional OAM techniques.
One difference between the network telemetry and the network OAM is One difference between the network telemetry and the network OAM is
that the network telemetry assumes machines as data consumer rather that the network telemetry assumes machines as data consumer rather
than human operators. Hence, the network telemetry can directly than human operators. Hence, the network telemetry can directly
trigger the automated network operation, while the conventional OAM trigger the automated network operation, while the conventional OAM
tools usually help human operators to monitor and diagnose the tools usually help human operators to monitor and diagnose the
networks and guide manual network operations. The difference leads networks and guide manual network operations. The difference leads
to very different techniques. to very different techniques.
Although the network telemetry techniques are just emerging and Although the network telemetry techniques are just emerging and
skipping to change at page 9, line 23 skipping to change at page 9, line 50
o Data Fusion: The data for a single application can come from o Data Fusion: The data for a single application can come from
multiple data sources (e.g., cross-domain, cross-device, and multiple data sources (e.g., cross-domain, cross-device, and
cross-layer) and needs to be correlated to take effect. cross-layer) and needs to be correlated to take effect.
o Dynamic and Interactive: Since the network telemetry means to be o Dynamic and Interactive: Since the network telemetry means to be
used in a closed control loop for network automation, it needs to used in a closed control loop for network automation, it needs to
run continuously and adapt to the dynamic and interactive queries run continuously and adapt to the dynamic and interactive queries
from the network operation controller. from the network operation controller.
Note that a technique does not need to have all the above In addition, an ideal network telemetry solution may also have the
characteristics to be qualified as telemetry. An ideal network following features or properties:
telemetry solution may also have the following features or
properties:
o In-Network Customization: The data can be customized in network at o In-Network Customization: The data can be customized in network at
run-time to cater to the specific need of applications. This run-time to cater to the specific need of applications. This
needs the support of a programmable data plane which allows probes needs the support of a programmable data plane which allows probes
to be deployed at flexible locations. to be deployed at flexible locations.
o In-Network Data Aggregation and Correlation: Network devices and
aggregation points can work out which events and what data needs
to be stored, reported, or discarded thus reducing the load on the
central collection and processing points while still ensuring that
the right information is ready to be processed in a timely way.
o In-Network Processing and Action: Sometimes it is not necessary or
feasible to gather all information to a central point so that it
can be processed and acted upon. It is possible for the data
processing to be done in the network, and actions taken more
locally and more responsively.
o Direct Data Plane Export: The data originated from data plane can o Direct Data Plane Export: The data originated from data plane can
be directly exported to the data consumer for efficiency, be directly exported to the data consumer for efficiency,
especially when the data bandwidth is large and the real-time especially when the data bandwidth is large and the real-time
processing is required. processing is required.
o In-band Data Collection: In addition to the passive and active o In-band Data Collection: In addition to the passive and active
data collection approaches, the new hybrid approach allows to data collection approaches, the new hybrid approach allows to
directly collect data for any target flow on its entire forwarding directly collect data for any target flow on its entire forwarding
path. path.
o Non-intrusive: The telemetry system should avoid the pitfall of It is worth noting that, no matter how sophisticated a network
the "observer effect". That is, it should not change the network telemetry system is, it should not be intrusive to networks, by
behavior and affect the forwarding performance. avoiding the pitfall of the "observer effect". That is, it should
not change the network behavior and affect the forwarding
performance.
Although in many cases a network telemetry system is akin to the SDN
architecture, it is important to understand that network telemetry
does not infer the need of any centralized data processing and
analytics engine. Telemetry data producers and consumers can
perfectly work in distributed or peer-to-peer fashions instead.
3. The Necessity of a Network Telemetry Framework 3. The Necessity of a Network Telemetry Framework
Big data analytics and machine-learning based AI technologies are Big data analytics and machine-learning based AI technologies are
applied for network operation automation, relying on abundant data applied for network operation automation, relying on abundant data
from networks. The single-sourced and static data acquisition cannot from networks. The single-sourced and static data acquisition cannot
meet the data requirements. It is desirable to have a framework that meet the data requirements. It is desirable to have a framework that
integrates multiple telemetry approaches from different layers. This integrates multiple telemetry approaches from different layers. This
allows flexible combinations for different applications. The allows flexible combinations for different applications. The
framework would benefit application development for the following framework would benefit application development for the following
skipping to change at page 10, line 36 skipping to change at page 11, line 34
o Applications require network telemetry to be elastic in order to o Applications require network telemetry to be elastic in order to
efficiently use the network resource and reduce the performance efficiently use the network resource and reduce the performance
impact. Routine network monitoring covers the entire network with impact. Routine network monitoring covers the entire network with
low data sampling rate. When issues arise or trends emerge, the low data sampling rate. When issues arise or trends emerge, the
telemetry data source can be modified and the data rate can be telemetry data source can be modified and the data rate can be
boosted. boosted.
o Efficient data fusion is critical for applications to reduce the o Efficient data fusion is critical for applications to reduce the
overall quantity of data and improve the accuracy of analysis. overall quantity of data and improve the accuracy of analysis.
So far, some telemetry related work has been done within IETF. A telemetry framework collects together all of the telemetry-related
However, the work is fragmented and scattered in different working work from different sources and working groups within the IETF. This
groups. The lack of coherence makes it difficult to assemble a makes it possible to assemble a comprehensive network telemetry
comprehensive network telemetry system and causes repetitive and system and to avoid repetitious or redundant work. The framework
redundant work. should cover the concepts and components from the standardization
perspective. This document clarifies the layered modules on which
A formal network telemetry framework is needed for constructing a the telemetry is exerted and decomposes the telemetry system into a
working system. The framework should cover the concepts and set of distinct components that the existing and future work can
components from the standardization perspective. This document easily map to.
clarifies the layers on which the telemetry is exerted and decomposes
the telemetry system into a set of distinct components that the
existing and future work can easily map to.
4. Network Telemetry Framework 4. Network Telemetry Framework
Network telemetry techniques can be classified from multiple Network telemetry techniques can be classified from multiple
dimensions. In this document, we provide three unique perspectives: dimensions. In this document, we provide three unique perspectives:
data acquiring mechanisms, data objects, and function components. data acquiring mechanisms, data objects, and function components.
4.1. Data Acquiring Mechanisms 4.1. Data Acquiring Mechanisms and Data Types
Broadly speaking, network data can be acquired through subscription Broadly speaking, network data can be acquired through subscription
(push) and query (poll). A subscriber may request data when it is (push) and query (poll). A subscriber may request data when it is
ready. It follows a Publish-Subscription (Pub-Sub) mode or a ready. It follows a Publish-Subscription (Pub-Sub) mode or a
Subscription-Publish (Sub-Pub) mode. In the Pub-Sub mode, pre- Subscription-Publish (Sub-Pub) mode. In the Pub-Sub mode, pre-
defined data are published and multiple qualified subscribers can defined data are published and multiple qualified subscribers can
subscribe the data. In the Sub-Pub mode, a subscriber designates subscribe the data. In the Sub-Pub mode, a subscriber designates
what data are of interest and demands the network devices to deliver what data are of interest and demands the network devices to deliver
the data when they are available. the data when they are available.
skipping to change at page 11, line 47 skipping to change at page 12, line 41
Event-triggered Data: The data are conditionally acquired based on Event-triggered Data: The data are conditionally acquired based on
the occurrence of some event. An event can be modeled as a Finite the occurrence of some event. An event can be modeled as a Finite
State Machine (FSM). State Machine (FSM).
Streaming Data: The data are continuously or periodically generated. Streaming Data: The data are continuously or periodically generated.
It can be time series or the dump of databases. The streaming It can be time series or the dump of databases. The streaming
data reflect realtime network states and metrics and require large data reflect realtime network states and metrics and require large
bandwidth and processing power. bandwidth and processing power.
The above data types are not mutual exclusive. For example, event- The above data types are not mutually exclusive. For example, event-
triggered data can be simple or complex, and streaming data can be triggered data can be simple or complex, and streaming data can be
event triggered. The relationships of these data types are event triggered. The relationships of these data types are
illustrated in Figure 1 illustrated in Figure 1
+--------------------------+ +--------------------------+
| +----------------------+ | | +----------------------+ |
| | +-----------------+ | | | | +-----------------+ | |
| | | +-------------+ | | | | | | +-------------+ | | |
| | | | Simple Data | | | | | | | | Simple Data | | | |
| | | +-------------+ | | | | | | +-------------+ | | |
| | | Complex Data | | | | | | Complex Data | | |
skipping to change at page 12, line 27 skipping to change at page 13, line 27
Figure 1: Data Type Relationship Figure 1: Data Type Relationship
Subscription usually deals with event-triggered data and streaming Subscription usually deals with event-triggered data and streaming
data, and query usually deals with simple data and complex data. It data, and query usually deals with simple data and complex data. It
is easy to see that conventional OAM techniques are mostly about is easy to see that conventional OAM techniques are mostly about
querying simple data only. While these techniques are still useful, querying simple data only. While these techniques are still useful,
advanced network telemetry techniques pay more attention on the other advanced network telemetry techniques pay more attention on the other
three data types, and prefer event/streaming data subscription and three data types, and prefer event/streaming data subscription and
complex data query over simple data query. complex data query over simple data query.
4.2. Data Objects 4.2. Data Object Modules
Telemetry can be applied on the forwarding plane, the control plane, Telemetry can be applied on the forwarding plane, the control plane,
and the management plane in a network, as well as other sources out and the management plane in a network, as well as other sources out
of the network, as shown in Figure 2. Therefore, we categorize the of the network, as shown in Figure 2. Therefore, we categorize the
network telemetry into four distinct modules. network telemetry into four distinct modules with each having its own
interface to Network Operation Applications.
+------------------------------+ +------------------------------+
| | | |
| Network Operation |<-------+ | Network Operation |<-------+
| Applications | | | Applications | |
| | | | | |
+------------------------------+ | +------------------------------+ |
^ ^ ^ | ^ ^ ^ |
| | | | | | | |
V | V V V | V V
skipping to change at page 13, line 28 skipping to change at page 14, line 28
| | | | | Event | | | | | | Event |
| ^ V | Management | | Telemetry | | ^ V | Management | | Telemetry |
+------|--------+ Plane | | | +------|--------+ Plane | | |
| V | Telemetry | +-----------+ | V | Telemetry | +-----------+
| Forwarding | | | Forwarding | |
| Plane <---> | | Plane <---> |
| Telemetry | | | Telemetry | |
| | | | | |
+---------------+--------------+ +---------------+--------------+
Figure 2: Layer Category of the Network Telemetry Framework Figure 2: Modules in Layer Category of NTF
The rationale of this partition lies in the different telemetry data The rationale of this partition lies in the different telemetry data
objects which result in different data source and export locations. objects which result in different data source and export locations.
Such differences have profound implications on in-network data Such differences have profound implications on in-network data
programming and processing capability, data encoding and transport programming and processing capability, data encoding and transport
protocol, and data bandwidth and latency. protocol, and data bandwidth and latency.
We summarize the major differences of the four modules in the We summarize the major differences of the four modules in the
following table. Some representative techniques are shown in some following table. They are mainly compared from six aspects: data
table blocks to highlight the technical diversity of these modules. object, data export location, data model, data encoding, telemetry
protocol, and transport method. Data object is the target and source
of each module. Because the data source varies, the data export
location varies. Because each data export location has different
capability, the proper data model, encoding, and transport method
cannot be kept the same. As a result, the suitable telemetry
protocol for each module can be different. Some representative
techniques are shown in some table blocks to highlight the technical
diversity of these modules. One cannot expect to use a universal
protocol to cover all the network telemetry requirements.
+---------+--------------+--------------+--------------+-----------+ +---------+--------------+--------------+--------------+-----------+
| Module | Control | Management | Forwarding | External | | Module | Control | Management | Forwarding | External |
| | Plane | Plane | Plane | Data | | | Plane | Plane | Plane | Data |
+---------+--------------+--------------+--------------+-----------+ +---------+--------------+--------------+--------------+-----------+
|Object | control | config. & | flow & packet| terminal, | |Object | control | config. & | flow & packet| terminal, |
| | protocol & | operation | QoS, traffic | social & | | | protocol & | operation | QoS, traffic | social & |
| | signaling, | state, MIB | stat., buffer| environ- | | | signaling, | state, MIB | stat., buffer| environ- |
| | RIB, ACL | | & queue stat.| mental | | | RIB, ACL | | & queue stat.| mental |
+---------+--------------+--------------+--------------+-----------+ +---------+--------------+--------------+--------------+-----------+
|Export | main control | main control | fwding chip | various | |Export | main control | main control | fwding chip | various |
|Location | CPU, | CPU | or linecard | | |Location | CPU, | CPU | or linecard | |
| | linecard CPU | | CPU; main | | | | linecard CPU | | CPU; main | |
| | or fwding | | control CPU | | | | or fwding | | control CPU | |
| | chip | | unlikely | | | | chip | | unlikely | |
+---------+--------------+--------------+--------------+-----------+ +---------+--------------+--------------+--------------+-----------+
|Model | YANG, | MIB, syslog, | template, | YANG | |Data | YANG, | MIB, syslog, | template, | YANG |
| | custom | YANG, | YANG, | | |Model | custom | YANG, | YANG, | |
| | | custom | custom | | | | | custom | custom | |
+---------+--------------+--------------+--------------+-----------+ +---------+--------------+--------------+--------------+-----------+
|Encoding | GPB, JSON, | GPB, JSON, | plain | GPB, JSON | |Data | GPB, JSON, | GPB, JSON, | plain | GPB, JSON |
| | XML, plain | XML | | XML, plain| |Encoding | XML, plain | XML | | XML, plain|
+---------+--------------+--------------+--------------+-----------+ +---------+--------------+--------------+--------------+-----------+
|Protocol | gRPC,NETCONF,| gPRC,NETCONF,| IPFIX, mirror| gRPC | |Protocol | gRPC,NETCONF,| gPRC,NETCONF,| IPFIX, mirror| gRPC |
| | IPFIX,mirror | | | | | | IPFIX,mirror | | | |
+---------+--------------+--------------+--------------+-----------+ +---------+--------------+--------------+--------------+-----------+
|Transport| HTTP, TCP, | HTTP, TCP | UDP | HTTP,TCP | |Transport| HTTP, TCP, | HTTP, TCP | UDP | HTTP,TCP |
| | UDP | | | UDP | | | UDP | | | UDP |
+---------+--------------+--------------+--------------+-----------+ +---------+--------------+--------------+--------------+-----------+
Figure 3: Layer Category of the Network Telemetry Framework Figure 3: Comparison of the Data Object Modules
Note that the interaction with the network operation applications can Note that the interaction with the network operation applications can
be indirect. For example, in the management plane telemetry, the be indirect. For example, in the management plane telemetry, the
management plane may need to acquire data from the data plane. Some management plane may need to acquire data from the data plane. Some
of the operational states can only be derived from the data plane of the operational states can only be derived from the data plane
such as the interface status and statistics. For another example, such as the interface status and statistics. For another example,
the control plane telemetry may need to access the FIB in data plane. the control plane telemetry may need to access the Forwarding
On the other hand, an application may involve more than one plane Information Base (FIB) in data plane. On the other hand, an
simultaneously. For example, an SLA compliance application may application may involve more than one plane simultaneously. For
require both the data plane telemetry and the control plane example, an SLA compliance application may require both the data
telemetry. plane telemetry and the control plane telemetry.
4.2.1. Requirements and Challenges for each Module
4.2.1.1. Management Plane Telemetry
The management plane of network elements interacts with the Network
Management System (NMS), and provides information such as performance
data, network logging data, network warning and defects data, and
network statistics and state data. Some legacy protocols, such as
SNMP and Syslog, are widely used for the management plane. However,
these protocols are insufficient to meet the requirements of the
future automated network operation applications.
New management plane telemetry protocols should consider the
following requirements:
Convenient Data Subscription: An application should have the freedom
to choose the data export means such as the data types and the
export frequency.
Structured Data: For automatic network operation, machines will
replace human for network data comprehension. The schema
languages such as YANG can efficiently describe structured data
and normalize data encoding and transformation.
High Speed Data Transport: In order to retain the information, a
server needs to send a large amount of data at high frequency.
Compact encoding formats are needed to compress the data and
improve the data transport efficiency. The push mode, by
replacing the poll mode, can also reduce the interactions between
clients and servers, which help to improve the server's
efficiency.
4.2.1.2. Control Plane Telemetry
The control plane telemetry refers to the health condition monitoring
of different network protocols, which covers Layer 2 to Layer 7.
Keeping track of the running status of these protocols is beneficial
for detecting, localizing, and even predicting various network
issues, as well as network optimization, in real-time and in fine
granularity.
One of the most challenging problems for the control plane telemetry
is how to correlate the E2E Key Performance Indicators (KPI) to a
specific layer's KPIs. For example, an IPTV user may describe his
User Experience (UE) by the video fluency and definition. Then in
case of an unusually poor UE KPI or a service disconnection, it is
non-trivial work to delimit and localize the issue to the responsible
protocol layer (e.g., the Transport Layer or the Network Layer), the
responsible protocol (e.g., ISIS or BGP at the Network Layer), and
finally the responsible device(s) with specific reasons.
Traditional OAM-based approaches for control plane KPI measurement
include PING (L3), Tracert (L3), Y.1731 (L2) and so on. One common
issue behind these methods is that they only measure the KPIs instead
of reflecting the actual running status of these protocols, making
them less effective or efficient for control plane troubleshooting
and network optimization. An example of the control plane telemetry
is the BGP monitoring protocol (BMP), it is currently used to
monitoring the BGP routes and enables rich applications, such as BGP
peer analysis, AS analysis, prefix analysis, security analysis, and
so on. However, the monitoring of other layers, protocols and the
cross-layer, cross-protocol KPI correlations are still in their
infancy (e.g., the IGP monitoring is missing), which require
substantial further research.
4.2.1.3. Data Plane Telemetry
An effective data plane telemetry system relies on the data that the
network device can expose. The data's quality, quantity, and
timeliness must meet some stringent requirements. This raises some
challenges to the network data plane devices where the first hand
data originate.
o A data plane device's main function is user traffic processing and
forwarding. While supporting network visibility is important, the
telemetry is just an auxiliary function, and it should not impede
normal traffic processing and forwarding (i.e., the performance is
not lowered and the behavior is not altered due to the telemetry
functions).
o The network operation applications requires end-to-end visibility
from various sources, which results in a huge volume of data.
However, the sheer data quantity should not stress the network
bandwidth, regardless of the data delivery approach (i.e., through
in-band or out-of-band channels).
o The data plane devices must provide timely data with the minimum
possible delay. Long processing, transport, storage, and analysis
delay can impact the effectiveness of the control loop and even
render the data useless.
o The data should be structured and labeled, and easy for
applications to parse and consume. At the same time, the data
types needed by applications can vary significantly. The data
plane devices need to provide enough flexibility and
programmability to support the precise data provision for
applications.
o The data plane telemetry should support incremental deployment and
work even though some devices are unaware of the system. This
challenge is highly relevant to the standards and legacy networks.
The industry has agreed that the data plane programmability is
essential to support network telemetry. Newer data plane chips are
all equipped with advanced telemetry features and provide flexibility
to support customized telemetry functions.
4.2.1.3.1. Technique Taxonomy
There can be multiple possible dimensions to classify the data plane
telemetry techniques.
Active and Passive: The active and passive methods (as well as the
hybrid types) are well documented in [RFC7799]. The passive
methods include TCPDUMP, IPFIX [RFC7011], sflow, and traffic
mirror. These methods usually have low data coverage. The
bandwidth cost is very high in order to improve the data coverage.
On the other hand, the active methods include Ping, Traceroute,
OWAMP [RFC4656], and TWAMP [RFC5357]. These methods are intrusive
and only provide indirect network measurement results. The hybrid
methods, including in-situ OAM
[I-D.brockners-inband-oam-requirements], IPFPM [RFC8321], and
Multipoint Alternate Marking
[I-D.fioccola-ippm-multipoint-alt-mark], provide a well-balanced
and more flexible approach. However, these methods are also more
complex to implement.
In-Band and Out-of-Band: The telemetry data, before being exported
to some collector, can be carried in user packets. Such methods
are considered in-band (e.g., in-situ OAM
[I-D.brockners-inband-oam-requirements]). If the telemetry data
is directly exported to some collector without modifying the user
packets, Such methods are considered out-of-band (e.g., postcard-
based INT). It is possible to have hybrid methods. For example,
only the telemetry instruction or partial data is carried by user
packets (e.g., IPFPM [RFC8321]).
E2E and In-Network: Some E2E methods start from and end at the
network end hosts (e.g., Ping). The other methods work in
networks and are transparent to end hosts. However, if needed,
the in-network methods can be easily extended into end hosts.
Flow, Path, and Node: Depending on the telemetry objective, the
methods can be flow-based (e.g., in-situ OAM
[I-D.brockners-inband-oam-requirements]), path-based (e.g.,
Traceroute), and node-based (e.g., IPFIX [RFC7011]).
4.2.1.4. External Data Telemetry
Events that occur outside the boundaries of the network system are
another important source of telemetry information. Correlating both
internal telemetry data and external events with the requirements of
network systems, as presented in Exploiting External Event Detectors
to Anticipate Resource Requirements for the Elastic Adaptation of
SDN/NFV Systems [I-D.pedro-nmrg-anticipated-adaptation], provides a
strategic and functional advantage to management operations.
As with other sources of telemetry information, the data and events
must meet strict requirements, especially in terms of timeliness,
which is essential to properly incorporate external event information
to management cycles. Thus, the specific challenges are described as
follows:
o The role of external event detector can be played by multiple
elements, including hardware (e.g. physical sensors, such as
seismometers) and software (e.g. Big Data sources that analyze
streams of information, such as Twitter messages). Thus, the
transmitted data must support different shapes but, at the same
time, follow a common but extensible ontology.
o Since the main function of the external event detectors is to
perform the notifications, their timeliness is assumed. However,
once messages have been dispatched, they must be quickly collected
and inserted into the control plane with variable priority, which
will be high for important sources and/or important events and low
for secondary ones.
o The ontology used by external detectors must be easily adopted by
current and future devices and applications. Therefore, it must
be easily mapped to current information models, such as in terms
of YANG.
Organizing together both internal and external telemetry information
will be key for the general exploitation of the management
possibilities of current and future network systems, as reflected in
the incorporation of cognitive capabilities to new hardware and
software (virtual) elements.
4.3. Function Components 4.3. Function Components
At each plane, the telemetry can be further partitioned into five At each plane, the telemetry can be further partitioned into five
distinct components: distinct components:
Data Query, Analysis, and Storage: This component works at the Data Query, Analysis, and Storage: This component works at the
application layer. On the one hand, it is responsible for issuing application layer. On the one hand, it is responsible for issuing
data queries. The queries can be for modeled data through data queries. The queries can be for modeled data through
configuration or custom data through programming. The queries can configuration or custom data through programming. The queries can
be one shot or subscriptions for events or streaming data. On the be one shot or subscriptions for events or streaming data. On the
other hand, it receives, stores, and processes the returned data other hand, it receives, stores, and processes the returned data
from network devices. Data analysis can be interactive to from network devices. Data analysis can be interactive to
initiate further data queries. initiate further data queries. Note that this component can
reside in either network devices or remote controllers.
Data Configuration and Subscription: This component deploys data Data Configuration and Subscription: This component deploys data
queries on devices. It determines the protocol and channel for queries on devices. It determines the protocol and channel for
applications to acquire desired data. This component is also applications to acquire desired data. This component is also
responsible for configuring the desired data that might not be responsible for configuring the desired data that might not be
directly available form data sources. The subscription data can directly available form data sources. The subscription data can
be described by models, templates, or programs. be described by models, templates, or programs.
Data Encoding and Export: This component determines how telemetry Data Encoding and Export: This component determines how telemetry
data are delivered to the data analysis and storage component. data are delivered to the data analysis and storage component.
skipping to change at page 16, line 31 skipping to change at page 21, line 31
| & Processing | | & Processing |
| | | |
+----------------------------------------| +----------------------------------------|
| | | |
| Data Object and Source | | Data Object and Source |
| | | |
+----------------------------------------+ +----------------------------------------+
Figure 4: Components in the Network Telemetry Framework Figure 4: Components in the Network Telemetry Framework
Since most existing standard-related work belongs to the first four
components, in the remainder of the document, we focus on these
components only.
4.4. Existing Works Mapped in the Framework 4.4. Existing Works Mapped in the Framework
The following two tables provide a non-exhaustive list of existing The following two tables provide a non-exhaustive list of existing
works (mainly published in IETF and with the emphasis on the latest works (mainly published in IETF and with the emphasis on the latest
new technologies) and shows their positions in the framework. The new technologies) and shows their positions in the framework. The
details about the mentioned work can be found in Appendix A. details about the mentioned work can be found in Appendix A.
+-----------------+---------------+----------------+ +-----------------+---------------+----------------+
| | Query | Subscription | | | Query | Subscription |
| | | | | | | |
+-----------------+---------------+----------------+ +-----------------+---------------+----------------+
| Simple Data | SNMP, NETCONF,| | | Simple Data | SNMP, NETCONF,| |
| | YANG, BMP, | | | | YANG, BMP, | |
| | IOAM, PBT,gPRC| | | | IOAM, PBT,gPRC| |
+-----------------+---------------+----------------+ +-----------------+---------------+----------------+
| Custom Data | DNP, YANG FSM | | | Complex Data | DNP, YANG FSM | |
| | gRPC, NETCONF | | | | gRPC, NETCONF | |
+-----------------+---------------+----------------+ +-----------------+---------------+----------------+
| Event-triggered | | gRPC, NETCONF, | | Event-triggered | | gRPC, NETCONF, |
| Data | | YANG PUSH, DNP | | Data | | YANG PUSH, DNP |
| | | IOAM, PBT, | | | | IOAM, PBT, |
| | | YANG FSM | | | | YANG FSM |
+-----------------+---------------+----------------+ +-----------------+---------------+----------------+
| Streaming Data | | gRPC, NETCONF, | | Streaming Data | | gRPC, NETCONF, |
| | | IOAM, PBT, DNP | | | | IOAM, PBT, DNP |
| | | IPFIX, IPFPM | | | | IPFIX, IPFPM |
+-----------------+---------------+----------------+ +-----------------+---------------+----------------+
Figure 5: Existing Work Mapping I Figure 5: Existing Work Mapping I
+--------------+---------------+----------------+---------------+ +--------------+---------------+----------------+---------------+
| | Management | Control | Forwarding | | | Management | Control | Forwarding |
| | Plane | Plane | Plane | | | Plane | Plane | Plane |
+--------------+---------------+----------------+---------------+ +--------------+---------------+----------------+---------------+
| data Config. | gRPC, NETCONF,| NETCONF/YANG | NETCONF/YANG, | | data Config. | gRPC, NETCONF,| NETCONF/YANG | NETCONF/YANG, |
| & subscrib. | YANG PUSH | | YANG FSM | | & subscrib. | YANG PUSH | | YANG FSM |
+--------------+---------------+----------------+---------------+ +--------------+---------------+----------------+---------------+
| data gen. & | DNP, | DNP, | In-situ OAM, | | data gen. & | DNP, | DNP, | IOAM, |
| processing | YANG | YANG | PBT, IPFPM, | | processing | YANG | YANG | PBT, IPFPM, |
| | | | DNP | | | | | DNP |
+--------------+---------------+----------------+---------------+ +--------------+---------------+----------------+---------------+
| data | gRPC, NETCONF | BMP, NETCONF | IPFIX | | data | gRPC, NETCONF | BMP, NETCONF | IPFIX |
| export | YANG PUSH | | | | export | YANG PUSH | | |
+--------------+---------------+----------------+---------------+ +--------------+---------------+----------------+---------------+
Figure 6: Existing Work Mapping II Figure 6: Existing Work Mapping II
5. Evolution of Network Telemetry 5. Evolution of Network Telemetry
As the network is evolving towards the automated operation, network As the network is evolving towards the automated operation, network
telemetry also undergoes several levels of evolution. telemetry also undergoes several levels of evolution.
Level 0 - Static Telemetry: The telemetry data is determined at Level 0 - Static Telemetry: The telemetry data source and type are
design time. The network operator can only configure how to use determined at design time. The network operator can only
it with limited flexibility. configure how to use it with limited flexibility.
Level 1 - Dynamic Telemetry: The telemetry data can be dynamically Level 1 - Dynamic Telemetry: The telemetry data can be dynamically
programmed or configured at runtime, allowing a tradeoff among programmed or configured at runtime, allowing a tradeoff among
resource, performance, flexibility, and coverage. DNP is an resource, performance, flexibility, and coverage. DNP is an
effort towards this direction. effort towards this direction.
Level 2 - Interactive Telemetry: The network operator can Level 2 - Interactive Telemetry: The network operator can
continuously customize the telemetry data in real time to reflect continuously customize the telemetry data in real time to reflect
the network operation's visibility requirements. At this level, the network operation's visibility requirements. At this level,
some tasks can be automated, although ultimately human operators some tasks can be automated, although ultimately human operators
skipping to change at page 19, line 30 skipping to change at page 24, line 30
Further discussion and development of this section will be required, Further discussion and development of this section will be required,
and it is expected that this security section, and subsequent policy and it is expected that this security section, and subsequent policy
section will be developed further. section will be developed further.
7. IANA Considerations 7. IANA Considerations
This document includes no request to IANA. This document includes no request to IANA.
8. Contributors 8. Contributors
The other major contributors of this document are listed as follows. The other contributors of this document are listed as follows.
o Tianran Zhou o Tianran Zhou
o Zhenbin Li o Zhenbin Li
o Daniel King o Daniel King
9. Acknowledgments o Adrian Farrel
We would like to thank Adrian Farrel, Randy Presuhn, Joe Clarke,
Victor Liu, James Guichard, Uri Blumenthal, Giuseppe Fioccola, Yunan
Gu, Parviz Yegani, Young Lee, Alexander Clemm, Qin Wu, and many
others who have provided helpful comments and suggestions to improve
this document.
10. References 9. Acknowledgments
10.1. Normative References We would like to thank Randy Presuhn, Joe Clarke, Victor Liu, James
Guichard, Uri Blumenthal, Giuseppe Fioccola, Yunan Gu, Parviz Yegani,
Young Lee, Alexander Clemm, Qin Wu, and many others who have provided
helpful comments and suggestions to improve this document.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 10. Informative References
Requirement Levels", BCP 14, RFC 2119,
DOI 10.17487/RFC2119, March 1997,
<https://www.rfc-editor.org/info/rfc2119>.
[RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC [gnmi] "gNMI - gRPC Network Management Interface",
2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, <https://github.com/openconfig/reference/tree/master/rpc/
May 2017, <https://www.rfc-editor.org/info/rfc8174>. gnmi>.
10.2. Informative References [grpc] "gPPC, A high performance, open-source universal RPC
framework", <https://grpc.io>.
[I-D.brockners-inband-oam-requirements] [I-D.brockners-inband-oam-requirements]
Brockners, F., Bhandari, S., Dara, S., Pignataro, C., Brockners, F., Bhandari, S., Dara, S., Pignataro, C.,
Gredler, H., Leddy, J., Youell, S., Mozes, D., Mizrahi, Gredler, H., Leddy, J., Youell, S., Mozes, D., Mizrahi,
T., Lapukhov, P., and r. remy@barefootnetworks.com, T., Lapukhov, P., and r. remy@barefootnetworks.com,
"Requirements for In-situ OAM", draft-brockners-inband- "Requirements for In-situ OAM", draft-brockners-inband-
oam-requirements-03 (work in progress), March 2017. oam-requirements-03 (work in progress), March 2017.
[I-D.fioccola-ippm-multipoint-alt-mark] [I-D.fioccola-ippm-multipoint-alt-mark]
Fioccola, G., Cociglio, M., Sapio, A., and R. Sisto, Fioccola, G., Cociglio, M., Sapio, A., and R. Sisto,
"Multipoint Alternate Marking method for passive and "Multipoint Alternate Marking method for passive and
hybrid performance monitoring", draft-fioccola-ippm- hybrid performance monitoring", draft-fioccola-ippm-
multipoint-alt-mark-04 (work in progress), June 2018. multipoint-alt-mark-04 (work in progress), June 2018.
[I-D.ietf-grow-bmp-adj-rib-out] [I-D.ietf-grow-bmp-adj-rib-out]
Evens, T., Bayraktar, S., Lucente, P., Mi, K., and S. Evens, T., Bayraktar, S., Lucente, P., Mi, K., and S.
Zhuang, "Support for Adj-RIB-Out in BGP Monitoring Zhuang, "Support for Adj-RIB-Out in BGP Monitoring
Protocol (BMP)", draft-ietf-grow-bmp-adj-rib-out-05 (work Protocol (BMP)", draft-ietf-grow-bmp-adj-rib-out-07 (work
in progress), June 2019. in progress), August 2019.
[I-D.ietf-grow-bmp-local-rib] [I-D.ietf-grow-bmp-local-rib]
Evens, T., Bayraktar, S., Bhardwaj, M., and P. Lucente, Evens, T., Bayraktar, S., Bhardwaj, M., and P. Lucente,
"Support for Local RIB in BGP Monitoring Protocol (BMP)", "Support for Local RIB in BGP Monitoring Protocol (BMP)",
draft-ietf-grow-bmp-local-rib-04 (work in progress), June draft-ietf-grow-bmp-local-rib-05 (work in progress),
2019. August 2019.
[I-D.ietf-netconf-udp-pub-channel] [I-D.ietf-netconf-udp-pub-channel]
Zheng, G., Zhou, T., and A. Clemm, "UDP based Publication Zheng, G., Zhou, T., and A. Clemm, "UDP based Publication
Channel for Streaming Telemetry", draft-ietf-netconf-udp- Channel for Streaming Telemetry", draft-ietf-netconf-udp-
pub-channel-05 (work in progress), March 2019. pub-channel-05 (work in progress), March 2019.
[I-D.ietf-netconf-yang-push] [I-D.ietf-netconf-yang-push]
Clemm, A. and E. Voit, "Subscription to YANG Datastores", Clemm, A. and E. Voit, "Subscription to YANG Datastores",
draft-ietf-netconf-yang-push-25 (work in progress), May draft-ietf-netconf-yang-push-25 (work in progress), May
2019. 2019.
skipping to change at page 21, line 23 skipping to change at page 26, line 12
(gNMI)", draft-openconfig-rtgwg-gnmi-spec-01 (work in (gNMI)", draft-openconfig-rtgwg-gnmi-spec-01 (work in
progress), March 2018. progress), March 2018.
[I-D.pedro-nmrg-anticipated-adaptation] [I-D.pedro-nmrg-anticipated-adaptation]
Martinez-Julia, P., "Exploiting External Event Detectors Martinez-Julia, P., "Exploiting External Event Detectors
to Anticipate Resource Requirements for the Elastic to Anticipate Resource Requirements for the Elastic
Adaptation of SDN/NFV Systems", draft-pedro-nmrg- Adaptation of SDN/NFV Systems", draft-pedro-nmrg-
anticipated-adaptation-02 (work in progress), June 2018. anticipated-adaptation-02 (work in progress), June 2018.
[I-D.song-ippm-postcard-based-telemetry] [I-D.song-ippm-postcard-based-telemetry]
Song, H., Zhou, T., Li, Z., and J. Shin, "Postcard-based Song, H., Zhou, T., Li, Z., Shin, J., and K. Lee,
On-Path Flow Data Telemetry", draft-song-ippm-postcard- "Postcard-based On-Path Flow Data Telemetry", draft-song-
based-telemetry-03 (work in progress), April 2019. ippm-postcard-based-telemetry-05 (work in progress),
September 2019.
[I-D.song-opsawg-dnp4iq] [I-D.song-opsawg-dnp4iq]
Song, H. and J. Gong, "Requirements for Interactive Query Song, H. and J. Gong, "Requirements for Interactive Query
with Dynamic Network Probes", draft-song-opsawg-dnp4iq-01 with Dynamic Network Probes", draft-song-opsawg-dnp4iq-01
(work in progress), June 2017. (work in progress), June 2017.
[I-D.zhou-netconf-multi-stream-originators] [I-D.zhou-netconf-multi-stream-originators]
Zhou, T., Zheng, G., Voit, E., Clemm, A., and A. Bierman, Zhou, T., Zheng, G., Voit, E., Clemm, A., and A. Bierman,
"Subscription to Multiple Stream Originators", draft-zhou- "Subscription to Multiple Stream Originators", draft-zhou-
netconf-multi-stream-originators-04 (work in progress), netconf-multi-stream-originators-06 (work in progress),
March 2019. July 2019.
[RFC1157] Case, J., Fedor, M., Schoffstall, M., and J. Davin, [RFC1157] Case, J., Fedor, M., Schoffstall, M., and J. Davin,
"Simple Network Management Protocol (SNMP)", RFC 1157, "Simple Network Management Protocol (SNMP)", RFC 1157,
DOI 10.17487/RFC1157, May 1990, DOI 10.17487/RFC1157, May 1990,
<https://www.rfc-editor.org/info/rfc1157>. <https://www.rfc-editor.org/info/rfc1157>.
[RFC2981] Kavasseri, R., Ed., "Event MIB", RFC 2981, [RFC2981] Kavasseri, R., Ed., "Event MIB", RFC 2981,
DOI 10.17487/RFC2981, October 2000, DOI 10.17487/RFC2981, October 2000,
<https://www.rfc-editor.org/info/rfc2981>. <https://www.rfc-editor.org/info/rfc2981>.
skipping to change at page 22, line 19 skipping to change at page 27, line 10
[RFC4656] Shalunov, S., Teitelbaum, B., Karp, A., Boote, J., and M. [RFC4656] Shalunov, S., Teitelbaum, B., Karp, A., Boote, J., and M.
Zekauskas, "A One-way Active Measurement Protocol Zekauskas, "A One-way Active Measurement Protocol
(OWAMP)", RFC 4656, DOI 10.17487/RFC4656, September 2006, (OWAMP)", RFC 4656, DOI 10.17487/RFC4656, September 2006,
<https://www.rfc-editor.org/info/rfc4656>. <https://www.rfc-editor.org/info/rfc4656>.
[RFC5357] Hedayat, K., Krzanowski, R., Morton, A., Yum, K., and J. [RFC5357] Hedayat, K., Krzanowski, R., Morton, A., Yum, K., and J.
Babiarz, "A Two-Way Active Measurement Protocol (TWAMP)", Babiarz, "A Two-Way Active Measurement Protocol (TWAMP)",
RFC 5357, DOI 10.17487/RFC5357, October 2008, RFC 5357, DOI 10.17487/RFC5357, October 2008,
<https://www.rfc-editor.org/info/rfc5357>. <https://www.rfc-editor.org/info/rfc5357>.
[RFC6020] Bjorklund, M., Ed., "YANG - A Data Modeling Language for
the Network Configuration Protocol (NETCONF)", RFC 6020,
DOI 10.17487/RFC6020, October 2010,
<https://www.rfc-editor.org/info/rfc6020>.
[RFC6241] Enns, R., Ed., Bjorklund, M., Ed., Schoenwaelder, J., Ed., [RFC6241] Enns, R., Ed., Bjorklund, M., Ed., Schoenwaelder, J., Ed.,
and A. Bierman, Ed., "Network Configuration Protocol and A. Bierman, Ed., "Network Configuration Protocol
(NETCONF)", RFC 6241, DOI 10.17487/RFC6241, June 2011, (NETCONF)", RFC 6241, DOI 10.17487/RFC6241, June 2011,
<https://www.rfc-editor.org/info/rfc6241>. <https://www.rfc-editor.org/info/rfc6241>.
[RFC7011] Claise, B., Ed., Trammell, B., Ed., and P. Aitken, [RFC7011] Claise, B., Ed., Trammell, B., Ed., and P. Aitken,
"Specification of the IP Flow Information Export (IPFIX) "Specification of the IP Flow Information Export (IPFIX)
Protocol for the Exchange of Flow Information", STD 77, Protocol for the Exchange of Flow Information", STD 77,
RFC 7011, DOI 10.17487/RFC7011, September 2013, RFC 7011, DOI 10.17487/RFC7011, September 2013,
<https://www.rfc-editor.org/info/rfc7011>. <https://www.rfc-editor.org/info/rfc7011>.
skipping to change at page 23, line 18 skipping to change at page 28, line 13
<https://www.rfc-editor.org/info/rfc7854>. <https://www.rfc-editor.org/info/rfc7854>.
[RFC8321] Fioccola, G., Ed., Capello, A., Cociglio, M., Castaldelli, [RFC8321] Fioccola, G., Ed., Capello, A., Cociglio, M., Castaldelli,
L., Chen, M., Zheng, L., Mirsky, G., and T. Mizrahi, L., Chen, M., Zheng, L., Mirsky, G., and T. Mizrahi,
"Alternate-Marking Method for Passive and Hybrid "Alternate-Marking Method for Passive and Hybrid
Performance Monitoring", RFC 8321, DOI 10.17487/RFC8321, Performance Monitoring", RFC 8321, DOI 10.17487/RFC8321,
January 2018, <https://www.rfc-editor.org/info/rfc8321>. January 2018, <https://www.rfc-editor.org/info/rfc8321>.
Appendix A. A Survey on Existing Network Telemetry Techniques Appendix A. A Survey on Existing Network Telemetry Techniques
We provide an overview of the challenges and existing solutions for In this non-normative appendix, we provide an overview of some
each network telemetry module. existing techniques and standard proposals for each network telemetry
module.
A.1. Management Plane Telemetry A.1. Management Plane Telemetry
A.1.1. Requirements and Challenges A.1.1. Push Extensions for NETCONF
The management plane of the network element interacts with the
Network Management System (NMS), and provides information such as
performance data, network logging data, network warning and defects
data, and network statistics and state data. Some legacy protocols
are widely used for the management plane, such as SNMP and Syslog.
However, these protocols are insufficient to meet the requirements of
the automatic network operation applications.
New management plane telemetry protocols should consider the
following requirements:
Convenient Data Subscription: An application should have the freedom
to choose the data export means such as the data types and the
export frequency.
Structured Data: For automatic network operation, machines will
replace human for network data comprehension. The schema
languages such as YANG can efficiently describe structured data
and normalize data encoding and transformation.
High Speed Data Transport: In order to retain the information, a
server needs to send a large amount of data at high frequency.
Compact encoding formats are needed to compress the data and
improve the data transport efficiency. The push mode, by
replacing the poll mode, can also reduce the interactions between
clients and servers, which help to improve the server's
efficiency.
A.1.2. Push Extensions for NETCONF
NETCONF [RFC6241] is one popular network management protocol, which NETCONF [RFC6241] is one popular network management protocol, which
is also recommended by IETF. Although it can be used for data is also recommended by IETF. Although it can be used for data
collection, NETCONF is good at configurations. YANG Push collection, NETCONF is good at configurations. YANG Push
[I-D.ietf-netconf-yang-push] extends NETCONF and enables subscriber [I-D.ietf-netconf-yang-push] extends NETCONF and enables subscriber
applications to request a continuous, customized stream of updates applications to request a continuous, customized stream of updates
from a YANG datastore. Providing such visibility into changes made from a YANG datastore. Providing such visibility into changes made
upon YANG configuration and operational objects enables new upon YANG configuration and operational objects enables new
capabilities based on the remote mirroring of configuration and capabilities based on the remote mirroring of configuration and
operational state. Moreover, distributed data collection mechanism operational state. Moreover, distributed data collection mechanism
[I-D.zhou-netconf-multi-stream-originators] via UDP based publication [I-D.zhou-netconf-multi-stream-originators] via UDP based publication
channel [I-D.ietf-netconf-udp-pub-channel] provides enhanced channel [I-D.ietf-netconf-udp-pub-channel] provides enhanced
efficiency for the NETCONF based telemetry. efficiency for the NETCONF based telemetry.
A.1.3. gRPC Network Management Interface A.1.2. gRPC Network Management Interface
gRPC Network Management Interface (gNMI) gRPC Network Management Interface (gNMI)
[I-D.openconfig-rtgwg-gnmi-spec] is a network management protocol [I-D.openconfig-rtgwg-gnmi-spec] is a network management protocol
based on the gRPC [I-D.kumar-rtgwg-grpc-protocol] RPC (Remote based on the gRPC [I-D.kumar-rtgwg-grpc-protocol] RPC (Remote
Procedure Call) framework. With a single gRPC service definition, Procedure Call) framework. With a single gRPC service definition,
both configuration and telemetry can be covered. gRPC is an HTTP/2 both configuration and telemetry can be covered. gRPC is an HTTP/2
[RFC7540] based open source micro service communication framework. [RFC7540] based open source micro service communication framework.
It provides a number of capabilities which are well-suited for It provides a number of capabilities which are well-suited for
network telemetry, including: network telemetry, including:
skipping to change at page 24, line 43 skipping to change at page 29, line 9
o gRPC provides higher-level features consistency across platforms o gRPC provides higher-level features consistency across platforms
that common HTTP/2 libraries typically do not. This that common HTTP/2 libraries typically do not. This
characteristic is especially valuable for the fact that telemetry characteristic is especially valuable for the fact that telemetry
data collectors normally reside on a large variety of platforms. data collectors normally reside on a large variety of platforms.
o The built-in load-balancing and failover mechanism. o The built-in load-balancing and failover mechanism.
A.2. Control Plane Telemetry A.2. Control Plane Telemetry
A.2.1. Requirements and Challenges A.2.1. BGP Monitoring Protocol
The control plane telemetry refers to the health condition monitoring
of different network protocols, which covers Layer 2 to Layer 7.
Keeping track of the running status of these protocols is beneficial
for detecting, localizing, and even predicting various network
issues, as well as network optimization, in real-time and in fine
granularity.
One of the most challenging problems for the control plane telemetry
is how to correlate the E2E Key Performance Indicators (KPI) to a
specific layer's KPIs. For example, an IPTV user may describe his
User Experience (UE) by the video fluency and definition. Then in
case of an unusually poor UE KPI or a service disconnection, it is
non-trivial work to delimit and localize the issue to the responsible
protocol layer (e.g., the Transport Layer or the Network Layer), the
responsible protocol (e.g., ISIS or BGP at the Network Layer), and
finally the responsible device(s) with specific reasons.
Traditional OAM-based approaches for control plane KPI measurement
include PING (L3), Tracert (L3), Y.1731 (L2) and so on. One common
issue behind these methods is that they only measure the KPIs instead
of reflecting the actual running status of these protocols, making
them less effective or efficient for control plane troubleshooting
and network optimization. An example of the control plane telemetry
is the BGP monitoring protocol (BMP), it is currently used to
monitoring the BGP routes and enables rich applications, such as BGP
peer analysis, AS analysis, prefix analysis, security analysis, and
so on. However, the monitoring of other layers, protocols and the
cross-layer, cross-protocol KPI correlations are still in their
infancy (e.g., the IGP monitoring is missing), which require
substantial further research.
A.2.2. BGP Monitoring Protocol
BGP Monitoring Protocol (BMP) [RFC7854] is used to monitor BGP BGP Monitoring Protocol (BMP) [RFC7854] is used to monitor BGP
sessions and intended to provide a convenient interface for obtaining sessions and intended to provide a convenient interface for obtaining
route views. route views.
The BGP routing information is collected from the monitored device(s) The BGP routing information is collected from the monitored device(s)
to the BMP monitoring station by setting up the BMP TCP session. The to the BMP monitoring station by setting up the BMP TCP session. The
BGP peers are monitored by the BMP Peer Up and Peer Down BGP peers are monitored by the BMP Peer Up and Peer Down
Notifications. The BGP routes (including Adjacency_RIB_In [RFC7854], Notifications. The BGP routes (including Adjacency_RIB_In [RFC7854],
Adjacency_RIB_out [I-D.ietf-grow-bmp-adj-rib-out], and Local_Rib Adjacency_RIB_out [I-D.ietf-grow-bmp-adj-rib-out], and Local_Rib
[I-D.ietf-grow-bmp-local-rib] are encapsulated in the BMP Route [I-D.ietf-grow-bmp-local-rib] are encapsulated in the BMP Route
Monitoring Message and the BMP Route Mirroring Message, in the form Monitoring Message and the BMP Route Mirroring Message, in the form
of both initial table dump and real-time route update. In addition, of both initial table dump and real-time route update. In addition,
BGP statistics are reported through the BMP Stats Report Message, BGP statistics are reported through the BMP Stats Report Message,
which could be either timer triggered or event-driven. More BMP which could be either timer triggered or event-driven. More BMP
extensions can be explored to enrich the applications of BGP extensions can be explored to enrich the applications of BGP
monitoring. monitoring.
A.3. Data Plane Telemetry A.3. Data Plane Telemetry
A.3.1. Requirements and Challenges
An effective data plane telemetry system relies on the data that the
network device can expose. The data's quality, quantity, and
timeliness must meet some stringent requirements. This raises some
challenges to the network data plane devices where the first hand
data originate.
o A data plane device's main function is user traffic processing and
forwarding. While supporting network visibility is important, the
telemetry is just an auxiliary function, and it should not impede
normal traffic processing and forwarding (i.e., the performance is
not lowered and the behavior is not altered due to the telemetry
functions).
o The network operation applications requires end-to-end visibility
from various sources, which results in a huge volume of data.
However, the sheer data quantity should not stress the network
bandwidth, regardless of the data delivery approach (i.e., through
in-band or out-of-band channels).
o The data plane devices must provide timely data with the minimum
possible delay. Long processing, transport, storage, and analysis
delay can impact the effectiveness of the control loop and even
render the data useless.
o The data should be structured and labeled, and easy for
applications to parse and consume. At the same time, the data
types needed by applications can vary significantly. The data
plane devices need to provide enough flexibility and
programmability to support the precise data provision for
applications.
o The data plane telemetry should support incremental deployment and
work even though some devices are unaware of the system. This
challenge is highly relevant to the standards and legacy networks.
The industry has agreed that the data plane programmability is
essential to support network telemetry. Newer data plane chips are
all equipped with advanced telemetry features and provide flexibility
to support customized telemetry functions.
A.3.2. Technique Taxonomy
There can be multiple possible dimensions to classify the data plane
telemetry techniques.
Active and Passive: The active and passive methods (as well as the
hybrid types) are well documented in [RFC7799]. The passive
methods include TCPDUMP, IPFIX [RFC7011], sflow, and traffic
mirror. These methods usually have low data coverage. The
bandwidth cost is very high in order to improve the data coverage.
On the other hand, the active methods include Ping, Traceroute,
OWAMP [RFC4656], and TWAMP [RFC5357]. These methods are intrusive
and only provide indirect network measurement results. The hybrid
methods, including in-situ OAM
[I-D.brockners-inband-oam-requirements], IPFPM [RFC8321], and
Multipoint Alternate Marking
[I-D.fioccola-ippm-multipoint-alt-mark], provide a well-balanced
and more flexible approach. However, these methods are also more
complex to implement.
In-Band and Out-of-Band: The telemetry data, before being exported A.3.1. The IPFPM technology
to some collector, can be carried in user packets. Such methods
are considered in-band (e.g., in-situ OAM
[I-D.brockners-inband-oam-requirements]). If the telemetry data
is directly exported to some collector without modifying the user
packets, Such methods are considered out-of-band (e.g., postcard-
based INT). It is possible to have hybrid methods. For example,
only the telemetry instruction or partial data is carried by user
packets (e.g., IPFPM [RFC8321]).
E2E and In-Network: Some E2E methods start from and end at the
network end hosts (e.g., Ping). The other methods work in
networks and are transparent to end hosts. However, if needed,
the in-network methods can be easily extended into end hosts.
Flow, Path, and Node: Depending on the telemetry objective, the
methods can be flow-based (e.g., in-situ OAM
[I-D.brockners-inband-oam-requirements]), path-based (e.g.,
Traceroute), and node-based (e.g., IPFIX [RFC7011]).
A.3.3. The IPFPM technology
The Alternate Marking method is efficient to perform packet loss, The Alternate Marking method is efficient to perform packet loss,
delay, and jitter measurements both in an IP and Overlay Networks, as delay, and jitter measurements both in an IP and Overlay Networks, as
presented in IPFPM [RFC8321] and presented in IPFPM [RFC8321] and
[I-D.fioccola-ippm-multipoint-alt-mark]. [I-D.fioccola-ippm-multipoint-alt-mark].
This technique can be applied to point-to-point and multipoint-to- This technique can be applied to point-to-point and multipoint-to-
multipoint flows. Alternate Marking creates batches of packets by multipoint flows. Alternate Marking creates batches of packets by
alternating the value of 1 bit (or a label) of the packet header. alternating the value of 1 bit (or a label) of the packet header.
These batches of packets are unambiguously recognized over the These batches of packets are unambiguously recognized over the
skipping to change at page 29, line 5 skipping to change at page 30, line 39
In summary, an application can configure end-to-end network In summary, an application can configure end-to-end network
monitoring. If the network does not experiment issues, this monitoring. If the network does not experiment issues, this
approximate monitoring is good enough and is very cheap in terms of approximate monitoring is good enough and is very cheap in terms of
network resources. However, in case of problems, the application network resources. However, in case of problems, the application
becomes aware of the issues from this approximate monitoring and, in becomes aware of the issues from this approximate monitoring and, in
order to localize the portion of the network that has issues, order to localize the portion of the network that has issues,
configures the measurement points more exhaustively. So a new configures the measurement points more exhaustively. So a new
detailed monitoring is performed. After the detection and resolution detailed monitoring is performed. After the detection and resolution
of the problem the initial approximate monitoring can be used again. of the problem the initial approximate monitoring can be used again.
A.3.4. Dynamic Network Probe A.3.2. Dynamic Network Probe
Hardware-based Dynamic Network Probe (DNP) [I-D.song-opsawg-dnp4iq] Hardware-based Dynamic Network Probe (DNP) [I-D.song-opsawg-dnp4iq]
provides a programmable means to customize the data that an provides a programmable means to customize the data that an
application collects from the data plane. A direct benefit of DNP is application collects from the data plane. A direct benefit of DNP is
the reduction of the exported data. A full DNP solution covers the reduction of the exported data. A full DNP solution covers
several components including data source, data subscription, and data several components including data source, data subscription, and data
generation. The data subscription needs to define the complex data generation. The data subscription needs to define the complex data
which can be composed and derived from the raw data sources. The which can be composed and derived from the raw data sources. The
data generation takes advantage of the moderate in-network computing data generation takes advantage of the moderate in-network computing
to produce the desired data. to produce the desired data.
While DNP can introduce unforeseeable flexibility to the data plane While DNP can introduce unforeseeable flexibility to the data plane
telemetry, it also faces some challenges. It requires a flexible telemetry, it also faces some challenges. It requires a flexible
data plane that can be dynamically reprogrammed at run-time. The data plane that can be dynamically reprogrammed at run-time. The
programming API is yet to be defined. programming API is yet to be defined.
A.3.5. IP Flow Information Export (IPFIX) protocol A.3.3. IP Flow Information Export (IPFIX) protocol
Traffic on a network can be seen as a set of flows passing through Traffic on a network can be seen as a set of flows passing through
network elements. IP Flow Information Export (IPFIX) [RFC7011] network elements. IP Flow Information Export (IPFIX) [RFC7011]
provides a means of transmitting traffic flow information for provides a means of transmitting traffic flow information for
administrative or other purposes. A typical IPFIX enabled system administrative or other purposes. A typical IPFIX enabled system
includes a pool of Metering Processes collects data packets at one or includes a pool of Metering Processes collects data packets at one or
more Observation Points, optionally filters them and aggregates more Observation Points, optionally filters them and aggregates
information about these packets. An Exporter then gathers each of information about these packets. An Exporter then gathers each of
the Observation Points together into an Observation Domain and sends the Observation Points together into an Observation Domain and sends
this information via the IPFIX protocol to a Collector. this information via the IPFIX protocol to a Collector.
A.3.6. In-Situ OAM A.3.4. In-Situ OAM
Traditional passive and active monitoring and measurement techniques Traditional passive and active monitoring and measurement techniques
are either inaccurate or resource-consuming. It is preferable to are either inaccurate or resource-consuming. It is preferable to
directly acquire data associated with a flow's packets when the directly acquire data associated with a flow's packets when the
packets pass through a network. In-situ OAM (iOAM) packets pass through a network. In-situ OAM (iOAM)
[I-D.brockners-inband-oam-requirements], a data generation technique, [I-D.brockners-inband-oam-requirements], a data generation technique,
embeds a new instruction header to user packets and the instruction embeds a new instruction header to user packets and the instruction
directs the network nodes to add the requested data to the packets. directs the network nodes to add the requested data to the packets.
Thus, at the path end, the packet's experience gained on the entire Thus, at the path end, the packet's experience gained on the entire
forwarding path can be collected. Such firsthand data is invaluable forwarding path can be collected. Such firsthand data is invaluable
to many network OAM applications. to many network OAM applications.
However, iOAM also faces some challenges. The issues on performance However, iOAM also faces some challenges. The issues on performance
impact, security, scalability and overhead limits, encapsulation impact, security, scalability and overhead limits, encapsulation
difficulties in some protocols, and cross-domain deployment need to difficulties in some protocols, and cross-domain deployment need to
be addressed. be addressed.
A.3.7. Postcard Based Telemetry A.3.5. Postcard Based Telemetry
PBT [I-D.song-ippm-postcard-based-telemetry] is an alternative to PBT [I-D.song-ippm-postcard-based-telemetry] is an alternative to
IOAM. PBT directly exports data at each node through an independent IOAM. PBT directly exports data at each node through an independent
packet. PBT solves several issues of IOAM. It can also help to packet. PBT solves several issues of IOAM. It can also help to
identify packet drop location in case a packet is dropped on its identify packet drop location in case a packet is dropped on its
forwarding path. forwarding path.
A.4. External Data and Event Telemetry A.4. External Data and Event Telemetry
A.4.1. Sources of External Events
Events that occur outside the boundaries of the network system are
another important source of telemetry information. Correlating both
internal telemetry data and external events with the requirements of
network systems, as presented in Exploiting External Event Detectors
to Anticipate Resource Requirements for the Elastic Adaptation of
SDN/NFV Systems [I-D.pedro-nmrg-anticipated-adaptation], provides a
strategic and functional advantage to management operations.
A.4.1. Requirements and Challenges
As with other sources of telemetry information, the data and events
must meet strict requirements, especially in terms of timeliness,
which is essential to properly incorporate external event information
to management cycles. Thus, the specific challenges are described as
follows:
o The role of external event detector can be played by multiple
elements, including hardware (e.g. physical sensors, such as
seismometers) and software (e.g. Big Data sources that analyze
streams of information, such as Twitter messages). Thus, the
transmitted data must support different shapes but, at the same
time, follow a common but extensible ontology.
o Since the main function of the external event detectors is to
perform the notifications, their timeliness is assumed. However,
once messages have been dispatched, they must be quickly collected
and inserted into the control plane with variable priority, which
will be high for important sources and/or important events and low
for secondary ones.
o The ontology used by external detectors must be easily adopted by
current and future devices and applications. Therefore, it must
be easily mapped to current information models, such as in terms
of YANG.
Organizing together both internal and external telemetry information
will be key for the general exploitation of the management
possibilities of current and future network systems, as reflected in
the incorporation of cognitive capabilities to new hardware and
software (virtual) elements.
A.4.2. Sources of External Events
To ensure that the information provided by external event detectors To ensure that the information provided by external event detectors
and used by the network management solutions is meaningful for the and used by the network management solutions is meaningful for the
management purposes, the network telemetry framework must ensure that management purposes, the network telemetry framework must ensure that
such detectors (sources) are easily connected to the management such detectors (sources) are easily connected to the management
solutions (sinks). This requires the specification of a simple solutions (sinks). This requires the specification of a simple
taxonomy of detectors and match it to the connectors and/or taxonomy of detectors and match it to the connectors and/or
interfaces required to connect them. interfaces required to connect them.
Once detectors are classified in such taxonomy, their definitions are Once detectors are classified in such taxonomy, their definitions are
skipping to change at page 32, line 26 skipping to change at page 33, line 22
other source types, a new information model, format, and reporting other source types, a new information model, format, and reporting
protocol is required to integrate the detectors of this type with protocol is required to integrate the detectors of this type with
the management solution. the management solution.
Additional types of detector types can be added to the system but Additional types of detector types can be added to the system but
they will be generally the result of composing the properties offered they will be generally the result of composing the properties offered
by these main classes. In any case, future revisions of the network by these main classes. In any case, future revisions of the network
telemetry framework will include the required types that cover new telemetry framework will include the required types that cover new
circumstances and that cannot be obtained by composition. circumstances and that cannot be obtained by composition.
A.4.3. Connectors and Interfaces A.4.2. Connectors and Interfaces
For allowing external event detectors to be properly integrated with For allowing external event detectors to be properly integrated with
other management solutions, both elements must expose interfaces and other management solutions, both elements must expose interfaces and
protocols that are subject to their particular objective. Since protocols that are subject to their particular objective. Since
external event detectors will be focused on providing their external event detectors will be focused on providing their
information to their main consumers, which generally will not be information to their main consumers, which generally will not be
limited to the network management solutions, the framework must limited to the network management solutions, the framework must
include the definition of the required connectors for ensuring the include the definition of the required connectors for ensuring the
interconnection between detectors (sources) and their consumers interconnection between detectors (sources) and their consumers
within the management systems (sinks) are effective. within the management systems (sinks) are effective.
 End of changes. 77 change blocks. 
393 lines changed or deleted 446 lines changed or added

This html diff was produced by rfcdiff 1.47. The latest version is available from http://tools.ietf.org/tools/rfcdiff/