[Docs] [txt|pdf] [Tracker] [Email] [Diff1] [Diff2] [Nits]

Versions: 00 01 02 03 04 05 06

NFVRG                                                         C. Meirosu
Internet Draft                                                  Ericsson
Intended status:  Informational                             A. Manzalini
Expires: January 2017                                     Telecom Italia
                                                             R. Steinert
                                                            G. Marchetto
                                                   Politecnico di Torino
                                                          K. Pentikousis
                                                               S. Wright
                                                                P. Lynch
                                                                 W. John

                                                           July 8, 2016

            DevOps for Software-Defined Telecom Infrastructures

Status of this Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at

   The list of Internet-Draft Shadow Directories can be accessed at

   This Internet-Draft will expire on January 8, 2016.

Meirosu, et al.        Expires January 8, 2017                 [Page 1]

Internet-Draft            DevOps Challenges                   July 2016

Copyright Notice

   Copyright (c) 2016 IETF Trust and the persons identified as the
   document authors. All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document. Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document. Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.


   Carrier-grade network management was optimized for environments built
   with monolithic physical nodes and involves significant deployment,
   integration and maintenance efforts from network service providers.
   The introduction of virtualization technologies, from the physical
   layer all the way up to the application layer, however, invalidates
   several well-established assumptions in this domain. This draft opens
   the discussion in NFVRG about challenges related to transforming the
   telecom network infrastructure into an agile, model-driven
   environment for communication services. We take inspiration from data
   center DevOps on the simplification and automation of management
   processes for a telecom service provider software-defined
   infrastructure (SDI). A number of challenges associated with
   operationalizing DevOps principles at scale in software-defined
   telecom networks are identified in relation to three areas related to
   key programmable management processes.

Table of Contents

   1. Introduction...................................................3
   2. Software-Defined Telecom Infrastructure: Roles and DevOps
      2.1. Service Developer Role....................................6
      2.2. VNF Developer role........................................6
      2.3. System Integrator role....................................6
      2.4. Operator role.............................................7
      2.5. Customer role.............................................7
      2.6. DevOps Principles.........................................7
   3. Continuous Integration.........................................9
   4. Continuous Delivery...........................................10

Meirosu, et al.        Expires January 8, 2017                 [Page 2]

Internet-Draft            DevOps Challenges                   July 2016

   5. Consistency, Availability and Partitioning Challenges.........10
   6. Stability and Real-Time Change Challenges.....................11
   7. Observability Challenges......................................13
   8. Verification Challenges.......................................15
   9. Testing Challenges............................................17
   10. Programmable management......................................18
   11. Security Considerations......................................20
   12. IANA Considerations..........................................20
   13. References...................................................20
      13.1. Informative References..................................20
   14. Contributors to earlier versions.............................23
   15. Acknowledgments..............................................23
   16. Authors' Addresses...........................................24

1. Introduction

   Carrier-grade network management was developed as an incremental
   solution once a particular network technology matured and came to be
   deployed in parallel with legacy technologies. This approach requires
   significant integration efforts when new network services are
   launched. Both centralized and distributed algorithms have been
   developed in order to solve very specific problems related to
   configuration, performance and fault management. However, such
   algorithms consider a network that is by and large functionally
   static. Thus, management processes related to introducing new or
   maintaining functionality are complex and costly due to significant
   efforts required for verification and integration.

   Network virtualization, by means of Software-Defined Networking (SDN)
   and Network Function Virtualization (NFV), creates an environment
   where network functions are no longer static or strictly embedded in
   physical boxes deployed at fixed points. The virtualized network is
   dynamic and open to fast-paced innovation enabling efficient network
   management and reduction of operating cost for network operators. A
   significant part of network capabilities are expected to become
   available through interfaces that resemble the APIs widespread within
   datacenters instead of the traditional telecom means of management
   such as the Simple Network Management Protocol, Command Line
   Interfaces or CORBA. Such an API-based approach, combined with the
   programmability offered by SDN interfaces [RFC7426], open
   opportunities for handling infrastructure, resources, and Virtual
   Network Functions (VNFs) as code, employing techniques from software

   The efficiency and integration of existing management techniques in
   virtualized and dynamic network environments are limited, however.
   Monitoring tools, e.g. based on simple counters, physical network

Meirosu, et al.        Expires January 8, 2017                 [Page 3]

Internet-Draft            DevOps Challenges                   July 2016

   taps and active probing, do not scale well and provide only a small
   part of the observability features required in such a dynamic
   environment. Although huge amounts of monitoring data can be
   collected from the nodes, the typical granularity is rather static
   and coarse and management bandwidths may be limited. Debugging and
   troubleshooting techniques developed for software-defined
   environments are a research topic that has gathered interest in the
   research community in the last years. Still, it is yet to be explored
   how to integrate them into an operational network management system.
   Moreover, research tools developed in academia (such as NetSight
   [H2014], OFRewind [W2011], FlowChecker [S2010], etc.) were limited to
   solving very particular, well-defined problems, and oftentimes are
   not built for automation and integration into carrier-grade network
   operations workflows. As the virtualized network functions,
   infrastructure software and infrastructure hardware become more
   dynamic [NFVSWA], the monitoring, management and testing approaches
   also need to change.

   The topics at hand have already attracted several standardization
   organizations to look into the issues arising in this new
   environment. For example, IETF working groups have activities in the
   area of OAM and Verification for Service Function Chaining
   [I-D.aldrin-sfc-oam-framework] [I-D.lee-sfc-verification] for Service
   Function Chaining. At IRTF, [RFC7149] asks a set of relevant
   questions regarding operations of SDNs. The ETSI NFV ISG defines the
   MANO interfaces [NFVMANO], and TMForum investigates gaps between
   these interfaces and existing specifications in [TR228]. The need for
   programmatic APIs in the orchestration of compute, network and
   storage resources is discussed in [I-D.unify-nfvrg-challenges].

   From a research perspective, problems related to operations of
   software-defined networks are in part outlined in [SDNsurvey] and
   research referring to both cloud and software-defined networks are
   discussed in [D4.1].

   The purpose of this first version of this document is to act as a
   discussion opener in NFVRG by describing a set of principles that are
   relevant for applying DevOps ideas to managing software-defined
   telecom network infrastructures. We identify a set of challenges
   related to developing tools, interfaces and protocols that would
   support these principles and how can we leverage standard APIs for
   simplifying management tasks.

Meirosu, et al.        Expires January 8, 2017                 [Page 4]

Internet-Draft            DevOps Challenges                   July 2016

2. Software-Defined Telecom Infrastructure: Roles and DevOps principles

   There is no single list of core principles of DevOps, but it is
   generally recognized as encompassing:

     .  Iterative development / Incremental feature content

     .  Continuous deployment

     .  Automated processes

     .  Holistic/Systemic views of development and deployment/

   With Deployment/ Operations becoming increasingly linked with
   software development, and business needs driving more rapid
   deployments, agile methodologies are assumed as a basis for DevOps.
   Agile methods used in many software focused companies are focused on
   releasing small interactions of code to implement VNFs with high
   velocity and high quality into a production environment. Similarly,
   Service providers are interested to release incremental improvements
   in the network services that they create from virtualized network
   functions. The cycle time for DevOps as applied in many open source
   projects is on the order of one quarter year or 13 weeks.

   The code needs to undergo a significant amount of automated testing
   and verification with pre-defined templates in a realistic setting.
   From the point of view of software defined telecom infrastructure
   management, the of the network and service configuration is expected
   to continuously evolve as result of network policy decomposition and
   refinement, service evolution, the updates, failovers or re-
   configuration of virtual functions, additions/upgrades of new
   infrastructure resources (e.g. whiteboxes, fibers). When
   troubleshooting the cause of unexpected behavior, fine-grained
   visibility onto all resources supporting the virtual functions
   (either compute, or network-related) is paramount to facilitating
   fast resolution times. While compute resources are typically very
   well covered by debugging and profiling toolsets based on many years
   of advances in software engineering, programmable network resources
   are a still a novelty and tools exploiting their potential are

Meirosu, et al.        Expires January 8, 2017                 [Page 5]

Internet-Draft            DevOps Challenges                   July 2016

2.1. Service Developer Role

   We identify two dimensions of the "developer" role in software-
   defined infrastructure (SDI).  The network service to be developed is
   captured in a network service descriptor (e.g. [IFA014]). One
   dimension relates to determining which high-level functions should be
   part of a particular service, deciding what logical interconnections
   are needed between these blocks and defining a set of high-level
   constraints or goals related to parameters that define, for instance,
   a Service Function Chain. This could be determined by the product
   owner for a particular family of services offered by a telecom
   provider. Or, it might be a key account representative that adapts an
   existing service template to the requirements of a particular
   customer by adding or removing a small number of functional entities.
   We refer to this person as the Service Developer and for simplicity
   (access control, training on technical background, etc.) we consider
   the role to be internal to the telecom provider.

2.2. VNF Developer role

   Another dimension of the "developer" role is a person that writes the
   software code for a new virtual network function (VNF). The VNF then
   needs to be delivered as a package (e.g.[IFA011]) that includes
   various metadata for ingestion/integration into some service. Note
   that a VNF may span multiple virtual machines to support design
   objectives (e.g. for reliability or scalability). Depending on the
   actual VNF being developed, this person might be internal or external
   (e.g. a traditional equipment vendor) to the telecom provider. We
   refer to them as VNF Developers.

2.3. System Integrator role

   The System Integrator role is to some extent similar to the Service
   Developer: people in this role need to identify the components of the
   system to be delivered. However, for the Service Developer, the
   service components are pre-integrated meaning that they have the
   right interfaces to interact with each other. In contrast, the
   Systems Integrator needs to develop the software that makes the
   system components interact with each other. As such, the Systems
   Integrator role combines aspects of the Developer roles and adds yet
   another dimension to it. Compared to the other Developer roles, the
   System Integrator might face additional challenges due to the fact
   that they might not have access to the source code of some of the
   components. This limits for example how fast they could address
   issues with components to be integrated, as well as uneven workload
   depending on the release granularity of the different components that
   need to be integrated. Some system integration activities may take

Meirosu, et al.        Expires January 8, 2017                 [Page 6]

Internet-Draft            DevOps Challenges                   July 2016

   place on an industry basis in collaborative communities (e.g.

2.4. Network service Operator role

   The role of a Network Service Operator is to ensure that the
   deployment processes were successful and a set of performance
   indicators associated to a particular network service are met. The
   network service is supported on infrastructure specific set of
   infrastructure resources that may be owned and operated by that
   Network Service Operator, or provided under contract from some other
   infrastructure service provider. .

2.5. Customer role

   A Customer contracts a telecom operator to provide one or more
   services. In SDI, the Customer may communicate with the provider in
   real time through an online portal. From the customer perspective,
   such portal interfaces become part of the service definition just
   like the data transfer aspects of the service. Compared to the
   Service Developer, the Customer is external to the operator and may
   define changes to their own service instance only in accordance to
   policies defined by the Service Developer. In addition to the usual
   per-service utilization statistics, in SDI the portal may enable the
   customer to trigger certain performance management or troubleshooting
   tools for the service. This, for example, enables the Customer to
   determine whether the root cause of certain error or degradation
   condition that they observe is located in the telecom operator domain
   or not and may facilitate the interaction with the customer support

2.6. DevOps Principles

   In line with the generic DevOps concept outlined in [DevOpsP], we
   consider that these four principles as important for adapting DevOps
   ideas to SDI:

   * Automated processes: Deploy with repeatable, reliable processes:
   Service and VNF Developers should be supported by automated build,
   orchestrate and deploy processes that are identical in the
   development, test and production environments. Such processes need to
   be made reliable and trusted in the sense that they should reduce the
   chance of human error and provide visibility at each stage of the
   process, as well as have the possibility to enable manual
   interactions in certain key stages.

Meirosu, et al.        Expires January 8, 2017                 [Page 7]

Internet-Draft            DevOps Challenges                   July 2016

   * Holistis/systemic view: Develop and test against production-like
   systems: both Service Developers and VNF Developers need to have the
   opportunity to verify and debug their respective SDI code in systems
   that have characteristics which are very close to the production
   environment where the code is expected to be ultimately deployed.
   Customizations of Service Function Chains or VNFs could thus be
   released frequently to a production environment in compliance with
   policies set by the Operators. Adequate isolation and protection of
   the services active in the infrastructure from services being tested
   or debugged should be provided by the production environment.

   * Continuous: Monitor and validate operational quality: Service
   Developers, VNF Developers and Operators must be equipped with tools,
   automated as much as possible, that enable to continuously monitor
   the operational quality of the services deployed on SDI. Monitoring
   tools should be complemented by tools that allow verifying and
   validating the operational quality of the service in line with
   established procedures which might be standardized (for example,
   Y.1564 Ethernet Activation [Y1564]) or defined through best practices
   specific to a particular telecom operator.

   * Iterative/Incremental: Amplify development cycle feedback loops: An
   integral part of the DevOps ethos is building a cross-cultural
   environment that bridges the cultural gap between the desire for
   continuous change by the Developers and the demand by the Operators
   for stability and reliability of the infrastructure. Feedback from
   customers is collected and transmitted throughout the organization.
   From a technical perspective, such cultural aspects could be
   addressed through common sets of tools and APIs that are aimed at
   providing a shared vocabulary for both Developers and Operators, as
   well as simplifying the reproduction of problematic situations in the
   development, test and operations environments.

   Network operators that would like to move to agile methods to deploy
   and manage their networks and services face a different environment
   compared to typical software companies where simplified trust
   relationships between personnel are the norm. In software companies,
   it is not uncommon that the same person may be rotating between
   different roles. In contrast, in a telecom service provider, there
   are strong organizational boundaries between suppliers (whether in
   Developer roles for network functions, or in Operator roles for
   outsourced services) and the carrier's own personnel that might also
   take both Developer and Operator roles. Extending DevOps principles
   across strong organizational boundaries e.g. through co-creation or
   collaborative development in open source communities) may be a
   commercial challenge rather than a technical issue.

Meirosu, et al.        Expires January 8, 2017                 [Page 8]

Internet-Draft            DevOps Challenges                   July 2016

3. Continuous Integration

   Software integration is the process of bringing together the software
   component subsystems into one software system, and ensuring that the
   subsystems function together as a system. Software integration can
   apply regardless of the size of the software components. The
   objective of Continuous Integration is to prevent integration
   problems close to the expected release of a software development
   project into a production (operations) environment. Continuous
   Integration is therefore closely coupled with the notion of DevOps as
   a mechanism to ease the transition from development to operations.

   Continuous integration may result in multiple builds per day. It is
   also typically used in conjunction with test driven development
   approaches that integrate unit testing into the build process. The
   unit testing is typically automated through build servers. Such
   servers may implement a variety of additional static and dynamic
   tests as well as other quality control and documentation extraction
   functions. The reduced cycle times of continuous enable improved
   software quality by applying small efforts frequently.

   Continuous Integration applies to developers of VNF as they integrate
   the components that they need to deliver their VNF. The VNFs may
   contain components developed by different teams within the VNF
   Provider, or may integrate code developed externally - e.g. in
   commercial code libraries or in open source communities.

   Service developers also apply continuous integration in the
   development of network services. Network services are comprised of
   various aspects including VNFs and connectivity within and between
   them as well as with various associated resource authorizations. The
   components of the networks service are all dynamic, and largely
   represented by software that must be integrated regularly to maintain

   Some of the software components that Service Developers integrate may
   be sourced from VNF Providers or from open source communities.
   Service Developers and Network Service Operators are increasingly
   motivated to engage with open Source communities [OSandS]. Open
   source interfaces supported by open source communities may be more
   useful than traditional paper interface specifications.  Even where
   Service Providers are deeply engaged in the open source community
   (e.g. OPNFV) many service providers may prefer to obtain the code
   through some software provider as a business practice. Such software
   providers have the same interests in software integration as other

Meirosu, et al.        Expires January 8, 2017                 [Page 9]

Internet-Draft            DevOps Challenges                   July 2016

   VNF providers. An open source integration community (e.g. OPNFV) may
   resolve common integration issues across the industry reducing the
   need for integration issue resolution specific to particular

4. Continuous Delivery

   The practice of Continuous Delivery extends Continuous Integration by
   ensuring that the software (either a VNF code or code for SDI)
   checked in on the mainline is always in a user deployable state and
   enables rapid deployment by those users. For critical systems such as
   telecommunications networks, Continuous Delivery may require the
   advantage of including a manual trigger before the actual deployment
   in the live system, compared to the Continuous Deployment methodology
   which is also part of DevOps processes in software companies.

   Automated Continuous deployment systems in may exceed 10 updates per
   day. Assuming an integration of 100 components, each with an average
   time to upgrade of 180 days then deployments on the order of every
   1.8 days might be expected. The telecom infrastructure is also very
   distributed - consider the case of cloud RAN use cases where the
   number of locations for deployment is of the order of the number of
   cell tower locations (~10^4..10^6). Deployments may need to be
   incremental across the infrastructure to reduce the risk of large-
   scale failures. Conversely, there may need to be rapid rollbacks to
   prior stable deployment configurations in the event of significant

5. Consistency, Availability and Partitioning Challenges

   The CAP theorem [CAP] states that any networked shared-data system
   can have at most two of following three properties: 1) Consistency
   (C) equivalent to having a single up-to-date copy of the data; 2)
   high Availability (A) of that data (for updates); and 3) tolerance to
   network Partitions (P).

   Looking at a telecom SDI as a distributed computational system
   (routing/forwarding packets can be seen as a computational problem),
   just two of the three CAP properties will be possible at the same
   time. The general idea is that 2 of the 3 have to be chosen. CP favor
   consistency, AP favor availability, CA there are no partition. This
   has profound implications for technologies that need to be developed
   in line with the "deploy with repeatable, reliable processes"

Meirosu, et al.        Expires January 8, 2017                [Page 10]

Internet-Draft            DevOps Challenges                   July 2016

   principle for configuring SDI states. Latency or delay and
   partitioning properties are closely related, and such relation
   becomes more important in the case of telecom service providers where
   Devs and Ops interact with widely distributed infrastructure.
   Limitations of interactions between centralized management and
   distributed control need to be carefully examined in such
   environments. Traditionally connectivity was the main concern: C and
   A was about delivering packets to destination. The features and
   capabilities of SDN and NFV are changing the concerns: for example in
   SDN, control plane Partitions no longer imply data plane Partitions,
   so A does not imply C. In practice, CAP reflects the need for a
   balance between local/distributed operations and remote/centralized

   Furthermore to CAP aspects related to individual protocols,
   interdependencies between CAP choices for both resources and VNFs
   that are interconnected in a forwarding graph need to be considered.
   This is particularly relevant for the "Monitor and Validate
   Operational Quality" principle, as apart from transport protocols,
   most OAM functionality is generally configured in processes that are
   separated from the configuration of the monitored entities. Also,
   partitioning in a monitoring plane implemented through VNFs executed
   on compute resources does not necessarily mean that the dataplane of
   the monitored VNF was partitioned as well.

6. Stability and Real-Time Change Challenges

   The dimensions, dynamicity and heterogeneity of networks are growing
   continuously. Monitoring and managing the network behavior in order
   to meet technical and business objectives is becoming increasingly
   complicated and challenging, especially when considering the need of
   predicting and taming potential instabilities.

   In general, instability in networks may have primary effects both
   jeopardizing the performance and compromising an optimized use of
   resources, even across multiple layers: in fact, instability of end-
   to-end communication paths may depend both on the underlying
   transport network, as well as the higher level components specific to
   flow control and dynamic routing. For example, arguments for
   introducing advanced flow admission control are essentially derived
   from the observation that the network otherwise behaves in an
   inefficient and potentially unstable manner. Even with resources over
   provisioning, a network without an efficient flow admission control
   has instability regions that can even lead to congestion collapse in
   certain configurations. Another example is the instability which is

Meirosu, et al.        Expires January 8, 2017                [Page 11]

Internet-Draft            DevOps Challenges                   July 2016

   characteristic of any dynamically adaptive routing system. Routing
   instability, which can be (informally) defined as the quick change of
   network reachability and topology information, has a number of
   possible origins, including problems with connections, router
   failures, high levels of congestion, software configuration errors,
   transient physical and data link problems, and software bugs.

   As a matter of fact, the states monitored and used to implement the
   different control and management functions in network nodes are
   governed by several low-level configuration commands. There are
   several dependencies among these states and the logic updating the
   states in real time (most of which are not synchronized
   automatically). Normally, high-level network goals (such as the
   connectivity matrix, load-balancing, traffic engineering goals,
   survivability requirements, etc) are translated into low-level
   configuration commands (mostly manually) individually executed on the
   network elements (e.g., forwarding table, packet filters, link-
   scheduling weights, and queue-management parameters, as well as
   tunnels and NAT mappings). Network instabilities due to configuration
   errors can spread from node to node and propagate throughout the

   DevOps in the data center is a source of inspiration regarding how to
   simplify and automate management processes for software-defined
   infrastructure. Although the low-level configuration could be
   automated by DevOps tools such as CFEngine [C2015], Puppet [P2015]
   and Ansible [A2015], the high-level goal translation towards tool-
   specific syntax is still a manual process. In addition, while
   carrier-grade configuration tools using the NETCONF protocol support
   complex atomic transaction management (which reduces the potential
   for instability), Ansible requires third-party components to support
   rollbacks and the Puppet transactions are not atomic.

   As a specific example, automated configuration functions are expected
   to take the form of a "control loop" that monitors (i.e., measures)
   current states of the network, performs a computation, and then
   reconfigures the network. These types of functions must work
   correctly even in the presence of failures, variable delays in
   communicating with a distributed set of devices, and frequent changes
   in network conditions. Nevertheless cascading and nesting of
   automated configuration processes can lead to the emergence of non-
   linear network behaviors, and as such sudden instabilities (i.e.
   identical local dynamic can give rise to widely different global

Meirosu, et al.        Expires January 8, 2017                [Page 12]

Internet-Draft            DevOps Challenges                   July 2016

7. Observability Challenges

   Monitoring algorithms need to operate in a scalable manner while
   providing the specified level of observability in the network, either
   for operation purposes (Ops part) or for debugging in a development
   phase (Dev part). We consider the following challenges:

   * Scalability - relates to the granularity of network observability,
   computational efficiency, communication overhead, and strategic
   placement of monitoring functions.

   * Distributed operation and information exchange between monitoring
   functions - monitoring functions supported by the nodes may perform
   specific operations (such as aggregation or filtering) locally on the
   collected data or within a defined data neighborhood and forward only
   the result to a management system. Such operation may require
   modifications of existing standards and development of protocols for
   efficient information exchange and messaging between monitoring
   functions. Different levels of granularity may need to be offered for
   the data exchanged through the interfaces, depending on the Dev or
   Ops role. Modern messaging systems, such as Apache Kafka [AK2015],
   widely employed in datacenter environments, were optimized for
   messages that are considerably larger than reading a single counter
   value (typical SNMP GET call usage) - note the throughput vs record
   size from [K2014]. It is also debatable to what extent properties
   such as message persistence within the bus are needed in a carrier
   environment, where MIBs practically offer already a certain level of
   persistence of management data at the node level. Also, they require
   the use of IP addressing which might not be needed when the monitored
   data is consumed by a function within the same node.

   * Common communication channel between monitoring functions and
   higher layer entities (orchestration, control or management systems)
   - a single communication channel for configuration and measurement
   data of diverse monitoring functions running on heterogeneous hard-
   and software environments. In telecommunication environments,
   infrastructure assets span not only large geographical areas, but
   also a wide range of technology domains, ranging from CPEs, access-,
   aggregation-, and transport networks, to datacenters. This
   heterogeneity of hard- and software platforms requires higher layer
   entities to utilize various parallel communication channels for
   either configuration or data retrieval of monitoring functions within
   these technology domains. To address automation and advances in
   monitoring programmability, software defined telecommunication
   infrastructures would benefit from a single flexible communication
   channel, thereby supporting the dynamicity of virtualized
   environments. Such a channel should ideally support propagation of

Meirosu, et al.        Expires January 8, 2017                [Page 13]

Internet-Draft            DevOps Challenges                   July 2016

   configuration, signalling, and results from monitoring functions;
   carrier-grade operations in terms of availability and multi-tenant
   features; support highly distributed and hierarchical architectures,
   keeping messages as local as possible; be lightweight, topology
   independent, network address agnostic; support flexibility in terms
   of transport mechanisms and programming language support.
   Existing popular state-of-the-art message queuing systems such as
   RabbitMQ [R2015] fulfill many of these requirements. However, they
   utilize centralized brokers, posing a single point-of-failure and
   scalability concerns within vastly distributed NFV environment.
   Furthermore, transport support is limited to TCP/IP. ZeroMQ [Z2015]
   on the other hard lacks any advanced features for carrier-grade
   operations, including high-availability, authentication, and tenant

   * Configurability and conditional observability - monitoring
   functions that go beyond measuring simple metrics (such as delay, or
   packet loss) require expressive monitoring annotation languages for
   describing the functionality such that it can be programmed by a
   controller. Monitoring algorithms implementing self-adaptive
   monitoring behavior relative to local network situations may employ
   such annotation languages to receive high-level objectives (KPIs
   controlling tradeoffs between accuracy and measurement frequency, for
   example) and conditions for varying the measurement intensity. Steps
   in this direction were taken by the DevOps tools such as Splunk
   [S2015], whose collecting agent has the ability to load particular
   apps that in turn access specific counters or log files. However,
   such apps are tool specific and may also require deploying additional
   agents that are specific to the application, library or
   infrastructure node being monitored. Choosing which objects to
   monitor in such environment means deploying a tool-specific script
   that configures the monitoring app.

   * Automation - includes mapping of monitoring functionality from a
   logical forwarding graph to virtual or physical instances executing
   in the infrastructure, as well as placement and re-placement of
   monitoring functionality for required observability coverage and
   configuration consistency upon updates in a dynamic network
   environment. Puppet [P2015] manifests or Ansible [A2015] playbooks
   could be used for automating the deployment of monitoring agents, for
   example those used by Splunk [S2015]. However, both manifests and
   playbooks were designed to represent the desired system configuration
   snapshot at a particular moment in time - they would now need to be
   generated automatically by the orchestration tools instead of a
   DevOps person.

   * Actionable data

Meirosu, et al.        Expires January 8, 2017                [Page 14]

Internet-Draft            DevOps Challenges                   July 2016

   Data produced by observability tools could be utilized in a wide
   category of processes, ranging from billing and dimensioning to real-
   time troubleshooting and optimization. In order to allow for data-
   driven automated decisions and actuations based on these decisions,
   the data needs to be actionable. We define actionable data as being
   representative for a particular context or situation and an adequate
   input towards a decision. Ensuring actionable data is challenging in
   a number of ways, including: defining adaptive correlation and
   sampling windows, filtering and aggregation methods that are adapted
   or coordinated with the actual consumer of the data, and developing
   analytical and predictive methods that account for the uncertainty or
   incompleteness of the data.

   * Data Virtualization

   Data is key in helping both Developers and Operators perform their
   tasks. Traditional Network Management Systems were optimized for
   using one database that contains the master copy of the operational
   statistics and logs of network nodes. Ensuring access to this data
   from across the organization is challenging because strict privacy
   and business secrets need to be protected. In DevOps-driven
   environments, data needs to be made available to Developers and their
   test environments. Data virtualization collectively defines a set of
   technologies that ensure that restricted copies of the partial data
   needed for a particular task may be made available while enforcing
   strict access control. Further than simple access control, data
   virtualization needs to address scalability challenges involved in
   copying large amounts of operational data as well as automatically
   disposing of it when the task authorized for using it has finished.

8. Verification Challenges

   Enabling ongoing verification of code is an important goal of
   continuous integration as part of the data center DevOps concept. In
   a telecom SDI, service definitions, decompositions and configurations
   need to be expressed in machine-readable encodings. For example,
   configuration parameters could be expressed in terms of YANG data
   models. However, the infrastructure management layers (such as
   Software-Defined Network Controllers and Orchestration functions)
   might not always export such machine-readable descriptions of the
   runtime configuration state. In this case, the management layer
   itself could be expected to include a verification process that has
   the same challenges as the stand-alone verification processes we
   outline later in this section. In that sense, verification can be
   considered as a set of features providing gatekeeper functions to

Meirosu, et al.        Expires January 8, 2017                [Page 15]

Internet-Draft            DevOps Challenges                   July 2016

   verify both the abstract service models and the proposed resource
   configuration before or right after the actual instantiation on the
   infrastructure layer takes place.

   A verification process can involve different layers of the network
   and service architecture. Starting from a high-level verification of
   the customer input (for example, a Service Graph as defined in
   [I-D.unify-nfvrg-challenges]), the verification process could go more
   in depth to reflect on the Service Function Chain configuration. At
   the lowest layer, the verification would handle the actual set of
   forwarding rules and other configuration parameters associated to a
   Service Function Chain instance. This enables the verification of
   more quantitative properties (e.g. compliance with resource
   availability), as well as a more detailed and precise verification of
   the abovementioned topological ones. Existing SDN verification tools
   could be deployed in this context, but the majority of them only
   operate on flow space rules commonly expressed using OpenFlow syntax.

   Moreover, such verification tools were designed for networks where
   the flow rules are necessary and sufficient to determine the
   forwarding state. This assumption is valid in networks composed only
   by network functions that forward traffic by analyzing only the
   packet headers (e.g. simple routers, stateless firewalls, etc.).
   Unfortunately, most of the real networks contain active network
   functions, represented by middle-boxes that dynamically change the
   forwarding path of a flow according to function-local algorithms and
   an internal state (that is based on the received packets), e.g. load
   balancers, packet marking modules and intrusion detection systems.
   The existing verification tools do not consider active network
   functions because they do not account for the dynamic transformation
   of an internal state into the verification process.

   Defining a set of verification tools that can account for active
   network functions is a significant challenge. In order to perform
   verification based on formal properties of the system, the internal
   states of an active (virtual or not) network function would need to
   be represented. Although these states would increase the verification
   process complexity (e.g., using simple model checking would not be
   feasible due to state explosion), they help to better represent the
   forwarding behavior in real networks. A way to address this challenge
   is by attempting to summarize the internal state of an active network
   function in a way that allows for the verification process to finish
   within a reasonable time interval.

Meirosu, et al.        Expires January 8, 2017                [Page 16]

Internet-Draft            DevOps Challenges                   July 2016

9. Testing Challenges

   Testing in an NFV environment does impact the methodology used. The
   main challenge is the ability to isolate the Device Under Test (DUT).
   When testing physical devices, which are dedicated to a specific
   function, isolation of this function is relatively simple: isolate
   the DUT by surrounding it with emulations from test devices. This
   achieves isolation of the DUT, in a black box fashion, for any type
   of testing. In an NFV environment, the DUT become a component of a
   software infrastructure which can't be isolated. For example, testing
   a VNF can't be achieved without the presence if the NFVI and MANO
   components. In addition, the NFVI and MANO components can greatly
   influence the behavior and the performance of the VNF under test.

   With this in mind, in NFV, the isolation of the DUT becomes a new
   concept: the VNF Under Test (VUT) becomes part of an environment that
   consists of the rest of the necessary architecture components (the
   test environment). In the previous example, the VNF becomes the VUT,
   while the MANO and NFVI become the test environment. Then, isolation
   of the VUT becomes a matter of configuration management, where the
   configuration of the test environment is kept fixed for each test of
   the VUT. So the MANO policies for instantiation, scaling, and
   placement, as well as the NFVI parameters such as HW used, CPU
   pinning, etc must remained fixed for each iterative test of the VNF.
   Only by keeping the configurations constant can the VNF tests can be
   compared to each other. If any test environment configurations are
   changed between tests, the behavior of the VNF can be impacted, thus
   negating any comparison of the results.

   Of course, there are instances of testing where the inverse is
   desired: the configuration of the test environment is changed between
   each test, while the VNF configuration is kept constant. As an
   example, this type of methodology would be used in order to discover
   the optimum configuration of the NFVI for a particular VNF workload.
   Another similar but daunting challenge is the introduction of co-
   located tenants in the same environment as the VNF under test. The
   workload on these "neighbors" can greatly influence the behavior and
   performance of the VNF under test, but the test itself is invaluable
   to understand the impact of such a configuration.

   Another challenge is the usage of test devices (traffic generator,
   emulator) that share the same infrastructure as the VNF under test.
   This can create a situation as above, where the neighbor competes for
   resources with the VUT itself, which can really negate test results.
   If a test architecture such as this is necessary (testing east-west
   traffic, for example), then care must be taken to configure the test
   devices such as they are isolated from the SUT in terms of allowed

Meirosu, et al.        Expires January 8, 2017                [Page 17]

Internet-Draft            DevOps Challenges                   July 2016

   resources, and that they don't impact the SUT's ability to acquire
   resources to operate in all conditions.

   NFV offers new features that didn't exist as such previously, or
   modifies existing mechanisms. Examples of new features are dynamic
   scaling of VNFs and network services (NS), standardized acceleration
   mechanisms and the presence of the virtualization layer, which
   includes the vSwitch. An example mechanism which changes with NFV how
   fault detection and fault recovery are handled. Fault recovery could
   now be handled by MANO in such a way to invoke mechanisms such as
   live migration or snapshots in order to recover the state of a VNF
   and restore operation quickly. While the end results are expected to
   be the same as before, since the mechanism is very different,
   rigorous testing is highly recommended to validate those results.

   Dynamic scaling of VNFs is a new concept in NFV. VNFs that require
   more resources will have them dynamically allocated on demand, and
   then subsequently released when not needed anymore. This is clearly a
   benefit arising from SDI. For each type of VNF, specific metrics will
   be used as input to conditions that will trigger a scaling operation,
   orchestrated by MANO. Testing this mechanism requires a methodology
   tailored to the specific operation of the VNF, in order to properly
   reach the monitored metrics and exercise the conditions leading to a
   scaling trigger. For example, a firewall VNF will be triggered for
   scaling on very different metrics than a 3GPP MME. Both VNFs
   accomplish different functions. Since there will normally be a
   collection of metrics that are monitored in order to trigger a
   scaling operation, the testing methodology must be constructed in
   such a way as to address all combinations of those metrics. Metrics
   for a particular VNF may include sessions, session
   instantiations/second, throughput, etc. These metrics will be
   observed in relation to the given resources for the VNF.

10. Programmable management

   The ability to automate a set of actions to be performed on the
   infrastructure, be it virtual or physical, is key to productivity
   increases following the application of DevOps principles. Previous
   sections in this document touched on different dimensions of

   -  Section 5 approached programmability in the context of developing
     new capabilities for monitoring and for dynamically setting
     configuration parameters of deployed monitoring functions

Meirosu, et al.        Expires January 8, 2017                [Page 18]

Internet-Draft            DevOps Challenges                   July 2016

   -  Section 7 reflected on the need to determine the correctness of
     actions that are to be inflicted on the infrastructure as result
     of executing a set of high-level instructions

   -  Section 8 considered programmability in the perspective of an
     interface to facilitate dynamic orchestration of troubleshooting
     steps towards building workflows and for reducing the manual steps
     required in troubleshooting processes

   We expect that programmable network management - along the lines of
   [RFC7426] - will draw more interest as we move forward. For example,
   in [I-D.unify-nfvrg-challenges], the authors identify the need for
   presenting programmable interfaces that accept instructions in a
   standards-supported manner for the Two-way Active Measurement
   Protocol (TWAMP)TWAMP protocol. More specifically, an excellent
   example in this case is traffic measurements, which are extensively
   used today to determine SLA adherence as well as debug and
   troubleshoot pain points in service delivery. TWAMP is both widely
   implemented by all established vendors and deployed by most global
   operators. However, TWAMP management and control today relies solely
   on diverse and proprietary tools provided by the respective vendors
   of the equipment. For large, virtualized, and dynamically
   instantiated infrastructures where network functions are placed
   according to orchestration algorithms proprietary mechanisms for
   managing TWAMP measurements have severe limitations. For example,
   today's TWAMP implementations are managed by vendor-specific,
   typically command-line interfaces (CLI), which can be scripted on a
   platform-by-platform basis. As a result, although the control and
   test measurement protocols are standardized, their respective
   management is not. This hinders dramatically the possibility to
   integrate such deployed functionality in the SP-DevOps concept. In
   this particular case, recent efforts in the IPPM WG
   [I-D.cmzrjp-ippm-twamp-yang] aim to define a standard TWAMP data
   model and effectively increase the programmability of TWAMP
   deployments in the future.

   Data center DevOps tools, such as those surveyed in [D4.1], developed
   proprietary methods for describing and interacting through interfaces
   with the managed infrastructure. Within certain communities, they
   became de-facto standards in the same way particular CLIs became de-
   facto standards for Internet professionals. Although open-source
   components and a strong community involvement exists, the diversity
   of the new languages and interfaces creates a burden for both vendors
   in terms of choosing which ones to prioritize for support, and then
   developing the functionality and operators that determine what fits
   best for the requirements of their systems.

Meirosu, et al.        Expires January 8, 2017                [Page 19]

Internet-Draft            DevOps Challenges                   July 2016

11. Security Considerations

   DevOps principles are typically practiced within the context of a
   single organization ie a single trust domain. Extending DevOps
   practices across strong organizational boundaries (e.g. between
   commercial organizations) requires consideration of additional threat
   models. Additional validation procedures may be required to ingest
   and accept code changes arising from outside an organization.

12. IANA Considerations

   This memo includes no request to IANA.

13. References

13.1. Informative References

   [NFVMANO] ETSI, "Network Function Virtualization (NFV) Management
             and Orchestration V0.6.1 (draft)", Jul. 2014

   [I-D.aldrin-sfc-oam-framework]   S. Aldrin, R. Pignataro, N. Akiya.
             "Service Function Chaining Operations, Administration and
             Maintenance Framework", draft-aldrin-sfc-oam-framework-02,
             (work in progress), July 2015.

   [I-D.lee-sfc-verification] S. Lee and M. Shin. "Service Function
             Chaining Verification", draft-lee-sfc-verification-00,
             (work in progress), February 2014.

   [RFC7426] E. Haleplidis (Ed.), K. Pentikousis (Ed.), S. Denazis, J.
             Hadi Salim, D. Meyer, and O. Koufopavlou, "Software Defined
             Networking (SDN):  Layers and Architecture Terminology",
             RFC 7426, January 2015

   [RFC7149] M. Boucadair and C Jaquenet. "Software-Defined Networking:
             A Perspective from within a Service Provider Environment",
             RFC 7149, March 2014.

Meirosu, et al.        Expires January 8, 2017                [Page 20]

Internet-Draft            DevOps Challenges                   July 2016

   [TR228]   TMForum Gap Analysis Related to MANO Work. TR228, May 2014

   [I-D.unify-nfvrg-challenges]  R. Szabo et al. "Unifying Carrier and
             Cloud Networks: Problem Statement and Challenges", draft-
             unify-nfvrg-challenges-03 (work in progress), October 2016

   [I-D.cmzrjp-ippm-twamp-yang]  Civil, R., Morton, A., Zheng, L.,
             Rahman, R., Jethanandani, M., and K. Pentikousis, "Two-Way
             Active Measurement Protocol (TWAMP) Data Model", draft-
             cmzrjp-ippm-twamp-yang-02 (work in progress), October 2015.

   [D4.1]    W. John et al. D4.1 Initial requirements for the SP-DevOps
             concept, universal node capabilities and proposed tools,
             August 2014.

   [SDNsurvey] D. Kreutz, F. M. V. Ramos, P. Verissimo, C. Esteve
             Rothenberg, S. Azodolmolky, S. Uhlig. "Software-Defined
             Networking: A Comprehensive Survey." To appear in
             proceedings of the IEEE, 2015.

   [DevOpsP] "DevOps, the IBM Approach" 2013. [Online].

   [Y1564]   ITU-R Recommendation Y.1564: Ethernet service activation
             test methodology, March 2011

   [CAP]     E. Brewer, "CAP twelve years later: How the "rules" have
             changed", IEEE Computer, vol.45, no.2, pp.23,29, Feb. 2012.

   [H2014]  N. Handigol, B. Heller, V. Jeyakumar, D. Mazieres, N.
             McKeown; "I Know What Your Packet Did Last Hop: Using
             Packet Histories to Troubleshoot Networks", In Proceedings
             of the 11th USENIX Symposium on Networked Systems Design
             and Implementation (NSDI 14), pp.71-95

   [W2011]  A. Wundsam, D. Levin, S. Seetharaman, A. Feldmann;
             "OFRewind: Enabling Record and Replay Troubleshooting for
             Networks". In Proceedings of the Usenix Anual Technical
             Conference (Usenix ATC '11), pp 327-340

   [S2010]  E. Al-Shaer and S. Al-Haj. "FlowChecker: configuration
             analysis and verification of federated Openflow
             infrastructures" In Proceedings of the 3rd ACM workshop on
             Assurable and usable security configuration (SafeConfig
             '10). Pp. 37-44

Meirosu, et al.        Expires January 8, 2017                [Page 21]

Internet-Draft            DevOps Challenges                   July 2016

   [OSandS]  S. Wright, D. Druta, "Open Source and Standards: The Role
             of Open Source in the Dialogue between Research and
             Standardization" Globecom Workshops (GC Wkshps), 2014 ,
             pp.650,655, 8-12 Dec. 2014

   [C2015]  CFEngine. Online: http://cfengine.com/product/what-is-
             cfengine/, retrieved Sep 23, 2015.

   [P2015]  Puppet. Online: http://puppetlabs.com/puppet/what-is-puppet,
             retrieved Sep 23, 2015.

   [A2015]  Ansible. Online: http://docs.ansible.com/ , retrieved Sep
             23, 2015.

   [AK2015] Apache Kafka. Online:
             http://kafka.apache.org/documentation.html, retrieved Sep
             23, 2015.

   [S2015]  Splunk. Online: http://www.splunk.com/en_us/products/splunk-
             light.html , retrieved Sep 23, 2015.

   [K2014]  J. Kreps. Benchmarking Apache Kafka: 2 Million Writes Per
             Second (On Three Cheap Machines). Online:
             retrieved Sep 23, 2015.

   [R2015]  RabbitMQ. Online: https://www.rabbitmq.com/ , retrieved Oct
             13, 2015

   [IFA014] ETSI, Network Functions Virtualisation (NFV); Management and
             Orchestration Network Service Templates Specification ,
             DGS/NFV-IFA014, Work In Progress

   [IFA011] ETSI, Network Functions Virtualisation (NFV); Management and
             Orchestration; VNF Packaging Specification, DGS/NFV-IFA011,
             Work in Progress

   [NFVSWA] ETSI, Network functions Virtualisation; Virtual Network
             Functions Architecture,  GS NFV-SWA 001 v1.1.1 (2014)

   [Z2015]  ZeroMQ. Online: http://zeromq.org/ , retrieved Oct 13, 2015

Meirosu, et al.        Expires January 8, 2017                [Page 22]

Internet-Draft            DevOps Challenges                   July 2016

14. Contributors to earlier versions

   J. Kim (Deutsche Telekom), S. Sharma (iMinds), I. Papafili (OTE)

15. Acknowledgments

   The research leading to these results has received funding from the
   European Union Seventh Framework Programme FP7/2007-2013 under grant
   agreement no. 619609 - the UNIFY project. The views expressed here
   are those of the authors only. The European Commission is not liable
   for any use that may be made of the information in this document.

   We would like to thank in particular the UNIFY WP4 contributors, the
   internal reviewers of the UNIFY WP4 deliverables and Russ White and
   Ramki Krishnan for their suggestions.

   This document was prepared using 2-Word-v2.0.template.dot.

Meirosu, et al.        Expires January 8, 2017                [Page 23]

Internet-Draft            DevOps Challenges                   July 2016

16. Authors' Addresses

   Catalin Meirosu
   Ericsson Research
   S-16480 Stockholm, Sweden
   Email: catalin.meirosu@ericsson.com

   Antonio Manzalini
   Telecom Italia
   Via Reiss Romoli, 274
   10148 - Torino, Italy
   Email: antonio.manzalini@telecomitalia.it

   Rebecca Steinert
   SICS Swedish ICT AB
   Box 1263, SE-16429 Kista, Sweden
   Email: rebste@sics.se

   Guido Marchetto
   Politecnico di Torino
   Corso Duca degli Abruzzi 24
   10129 - Torino, Italy
   Email: guido.marchetto@polito.it

   Kostas Pentikousis
   Travelping GmbH
   Koernerstrasse 7-10
   Berlin 10785
   Email: k.pentikousis@travelping.com

   Steven Wright
   AT&T Services Inc.
   1057 Lenox Park Blvd NE, STE 4D28
   Atlanta, GA 30319
   Email: sw3588@att.com

   Pierre Lynch
   800 Perimeter Park Drive, Suite A
   Morrisville, NC 27560

Meirosu, et al.        Expires January 8, 2017                [Page 24]

Internet-Draft            DevOps Challenges                   July 2016

   Email: plynch@ixiacom.com

   Wolfgang John
   Ericsson Research
   S-16480 Stockholm, Sweden
   Email: wolfgang.john@ericsson.com

Meirosu, et al.        Expires January 8, 2017                [Page 25]

Html markup produced by rfcmarkup 1.129c, available from https://tools.ietf.org/tools/rfcmarkup/