--- 1/draft-ietf-ccamp-alarm-module-02.txt 2018-09-20 05:13:21.987634816 -0700 +++ 2/draft-ietf-ccamp-alarm-module-03.txt 2018-09-20 05:13:22.123638087 -0700 @@ -1,19 +1,19 @@ Network Working Group S. Vallin Internet-Draft Stefan Vallin AB Intended status: Standards Track M. Bjorklund -Expires: February 9, 2019 Cisco - August 8, 2018 +Expires: March 24, 2019 Cisco + September 20, 2018 YANG Alarm Module - draft-ietf-ccamp-alarm-module-02 + draft-ietf-ccamp-alarm-module-03 Abstract This document defines a YANG module for alarm management. It includes functions for alarm list management, alarm shelving and notifications to inform management systems. There are also RPCs to manage the operator state of an alarm and administrative alarm procedures. The module carefully maps to relevant alarm standards. Status of This Memo @@ -24,21 +24,21 @@ Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." - This Internet-Draft will expire on February 9, 2019. + This Internet-Draft will expire on March 24, 2019. Copyright Notice Copyright (c) 2018 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents @@ -52,69 +52,71 @@ 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1. Terminology and Notation . . . . . . . . . . . . . . . . 3 2. Objectives . . . . . . . . . . . . . . . . . . . . . . . . . 4 3. Alarm Module Concepts . . . . . . . . . . . . . . . . . . . . 5 3.1. Alarm Definition . . . . . . . . . . . . . . . . . . . . 5 3.2. Alarm Type . . . . . . . . . . . . . . . . . . . . . . . 5 3.3. Identifying the Alarming Resource . . . . . . . . . . . . 7 3.4. Identifying Alarm Instances . . . . . . . . . . . . . . . 8 3.5. Alarm Life-Cycle . . . . . . . . . . . . . . . . . . . . 8 - 3.5.1. Resource Alarm Life-Cycle . . . . . . . . . . . . . . 8 - 3.5.2. Operator Alarm Life-cycle . . . . . . . . . . . . . . 9 + 3.5.1. Resource Alarm Life-Cycle . . . . . . . . . . . . . . 9 + 3.5.2. Operator Alarm Life-cycle . . . . . . . . . . . . . . 10 3.5.3. Administrative Alarm Life-Cycle . . . . . . . . . . . 10 3.6. Root Cause, Impacted Resources and Related Alarms . . . . 10 3.7. Alarm Shelving . . . . . . . . . . . . . . . . . . . . . 11 3.8. Alarm Profiles . . . . . . . . . . . . . . . . . . . . . 11 - 4. Alarm Data Model . . . . . . . . . . . . . . . . . . . . . . 11 - 4.1. Alarm Control . . . . . . . . . . . . . . . . . . . . . . 12 - 4.1.1. Alarm Shelving . . . . . . . . . . . . . . . . . . . 12 - 4.2. Alarm Inventory . . . . . . . . . . . . . . . . . . . . . 12 - 4.3. Alarm Summary . . . . . . . . . . . . . . . . . . . . . . 13 - 4.4. The Alarm List . . . . . . . . . . . . . . . . . . . . . 13 - 4.5. The Shelved Alarms List . . . . . . . . . . . . . . . . . 13 - 4.6. Alarm Profiles . . . . . . . . . . . . . . . . . . . . . 14 - 4.7. RPCs and Actions . . . . . . . . . . . . . . . . . . . . 14 - 4.8. Notifications . . . . . . . . . . . . . . . . . . . . . . 14 - 5. Alarm YANG Module . . . . . . . . . . . . . . . . . . . . . . 14 - 6. X.733 Extensions . . . . . . . . . . . . . . . . . . . . . . 44 - 7. The X.733 Mapping Module . . . . . . . . . . . . . . . . . . 44 - 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 55 - 9. Security Considerations . . . . . . . . . . . . . . . . . . . 56 - 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 57 - 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 57 - 11.1. Normative References . . . . . . . . . . . . . . . . . . 57 - 11.2. Informative References . . . . . . . . . . . . . . . . . 58 - Appendix A. Vendor-specific Alarm-Types Example . . . . . . . . 59 - Appendix B. Alarm Inventory Example . . . . . . . . . . . . . . 60 - Appendix C. Alarm List Example . . . . . . . . . . . . . . . . . 61 - Appendix D. Alarm Shelving Example . . . . . . . . . . . . . . . 62 - Appendix E. X.733 Mapping Example . . . . . . . . . . . . . . . 63 - Appendix F. Background and Usability Requirements . . . . . . . 64 - F.1. Alarm Concepts . . . . . . . . . . . . . . . . . . . . . 64 - F.1.1. Alarm type . . . . . . . . . . . . . . . . . . . . . 64 - F.2. Usability Requirements . . . . . . . . . . . . . . . . . 65 - - Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 68 + 4. Alarm Data Model . . . . . . . . . . . . . . . . . . . . . . 12 + 4.1. Alarm Control . . . . . . . . . . . . . . . . . . . . . . 13 + 4.1.1. Alarm Shelving . . . . . . . . . . . . . . . . . . . 13 + 4.2. Alarm Inventory . . . . . . . . . . . . . . . . . . . . . 13 + 4.3. Alarm Summary . . . . . . . . . . . . . . . . . . . . . . 14 + 4.4. The Alarm List . . . . . . . . . . . . . . . . . . . . . 15 + 4.5. The Shelved Alarms List . . . . . . . . . . . . . . . . . 17 + 4.6. Alarm Profiles . . . . . . . . . . . . . . . . . . . . . 17 + 4.7. RPCs and Actions . . . . . . . . . . . . . . . . . . . . 17 + 4.8. Notifications . . . . . . . . . . . . . . . . . . . . . . 17 + 5. Alarm YANG Module . . . . . . . . . . . . . . . . . . . . . . 18 + 6. X.733 Extensions . . . . . . . . . . . . . . . . . . . . . . 47 + 7. The X.733 Mapping Module . . . . . . . . . . . . . . . . . . 48 + 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 59 + 9. Security Considerations . . . . . . . . . . . . . . . . . . . 59 + 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 60 + 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 60 + 11.1. Normative References . . . . . . . . . . . . . . . . . . 60 + 11.2. Informative References . . . . . . . . . . . . . . . . . 61 + Appendix A. Vendor-specific Alarm-Types Example . . . . . . . . 62 + Appendix B. Alarm Inventory Example . . . . . . . . . . . . . . 63 + Appendix C. Alarm List Example . . . . . . . . . . . . . . . . . 64 + Appendix D. Alarm Shelving Example . . . . . . . . . . . . . . . 65 + Appendix E. X.733 Mapping Example . . . . . . . . . . . . . . . 66 + Appendix F. Background and Usability Requirements . . . . . . . 67 + F.1. Alarm Concepts . . . . . . . . . . . . . . . . . . . . . 67 + F.1.1. Alarm type . . . . . . . . . . . . . . . . . . . . . 67 + F.2. Relationships to other alarm standards . . . . . . . . . 68 + F.2.1. Alarm definition . . . . . . . . . . . . . . . . . . 68 + F.2.2. Data model . . . . . . . . . . . . . . . . . . . . . 70 + F.3. Usability Requirements . . . . . . . . . . . . . . . . . 72 + Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 75 1. Introduction This document defines a YANG [RFC7950] module for alarm management. - The purpose is to define a standardised alarm interface for network + The purpose is to define a standardized alarm interface for network devices that can be easily integrated into management applications. The model is also applicable as a northbound alarm interface in the management applications. Alarm monitoring is a fundamental part of monitoring the network. Raw alarms from devices do not always tell the status of the network services or necessarily point to the root cause. However, being able - to feed alarms to the alarm management application in a standardised + to feed alarms to the alarm management application in a standardized format is a starting point for performing higher level network assurance tasks. The design of the module is based on experience from using and implementing available alarm standards from ITU [X.733], 3GPP [ALARMIRP] and ANSI [ISA182]. 1.1. Terminology and Notation The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", @@ -148,20 +150,24 @@ for example: an interface, a process. o Alarm Instance: The alarm state for a specific resource and alarm type. For example (GigabitEthernet0/15, link-alarm). An entry in the alarm list. o Alarm Inventory: A list of all possible alarm types on a system. o Alarm Shelving: Blocking alarms according to specific criteria. + o Corrective Action: An action taken by an operator or automation + routine in order to minimize the impact of the alarm or resolving + the root cause. + o Management System: The alarm management application that consumes the alarms, i.e., acts as a client. o System: The system that implements this YANG alarm module, i.e., acts as a server. This corresponds to a network device or a management application that provides a north-bound alarm interface. Tree diagrams used in this document follow the notation defined in [RFC8340]. @@ -213,21 +219,21 @@ There are two main things to remember from this definition: 1. the definition focuses on leaving out events and logging information in general. Alarms should only be used for undesired states that require action. 2. the definition also focus on alarms as a state on a resource, not the notifications that report the state changes. See Appendix F for more motivation and consequences around this - definition. + definition as well as how it relates to other alarm standards. 3.2. Alarm Type This document defines an alarm type with an alarm type id and an alarm type qualifier. The alarm type id is modeled as a YANG identity. With YANG identities, new alarm types can be defined in a distributed fashion. YANG identities are hierarchical, which means that an hierarchy of alarm types can be defined. @@ -237,33 +243,34 @@ The use of YANG identities means that all possible alarms are identified at design time. This explicit declaration of alarm types makes it easier to allow for alarm qualification reviews and preparation of alarm actions and documentation. There are occasions where the alarm types are not known at design time. For example, a system with digital inputs that allows users to connects detectors (e.g., smoke detector) to the inputs. In this case it is a configuration action that says that certain connectors - are fire alarms for example. A potential drawback of this is that - there is a big risk that alarm operators will receive alarm types as - a surprise, they do not know how to resolve the problem since a - defined alarm procedure does not necessarily exist. To avoid this - risk the system MUST publish all possible alarm types in the alarm - inventory, see Section 4.2. + are fire alarms for example. In order to allow for dynamic addition of alarm types the alarm - module also allows for further qualification of the identity based - alarm type using a string. + module allows for further qualification of the identity based alarm + type using a string. A potential drawback of this is that there is a + big risk that alarm operators will receive alarm types as a surprise, + they do not know how to resolve the problem since a defined alarm + procedure does not necessarily exist. To avoid this risk the system + MUST publish all possible alarm types in the alarm inventory, see + Section 4.2. - A vendor or standard can then define their own alarm-type hierarchy. - The example below shows a hierarchy based on X.733 event types: + A vendor or standard organization can define their own alarm-type + hierarchy. The example below shows a hierarchy based on X.733 event + types: import ietf-alarms { prefix al; } identity vendor-alarms { base al:alarm-type; } identity communications-alarm { base vendor-alarms; } @@ -321,20 +328,34 @@ A server SHOULD strive to minimize the number of dynamically defined alarm types. 3.3. Identifying the Alarming Resource It is of vital importance to be able to refer to the alarming resource. This reference must be as fine-grained as possible. If the alarming resource exists in the data tree then an instance- identifier MUST be used with the full path to the object. + When the module is used in a controller/orchestrator/manager the + original device resource identification can be modified to include + the device in the path. The details depend on how devices are + identified, and are out of scope for this specification. + + Example: + + The original device alarm might identify the resource as + "/dev:interfaces/dev:interface[dev:name='FastEthernet1/0']". + + The resource identification in the manager could look something + like: "/mgr:devices/mgr:device[mgr:name='xyz123']/dev:interfaces/ + dev:interface[dev:name='FastEthernet1/0']" + This module also allows for alternate naming of the alarming resource if it is not available in the data tree. 3.4. Identifying Alarm Instances A primary goal of this alarm module is to remove any ambiguity in how alarm notifications are mapped to an update of an alarm instance. X.733 and especially 3GPP were not really clear on this point. This YANG alarm module states that the tuple (resource, alarm type identifier, alarm type qualifier) corresponds to a single alarm @@ -507,36 +528,80 @@ The fundamental parts of the data model are the "alarm-list" with associated notifications and the "alarm-inventory" list of all possible alarm types. These MUST be implemented by a system. The rest of the data model are made conditional with YANG the features "operator-actions", "alarm-shelving", "alarm-history", "alarm- summary", "alarm-profile", and "severity-assignment". The data model has the following overall structure: + +--rw control + | +--rw max-alarm-status-changes? union + | +--rw (notify-status-changes)? + | | ... + | +--rw alarm-shelving {alarm-shelving}? + | ... + +--ro alarm-inventory + | +--ro alarm-type* [alarm-type-id alarm-type-qualifier] + | ... + +--ro summary {alarm-summary}? + | +--ro alarm-summary* [severity] + | | ... + | +--ro shelves-active? empty {alarm-shelving}? + +--ro alarm-list + | +--ro number-of-alarms? yang:gauge32 + | +--ro last-changed? yang:date-and-time + | +--ro alarm* [resource alarm-type-id alarm-type-qualifier] + | ... + +--ro shelved-alarms {alarm-shelving}? + | +--ro number-of-shelved-alarms? yang:gauge32 + | +--ro alarm-shelf-last-changed? yang:date-and-time + | +--ro shelved-alarm* + | [resource alarm-type-id alarm-type-qualifier] + | ... + +--rw alarm-profile* + [alarm-type-id alarm-type-qualifier-match resource] + {alarm-profile}? + +--rw alarm-type-id al:alarm-type-id + +--rw alarm-type-qualifier-match string + +--rw resource al:resource-match + +--rw description string + +--rw alarm-severity-assignment-profile + {severity-assignment}? + ... + 4.1. Alarm Control The "/alarms/control/notify-status-changes" choice controls if notifications are sent for all state changes, only raise and clear, or only notifications more severe than a configured level. This feature in combination with alarm shelving corresponds to the ITU Alarm Report Control functionality. Every alarm has a list of status changes, this is a circular list. The length of this list is controlled by "/alarms/control/max-alarm- status-changes". 4.1.1. Alarm Shelving The shelving control tree is shown below: + +--rw control + +--rw alarm-shelving {alarm-shelving}? + +--rw shelf* [name] + +--rw name string + +--rw resource* resource-match + +--rw alarm-type-id? alarm-type-id + +--rw alarm-type-qualifier-match? string + +--rw description? string + Shelved alarms are shown in a dedicated shelved alarm list. The instrumentation MUST move shelved alarms from the alarm list (/alarms/alarm-list) to the shelved alarm list (/alarms/shelved- alarms/). Shelved alarms do not generate any notifications. When the shelving criteria is removed or changed the alarm list MUST be updated to the correct actual state of the alarms. Shelving and unshelving can only be performed by editing the shelf configuration. It cannot be performed on individual alarms. The server will add an operator state indicating that the alarm was @@ -564,32 +629,99 @@ the alarm type qualifier MUST populate this list. The optional leaf-list "resource" in the alarm inventory enables the system to publish for which resources a given alarm type may appear. A server MUST implement the alarm inventory in order to enable controlled alarm procedures in the client. The alarm inventory tree is shown below: + +--ro alarm-inventory + +--ro alarm-type* [alarm-type-id alarm-type-qualifier] + +--ro alarm-type-id alarm-type-id + +--ro alarm-type-qualifier alarm-type-qualifier + +--ro resource* resource-match + +--ro has-clear boolean + +--ro severity-levels* severity + +--ro description string + 4.3. Alarm Summary - The alarm summary list summarises alarms per severity; how many + The alarm summary list summarizes alarms per severity; how many cleared, cleared and closed, and closed. It also gives an indication if there are shelved alarms. The alarm summary tree is shown below: + +--ro summary {alarm-summary}? + +--ro alarm-summary* [severity] + | +--ro severity severity + | +--ro total? yang:gauge32 + | +--ro cleared? yang:gauge32 + | +--ro cleared-not-closed? yang:gauge32 + | | {operator-actions}? + | +--ro cleared-closed? yang:gauge32 + | | {operator-actions}? + | +--ro not-cleared-closed? yang:gauge32 + | | {operator-actions}? + | +--ro not-cleared-not-closed? yang:gauge32 + | {operator-actions}? + +--ro shelves-active? empty {alarm-shelving}? + 4.4. The Alarm List The alarm list (/alarms/alarm-list) is a function from (resource, - alarm type, alarm type qualifier) to the current alarm state. + alarm type, alarm type qualifier) to the current composite alarm + state. The composite state includes states for the resource life- + cycle such as severity, clearance flag and operator states such as + acknowledgment. + + +--ro alarm-list + +--ro number-of-alarms? yang:gauge32 + +--ro last-changed? yang:date-and-time + +--ro alarm* [resource alarm-type-id alarm-type-qualifier] + +--ro resource resource + +--ro alarm-type-id alarm-type-id + +--ro alarm-type-qualifier alarm-type-qualifier + +--ro alt-resource* resource + +--ro related-alarm* + | [resource alarm-type-id alarm-type-qualifier] + | +--ro resource + | | -> /alarms/alarm-list/alarm/resource + | +--ro alarm-type-id leafref + | +--ro alarm-type-qualifier leafref + +--ro impacted-resource* resource + +--ro root-cause-resource* resource + +--ro time-created yang:date-and-time + +--ro is-cleared boolean + +--ro last-changed yang:date-and-time + +--ro perceived-severity severity + +--ro alarm-text alarm-text + +--ro status-change* [time] {alarm-history}? + | +--ro time yang:date-and-time + | +--ro perceived-severity severity-with-clear + | +--ro alarm-text alarm-text + +--ro operator-state-change* [time] {operator-actions}? + | +--ro time yang:date-and-time + | +--ro operator string + | +--ro state operator-state + | +--ro text? string + +---x set-operator-state {operator-actions}? + | +---w input + | +---w state writable-operator-state + | +---w text? string + +---n operator-action {operator-actions}? + +-- time yang:date-and-time + +-- operator string + +-- state operator-state + +-- text? string Every alarm has three important states, the resource clearance state "is-cleared", the severity "perceived-severity" and the operator state available in the operator state change list. In order to see the alarm history the resource state changes are available in the "status-change" list and the operator history is available in the "operator-state-change" list. 4.5. The Shelved Alarms List @@ -600,20 +732,31 @@ 4.6. Alarm Profiles Alarm profiles (/alarms/alarm-profile/) is a list of configurable alarm types. The list supports configurable alarm severity levels in the container "alarm-severity-assignment-profile". If an alarm matches the configured alarm type it MUST use the configured severity level(s) instead of the system default. This configuration MUST also be represented in the alarm inventory. + +--rw alarm-profile* + [alarm-type-id alarm-type-qualifier-match resource] + {alarm-profile}? + +--rw alarm-type-id al:alarm-type-id + +--rw alarm-type-qualifier-match string + +--rw resource al:resource-match + +--rw description string + +--rw alarm-severity-assignment-profile + {severity-assignment}? + +--rw severity-levels* al:severity + 4.7. RPCs and Actions The alarm module supports rpcs and actions to manage the alarms: "purge-alarms" (rpc): delete alarms according to specific criteria, for example all cleared alarms older then a specific date. "compress-alarms" (rpc): compress the status-change list for the alarms. @@ -631,21 +774,21 @@ operator state on an alarm, like acknowledge. If the alarm inventory is changed, for example a new card type is inserted, a notification will tell the management application that new alarm types are available. 5. Alarm YANG Module This YANG module references [RFC6991]. - file "ietf-alarms@2018-08-08.yang" + file "ietf-alarms@2018-09-20.yang" module ietf-alarms { yang-version 1.1; namespace "urn:ietf:params:xml:ns:yang:ietf-alarms"; prefix al; import ietf-yang-types { prefix yang; reference "RFC 6991: Common YANG Data Types."; } @@ -754,21 +897,21 @@ The key words 'MUST', 'MUST NOT', 'REQUIRED', 'SHALL', 'SHALL NOT', 'SHOULD', 'SHOULD NOT', 'RECOMMENDED', 'MAY', and 'OPTIONAL' in the module text are to be interpreted as described in RFC 2119 (https://tools.ietf.org/html/rfc2119). This version of this YANG module is part of RFC XXXX (https://tools.ietf.org/html/rfcXXXX); see the RFC itself for full legal notices."; - revision 2018-08-08 { + revision 2018-09-20 { description "Initial revision."; reference "RFC XXXX: YANG Alarm Module"; } /* * Features */ feature operator-actions { @@ -2073,21 +2211,21 @@ mapping provided by the system is in conflict with other management systems or not considered correct. Note that the IETF Alarm Module term 'resource' is synonymous to the ITU term 'managed object'. 7. The X.733 Mapping Module This YANG module references [X.733] and [X.736]. - file "ietf-alarms-x733@2018-08-08.yang" + file "ietf-alarms-x733@2018-09-20.yang" module ietf-alarms-x733 { yang-version 1.1; namespace "urn:ietf:params:xml:ns:yang:ietf-alarms-x733"; prefix x733; import ietf-alarms { prefix al; } import ietf-yang-types { prefix yang; @@ -2128,24 +2266,23 @@ The module uses an integer and a corresponding string for probable cause instead of a globally defined enumeration, in order to be able to manage conflicting enumeration definitions. A single globally defined enumeration is challenging to maintain."; reference "ITU Recommendation X.733: Information Technology - Open Systems Interconnection - System Management: Alarm Reporting Function"; - revision 2018-08-08 { + revision 2018-09-20 { description "Initial revision."; - reference "RFC XXXX: YANG Alarm Module"; } /* * Features */ feature configure-x733-mapping { description "The system supports configurable X733 mapping from @@ -2747,20 +2883,25 @@ "The semantics of alarm definitions: enabling systematic reasoning about alarms. International Journal of Network Management, Volume 22, Issue 3, John Wiley and Sons, Ltd, http://dx.doi.org/10.1002/nem.800", March 2012. [EEMUA] EEMUA Publication No. 191 Engineering Equipment and Materials Users Association, London, 2 edition., "Alarm Systems: A Guide to Design, Management and Procurement.", 2007. + [G.7710] ITU-T, "SERIES G: TRANSMISSION SYSTEMS AND MEDIA, DIGITAL + SYSTEMS AND NETWORKS Data over Transport - Generic aspects + - Transport network control aspects. Common equipment + management function requirements", 2012. + [ISA182] International Society of Automation,ISA, "ANSI/ISA- 18.2-2009 Management of Alarm Systems for the Process Industries", 2009. [RFC3877] Chisholm, S. and D. Romascanu, "Alarm Management Information Base (MIB)", RFC 3877, DOI 10.17487/RFC3877, September 2004, . [RFC8340] Bjorklund, M. and L. Berger, Ed., "YANG Tree Diagrams", BCP 215, RFC 8340, DOI 10.17487/RFC8340, March 2018, @@ -2958,52 +3099,52 @@ Appendix F. Background and Usability Requirements This section gives background information regarding design choices in the alarm module. It also defines usability requirements for alarms. Alarm usability is important for an alarm interface. A data-model will help in defining the format but if the actual alarms are of low value we have not gained the goal of alarm management. - The telecommunication domain has standardised an alarm interface in + The telecommunication domain has standardized an alarm interface in ITU-T X.733 [X.733]. This continued in mobile networks within the - 3GPP organisation [ALARMIRP]. Although SNMP is the dominant - mechanism for monitoring devices, IETF did not early on standardise + 3GPP organization [ALARMIRP]. Although SNMP is the dominant + mechanism for monitoring devices, IETF did not early on standardize an alarm MIB. Instead, management systems interpreted the enterprise specific traps per MIB and device to build an alarm list. When finally The Alarm MIB [RFC3877] was published, it had to address the existence of enterprise traps and map these into alarms. This requirement led to a MIB that is not always easy to use. F.1. Alarm Concepts There are two misconceptions regarding alarms and alarm interfaces that are important to sort out. The first problem is that alarms are mixed with events in general. Alarms MUST correspond to an undesirable state that needs corrective action. Many implementations of alarm interfaces do not adhere to this principle and just send events in general. In order to qualify as an alarm, there must exist a corrective action. If that is not true, it is an event that can go into logs. - The other misconception is that the term "alarm" refers to the - notification itself. Rather, an alarm is a state of a resource in - the system. The alarm notifications report state changes of the - alarm, such as alarm raise and alarm clear. - "One of the most important principles of alarm management is that an alarm requires an action. This means that if the operator does not need to respond to an alarm (because unacceptable consequences do not occur), then it is not an alarm. Following this cardinal rule will help eliminate many potential alarm management issues." [ISA182] + The other misconception is that the term "alarm" refers to the + notification itself. Rather, an alarm is a state of a resource in + the system. The alarm notifications report state changes of the + alarm, such as alarm raise and alarm clear. + F.1.1. Alarm type Since every alarm has a corresponding corrective action, a vendor can to prepare a list of available alarms and their corrective actions. We use the term "alarm type" to refer to every possible alarm that could be active in the system. Alarm types are also fundamental in order to provide a state-based alarm list. The alarm list correlates alarm state changes for the same alarm type and the same resource into one alarm. @@ -3011,159 +3152,341 @@ Different alarm interfaces use different mechanisms to define alarm types, ranging from simple error numbers to more advanced mechanisms like the X.733 triplet of event type, probable cause and specific problem. A common misunderstanding is that individual alarm notifications are alarm types. This is not correct; e.g., "link-up" and "link-down" are two notifications reporting different states for the same alarm type, "link-alarm". -F.2. Usability Requirements +F.2. Relationships to other alarm standards - Common alarm problems and the cause of the problems are summarised in - Table 1. This summary is adopted to networking based on the ISA + This section briefly describes how this alarm module relates to other + relevant alarm standards. It covers the definition of the concept of + an alarm and the data models of the referenced alarm standards. + +F.2.1. Alarm definition + + The table below summarizes relevant definitions of the term "alarm". + + +------------+---------------------------+--------------------------+ + | Standard | Definition | Comment | + +------------+---------------------------+--------------------------+ + | X.733 | error: A deviation of a | The X.733 alarm | + | [X.733] | system from normal | definition is focused on | + | | operation. fault: The | the notification as such | + | | physical or algorithmic | and not the state. It | + | | cause of a malfunction. | also uses the basic | + | | Faults manifest | criteria of deviation | + | | themselves as errors. | from normal condition. | + | | alarm: A notification, of | There is no requirement | + | | the form defined by this | for an operation action | + | | function, of a specific | to be required. | + | | event. An alarm may or | | + | | may not represent an | | + | | error. | | + | | | | + | G.7710 | Alarms are indications | The G.7710 definition is | + | [G.7710] | that are automatically | close to the original | + | | generated by an NE as a | X.733 definition. | + | | result of the declaration | | + | | of a failure. | | + | | | | + | Alarm MIB | Alarm: Persistent | RFC 3877 defines alarm | + | [RFC3877] | indication of a fault. | referring back to "a | + | | Fault: Lasting error or | deviation from normal | + | | warning condition. | operation". This is | + | | Error: A deviation of a | problematic, since this | + | | system from normal | might not require an | + | | operation. | operator action. The | + | | | alarm MIB is state | + | | | oriented rather than | + | | | notification oriented, | + | | | an alarm is a "lasting | + | | | condition", not a | + | | | discrete notification | + | | | reporting about a | + | | | condition state change. | + | | | | + | ISA | Alarm: An audible and/or | The ISA standard adds an | + | [ISA182] | visible means of | important requirement to | + | | indicating to the | the "deviation from | + | | operator an equipment | normal condition state"; | + | | malfunction, process | requiring a response. | + | | deviation or abnormal | | + | | condition requiring a | | + | | response. | | + | | | | + | EEMUA | An alarm is an event to | This is the foundation | + | [EEMUA] | which an operator must | for the definition of | + | | knowingly react,respond, | alarm in this document. | + | | and acknowledge - not | It focuses on the core | + | | simply acknowledge and | criteria that an action | + | | ignore. | is really needed. | + | | | | + | 3GPP Alarm | 3GPP v15: An alarm | The latest 3GPP Alarm | + | IRP | signifies an undesired | IRP version uses | + | [ALARMIRP] | condition of a resource | literally the same alarm | + | | (e.g. network element, | definition as this alarm | + | | link) for which an | module. It is worth | + | | operator action is | noting that earlier | + | | required. It emphasizes a | versions used a | + | | key requirement that | definition not requiring | + | | operators [...] should | an operator action and | + | | not be informed about an | the more broad | + | | undesired condition | definition of deviation | + | | unless it requires | from normal condition. | + | | operator action. 3GPP | The earlier version also | + | | v12: alarm: abnormal | defined an alarm as a | + | | network entity condition, | special case of "event". | + | | which categorizes an | | + | | event as a fault. fault: | | + | | a deviation of a system | | + | | from normal operation, | | + | | which may result in the | | + | | loss of operational | | + | | capabilities [...] | | + +------------+---------------------------+--------------------------+ + + Table 1: Definition of alarm in standards + + The evolution of the definition of alarm moves from focused on events + reporting a deviation from normal operation towards a definition to a + undesired *state* which *requires an operator action*. + +F.2.2. Data model + + This section describes how this YANG alarm module relates to other + standard data models. Note well that we cover other data-models for + alarm interfaces. Not other standards such as SDO specific alarms + for example. + +F.2.2.1. X.733 + + X.733 has acted as a base for several alarm data models over the + year. The YANG alarm module differs in the following ways: + + X.733 models the alarm list as a list of notifications. The YANG + alarm module defines the alarm list as the current alarm states + for the resources, which is generated from the state change + reporting notifications. + + In X.733 an alarm can have the severity level clear. In the YANG + alarm module "clear" is not a severity level, it is a separate + state of the alarm. An alarm can have the following states for + example (major, cleared), (minor, not cleared) + + X.733 uses a flat globally defined enumerated "probable cause" to + identify alarm types. This alarm module uses a hierarchical YANG + identity, alarm-type. This enables delegation of alarm types + within organizations. It also lets management reason about + "abstract" alarm-types corresponding to base identities, see + Section 3.2. + + The YANG alarm module has not included the majority of the X.733 + alarm attributes. Rather these are defined in an augmenting + module if "strict" X.733 compliance is needed. + +F.2.2.2. RFC3877, the Alarm MIB + + The MIB in RFC3877 takes a different approach, rather than defining a + concrete data-model for alarms, it defines a model to map existing + SNMP managed-objects and notifications into alarm states and alarm + notifications. This was necessary since MIBs where already defined + with both managed objects and notifications indicating alarms, for + example linkUp and linkDown notifications in combination with + ifAdminState and ifOperState. So RFC3877 can not really be compared + to the alarm YANG module in that sense. + + The Alarm MIB maps existing MIB definitions into alarms, + alarmModelTable. The upside of that is that a SNMP Manager can at + runtime read the possible alarm types. This corresponds to the + alarmInventory in the alarm YANG module. + +F.2.2.3. 3GPP Alarm IRP + + The 3GPP Alarm IRP is an evolution of X.733. Main differences + between the alarm YANG module and 3GPP are: + + 3GPP keeps the majority of the X.733 attributes, the alarm YANG + module does not. + + 3GPP introduced overlapping and possibly conflicting keys for + alarms, alarmId and (managed object, event type, probable cause, + specific problem). (See Annex C in [X.733] Example 3). In the + YANG alarm module the key for identifying an alarm instance is + clearly defined by (resource, alarm-type, alarm-type-qualifier). + See also Section 3.4 for more information. + + The alarm YANG module clearly separates the resource/ + instrumentation life cycle from the operator life cycle. 3GPP + allows operators to set the alarm severity to clear, this is not + allowed by this module, rather an operator closes an alarm which + does not affect the severity. + +F.2.2.4. G.7710 + + G.7710 is different than the previous referenced alarm standards. It + does define a data-model for alarm reporting. It defines common + equipment management function requirements including alarm + instrumentation. The scope is transport networks. + + The requirements in G.7710 corresponds to features in the alarm YANG + module in the following way: + + Alarm Severity Assignment Profile (ASAP): the alarm profile + "/alarms/alarm-profile/". + + Alarm Reporting Control (ARC): alarm shelving "/alarms/control/ + alarm-shelving/" and the ability to control alarm notifications + "/alarms/control/notify-status-changes". + +F.3. Usability Requirements + + Common alarm problems and the cause of the problems are summarized in + Table 2. This summary is adopted to networking based on the ISA [ISA182] and EEMUA [EEMUA] standards. +------------------+--------------------------------+---------------+ | Problem | Cause | How this | | | | module | | | | address the | | | | cause | +------------------+--------------------------------+---------------+ | Alarms are | "Nuisance" alarms (chattering | Strict | | generated but | alarms and fleeting alarms), | definition of | | they are ignored | faulty hardware, redundant | alarms | | by the operator. | alarms, cascading alarms, | requiring | | | incorrect alarm settings, | corrective | | | alarms have not been | response. | - | | rationalised, the alarms | Alarm | + | | rationalized, the alarms | Alarm | | | represent log information | requirements | - | | rather than true alarms. | in Table 2. | + | | rather than true alarms. | in Table 3. | | | | | | When alarms | Insufficient alarm response | The alarm | | occur, operators | procedures and not well | inventory | | do not know how | defined alarm types. | lists all | | to respond. | | alarm types | | | | and | | | | corrective | | | | actions. | | | | Alarm | | | | requirements | - | | | in Table 2. | + | | | in Table 3. | | | | | | The alarm | Nuisance alarms, stale alarms, | The alarm | | display is full | alarms from equipment not in | definition | | of alarms, even | service. | and alarm | | when there is | | shelving. | | nothing wrong. | | | | | | | | During a | Incorrect prioritization of | State-based | | failure, | alarms. Not using advanced | alarm model, | | operators are | alarm techniques (e.g. state- | alarm rate | | flooded with so | based alarming). | requirements | - | many alarms that | | in Table 3 | - | they do not know | | and Table 4 | + | many alarms that | | in Table 4 | + | they do not know | | and Table 5 | | which ones are | | | | the most | | | | important. | | | +------------------+--------------------------------+---------------+ - Table 1: Alarm Problems and Causes + Table 2: Alarm Problems and Causes Based upon the above problems EEMUA gives the following definition of a good alarm: +----------------+--------------------------------------------------+ | Characteristic | Explanation | +----------------+--------------------------------------------------+ | Relevant | Not spurious or of low operational value. | | | | | Unique | Not duplicating another alarm. | | | | | Timely | Not long before any response is needed or too | | | late to do anything. | | | | - | Prioritised | Indicating the importance that the operator | + | Prioritized | Indicating the importance that the operator | | | deals with the problem. | | | | | Understandable | Having a message which is clear and easy to | | | understand. | | | | | Diagnostic | Identifying the problem that has occurred. | | | | | Advisory | Indicative of the action to be taken. | | | | | Focusing | Drawing attention to the most important issues. | +----------------+--------------------------------------------------+ - Table 2: Definition of a Good Alarm + Table 3: Definition of a Good Alarm - Vendors SHOULD rationalise all alarms according to above. Another + Vendors SHOULD rationalize all alarms according to above. Another crucial requirement is acceptable alarm notification rates. Vendors SHOULD make sure that they do not exceed the recommendations from EEMUA below: +-----------------------------------+-------------------------------+ | Long Term Alarm Rate in Steady | Acceptability | | Operation | | +-----------------------------------+-------------------------------+ | More than one per minute | Very likely to be | | | unacceptable. | | | | | One per 2 minutes | Likely to be over-demanding. | | | | | One per 5 minutes | Manageable. | | | | | Less than one per 10 minutes | Very likely to be acceptable. | +-----------------------------------+-------------------------------+ - Table 3: Acceptable Alarm Rates, Steady State + Table 4: Acceptable Alarm Rates, Steady State +----------------------------+--------------------------------------+ | Number of alarms displayed | Acceptability | | in 10 minutes following a | | | major network problem | | +----------------------------+--------------------------------------+ | More than 100 | Definitely excessive and very likely | | | to lead to the operator to abandon | | | the use of the alarm system. | | | | | 20-100 | Hard to cope with. | | | | | Under 10 | Should be manageable - but may be | | | difficult if several of the alarms | | | require a complex operator response. | +----------------------------+--------------------------------------+ - Table 4: Acceptable Alarm Rates, Burst + Table 5: Acceptable Alarm Rates, Burst - The numbers in Table 3 and Table 4 are the sum of all alarms for a + The numbers in Table 4 and Table 5 are the sum of all alarms for a network being managed from one alarm console. So every individual system or NMS contributes to these numbers. Vendors SHOULD make sure that the following rules are used in designing the alarm interface: 1. Rationalize the alarms in the system to ensure that every alarm is necessary, has a purpose, and follows the cardinal rule - that it requires an operator response. Adheres to the rules of - Table 2 + Table 3 2. Audit the quality of the alarms. Talk with the operators about how well the alarm information support them. Do they know what to do in the event of an alarm? Are they able to quickly diagnose the problem and determine the corrective action? Does - the alarm text adhere to the requirements in Table 2? + the alarm text adhere to the requirements in Table 3? 3. Analyze and benchmark the performance of the system and compare - it to the recommended metrics in Table 3 and Table 4. Start by + it to the recommended metrics in Table 4 and Table 5. Start by identifying nuisance alarms, standing alarms at normal state and startup. Authors' Addresses Stefan Vallin Stefan Vallin AB Email: stefan@wallan.se Martin Bjorklund