draft-ietf-dime-overload-reqs-02.txt   draft-ietf-dime-overload-reqs-03.txt 
Network Working Group E. McMurry Network Working Group E. McMurry
Internet-Draft B. Campbell Internet-Draft B. Campbell
Intended status: Standards Track Tekelec Intended status: Standards Track Tekelec
Expires: June 20, 2013 December 17, 2012 Expires: July 19, 2013 January 15, 2013
Diameter Overload Control Requirements Diameter Overload Control Requirements
draft-ietf-dime-overload-reqs-02 draft-ietf-dime-overload-reqs-03
Abstract Abstract
When a Diameter server or agent becomes overloaded, it needs to be When a Diameter server or agent becomes overloaded, it needs to be
able to gracefully reduce its load, typically by informing clients to able to gracefully reduce its load, typically by informing clients to
reduce sending traffic for some period of time. Otherwise, it must reduce sending traffic for some period of time. Otherwise, it must
continue to expend resources parsing and responding to Diameter continue to expend resources parsing and responding to Diameter
messages, possibly resulting in congestion collapse. The existing messages, possibly resulting in congestion collapse. The existing
mechanisms provided by Diameter are not sufficient for this purpose. Diameter mechanisms, listed in Section 3 are not sufficient for this
This document describes the limitations of the existing mechanisms, purpose. This document describes the limitations of the existing
and provides requirements for new overload management mechanisms. mechanisms in Section 4. Requirements for new overload management
mechanisms are provided in Section 7.
Status of this Memo Status of this Memo
This Internet-Draft is submitted in full conformance with the This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79. provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/. Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on June 20, 2013. This Internet-Draft will expire on July 19, 2013.
Copyright Notice Copyright Notice
Copyright (c) 2012 IETF Trust and the persons identified as the Copyright (c) 2013 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as the Trust Legal Provisions and are provided without warranty as
skipping to change at page 2, line 20 skipping to change at page 2, line 21
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1. Causes of Overload . . . . . . . . . . . . . . . . . . . . 3 1.1. Causes of Overload . . . . . . . . . . . . . . . . . . . . 3
1.2. Effects of Overload . . . . . . . . . . . . . . . . . . . 5 1.2. Effects of Overload . . . . . . . . . . . . . . . . . . . 5
1.3. Overload vs. Network Congestion . . . . . . . . . . . . . 5 1.3. Overload vs. Network Congestion . . . . . . . . . . . . . 5
1.4. Diameter Applications in a Broader Network . . . . . . . . 5 1.4. Diameter Applications in a Broader Network . . . . . . . . 5
1.5. Documentation Conventions . . . . . . . . . . . . . . . . 6 1.5. Documentation Conventions . . . . . . . . . . . . . . . . 6
2. Overload Scenarios . . . . . . . . . . . . . . . . . . . . . . 6 2. Overload Scenarios . . . . . . . . . . . . . . . . . . . . . . 6
2.1. Peer to Peer Scenarios . . . . . . . . . . . . . . . . . . 7 2.1. Peer to Peer Scenarios . . . . . . . . . . . . . . . . . . 7
2.2. Agent Scenarios . . . . . . . . . . . . . . . . . . . . . 9 2.2. Agent Scenarios . . . . . . . . . . . . . . . . . . . . . 9
2.3. Interconnect Scenario . . . . . . . . . . . . . . . . . . 12 2.3. Interconnect Scenario . . . . . . . . . . . . . . . . . . 12
3. Extensibility . . . . . . . . . . . . . . . . . . . . . . . . 13 3. Existing Mechanisms . . . . . . . . . . . . . . . . . . . . . 13
4. Existing Mechanisms . . . . . . . . . . . . . . . . . . . . . 14 4. Issues with the Current Mechanisms . . . . . . . . . . . . . . 14
5. Issues with the Current Mechanisms . . . . . . . . . . . . . . 14 4.1. Problems with Implicit Mechanism . . . . . . . . . . . . . 15
5.1. Problems with Implicit Mechanism . . . . . . . . . . . . . 15 4.2. Problems with Explicit Mechanisms . . . . . . . . . . . . 15
5.2. Problems with Explicit Mechanisms . . . . . . . . . . . . 15 5. Diameter Overload Case Studies . . . . . . . . . . . . . . . . 16
6. Diameter Overload Case Studies . . . . . . . . . . . . . . . . 16 5.1. Overload in Mobile Data Networks . . . . . . . . . . . . . 16
6.1. Overload in Mobile Data Networks . . . . . . . . . . . . . 16 5.2. 3GPP Study on Core Network Overload . . . . . . . . . . . 17
6.2. 3GPP Study on Core Network Overload . . . . . . . . . . . 17 6. Extensibility and Application Independence . . . . . . . . . . 18
7. Solution Requirements . . . . . . . . . . . . . . . . . . . . 18 7. Solution Requirements . . . . . . . . . . . . . . . . . . . . 19
8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 23 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 24
9. Security Considerations . . . . . . . . . . . . . . . . . . . 23 9. Security Considerations . . . . . . . . . . . . . . . . . . . 24
9.1. Access Control . . . . . . . . . . . . . . . . . . . . . . 24 9.1. Access Control . . . . . . . . . . . . . . . . . . . . . . 24
9.2. Denial-of-Service Attacks . . . . . . . . . . . . . . . . 24 9.2. Denial-of-Service Attacks . . . . . . . . . . . . . . . . 24
9.3. Replay Attacks . . . . . . . . . . . . . . . . . . . . . . 24 9.3. Replay Attacks . . . . . . . . . . . . . . . . . . . . . . 25
9.4. Man-in-the-Middle Attacks . . . . . . . . . . . . . . . . 25 9.4. Man-in-the-Middle Attacks . . . . . . . . . . . . . . . . 25
9.5. Compromised Hosts . . . . . . . . . . . . . . . . . . . . 25 9.5. Compromised Hosts . . . . . . . . . . . . . . . . . . . . 25
10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 25 10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 26
10.1. Normative References . . . . . . . . . . . . . . . . . . . 25 10.1. Normative References . . . . . . . . . . . . . . . . . . . 26
10.2. Informative References . . . . . . . . . . . . . . . . . . 26 10.2. Informative References . . . . . . . . . . . . . . . . . . 26
Appendix A. Contributors . . . . . . . . . . . . . . . . . . . . 26 Appendix A. Contributors . . . . . . . . . . . . . . . . . . . . 27
Appendix B. Acknowledgements . . . . . . . . . . . . . . . . . . 26 Appendix B. Acknowledgements . . . . . . . . . . . . . . . . . . 27
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 27 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 27
1. Introduction 1. Introduction
When a Diameter [RFC6733] server or agent becomes overloaded, it When a Diameter [RFC6733] server or agent becomes overloaded, it
needs to be able to gracefully reduce its load, typically by needs to be able to gracefully reduce its load, typically by
informing clients to reduce sending traffic for some period of time. informing clients to reduce sending traffic for some period of time.
Otherwise, it must continue to expend resources parsing and Otherwise, it must continue to expend resources parsing and
responding to Diameter messages, possibly resulting in congestion responding to Diameter messages, possibly resulting in congestion
collapse. The existing mechanisms provided by Diameter are not collapse. The existing mechanisms provided by Diameter are not
sufficient for this purpose. This document describes the limitations sufficient for this purpose. This document describes the limitations
of the existing mechanisms, and provides requirements for new of the existing mechanisms, and provides requirements for new
overload management mechanisms. overload management mechanisms.
This document draws on [RFC5390] and the work done on SIP overload This document draws on the work done on SIP overload control
control as well as on overload practices in SS7 networks and studies ([RFC5390], [RFC6357]) as well as on experience gained via overload
done by 3GPP. handling in Signaling System No. 7 (SS7) networks and studies done by
the Third Generation Partnersip Project (3GPP) (Section 5).
Diameter is not typically an end-user protocol; rather it is Diameter is not typically an end-user protocol; rather it is
generally used as one component in support of some end-user activity. generally used as one component in support of some end-user activity.
For example, a WiFi access point might use Diameter to authenticate
and authorize user access via 802.11. Overload in a network that For example, a SIP server might use Diameter to authenticate and
uses Diameter applications will likely spill over into the end-user authorize user access. Overload in the Diameter backend
application network. The impact of Diameter overload on the client infrastructure will likely impact the experience observed by the end
application (a client application may use the Diameter protocol and user in the SIP application.
other protocols to do its job) is beyond the scope of this document.
The impact of Diameter overload on the client application (a client
application may use the Diameter protocol and other protocols to do
its job) is beyond the scope of this document.
This document presents non-normative descriptions of causes of This document presents non-normative descriptions of causes of
overload along with related scenarios and studies. Finally, it overload along with related scenarios and studies. Finally, it
offers a set of normative requirements for an improved overload offers a set of normative requirements for an improved overload
indication mechanism. indication mechanism.
1.1. Causes of Overload 1.1. Causes of Overload
Overload occurs when an element, such as a Diameter server or agent, Overload occurs when an element, such as a Diameter server or agent,
has insufficient resources to successfully process all of the traffic has insufficient resources to successfully process all of the traffic
skipping to change at page 5, line 19 skipping to change at page 5, line 26
transaction volumes. If a Diameter node becomes overloaded, or even transaction volumes. If a Diameter node becomes overloaded, or even
worse, fails completely, a large number of messages may be lost very worse, fails completely, a large number of messages may be lost very
quickly. Even with redundant servers, many messages can be lost in quickly. Even with redundant servers, many messages can be lost in
the time it takes for failover to complete. While a Diameter client the time it takes for failover to complete. While a Diameter client
or agent should be able to retry such requests, an overloaded peer or agent should be able to retry such requests, an overloaded peer
may cause a sudden large increase in the number of transaction may cause a sudden large increase in the number of transaction
transactions needing to be retried, rapidly filling local queues or transactions needing to be retried, rapidly filling local queues or
otherwise contributing to local overload. Therefore Diameter devices otherwise contributing to local overload. Therefore Diameter devices
need to be able to shed load before critical failures can occur. need to be able to shed load before critical failures can occur.
Diameter depends heavily on The "Authentication, Authorization,
and Accounting (AAA) Transport Profile" [RFC3539], which states
assumptions about the scale of AAA services which may be incorrect
for current uses of Diameter. In particular, the document
suggests that AAA services will typically be low volume and that
traffic will typically be application-driven. Section 2.1 of that
document uses an example of a 48 port NAS. However, Diameter is
commonly used in large-scale mobile data environments, where a
typical client could be a packet gateway that serves millions of
users, and generates Diameter messages at network-driven rates.
1.3. Overload vs. Network Congestion 1.3. Overload vs. Network Congestion
This document uses the term "overload" to refer to application-layer This document uses the term "overload" to refer to application-layer
overload at Diameter nodes. This is distinct from "network overload at Diameter nodes. This is distinct from "network
congestion", that is, congestion that occurs at the lower networking congestion", that is, congestion that occurs at the lower networking
layers that may impact the delivery of Diameter messages between layers that may impact the delivery of Diameter messages between
nodes. The authors recognize that element overload and network nodes. The authors recognize that element overload and network
congestion are interrelated, and that overload can contribute to congestion are interrelated, and that overload can contribute to
network congestion and vice versa. network congestion and vice versa.
skipping to change at page 13, line 37 skipping to change at page 13, line 37
shared between components within a network operator's network. shared between components within a network operator's network.
Network operators may not want to convey topology or operational Network operators may not want to convey topology or operational
information, which limits how much overload and loading information information, which limits how much overload and loading information
can be sent. For the interconnect scenario shown, Server 2 may want can be sent. For the interconnect scenario shown, Server 2 may want
to signal overload to Server 1, to affect traffic coming from Network to signal overload to Server 1, to affect traffic coming from Network
Operator 1. Operator 1.
This case is distinct from those internal to a network operator's This case is distinct from those internal to a network operator's
network, where there may be many more elements in a more complicated network, where there may be many more elements in a more complicated
topology. Also, the elements in the interconnect network may not topology. Also, the elements in the interconnect network may not
support diameter overload control, and the network operators may not support Diameter overload control, and the network operators may not
want the interconnect network to use overload or loading information. want the interconnect network to use overload or loading information.
They may only want the information to pass through the interconnect They may only want the information to pass through the interconnect
network without further processing or action by the interconnect network without further processing or action by the interconnect
network even if the elements in the interconnect network do support network even if the elements in the interconnect network do support
diameter overload control. Diameter overload control.
3. Extensibility
Given the variety of scenarios diameter elements can be deployed in,
and the variety of roles they can fulfill with diameter and other
technologies, a single algorithm for handling overload may not be
sufficient. This effort cannot anticipate all possible future
scenarios and roles. Extensibility, particularly of algorithms used
to deal with overload, will be important to cover these cases.
4. Existing Mechanisms 3. Existing Mechanisms
Diameter offers both implicit and explicit mechanisms for a Diameter Diameter offers both implicit and explicit mechanisms for a Diameter
node to learn that a peer is overloaded or unreachable. The implicit node to learn that a peer is overloaded or unreachable. The implicit
mechanism is simply the lack of responses to requests. If a client mechanism is simply the lack of responses to requests. If a client
fails to receive a response in a certain time period, it assumes the fails to receive a response in a certain time period, it assumes the
upstream peer is unavailable, or overloaded to the point of effective upstream peer is unavailable, or overloaded to the point of effective
unavailability. The watchdog mechanism [RFC3539] ensures that a unavailability. The watchdog mechanism [RFC3539] ensures that a
certain rate of transaction responses occur even when there is certain rate of transaction responses occur even when there is
otherwise little or no other Diameter traffic. otherwise little or no other Diameter traffic.
skipping to change at page 14, line 49 skipping to change at page 14, line 40
issues with transport (e.g. congestion propagation and window issues with transport (e.g. congestion propagation and window
management) are managed at that level. But even with a congestion- management) are managed at that level. But even with a congestion-
managed transport, a Diameter node can become overloaded at the managed transport, a Diameter node can become overloaded at the
Diameter protocol or application layers due to the causes described Diameter protocol or application layers due to the causes described
in Section 1.1 and congestion managed transports do not provide in Section 1.1 and congestion managed transports do not provide
facilities (and are at the wrong level) to handle server overload. facilities (and are at the wrong level) to handle server overload.
Transport level congestion management is also not sufficient to Transport level congestion management is also not sufficient to
address overload in cases of multi-hop and multi-destination address overload in cases of multi-hop and multi-destination
signaling. signaling.
5. Issues with the Current Mechanisms 4. Issues with the Current Mechanisms
The currently available Diameter mechanisms for indicating an The currently available Diameter mechanisms for indicating an
overload condition are not adequate to avoid service outages due to overload condition are not adequate to avoid service outages due to
overload. This inadequacy may, in turn, contribute to broader overload. This inadequacy may, in turn, contribute to broader
congestion collapse due to unresponsive Diameter nodes causing congestion collapse due to unresponsive Diameter nodes causing
application or transport layer retransmissions. In particular, they application or transport layer retransmissions. In particular, they
do not allow a Diameter agent or server to shed load as it approaches do not allow a Diameter agent or server to shed load as it approaches
overload. At best, a node can only indicate that it needs to overload. At best, a node can only indicate that it needs to
entirely stop receiving requests, i.e. that it has effectively entirely stop receiving requests, i.e. that it has effectively
failed. Even that is problematic due to the inability to indicate failed. Even that is problematic due to the inability to indicate
durational validity on the transient errors available in the base durational validity on the transient errors available in the base
Diameter protocol. Diameter offers no mechanism to allow a node to Diameter protocol. Diameter offers no mechanism to allow a node to
indicate different overload states for different categories of indicate different overload states for different categories of
messages, for example, if it is overloaded for one Diameter messages, for example, if it is overloaded for one Diameter
application but not another. application but not another.
5.1. Problems with Implicit Mechanism 4.1. Problems with Implicit Mechanism
The implicit mechanism doesn't allow an agent or server to inform the The implicit mechanism doesn't allow an agent or server to inform the
client of a problem until it is effectively too late to do anything client of a problem until it is effectively too late to do anything
about it. The client does not know to take action until the upstream about it. The client does not know to take action until the upstream
node has effectively failed. A Diameter node has no opportunity to node has effectively failed. A Diameter node has no opportunity to
shed load early to avoid collapse in the first place. shed load early to avoid collapse in the first place.
Additionally, the implicit mechanism cannot distinguish between Additionally, the implicit mechanism cannot distinguish between
overload of a Diameter node and network congestion. Diameter treats overload of a Diameter node and network congestion. Diameter treats
the failure to receive an answer as a transport failure. the failure to receive an answer as a transport failure.
5.2. Problems with Explicit Mechanisms 4.2. Problems with Explicit Mechanisms
The Diameter specification is ambiguous on how a client should handle The Diameter specification is ambiguous on how a client should handle
receipt of a DIAMETER_TOO_BUSY response. The base specification receipt of a DIAMETER_TOO_BUSY response. The base specification
[RFC6733] indicates that the sending client should attempt to send [RFC6733] indicates that the sending client should attempt to send
the request to a different peer. It makes no suggestion that a the the request to a different peer. It makes no suggestion that the
receipt of a DIAMETER_TOO_BUSY response should affect future Diameter receipt of a DIAMETER_TOO_BUSY response should affect future Diameter
messages in any way. messages in any way.
The Authentication, Authorization, and Accounting (AAA) Transport The Authentication, Authorization, and Accounting (AAA) Transport
Profile [RFC3539] recommends that a AAA node that receives a "Busy" Profile [RFC3539] recommends that a AAA node that receives a "Busy"
response failover all remaining requests to a different agent or response failover all remaining requests to a different agent or
server. But while the Diameter base specification explicitly depends server. But while the Diameter base specification explicitly depends
on RFC3539 to define transport behavior, it does not refer to RFC3539 on RFC3539 to define transport behavior, it does not refer to RFC3539
in the description of behavior on receipt of DIAMETER_TOO_BUSY. in the description of behavior on receipt of DIAMETER_TOO_BUSY.
There's a strong likelihood that at least some implementations will There's a strong likelihood that at least some implementations will
skipping to change at page 16, line 40 skipping to change at page 16, line 32
DIAMETER_UNABLE_TO_DELIVER, or using DPR with cause code BUSY also DIAMETER_UNABLE_TO_DELIVER, or using DPR with cause code BUSY also
have no mechanisms for specifying the scope or cause of the failure, have no mechanisms for specifying the scope or cause of the failure,
or the durational validity. or the durational validity.
The issues with error responses in [RFC6733] extend beyond the The issues with error responses in [RFC6733] extend beyond the
particular issues for overload control and have been addressed in an particular issues for overload control and have been addressed in an
ad hoc fashion by various implementations. Addressing these in a ad hoc fashion by various implementations. Addressing these in a
standard way would be a useful exercise, but it us beyond the scope standard way would be a useful exercise, but it us beyond the scope
of this document. of this document.
6. Diameter Overload Case Studies 5. Diameter Overload Case Studies
6.1. Overload in Mobile Data Networks 5.1. Overload in Mobile Data Networks
As the number of Third Generation (3G) and Long Term Evolution (LTE) As the number of Third Generation (3G) and Long Term Evolution (LTE)
enabled smartphone devices continue to expand in mobility networks, enabled smartphone devices continue to expand in mobility networks,
there have been situations where high signaling traffic load led to there have been situations where high signaling traffic load led to
overload events at the Diameter-based Home Location Registries (HLR) overload events at the Diameter-based Home Location Registries (HLR)
and/or Home Subscriber Servers (HSS) [TR23.843]. The root causes of and/or Home Subscriber Servers (HSS) [TR23.843]. The root causes of
the HLR congestion events were manifold but included hardware failure the HLR congestion events were manifold but included hardware failure
and procedural errors. The result was high signaling traffic load on and procedural errors. The result was high signaling traffic load on
the HLR and HSS. the HLR and HSS.
The 3GPP architecture [TS23.002] makes extensive use of Diameter. It The 3GPP architecture [TS23.002] makes extensive use of Diameter. It
is used for mobility management [TS29.272] (and others), IMS is used for mobility management [TS29.272] (and others), (IP
[TS29.228] (and others), policy and charging control [TS29.212] (and Multimedia Subsystem) IMS [TS29.228] (and others), policy and
others) as well as other functions. The details of the architecture charging control [TS29.212] (and others) as well as other functions.
are out of scope for this document, but it is worth noting that there The details of the architecture are out of scope for this document,
are quite a few Diameter applications, some with quite large amounts but it is worth noting that there are quite a few Diameter
of Diameter signaling in deployed networks. applications, some with quite large amounts of Diameter signaling in
deployed networks.
The 3GPP specifications do not currently address overload for The 3GPP specifications do not currently address overload for
Diameter applications or provide an equivalent load control mechanism Diameter applications or provide an equivalent load control mechanism
to those provided in the more traditional SS7 elements in GSM to those provided in the more traditional SS7 elements in (Global
[TS29.002]. The capabilities specified in the 3GPP standards do not System for Mobile Communications) GSM [TS29.002]. The capabilities
adequately address the abnormal condition where excessively high specified in the 3GPP standards do not adequately address the
signaling traffic load situations are experienced. abnormal condition where excessively high signaling traffic load
situations are experienced.
Smartphones contribute much more heavily, relative to non- Smartphones, an increasingly large percentage of mobile devices,
smartphones, to the continuation of a registration surge due to their contribute much more heavily, relative to non-smartphones, to the
very aggressive registration algorithms. The aggressive smartphone continuation of a registration surge due to their very aggressive
logic is designed to: registration algorithms. Smartphone behavior contributes to network
loading and can contribute to overload conditions. The aggressive
smartphone logic is designed to:
a. always have voice and data registration, and a. always have voice and data registration, and
b. constantly try to be on 3G or LTE data (and thus on 3G voice or b. constantly try to be on 3G or LTE data (and thus on 3G voice or
VoLTE) for their added benefits. VoLTE) for their added benefits.
Non-smartphones typically have logic to wait for a time period after Non-smartphones typically have logic to wait for a time period after
registering successfully on voice and data. registering successfully on voice and data.
The smartphone aggressive registration is problematic in two ways: The smartphone aggressive registration is problematic in two ways:
o first by generating excessive signaling load towards the HLR that o first by generating excessive signaling load towards the HLR that
is ten times that from a non-smartphone, is ten times that from a non-smartphone,
o and second by causing continual registration attempts when a o and second by causing continual registration attempts when a
network failure affects registrations through the 3G data network. network failure affects registrations through the 3G data network.
6.2. 3GPP Study on Core Network Overload 5.2. 3GPP Study on Core Network Overload
A study in 3GPP SA2 on core network overload has produced the A study in 3GPP SA2 on core network overload has produced the
technical report [TR23.843]. This enumerates several causes of technical report [TR23.843]. This enumerates several causes of
overload in mobile core networks including portions that are signaled overload in mobile core networks including portions that are signaled
using Diameter. This document is a work in progress and is not using Diameter. This document is a work in progress and is not
complete. However, it is useful for pointing out scenarios and the complete. However, it is useful for pointing out scenarios and the
general need for an overload control mechanism for Diameter. general need for an overload control mechanism for Diameter.
It is common for mobile networks to employ more than one radio It is common for mobile networks to employ more than one radio
technology and to do so in an overlay fashion with multiple technology and to do so in an overlay fashion with multiple
technologies present in the same location (such as GSM, UMTS or CDMA technologies present in the same location (such as 2nd or 3rd
along with LTE). This presents opportunities for traffic storms when generation mobile technologies along with LTE). This presents
issues occur on one overlay and not another as all devices that had opportunities for traffic storms when issues occur on one overlay and
been on the overlay with issues switch. This causes a large amount not another as all devices that had been on the overlay with issues
of Diameter traffic as locations and policies are updated. switch. This causes a large amount of Diameter traffic as locations
and policies are updated.
Another scenario called out by this study is a flood of registration Another scenario called out by this study is a flood of registration
and mobility management events caused by some element in the core and mobility management events caused by some element in the core
network failing. This flood of traffic from end nodes falls under network failing. This flood of traffic from end nodes falls under
the network initiated traffic flood category. There is likely to the network initiated traffic flood category. There is likely to
also be traffic resulting directly from the component failure in this also be traffic resulting directly from the component failure in this
case. A similar flood can occur when elements or components recover case. A similar flood can occur when elements or components recover
as well. as well.
Subscriber initiated traffic floods are also indicated in this study Subscriber initiated traffic floods are also indicated in this study
as an overload mechanism where a large number of mobile devices as an overload mechanism where a large number of mobile devices
attempting to access services at the same time, such as in response attempting to access services at the same time, such as in response
to an entertainment event or a catastrophic event. to an entertainment event or a catastrophic event.
While this 3GPP study is concerned with the broader effects of these While this 3GPP study is concerned with the broader effects of these
scenarios on wireless networks and their elements, they have scenarios on wireless networks and their elements, they have
implications specifically for Diameter signaling. One of the goals implications specifically for Diameter signaling. One of the goals
of this document is to provide guidance for a core mechanism that can of this document is to provide guidance for a core mechanism that can
be used to mitigate the scenarios called out by this study. be used to mitigate the scenarios called out by this study.
6. Extensibility and Application Independence
Given the variety of scenarios Diameter elements can be deployed in,
and the variety of roles they can fulfill with Diameter and other
technologies, a single algorithm for handling overload may not be
sufficient. This effort cannot anticipate all possible future
scenarios and roles. Extensibility, particularly of algorithms used
to deal with overload, will be important to cover these cases.
Similarly, the scopes that overload information may apply to may
include cases that have not yet been considered. Extensibility in
this area will also be important.
The basic mechanism is intended to be application-independent, that
is, a Diameter node can use it across any existing and future
Diameter applications and expect reasonable results. Certain
Diameter applications might, however, benefit from application-
specific behavior over and above the mechanism's defaults. For
example, an application specification might specify relative
priorities of messages or selection of a specific overload control
algorithm.
7. Solution Requirements 7. Solution Requirements
This section proposes requirements for an improved mechanism to This section proposes requirements for an improved mechanism to
control Diameter overload, with the goals of improving the issues control Diameter overload, with the goals of improving the issues
described in Section 5 and supporting the scenarios described in described in Section 4 and supporting the scenarios described in
Section 2 Section 2
REQ 1: The overload control mechanism MUST provide a communication REQ 1: The overload control mechanism MUST provide a communication
method for Diameter nodes to exchange load and overload method for Diameter nodes to exchange load and overload
information. information.
REQ 2: [Open Issue: The following requirement has generated list REQ 2: The mechanism MUST allow Diameter nodes to support overload
discussion that is unresolved at the time of this writing. control regardless of which Diameter applications they
The discussion concerns whether this requirement is needed support.
at all, whether it should include the "MUST NOT require
specification changes" language vs saying that it should not
force changes large enough to require new application IDs,
and whether we should include additional language to forbid
assumptions about the behavior of specific implementations.]
The overload control mechanism MUST be useable with any
existing or future Diameter application. It MUST NOT
require specification changes for existing Diameter
applications.
REQ 3: The overload control mechanism MUST limit the impact of REQ 3: The overload control mechanism MUST limit the impact of
overload on the overall useful throughput of a Diameter overload on the overall useful throughput of a Diameter
server, even when the incoming load on the network is far in server, even when the incoming load on the network is far in
excess of its capacity. The overall useful throughput under excess of its capacity. The overall useful throughput under
load is the ultimate measure of the value of an overload load is the ultimate measure of the value of an overload
control mechanism. control mechanism.
REQ 4: Diameter allows requests to be sent from either side of a REQ 4: Diameter allows requests to be sent from either side of a
connection and either side of a connection may have need to connection and either side of a connection may have need to
skipping to change at page 20, line 5 skipping to change at page 20, line 17
decisions using the most currently available information. decisions using the most currently available information.
REQ 9: The mechanism MUST function across fully loaded as well as REQ 9: The mechanism MUST function across fully loaded as well as
quiescent transport connections. This is partially derived quiescent transport connections. This is partially derived
from the requirements for stability and hysteresis control from the requirements for stability and hysteresis control
above. above.
REQ 10: Consumers of overload state indications MUST be able to REQ 10: Consumers of overload state indications MUST be able to
determine when the overload condition improves or ends. determine when the overload condition improves or ends.
REQ 11: The overload control mechanism MUST be scalable. That is, REQ 11: The overload control mechanism MUST be able to operate in
it MUST be able to operate in different sized networks. networks of different sizes.
REQ 12: When a single network node fails, goes into overload, or REQ 12: When a single network node fails, goes into overload, or
suffers from reduced processing capacity, the mechanism MUST suffers from reduced processing capacity, the mechanism MUST
make it possible to limit the impact of this on other nodes make it possible to limit the impact of this on other nodes
in the network. This helps to prevent a small-scale failure in the network. This helps to prevent a small-scale failure
from becoming a widespread outage. from becoming a widespread outage.
REQ 13: The mechanism MUST NOT introduce substantial additional work REQ 13: The mechanism MUST NOT introduce substantial additional work
for node in an overloaded state. For example, a requirement for node in an overloaded state. For example, a requirement
for an overloaded node to send overload information every for an overloaded node to send overload information every
skipping to change at page 20, line 45 skipping to change at page 21, line 12
environment with a mix of nodes that do, and nodes that do environment with a mix of nodes that do, and nodes that do
not, support the mechanism. not, support the mechanism.
REQ 17: In a mixed environment with nodes that support the overload REQ 17: In a mixed environment with nodes that support the overload
control mechanism and that do not, the mechanism MUST result control mechanism and that do not, the mechanism MUST result
in at least as much useful throughput as would have resulted in at least as much useful throughput as would have resulted
if the mechanism were not present. It SHOULD result in less if the mechanism were not present. It SHOULD result in less
severe congestion in this environment. severe congestion in this environment.
REQ 18: In a mixed environment of nodes that support the overload REQ 18: In a mixed environment of nodes that support the overload
control mechanism and that do not, users and operators of control mechanism and that do not, the mechanism MUST NOT
nodes that do not support the mechanism MUST NOT unfairly preclude elements that support overload control from
benefit from the mechanism. treating elements that do not support overload control in a
equitable fashion relative to those that do. users and
operators of nodes that do not support the mechanism MUST
NOT unfairly benefit from the mechanism. The mechanism
specification SHOULD provide guidance to implementors for
dealing with elements not supporting overload control.
REQ 19: It MUST be possible to use the mechanism between nodes in REQ 19: It MUST be possible to use the mechanism between nodes in
different realms and in different administrative domains. different realms and in different administrative domains.
REQ 20: Any explicit overload indication MUST distinguish between REQ 20: Any explicit overload indication MUST distinguish between
actual overload, as opposed to other, non-overload related actual overload, as opposed to other, non-overload related
failures. failures.
REQ 21: In cases where a network node fails, is so overloaded that REQ 21: In cases where a network node fails, is so overloaded that
it cannot process messages, or cannot communicate due to a it cannot process messages, or cannot communicate due to a
skipping to change at page 26, line 10 skipping to change at page 26, line 26
RFC 2914, September 2000. RFC 2914, September 2000.
[RFC3539] Aboba, B. and J. Wood, "Authentication, Authorization and [RFC3539] Aboba, B. and J. Wood, "Authentication, Authorization and
Accounting (AAA) Transport Profile", RFC 3539, June 2003. Accounting (AAA) Transport Profile", RFC 3539, June 2003.
10.2. Informative References 10.2. Informative References
[RFC5390] Rosenberg, J., "Requirements for Management of Overload in [RFC5390] Rosenberg, J., "Requirements for Management of Overload in
the Session Initiation Protocol", RFC 5390, December 2008. the Session Initiation Protocol", RFC 5390, December 2008.
[RFC6357] Hilt, V., Noel, E., Shen, C., and A. Abdelal, "Design
Considerations for Session Initiation Protocol (SIP)
Overload Control", RFC 6357, August 2011.
[TR23.843] [TR23.843]
3GPP, "Study on Core Network Overload Solutions", 3GPP, "Study on Core Network Overload Solutions",
TR 23.843 0.6.0, October 2012. TR 23.843 0.6.0, October 2012.
[IR.34] GSMA, "Inter-Service Provider IP Backbone Guidelines", [IR.34] GSMA, "Inter-Service Provider IP Backbone Guidelines",
IR 34 7.0, January 2012. IR 34 7.0, January 2012.
[IR.88] GSMA, "LTE Roaming Guidelines", IR 88 7.0, January 2012. [IR.88] GSMA, "LTE Roaming Guidelines", IR 88 7.0, January 2012.
[TS23.002] [TS23.002]
 End of changes. 32 change blocks. 
99 lines changed or deleted 111 lines changed or added

This html diff was produced by rfcdiff 1.41. The latest version is available from http://tools.ietf.org/tools/rfcdiff/