[Docs] [txt|pdf|xml|html] [Tracker] [Email] [Nits]

Versions: 00 01

Audio/Video Transport Working                                    G. Hunt
Group                                                           P. Arden
Internet-Draft                                                        BT
Intended status: Informational                              July 7, 2008
Expires: January 8, 2009

                    Monitoring Architectures for RTP

Status of this Memo

   By submitting this Internet-Draft, each author represents that any
   applicable patent or other IPR claims of which he or she is aware
   have been or will be disclosed, and any of which he or she becomes
   aware will be disclosed, in accordance with Section 6 of BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at

   The list of Internet-Draft Shadow Directories can be accessed at

   This Internet-Draft will expire on January 8, 2009.

Hunt & Arden             Expires January 8, 2009                [Page 1]

Internet-Draft        RTP Monitoring Architectures             July 2008


   This memo is intended to stimulate discussion on a hierarchical
   monitoring architecture for RTP, including a scheme for the
   definition of lower-layer metrics which are usable by a range of
   applications.  Systematic investigation of a monitoring architecture
   for RTP/RTCP was requested at the IETF71 (Philadelphia) AVT session.

   This first version of the draft is restricted to transport metrics
   and to a subset of audio application metrics, but it is envisaged
   that future work should extend this to other applications,
   principally video.

Table of Contents

   1.  Requirements notation  . . . . . . . . . . . . . . . . . . . .  3
   2.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  4
   3.  Transport layer metrics  . . . . . . . . . . . . . . . . . . .  8
     3.1.  Option 1 - Monitoring every packet . . . . . . . . . . . .  8
     3.2.  Option 2 - Real-time histogram methods . . . . . . . . . . 10
     3.3.  Option 3 - Monitoring by exception . . . . . . . . . . . . 11
     3.4.  Option 4 - Application-specific monitoring . . . . . . . . 12
   4.  RTP terminal metrics . . . . . . . . . . . . . . . . . . . . . 13
   5.  Application layer metrics  . . . . . . . . . . . . . . . . . . 14
     5.1.  Requirements for speech quality monitoring metrics . . . . 14
     5.2.  The audio hierarchy  . . . . . . . . . . . . . . . . . . . 16
     5.3.  Individual network transport and terminal parameters
           affecting speech quality . . . . . . . . . . . . . . . . . 16
     5.4.  Composite objective speech quality metrics . . . . . . . . 18
   6.  Choosing transport protocols for metrics . . . . . . . . . . . 23
     6.1.  RTCP as a transport for metrics - advantages and
           disadvantages  . . . . . . . . . . . . . . . . . . . . . . 23
       6.1.1.  Advantages of RTCP . . . . . . . . . . . . . . . . . . 23
       6.1.2.  Disadvantages of RTCP  . . . . . . . . . . . . . . . . 24
   7.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 25
   8.  Security Considerations  . . . . . . . . . . . . . . . . . . . 26
   9.  Acknowledgments  . . . . . . . . . . . . . . . . . . . . . . . 27
   10. Informative References . . . . . . . . . . . . . . . . . . . . 28
   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 30
   Intellectual Property and Copyright Statements . . . . . . . . . . 31

Hunt & Arden             Expires January 8, 2009                [Page 2]

Internet-Draft        RTP Monitoring Architectures             July 2008

1.  Requirements notation

   This memo is informative and as such contains no normative

Hunt & Arden             Expires January 8, 2009                [Page 3]

Internet-Draft        RTP Monitoring Architectures             July 2008

2.  Introduction

   The development of multiple metrics for transport and application
   quality monitoring has been identified as a potential problem for
   RTP/RTCP interoperability.  The AVT group has requested work on an
   architectural framework for monitoring which recognises that
   different applications layered on RTP may have some monitoring
   requirements in common, which should be satisfied by a common design.
   When this work was initiated, the objective was to design a framework
   and a small number of re-usable metrics at each appropriate layer to
   reduce implementation costs and to maximise inter-operability.  Since
   then, work-in-progress on [GUIDELINES] has stated that RTCP should be
   used primarily to provide information to peer RTP systems, whilst
   information used for network management should be carried by out-of-
   band protocols.  By implication, AVT should not work on metrics or
   their transport in RTCP unless they are motivated by RTP-system-to-
   RTP-system requirements.  However, metrics supporting network and
   service management are still required for RTP and the applications
   transported over it, to support many significant real-world

   Service providers may wish to answer some or all of the following:

   o  is a user experiencing a problem?

   o  what is the nature of the problem?

   o  how severe is the problem?

   o  what is the location of the problem?

   Metrics of transport performance and application performance,
   considered either on an isolated per-session basis or as a collection
   of metrics for multiple sessions using a common network component,
   can answer or contribute to answers to some or all of these

   One example which might lead to a shared metric arises from a shared
   requirement for monitoring of packet transport, which might be useful
   for every media type (audio, video, text, messaging) carried over

   Another example is the set of applications all of which transmit
   audio, including streaming audio speech, streaming music, two-party
   conversational speech, and audio conferencing.  This set of
   applications might be able to share a suitably defined set of audio
   metrics, e.g. for parameters such as noise floor, mean level, or
   amplitude clipping.  The subset of interactive speech applications

Hunt & Arden             Expires January 8, 2009                [Page 4]

Internet-Draft        RTP Monitoring Architectures             July 2008

   may be able to use common additional metrics related to interactivity
   (e.g. media delay and echo) which are not applicable to all audio
   applications.  Some or all of these audio metrics may be applicable
   to the audio channel(s) of a video application, such as IP TV or
   conversational video.

   [Editor's note: need to add a video-based view and examples]

   Metrics of RTP transport performance usually relate to single packet
   network segments, whilst metrics of application performance are more
   likely to represent the end-to-end connection which may include
   transmission over non-packet networks and/or over multiple packet
   networks.  Access to, and integration of, multiple sets of packet
   transport metrics relevant to a single connection typically present
   difficulties in current networks.

   Metrics are typically measured in an RTP systems but may be required
   at another RTP system or at a non-RTP system.  Hence transport of
   metrics is often required.  Metrics might be transported alongside
   RTP media using the extensibility mechanism defined in [RFC3611] but
   this is not an input requirement.  Other methods may be used if RTCP
   XR blocks are not suitable or another method offers significant
   technical advantages.  Following the work-in-progress in [GUIDELINES]
   which restricts the usage of RTCP, the method for transporting
   metrics need not be RTCP and should be chosen independently of the
   metrics themselves.  If the transport is not by RTCP, it is likely
   that multiple transport mechanisms should be permitted, and probably
   should not be restricted by AVT.

   For transport metrics, IETF and other SDOs have defined metrics.
   There is a wide choice of potentially useful metrics.  Some metrics
   may embed arbitrary design choices, or be application-specific.  It
   is a goal of this work to find generic and re-usable metrics.  This
   may result in a preference for some of the existing metrics over
   others, or to the definition of alternative metrics meeting the
   architectural goals of this work.

   For metrics at layers higher than transport, metrics are developed by
   a variety of external SDOs, e.g. by ITU-T for voice telephony

   The development of application metrics is an active field.  Any
   framework should be extensible to accommodate useful innovations when
   there is a consensus for their adoption.

   It is obviously desirable to achieve some consensus (the more, the
   better) on a set of useful metrics (the fewer, the better) which may
   be widely implemented, widely inter-operated, and widely understood.

Hunt & Arden             Expires January 8, 2009                [Page 5]

Internet-Draft        RTP Monitoring Architectures             July 2008

   Large data sets of raw measurements must be condensed into a smaller
   set of metrics or statistics before any agent (human or machine) can
   make decisions based on them.  It has been suggested that AVT might
   remain "metric-neutral" by storing and transporting raw measurement
   data, rather than the condensed metrics (see Option 1 below).  Even
   if data volumes are sufficiently small to make this feasible, some
   layer must perform the condensation and hence commit to specific

   A four-step process is suggested.  The AVT community may wish to
   contribute to some of these steps.

   1.  Choose a set of metrics which is useful for each application.

   2.  Classify each member of the sets of metrics according to the
       architectural layer which they monitor, creating sets of per-
       application, per-layer metrics.

   3.  Define a set of required metrics at each layer as the union of
       the application-specific sets in each layer.  This should include
       the selection of only one from any group of metrics with
       overlapping or nearly-overlapping capabilities, leading to agreed
       sets of per-layer metrics.  All of these metrics should be
       available within the architecture, but each application may
       select a subset which meets its needs.  Most RTP end systems and
       RTP mixers implement only a subset of possible RTP applications,
       and clearly these devices need not implement any metric which is
       relevant only to applications which they do not support.

   4.  Choose one or more transport protocols for those cases where
       metrics are measured at one location but must be available at
       another, e.g. to cause a reaction in an RTP system's peer, or for
       network or service management purposes.

   The fourth question seems at first sight to be of secondary
   importance ("We've chosen our metrics, now all we have to do is to
   transport them") but the choice of transport protocols may be tightly
   constrained, for example because the measuring point has limited
   performance and/or limited access bandwidth and/or is in a different
   trust domain.

   Section 3 describes some options for metrics of transport
   performance.  This includes an initial quantitative investigation of
   the feasibility of becoming "metric-neutral" by sending raw
   measurement data rather than condensed metrics.

   Section 5 starts the process of describing requirements for
   application-layer monitoring and the metrics frameworks available to

Hunt & Arden             Expires January 8, 2009                [Page 6]

Internet-Draft        RTP Monitoring Architectures             July 2008

   meet them.  In this first version of the draft, the description is
   limited to interactive speech and takes most of its material from the
   work of ITU-T.

   Section 6 discusses the choice of transport protocols, including
   discussion of the merits of RTCP which remains a candidate protocol.

Hunt & Arden             Expires January 8, 2009                [Page 7]

Internet-Draft        RTP Monitoring Architectures             July 2008

3.  Transport layer metrics

   The objective is to provide a set of metrics which characterise the
   three transport impairments of packet loss, packet delay, and packet
   delay variation.  These metrics should be usable by any application
   which uses RTP transport.

3.1.  Option 1 - Monitoring every packet

   Most transport metrics, almost by definition, condense a large amount
   of information about packet arrivals into a small number of
   statistics.  Usually, the aim of the statistics is to present key
   features of any transport impairments in ways which are readily
   understood by the operators of the network, with the minimum of
   distracting additional information.  Unfortunately there are multiple
   ways to condense data about packet arrivals, and the "key features"
   (those impairments which result in degraded application performance)
   are likely to be application-dependent.  Given this, it is not
   surprising that there are no known provably optimal metrics for the
   three transport impairments.  There are instead multiple heuristic

   The aim of "monitoring every packet" is to ensure that the
   information reported is not dependent on the application.  In this
   scheme, RTP systems will report arrival data for each individual RTP
   packet.  RTP (or other) systems receiving this "raw" data may use it
   to calculate any preferred heuristic metrics, but such calculations
   and the reporting of the results (e.g. to a session control layer or
   a management layer) are outside the scope of RTP and RTCP.

   Run-length encoding (RLE) is a well-known technique for compressing
   per-packet information about packet loss.  The efficiency of RLE
   compression is reduced as the packet loss fraction increases, leading
   to unpredictable metrics data.

   If packet round-trip delay is measured using the technique described
   in [RFC3550] section 6.4.1 and [RFC3550] Figure 2, the rate of
   measurement is low (at most one measurement per RTCP measurement
   cycle) and the volume of data involved in reporting the result is

   There are no obvious techniques for substantial compression of data
   related to the arrival times of individual packets, but such data is
   needed to compute packet delay variation.  Hence it appears that an
   item of data must be sent per packet, if packet delay variation is to
   be calculated from "raw" data.

   The following calculation estimates the volume of data needed to send

Hunt & Arden             Expires January 8, 2009                [Page 8]

Internet-Draft        RTP Monitoring Architectures             July 2008

   per-packet data, assuming a simple logarithmic scheme to code the
   delay variation.

   Consider the raw delay variation metric D(1,j) using the notation of
   [RFC3550] section 6.4.1.  If delay variation, relative to that of the
   first packet of the connection, is measured in RTP timestamp units,
   delay could be coded on a compressed "logarithmic" scale similar to
   G.711 A-law, which can code with a resolution of 1 unit on the
   uncompressed chord, and resolutions 2, 4, 8, 16, 32, 64 on each
   successive more compressed chord to give a range of +/- 2048.  This
   would correspond to +/- 2048/8000s ~ +/- 250ms for 8kHz sampled
   speech (enough to cover jitter), whilst using 1 byte per packet.
   Modifications would be needed for other sampling rates.  It might be
   necessary to standardise a timing unit resolution independent of the
   sampling clock.  Specific reserved values could be used to indicate
   that an expected packet did not arrive.

   To estimate data volume, consider a low-bandwidth codec like G.729
   with 20ms packetisation.  Over a 5s RTCP cycle there will be 250
   media packets and 102 bytes/packet (20ms G.729 in RTP/UDP/IP/Ethernet
   including preamble and Inter-Frame Gap) for a total media layer-2
   bandwidth of 25500 bytes/5s (about 40kbit/s). 1 byte per received
   packet is 250 bytes "raw data" and an overhead of 82 bytes (RTP/UDP/
   IP/Ethernet, same basis) - say 350 bytes total including some
   identification (SSRCs etc).  This is a fraction 350/25500~1.4% which
   is within RTP guidelines for RTCP bandwidth.  The corresponding
   calculation for G.711 with 10ms packetisation is 81000 bytes/5s media
   and a 600-byte "raw transport report" or 0.75%.

   However, the use of D(i,j) [RFC3550] for estimation of packet delay
   variation relies on a fixed relationship in the source RTP system
   between the RTP timestamp and the transmission time of the packet
   onto the wire.  This fixed relationship is not guaranteed even for
   audio coding and is almost certainly significantly wrong for many
   video formats, where the RTP timestamp indicates the sampling instant
   of a frame which may be encoded into multiple packets sent at
   significantly different times throughout a frame interval.  It could
   be argued that the current RTP framework provides no means for
   reliable estimation of packet delay variation in general, despite the
   usefulness of the D(i,j) metric for simple audio streams.  This could
   lead to a conclusion that an RTP-based measure of packet delay
   variation is not re-usable across RTP applications other than simple
   VoIP codecs.

   Logically, digital signal processors (DSPs) would be used to
   calculate metrics, including the per-packet data described above.
   Current advice is that an additional overhead of 600 bytes per
   channel is needed to store measurement results before periodic

Hunt & Arden             Expires January 8, 2009                [Page 9]

Internet-Draft        RTP Monitoring Architectures             July 2008

   transmission, and as such, the per-channel-memory required to support
   this option will increase memory requirements on infrastructure
   devices.  As memory solutions in currently deployed infrastructure
   gateways are sized for optimum performance, cost and power, adding
   this measurement function would result in a reduction of channel
   density which of course ultimately impacts cost and power.  Including
   additional memory in future designs of course has the same cost and
   power impacts.

   The principle that RTP systems should send per-packet reception
   report data, and correspondingly that the RTP (or other) system
   receiving this report data should calculate the metrics of its choice
   from this data, results in a requirement for computation both at the
   RTP system which sends the per-packet report and at the RTP (or
   other) system which receives the report.  If DSPs are used to perform
   this computation in the system which receives the report, there is a
   further demand on the memory of the DSP devices involved.  If
   general-purpose computing devices are used, then the cost of these
   devices may be significant.  For example, for a 16000 channel trunk
   media gateway implementing the scheme above and using 10ms
   packetisation, the gateway must code or decode a total of 3200000
   bytes of data per second.

   Note that this general method of supplying raw data from the RTP
   system is the only one which gives the system which receives the data
   the flexibility to calculate any chosen transport metric for upward
   reporting.  All other methods below either omit or condense data,
   such that the RTP (or other) system receiving the report is informed
   only about certain aspects of the transport performance which was
   measured at the remote RTP system.  However the method does not
   report on the impairment to far-end application that the impairment
   to outgoing transport caused.  For example, it provides no
   information about far-end jitter buffer events or late packets deemed
   lost by the application.  This is considered further in Section 4

3.2.  Option 2 - Real-time histogram methods

   There are several potentially useful metrics which rely on the
   accumulation of a histogram in real time, so that a packet arrival
   results in a counter being incremented rather than in the creation of
   a new data item.  These metrics may be gathered with a low and
   predictable storage requirement.  Each counter corresponds to a
   single class interval or "bin" of the histogram.  Examples of metrics
   which may be accumulated in this way include the observed
   distribution of packet delay variation, and the number of packets
   lost per unit time interval.

Hunt & Arden             Expires January 8, 2009               [Page 10]

Internet-Draft        RTP Monitoring Architectures             July 2008

   Different networks may have very different expected and achieved
   levels of performance, but it may be useful to fix the number of
   class intervals in the reported histogram to give a predictable
   volume of data.  This can be achieved by starting with small class
   intervals ("bin widths") and automatically increasing the width (e.g.
   by factors of two) if outliers are seen beyond the current upper
   limit of the histogram.  Data already accumulated may be assigned
   unambiguously to the new set of bins, given some simple conditions on
   the relationship between the old and new origins and bin widths.

   A significant disadvantage of the histogram method is the loss of any
   information about time-domain correlations between the samples which
   build the histogram.  For example, a histogram of packet delay
   variation provides no indication of whether successive samples of
   packet delay variation were uncorrelated, or alternatively that the
   packet delay variation showed a highly-correlated low-frequency

3.3.  Option 3 - Monitoring by exception

   An entity which both monitors the packet stream, and has sufficient
   knowledge of the application to know when transport impairments may
   have degraded the application's performance, may choose to send
   exception reports containing details of the transport impairments to
   a receiving system.  The crossing of a transport impairment
   threshold, or some application-layer event, would trigger such
   reports.  RTP end systems and mixers are likely to contain
   application implementations which may, in principle, identify this
   type of exception.

   It is likely that RTP translators will not contain suitable
   implementations which could identify such exceptions.

   On-path devices such as routers and switches are not likely to be
   aware of RTP at all.  Even if they are aware of RTP, they are
   unlikely to be aware of the RTP-level performance required by
   specific applications, and hence they are unlikely to be able to
   identify the level of impairment at which exceptional transport
   conditions may start to affect application performance.

   This type of monitoring typically requires the storage of recent data
   in a FIFO (e.g. a circular buffer) so that data relevant to the
   period just before and just after the exception may be reported.  It
   is not usually helpful to report transport data only from the period
   following an exception event detected by an application.  This
   imposes some storage requirement (though less than needed for Option
   1).  It also implies the existence of additional cross-layer
   primitives or APIs to trigger the transport layer to generate and

Hunt & Arden             Expires January 8, 2009               [Page 11]

Internet-Draft        RTP Monitoring Architectures             July 2008

   send its exception report.  Such a capability might be considered
   architecturally undesirable, in that it complicates one or more
   interfaces above the RTP layer.

3.4.  Option 4 - Application-specific monitoring

   This is a business-as-usual option which suggests that the current
   approach should not be changed, based on the idea that previous
   application-specific approaches such as that of [RFC3611] were valid.
   If a large category of RTP applications (such as VoIP) has a
   requirement for a unique set of transport metrics, arising from its
   different requirements of the transport, then it seems reasonable for
   each application category to define its preferred set of metrics to
   describe transport impairments.  We expect that there will be few
   such categories, probably less than 10.

   It may be easier to achieve interworking for a well-defined set of
   application-specific metrics than it would be in the case that
   applications select a profile from a palette of many independent re-
   usable metrics.

Hunt & Arden             Expires January 8, 2009               [Page 12]

Internet-Draft        RTP Monitoring Architectures             July 2008

4.  RTP terminal metrics

   By "RTP terminal metrics" we mean metrics relating to the way a
   terminal deals with transport impairments affecting the incident RTP
   stream.  These may include de-jitter buffering, packet loss
   concealment, and the use of redundant streams (if any) for correction
   of error or loss.

   An examples of such a metric is a count of packets arriving too late
   to be played out at current de-jitter buffer settings.

Hunt & Arden             Expires January 8, 2009               [Page 13]

Internet-Draft        RTP Monitoring Architectures             July 2008

5.  Application layer metrics

5.1.  Requirements for speech quality monitoring metrics

   RTP transport can be used for different application types such as IP
   (including public internet) and non-IP.  It can also apply to
   different user group sizes running over networks ranging in size from
   a small closed user group through an enterprise system to national
   and international networks.  Engineering judgment is required to
   choose the most suitable set of speech quality monitoring metrics for
   the type of application and the size of the network the application
   is running on.  Some metrics are more suitable for monitoring service
   level agreements (SLAs), others may be required for regular routine
   monitoring, and still others may be required for fault diagnosis.
   The resolution of the metrics may also be different for different
   types of monitoring.  These considerations make it difficult to
   propose a "one size fits all" set of metrics.  However some general
   points can be made and it is also useful to propose a minimum set of

   Mean Opinion Score (MOS) speech quality metrics such as MOS-LQO for
   listening quality and MOS-CQO for conversation quality (see later
   section for further discussion of MOS metrics) are useful for
   measuring end-to-end speech quality.  However they typically require
   significant time and processing power to produce a result and some
   MOS-LQO test methods require test calls that consume bandwidth.  This
   rules out MOS metrics for frequent large-scale monitoring.  Also
   methods for measuring conversational MOS are not yet mature enough
   for VoIP monitoring applications, even although many vendors are
   using an E-model [G.107] approach in the absence of anything else.
   This only leaves MOS-LQO as an overall composite speech quality
   metric, and, being a listening-only metric, it does not take account
   of interactive effects such as fixed delay and echo.  However, MOS-
   LQO is often used for SLAs and usually provides a better estimate of
   what a user actually experiences, than a single network or terminal
   metric or a group of such metrics.  However, a poor MOS score by
   itself gives little indication of the cause of a problem, and further
   metrics are required for diagnostic purposes.

   A proposed minimum set of metrics with suggested resolutions is as

Hunt & Arden             Expires January 8, 2009               [Page 14]

Internet-Draft        RTP Monitoring Architectures             July 2008

     | Metric                           | Resolution | Range        |
     | MOS-LQO                          | 0.1 MOS    | 1 to 5       |
     |                                  |            |              |
     | Received speech level            | 0.1 dB     | -60 to +10   |
     |                                  |            |              |
     | Received noise level             | 0.5 dB     | -130 to +10  |
     |                                  |            |              |
     | Echo return loss                 | 0.1 dB     | 6 to 40      |
     |                                  |            |              |
     | Round trip delay                 | 1 ms       | 1 ms to 65 s |
     |                                  |            |              |
     | Packet delay variation or jitter | 1 ms       | 1 ms to 65 s |
     |                                  |            |              |
     | Packet loss                      | 1 packet   | 0 to 2^24    |

                                  Table 1

   [Editor's note: More detail required here in a future draft to add
   information about meaningful measurement durations and whether
   measurements should include mean and peak values etc.  Also require
   some discussion around "second level" metrics such as jitter buffer
   parameters for diagnosis of more complicated problems.]

   Note that some voiceband data applications running over the same
   transport network as voice applications may require much lower values
   of packet loss and packet delay variation than would be required for
   voice applications alone.

   A reporting system for these metrics should be capable of
   accommodating intermediate network and terminal parameters as well as
   end-to-end quality metrics for both monitoring and diagnostic

   This minimum set of metrics should allow a wide range of problems to
   be diagnosed particularly if metrics are available at intermediate
   points in the network as well as at the endpoints.  Echo return loss
   and delay can be used to establish whether echo is a problem (which
   would not affect the MOS-LQO score as this is a listening only
   measurement).  Poor MOS-LQO scores could be caused by several
   factors, but individual measures of packet loss, jitter and noise
   levels could be used to establish the presence or absence of these
   degradations.  Finally, the level of received speech gives an
   indication of whether the operating point is correct and whether
   possible distortion or poor signal-to-noise are causing problems.

Hunt & Arden             Expires January 8, 2009               [Page 15]

Internet-Draft        RTP Monitoring Architectures             July 2008

   The codec type will often be known and this can also be very useful
   for diagnostic purposes if information about typical MOS scores and
   susceptibility to packet loss is known for example.  Knowledge of
   network topology is also very useful and can give an indication of
   possible bandwidth bottlenecks for example.

5.2.  The audio hierarchy

   The audio hierarchy can be broadly split into listening (one-way) and
   conversation (two-way, or multi-way conferencing) applications.
   These categories can be further split as shown in Figure 1.  In
   addition, ITU-T has defined a number of bandwidth categories;
   narrowband (300 to 3400 Hz), wideband (50 to 7000 Hz), super wideband
   (50 to 14000 Hz) and full band (20 to 20,000 Hz).

           Listening             Conversation
                 |                      |
       -------------              -----------
      |             |            |           |
   Streaming  Non-streaming   Two-way  Conferencing
                                     |             |
                               Non-spatial       Spatial

                       Figure 1: The audio hierarchy

   The following sections concentrate on one-way (listening only) and
   two-way (conversational) telephony applications, for which several
   composite speech quality metrics exist in ITU-T Recommendations.
   Similar considerations could apply to other applications such as
   conferencing and this should be addressed in further drafts.
   Suitable metrics for spatial conferencing are more difficult to
   derive at this stage since the technology is still relatively new.

5.3.  Individual network transport and terminal parameters affecting
      speech quality

   Parameters affecting both listening and conversation quality include:

   o  Listening level

   o  Noise (both electrical circuit noise and environmental noise)

Hunt & Arden             Expires January 8, 2009               [Page 16]

Internet-Draft        RTP Monitoring Architectures             July 2008

   o  Distortion (including amplitude clipping and codec distortion)

   o  Syllable clipping

   o  Comfort noise and voice activity detection

   o  Packet delay variation and jitter buffer operation

   o  Packet loss

   Listening levels that are either too quiet or too loud can be
   unpleasant and make communication difficult.

   High noise levels can make listening difficult and in a conversation,
   high background noise levels may cause a speaker to raise their voice
   level so that they can hear themselves above the noise.

   Certain types of signal distortion such as amplitude clipping can be
   very unpleasant.

   Syllable clipping occurs when the speech at the start or end of a
   syllable is missing and can cause words to be misunderstood.

   Voice activity detection is used to sense periods of voice inactivity
   and then transmit them as silence periods to reduce bandwidth.
   Artificial noise (comfort noise) is then injected on the receiving
   side of a connection to mask the silence caused by the voice activity
   detection.  Without the comfort noise injection the listener might
   think that the connection had died.  However, the contrast between
   comfort noise and transmitted background noise may be unpleasant for
   the listener if the comfort noise has not been well matched to the
   background noise.

   Packet delay variation caused by the underlying transport has to be
   "smoothed out" by using a jitter buffer to temporarily store received
   speech and then play it back at a uniform rate.  Jitter buffers that
   are too short or have been incorrectly implemented may cause packet
   loss, or "stuttering" of speech, and jitter buffers delays that are
   too long unduly add to the overall delay of a connection.  For speech
   or music applications (not data) adaptive jitter buffers that reduce
   delay as much as possible whilst minimising the risk of packet loss
   are preferable.  However buffer length adaptations must be carefully
   managed to ensure they are inaudible.  This is usually achieved by
   ensuring that such adaptations occur during silence intervals.

   Finally packet loss causes temporary loss of the signal that may
   become unintelligible as a result.

Hunt & Arden             Expires January 8, 2009               [Page 17]

Internet-Draft        RTP Monitoring Architectures             July 2008

   In addition, a good conversational experience requires interactivity
   between parties which in turn requires low delay, low echo
   applications.  So some additional parameters affecting conversation
   quality can be listed as follows:

   o  Delay

   o  Talker echo

   o  Listener echo

   o  Double-talk performance

   o  Sidetone

   Long delays affect interactivity and can cause one party to think
   that the other party is being "very slow" in answering.  In extreme
   cases, very long delays can be very confusing and can cause one party
   to talk over the other party.  The only way round this problem is for
   the conversation to become half duplex where each party takes it in
   turns to speak, and each makes it clear when they have finished
   speaking.  Echo can either be caused by electrical reflections at a
   2-wire to 4-wire converter or by acoustic or mechanical transmission
   paths between microphone and earphone.  The latter effect is known as
   terminal coupling loss.  Talker echo cause the speaker to hear an
   echo of his own voice and can be very confusing.  Listener echo is
   generally less common and occurs when the listener hears an echo of
   the speaker's voice.  Short delays cause the signal to sound hollow
   or slightly reverberant, whilst longer delays cause a distinct echo
   or echoes.

   Echo cancellers are used to minimise echo, but can cause other
   problems if not carefully designed.  For example, periods of double-
   talk where both parties are speaking at the same time may cause the
   canceller to diverge and produce echo.

   Sidetone is local feedback from the speaker's microphone to their
   earpiece, which lets them know that the connection is still "live".
   Without this feedback, the connection would sound "dead", which would
   be confusing.  The level, frequency response and distortion of the
   sidetone can all affect the user's experience.

5.4.  Composite objective speech quality metrics

   In addition to the individual "network" or "terminal" metrics
   described in the previous section, there are several composite speech
   quality metrics for objectively measuring end-to-end overall speech
   quality, based on a 5-point scale defined as follows:

Hunt & Arden             Expires January 8, 2009               [Page 18]

Internet-Draft        RTP Monitoring Architectures             July 2008


   o  5 = Excellent

   o  4 = Good

   o  3 = Fair

   o  2 = Poor

   o  1 = Bad

   A measurement using the scale just described results in a Mean
   Opinion Score (MOS), which represents the mean of several opinions
   obtained from a subjective test.  Mean opinion score terminology is
   defined in [P.800.1].

   The composite speech quality metrics are useful for commissioning and
   Service Level Agreements (SLAs), but (as previously discussed)
   further additional diagnostic information is required when these
   metrics fall below threshold values.

   Composite objective speech quality metrics can be divided into
   listening quality (MOS-LQO) and conversational quality (MOS-CQO).
   The ITU-T has produced several recommendations for measuring these
   composite speech quality metrics [P.561], [P.562], [P.563], [P.564],
   [P.862], [P.862.1], and [P.862.2].  A hierarchy of the various ITU
   speech quality test methods is shown in Figure 2.

Hunt & Arden             Expires January 8, 2009               [Page 19]

Internet-Draft        RTP Monitoring Architectures             July 2008

                Objective speech quality test methods
                     |                       |
                 Listening             Conversation
                     |                       |
            -----------------                |
           |                 |               |
      Intrusive        Non-intrusive        INMD
     Double-ended      Single-ended     P.561,P.562
           |                 |               |
           |            -----------          |
         PESQ         P.563      P.564     P.CQO
   P.862, P.862.1   Estimate    Estimate   under
       P.862.2        based     based on   development
     WB extension   on speech   IP n/work
           |         payload    parameters

          Figure 2: Hierarchy of ITU Speech quality test methods

   Double-ended test methods (P.862/P.862.1/P.862.2) rely on a reference
   signal that is injected at one end of the network and then captured
   at the other end of the network.  The reference and degraded signal
   are compared and an auditory transform that models the human hearing
   system is then applied to produce the final MOS value.  In contrast,
   single-ended systems do not require a reference signal and rely
   solely on the speech payload (eg P.563) or on IP network parameters
   (eg P.564).  P.563 measures several individual characteristics of the
   received speech signal and then combines the results to form a MOS-
   LQO, which has been verified against subjectively scored degraded
   speech files.  P.564 uses several IP network parameters and permitted
   RTCP-XR data to again produce a MOS-LQO.  In general double-ended
   methods are more accurate because they have a reference signal
   against which to compare the degraded signal.

   P.561 describes an In-service Non-intrusive Measurement Device (INMD)
   for making in-service measurements of several voice and network
   parameters, which can then be used to produce a conversational mean
   opinion score as described in P.562.  However the algorithm in P.562
   was originally intended for TDM rather than IP applications and
   therefore can only be applied to situations where the impact of IP
   impairments is negligible.  The term "In-service" means that the
   measurements are made during real customer calls.

Hunt & Arden             Expires January 8, 2009               [Page 20]

Internet-Draft        RTP Monitoring Architectures             July 2008

   In addition to the recommendations already mentioned, there is also a
   planning tool called the E-Model described in another ITU-T
   recommendation [G.107].  This was not designed for monitoring
   applications, but has unfortunately been mis-used for this purpose by
   several vendors.

   Another objective measurement tool is described in an ITU-R
   Recommendation [BS.1387].  Perceptual Evaluation of Audio Quality
   (PEAQ) has generally been optimised for the assessment of music
   signals rather than speech and is applicable to high-quality coded
   audio systems as used by broadcasters for example.

   The listening quality methods already mentioned (P.862/P.862.1/
   P.862.2, P.563 and P.564) all produce MOS-LQO values as their primary
   outputs and either require speech as an input or individual network
   parameters in the case of P.564.  Each can be used at intermediate,
   or end-points of the network provided that appropriate interfaces are
   available.  Except in the case of P.564, these methods either require
   computational power at the measurement point, or the speech file has
   to be captured and sent to a server for processing.  In the latter
   case, the size of the speech file is too large for transport by RTCP.
   By contrast, a P.564 MOS-LQO calculation only relies on packet header
   information and permitted information from RTCP-XR ie relatively
   lightweight data.

   P.561/P.562 is the only ITU conversational monitoring method
   (although P.CQO is under development) and it requires the following
   parameters to be measured:

   o  Active speech level

   o  Noise level (psophometrically weighted)

   o  Speech activity factor

   o  Speech echo path delay

   And at least one of

   o  Echo loss

   o  Echo path loss

   o  Speech echo path loss

   Class D INMDs [P.561] for IP applications are required to implement
   the following functions:

Hunt & Arden             Expires January 8, 2009               [Page 21]

Internet-Draft        RTP Monitoring Architectures             July 2008

   o  De-jitter buffer

   o  Voice decoder

   o  Comfort noise generator

   o  Error concealment process

   and are required to measure packet delay variation and IP packet loss

   P.562 uses these input parameters to calculate a MOS-CQO score.
   However as already mentioned the algorithm is at present suitable
   only for situations where the impact of IP impairments is negligible.

Hunt & Arden             Expires January 8, 2009               [Page 22]

Internet-Draft        RTP Monitoring Architectures             July 2008

6.  Choosing transport protocols for metrics

   Metrics related to RTP sessions are measured by RTP systems but may
   use any convenient transport mechanism "horizontally" to other RTP
   systems or "northbound" to session control or management systems,
   e.g.  RTCP XR [RFC3611], SNMP [RFC3410], as SIP [RFC3261] headers or
   attachments, or TR-069 mechanisms [DSLF-TR-069].

6.1.  RTCP as a transport for metrics - advantages and disadvantages

   RTCP XR remains at least as a candidate transport protocol for
   metrics, though note that [GUIDELINES] states explicitly that "The
   amount of information going into RTCP reports should primarily target
   the peer (and thus include information that can be meaningfully
   reacted upon).  Gathering and reporting statistics beyond this is not
   an RTCP task and should be addressed by out-of-band protocols".

   If RTCP is used, AVT need define only a generic means to transport
   arbitrary payloads.  Such a means is already available in the form of
   RTCP XR block types [RFC3611].  If the data is self-describing, e.g.
   based on ASN.1 [X.680] or XML [XML], or if usage is standardised in
   profiles, it would be possible to transmit many different collections
   of data whilst using only a small number of codepoints from the
   limited namespace of XR report block types.  As a minimum, only one
   XR block type codepoint need be allocated per SDO, with delegation to
   the SDO to manage a namespace defined by a type field in the payload.
   The measurements of round-trip delay and packet loss could still use
   the established mechanisms from RFC 3550.

   This approach is analogous to the definition of codec payload formats
   for RTP.  A specification could define how metrics payloads are
   carried in RTCP, and how SDP (including offer/answer) is used to
   request an RTP system to send a metrics payload.  The approach
   decouples the RTCP base protocol (transport format, routing, and
   transmission rate rules, and RTCP's base metrics) from less generic
   use cases.

6.1.1.  Advantages of RTCP

   RTCP uses the same transport as the RTP media path and hence if media
   may be transmitted, it is likely that RTCP may also be transmitted -
   although for connections not using [RTPRTCPMUX], this is subject to
   possible difficulties with NAT and firewall devices which may
   sometimes not open a port for RTCP.

   RTCP uses the same transport as the RTP media path so will normally
   experience the same transport performance as that experienced by the
   RTP media packets.  Firstly this allows an RTCP-based mechanism to

Hunt & Arden             Expires January 8, 2009               [Page 23]

Internet-Draft        RTP Monitoring Architectures             July 2008

   make a representative measurement of round-trip delay.  Secondly, if
   QoS mechanisms such as expedited forwarding (EF) have been
   implemented in support of the RTP media traffic, the transport is
   likely to be low-delay and possibly also low-loss, compared with a
   best-efforts class.

   Existing transport devices (for example, SBCs, BGWs, NAT) have often
   been implemented to allow RTCP to transit transparently on next
   higher UDP port.  The devices are unlikely to pass another protocol
   for the transport of metrics without modification.  This would make
   it harder to introduce any non-RTCP protocol for transport of

6.1.2.  Disadvantages of RTCP

   RTCP is usually carried over an unreliable RTP/UDP/IP transport.  Any
   monitoring scheme using RTCP as its transport must be designed to
   tolerate message loss and duplication.

   Bandwidth for the transport of RTCP may be limited.  [RFC3550]
   explicitly limits the bandwidth consumed by RTCP traffic to 5% of the
   bandwidth used by RTP media.  Even without this limitation, the
   volume of traffic which is allowed access to EF queues may be
   policed, such that large fractions of RTCP traffic might result in
   high loss for both the RTCP traffic and for RTP media.

Hunt & Arden             Expires January 8, 2009               [Page 24]

Internet-Draft        RTP Monitoring Architectures             July 2008

7.  IANA Considerations


Hunt & Arden             Expires January 8, 2009               [Page 25]

Internet-Draft        RTP Monitoring Architectures             July 2008

8.  Security Considerations

   This document itself contains no normative text and hence should not
   give rise to any new security considerations, to be confirmed.

   [Editor's note - should this section consider security merits/
   demerits of proposals for alternative protocols to RTCP?]

Hunt & Arden             Expires January 8, 2009               [Page 26]

Internet-Draft        RTP Monitoring Architectures             July 2008

9.  Acknowledgments

   This document was originally motivated by ideas from Colin Perkins.
   The authors would like to thank Graeme Gibbs at BT, and Debbie
   Greenstreet and her TI colleagues for their review comments.

Hunt & Arden             Expires January 8, 2009               [Page 27]

Internet-Draft        RTP Monitoring Architectures             July 2008

10.  Informative References

   [BS.1387]  ITU-R, "Recommendation BS.1387.  Method for objective
              measurements of perceived audio quality", November 2001.

              DSL Forum, "TR-069 CPE WAN Management Protocol v1.1",
              December 2007.

   [G.107]    ITU-T, "Recommendation G.107.  The E-model, a
              computational model for use in transmission planning.",
              March 2005.

              Ott, J., "Guidelines for Extending the RTP Control
              Protocol (RTCP)", ID draft-ott-avt-rtcp-guidelines-01,
              June 2008.

   [P.561]    ITU-T, "Recommendation P.561, In-service non-intrusive
              measurement device - Voice service measurements",
              July 2002.

   [P.562]    ITU-T, "Recommendation P.562.  Analysis and interpretation
              of INMD voice-service measurements", May 2004.

   [P.563]    ITU-T, "Recommendation P.563.  Single-ended method for
              objective speech quality assessment in narrow-band
              telephony applications", May 2004.

   [P.564]    ITU-T, "Recommendation P.564.  Conformance testing for
              narrowband voice over IP transmission quality assessment
              models", November 2007.

   [P.800.1]  ITU-T, "Recommendation P.800.1, Mean Opinion Score (MOS)
              terminology", July 2006.

   [P.862]    ITU-T, "Recommendation P.862.  Perceptual evaluation of
              speech quality (PESQ): An objective method for end-to-end
              speech quality assessment of narrow-band telephone
              networks and speech codecs", February 2001.

   [P.862.1]  ITU-T, "Recommendation P.862.1.  Mapping function for
              transforming P.862 raw result scores to MOS-LQO",
              November 2003.

   [P.862.2]  ITU-T, "Recommendation P.862.2.  Wideband extension to
              Recommendation P.862 for the assessment of wideband
              telephone networks and  speech codecs", November 2007.

Hunt & Arden             Expires January 8, 2009               [Page 28]

Internet-Draft        RTP Monitoring Architectures             July 2008

   [RFC3261]  Rosenberg, J., "SIP: Session Initiation Protocol",
              RFC 3261, June 2002.

   [RFC3410]  Case, J., "Introduction and Applicability Statements for
              Internet Standard Management Framework", RFC 3410,
              December 2002.

   [RFC3550]  Schulzrinne, H., "RTP: A Transport Protocol for Real-Time
              Applications", RFC 3550, July 2003.

   [RFC3611]  Friedman, T., "RTP Control Protocol Extended Reports (RTCP
              XR)", RFC 3611, November 2003.

              Perkins, C., "Multiplexing RTP Data and Control Packets on
              a Single Port", ID draft-ietf-avt-rtp-and-rtcp-mux-07,
              August 2007.

   [X.680]    ITU-T, "Recommendation X.680, Abstract Syntax Notation One
              (ASN.1): Specification of basic notation", July 2002.

   [XML]      W3C, "Extensible Markup Language (XML) 1.0 (Fourth
              Edition)", September 2006.

Hunt & Arden             Expires January 8, 2009               [Page 29]

Internet-Draft        RTP Monitoring Architectures             July 2008

Authors' Addresses

   Geoff Hunt
   Orion 1 PP9
   Adastral Park
   Martlesham Heath
   Ipswich, Suffolk  IP5 3RE
   United Kingdom

   Phone: +44 1473 608325
   Email: geoff.hunt@bt.com

   Philip Arden
   Orion 3/7 PP4
   Adastral Park
   Martlesham Heath
   Ipswich, Suffolk  IP5 3RE
   United Kingdom

   Phone: +44 1473 644192
   Email: philip.arden@bt.com

Hunt & Arden             Expires January 8, 2009               [Page 30]

Internet-Draft        RTP Monitoring Architectures             July 2008

Full Copyright Statement

   Copyright (C) The IETF Trust (2008).

   This document is subject to the rights, licenses and restrictions
   contained in BCP 78, and except as set forth therein, the authors
   retain all their rights.

   This document and the information contained herein are provided on an

Intellectual Property

   The IETF takes no position regarding the validity or scope of any
   Intellectual Property Rights or other rights that might be claimed to
   pertain to the implementation or use of the technology described in
   this document or the extent to which any license under such rights
   might or might not be available; nor does it represent that it has
   made any independent effort to identify any such rights.  Information
   on the procedures with respect to rights in RFC documents can be
   found in BCP 78 and BCP 79.

   Copies of IPR disclosures made to the IETF Secretariat and any
   assurances of licenses to be made available, or the result of an
   attempt made to obtain a general license or permission for the use of
   such proprietary rights by implementers or users of this
   specification can be obtained from the IETF on-line IPR repository at

   The IETF invites any interested party to bring to its attention any
   copyrights, patents or patent applications, or other proprietary
   rights that may cover technology that may be required to implement
   this standard.  Please address the information to the IETF at

Hunt & Arden             Expires January 8, 2009               [Page 31]

Html markup produced by rfcmarkup 1.129d, available from https://tools.ietf.org/tools/rfcmarkup/