[Docs] [txt|pdf|xml] [Tracker] [Email] [Nits]
Versions: 00
RTCWEB J. Rosenberg
Internet-Draft M. Kaufman
Intended status: Informational M. Hiie
Expires: August 12, 2011 F. Audet
Skype
February 8, 2011
An Architectural Framework for Browser based Real-Time Communications
(RTC)
draft-rosenberg-rtcweb-framework-00
Abstract
This document defines an architectural framework for browser-based
real-time communications (RTC). We propose a media component model,
where the browser provides an API abstraction which models media
components and connections. The underlying protocols within the
browser provide for a minimum set of functionality related to
transport of media.
Status of this Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on August 12, 2011.
Copyright Notice
Copyright (c) 2011 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
Rosenberg, et al. Expires August 12, 2011 [Page 1]
Internet-Draft RTCWEB-fw February 2011
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3
2. The Media Component Model . . . . . . . . . . . . . . . . . . 4
3. The Role of Signaling . . . . . . . . . . . . . . . . . . . . 5
4. The Role of Media Transport . . . . . . . . . . . . . . . . . 8
5. Benefits of the Media Component Model . . . . . . . . . . . . 9
5.1. Enabling Innovation . . . . . . . . . . . . . . . . . . . 9
5.2. The Importance of Flexibility . . . . . . . . . . . . . . 10
6. Interoperability with Existing VoIP Gear . . . . . . . . . . . 11
7. Informative References . . . . . . . . . . . . . . . . . . . . 12
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 13
Rosenberg, et al. Expires August 12, 2011 [Page 2]
Internet-Draft RTCWEB-fw February 2011
1. Introduction
Real-time communications (RTC) remains one of the few - if only -
classes of desktop applications that is not yet possible using the
native capabilities of the web browser. These applications run
natively on the desktop, or are powered by plugins. The
functionality provided by these desktop clients is rich and complex -
ranging from user interface, to real-time notifications, to call
signaling and call processing, to instant messaging and presence, and
of course - the real-time media stack itself, including codecs,
transport, firewall and NAT traversal, security, and so on.
Given the breadth of functionality in today's desktop RTC clients,
careful consideration needs to be paid to how that functionality
manifests in the browser. What functionality lives within the
browser itself? What functionality lives on top of it - either in
client-side Javascript or within servers? What protocols are spoken
by the browser itself? What protocols can be implemented within the
Javascript? What protocols need to be standardized, and which do
not? Pictorially, the question is what protocols, APIs, and
functionality reside within the box marked "Browser RTC Function" in
Figure 1. Indeed, the central question is what functionality resides
in that box, as the functionality will ultimately dictate the
protocols that interface to it, and the APIs which control it.
Rosenberg, et al. Expires August 12, 2011 [Page 3]
Internet-Draft RTCWEB-fw February 2011
+------------------------+ On-the-wire
| | Protocols
| Servers |--------->
| |
| |
+------------------------+
^
|
|
| HTTP/
| Websockets
|
|
|
|
+----------------------------+
| Javascript/HTML/CSS |
+----------------------------+
Other ^ ^RTC
APIs | |APIs
+---|-----------------|------+
| | | |
| +---------+|
| | Browser || On-the-wire
| Browser | RTC || Protocols
| | Function|----------->
| | ||
| | ||
| +---------+|
+---------------------|------+
|
V
Native OS Services
Figure 1: Browser Model
2. The Media Component Model
It is our position that the functionality that manifests within the
box be a media component model. In this model, the browser
Rosenberg, et al. Expires August 12, 2011 [Page 4]
Internet-Draft RTCWEB-fw February 2011
implements the necessary functionality to perform the real-time
processing of media, starting from capture/render, through
encapsulation in real-time transport protocols sent over the
Internet. This functionality must be built into the browser, rather
than within Javascript, due to its tight timing requirements and
complexity. Furthermore, the functionality manifest as a set of
loosely coupled components, each of which performs some aspect of the
real-time processing. Each component has APIs which allow that
component to be configured (with sensible defaults where
appropriate), along with APIs that allow applications to gather
information and statistics about the performance of that module.
The modules would include the codec itself, the acoustic echo
canceller (AEC), the jitter buffer, audio and video pre-processing
modules, and network transport components (including encryption and
integrity protection of media) which speak specific transport
protocols (such as the Real-Time Transport Protocol (RTP)). The
media component model is purposefully minimalistic. It opts for
maximizing the functionality that lives outside of the browser itself
- within Javascript or servers. In particular, only functionality
which is real-time - which cannot be done using Javascript or server
functionality - resides within the browser itself. As explained in
Section 5, this facilitates innovation, differentiation, and
development velocity - all of the key characteristics that have made
the web what it is.
As an example, a codec component implementing Opus
[I-D.ietf-codec-opus] might be represented by a Javascript object
with properties that mirror the configuration settings of the codec
itself - the sample rate (one of narrowband, mediumband, wideband or
super-wideband), the packet rate (number of frames per packet), the
bitrate (which can vary between 6 and 40kbps), a slider that adjusts
the packet loss resilience, a Boolean which indicates whether inband
FEC should be used, and another Boolean which indicates whether to
apply silence suppression. Of course, all of these parameters might
have reasonable defaults so that non-expert programmers can just make
it work. However, an advanced programmer could force a mode or
change a setting as needed. After all, the Opus codec itself makes
these parameters tunable exactly because there is no one right value;
the correct setting depends on the application scenario and needs of
the developer.
3. The Role of Signaling
It is our view that signaling is accomplished using a combination of
existing client-server web protocols (HTTP, COMET, and websockets)
and standards-based server-to-server protocols, such as SIP. A view
Rosenberg, et al. Expires August 12, 2011 [Page 5]
Internet-Draft RTCWEB-fw February 2011
of the "browser RTC Trapezoid" is shown in Figure 2.
+-----------+ +-----------+
| Web/ | | Web/ |
| SIP | SIP | SIP |
| |-------------| |
| Server | | Server |
| | | |
+-----------+ +-----------+
/ \
/ \ Proprietary over
/ \ HTTP/Websockets
/ \
/ Proprietary over \
/ HTTP/Websockets \
/ \
+-----------+ +-----------+
|JS/HTML/CSS| |JS/HTML/CSS|
+-----------+ +-----------+
+-----------+ +-----------+
| | | |
| | | |
| Browser | ------------------------- | Browser |
| | Media | |
| | | |
+-----------+ +-----------+
Figure 2: Browser RTC Trapezoid
In this example, a call is placed between two different providers.
They use a SIP-based interface to federate between them. However,
each of their respective browser-based clients signals to its server
using proprietary application protocols built ontop of HTTP and
Websockets. For example, provider A might offer simple calling
services, and have a very simple web services interface for placing
calls:
http://calling.providerA.com/call?target=joe@providerB.com&myIP=1.2.3.4:4476
Which takes only the called party and local IP/port as arguments.
Provider A's server infrastructure - some combination of web and SIP
servers built in any way it likes - uses the identity of the target,
along with previously-known information on the capabilities of the
caller's browser learned through a web-services registration, to
Rosenberg, et al. Expires August 12, 2011 [Page 6]
Internet-Draft RTCWEB-fw February 2011
generate a SIP INVITE. This arrives at provider B's server
infrastructure, which alerts its browser-based client of the incoming
call. Provider B might be an enterprise service provider, and offer
much richer features and signaling. Provider B uses a websocket
interface to the browser, providing it the identity of the caller,
the list of available codecs, and so on. B's service provider offers
web-services based APIs for answering the call, declining it, sending
to voicemail, redirecting to another number, parking it, and so on.
APIs within the browser allow each side to instruct the browsers to
send media, including selection of media types and codecs. In this
model, there is no SIP in the browser. It is our view that SIP has
no place within the browser.
SIP is an application protocol - providing call setup, registration,
codec negotiation, chat and presence, amongst other features. For
each and every new feature that is desired to run between a SIP
client and a SIP server, a new standard must be defined and then
implemented. The feature set is indeed vast, considering the wealth
of potential endpoints, ranging from simple consumer "voice only"
clients, to richer videophones, to voice and video multiparty
conferencing (including content sharing), to low-end enterprise
phones, to high end executive admin phones, to contact centers
endpoints, and beyond. Each of those requires more and more SIP
extensions in order to function. This has resulted in a growing
number of specifications, with diminishing returns of
interoperability and feature velocity. As an example, the BLISS
working group in IETF was formed to tackle some basic business phone
features - including line sharing, park, call queuing, and automated
call handling. Each of these individual features requires one or
more specifications, and needs to be designed to meet the needs of
all of the participants in the process.
There are two important consequences of this. First, the requirement
of standardization acts as a huge deterrent to innovation. Indeed,
in many ways, it is anathema to the very notion of how the web is
supposed to work. In the web model, the provider can define
arbitrary content to render to users, craft arbitrary UI, and define
arbitrary messaging from the browser back to the server, all without
standardization or change to the web browser. Google does not need
to wait for the browsers to implement IMAP in order to provide mail
service. Facebook does not need the browser to have XMPP or SIP to
enable presence and instant messaging. Why is call processing any
different? Why should Skype or any other real-time communications
provider be constrained by standardized application protocols? Each
provider should be able to design and innovate what it needs, and not
be constrained by the functionalities of the application protocols
burned into the browser.
Rosenberg, et al. Expires August 12, 2011 [Page 7]
Internet-Draft RTCWEB-fw February 2011
While it is true that standardization will be required in order to
extend these features between domains, that standardization process
can be the successor - not the predecessor - to successsful
deployment and usage of the feature within a domain. Furthermore,
many features and services do not need to be extended between
domains. Many of the BLISS features are good examples of this.
Inclusion of SIP in the browser for client to server signaling will
also harm interoperability. Unfortunately, SIP interoperability
betweend endpoints and servers has been relatively poor; working only
for basic call setup, teardown, and basic features. Important
concepts like configuration remain poorly standardized and almost
never interoperate. The web has certainly had interoperability
problems, but the nature of those problems is different. In the web,
content providers often need to code differently for different
browsers, but at least they can deliver their application
functionality. On the other hand, with SIP phones, many cases
features simply do not and cannot work, and this cannot be resolved
through software development on behalf of the SIP provider.
Interoperability is improved when there are fewer standards and not
more. Instead of adding SIP and its extensions to the browser,
application providers can use the tools that are already there - HTTP
and websockets, and then define whatever signaling functions they
desire ontop, without interoperability consequences.
Make no mistake - SIP remains important as a glue between service
providers, and between server infrastructure within service provider
networks. However, in a web context, there is simply no need for SIP
support in the browser.
4. The Role of Media Transport
Unlike signaling, media transport does need to be in the browser, for
two important reasons:
1. It operates in real-time and does not fit well with the
programming model of Javascript
2. It needs to flow between endpoints directly - over UDP - in order
to achieve low latency, and therefore requires standardization in
order to interoperate with other providers or endpoints
The second point is important. Unlike most other web protocols,
real-time media needs to be sent from the browser client to
recipients other than the origin server or domain from which the web
content came from. This is essential for ensuring low latency
operations - one of the key metrics of quality in Voice over IP
Rosenberg, et al. Expires August 12, 2011 [Page 8]
Internet-Draft RTCWEB-fw February 2011
systems. In some cases, the recipient will be another browser
endpoint from the same provider. However, it could be a desktop
client or mobile client from the same provider, or as shown in
Figure 2, it could be a browser endpoint or desktop endpoint from
another service provider. In all cases, a direct connection - indeed
a direct UDP connection - is important whenever possible.
From a security perspective however, the browser cannot just have an
API that tells it to send arbitrary UDP datagrams or even
standardized-format voice (or worse - video) media packets to an
arbitary IP address. The former introduces the opportunity for
malicious JavaScript to craft packets that mimick other application
protocols and send them to arbitrary endpoints (for example, an
enterprise SNMP server). The latter would introduce a substantial
opportunity for denial-of-service attacks. Malicious Javascript
could tell the browser to "spam" an unwitting recipient with high
bandwidth video. In the voice literature, this is referred to as the
voice hammer attack [RFC5245]. In existing voice systems, this
attack is possible but not likely due to the closed nature of most of
the software and systems. In a web environment, where all it takes
is one line of malicious Javascript, the attack becomes almost a
certainty.
To avoid this attack, a simple handshake can be utilized. The
browser should support a simple STUN-based [RFC5389] connection
handshake. The exchange of the STUN transaction ID prior to
transmission of media prevents the attack.
5. Benefits of the Media Component Model
There are several important benefits of the media component model
proposed here.
5.1. Enabling Innovation
One of the reasons why the Web has been successful as a user
interface platform is the short turn-around time to deploy new
versions of web-based services. Often, these new versions are
experiments that vary small details which are important to make the
service successful. It is the fine granularity of user interface
elements in HTML and related technologies that allow this
experimentation with details. As there is no agreed-upon
configuration of real-time audio/video communication technologies
that always delivers the best result, we think that it is essential
to give the application developers the same benefit of short turn-
around time and ability to experiment with details. Therefore, the
real-time communication primitives offered by user agents to web
Rosenberg, et al. Expires August 12, 2011 [Page 9]
Internet-Draft RTCWEB-fw February 2011
applications/services should be fine-grained enough to allow for
enhanced configurations and possibly new scenarios. Also, these
interfaces to the primitives should allow gathering real-world data
in enough detail on how the primitives are operating, to enable the
feedback loop of deploy-measure-reconfigure-redeploy.
One of the areas where perhaps the most innovation can be expected is
signaling - one only needs to look at the plethora of standards
around SIP. Proposing user-agent vendors to implement all these
standards is a sure way to make the common denominator across user
agents marginal. Instead, the browser already has a programmability
model (JavaScript) that can handle all these use cases, and more,
provided the programming environment has access to the underlying
media components as we propose here. Drawing again parallels from
user interface development, there is an undecided problem of what
should be executed by the user agent, and what by the web servers
(e.g. validation). Similar gray boundary between the client and the
server exists in the field of real-time communications. Therefore we
propose to leave standardization of signaling out of scope for this
activity, and let the web service providers define signaling as they
see fit.
5.2. The Importance of Flexibility
There are obviously tradeoffs between built-in functionality and
programmability. It is often tempting to provide the web page author
with a simple and relatively inflexible way of expressing their
intent so as to minimize the page author's effort and accelerate
adoption. As an example, the "<blink>" tag was adopted much more
rapidly than it would have been if blinking text could have only been
implemented by writing a JavaScript timer task to manipulate the DOM
style objects.
On the other hand, such built-in functionality comes at two important
costs. First, each browser implementation must implement the
functionality, and the more which is moved from JavaScript to
built-in functionality the more code must be present for that
implemention. Second, and more important, the page author is now
restricted to the subset of functionality which is provided by these
browser implementations.
The "<video>" tag as it currently stands is an excellent example.
While it does make it possible for a page author to embed video
playback within a page without relying on external plug-ins (and
without knowing much more than the URL of the video they wish to
play), it also leaves the implementation of advanced functionality -
such as adaptive multi-bitrate streaming - in the hands of the
browser developer, not the page author. Unless all vendors agree on
Rosenberg, et al. Expires August 12, 2011 [Page 10]
Internet-Draft RTCWEB-fw February 2011
a standard for transmission of such videos (including things like the
file format for manifest information), this advanced functionality
will be not available across the browser landscape. Most
importantly, the logic - the actual decisions about when to switch
rates and why - becomes buried deep inside the browser, hard or
potentially impossible to adjust for various circumstances.
An alternative approach for adaptive multi-bitrate video streaming
was recently adopted by the Flash Player. The video object simply
has an API for receiving bits to be played back. The script engine
(and thus the script author, usually through the use of a pre-
existing library) becomes responsible for determining which bits to
download and which bits to pass to the video object. This enables
adaptive multi-bitrate HTTP streaming video, but it also enables any
number of other uses, many of which were not even contemplated by the
providers of that API. It also means that upgrades to this logic
come in the form of new script libraries, and not in the form of an
upgrade to the Flash Player itself.
We advocate a similar approach here whenever it is possible. With
the exception of the passing of real-time data to and from the media
components (which we believe must communicate directly in order to
meet real-time latency constraints) we advocate placing all of the
logic outside of the browser itself and instead into the hands of the
page author through JavaScript APIs. These APIs may be more complex
to use for some cases, but they minimize the implementation effort on
the part of the browser vendor and can provide functionality that has
not yet been contemplated.
An example of this might be the peer-to-peer NAT traversal problem.
Rather than having an API for "browser, please use ICE [RFC5245] to
open a connection to another peer" we would instead have APIs like
"browser, please send an ICE-compatible STUN [RFC5389] probe to the
following candidate address". This allows the actual logic, the
sequencing, the choice of what to implement at the client and what to
offload to the server, to be in the hands of the JavaScript
developer. We expect that libraries to implement common
functionality (such as ICE, which could be built ontop of this) will
become readily and freely available, and so in short order the extra
work required for a page author to work with these lower level APIs
becomes insignificant.
6. Interoperability with Existing VoIP Gear
In order for Browser-based Real-Time Communication to be successful,
it is essential to ensure a good level of interoperability with
existing VoIP gear. This means that a strong baseline for
Rosenberg, et al. Expires August 12, 2011 [Page 11]
Internet-Draft RTCWEB-fw February 2011
interoperability of end-to-end media needs to be defined.
The amount of VoIP gear currently deployed is very substantial for
both VoIP Service Providers and Enterprise IP Telephony. In both
cases, media is transported on RTP/RTCP [RFC3550] using codecs such
as G.711 and G.729. Signaling for call control uses SIP [RFC3261],
H.323, H.248/Megaco, and a wide range of proprietary protocols.
Inter-domain, the signaling protocol is mostly SIP.
Interoperability at the signaling level can be handled by gateways,
and is outside the scope of this paper. Media interoperability
however needs to be addressed. It is not acceptable to rely on
servers to convert media from one transport (and codec) to another
because it introduces significant challenges. First, it requires a
large number of servers to do the actual transcoding, which increases
cost. Second, it affects the routing of media by adding an
additional leg to the transport, which increases end-to-end delay,
and therefore decreases voice quality. If there is codec
translation, it decreases voice quality even further. And finally,
it can potentially complicate end-to-end security.
Interoperability means working with reality, and not just standards.
As such, it is important that browsers support basic RTP transport
for voice and support the G.711 codec. Furthermore, they should
interoperate with network-based session border controllers, which are
the most commonly deployed technique for NAT traversal in existing
networks. They should also support security, based on SRTP
[RFC3711].
7. Informative References
[RFC5389] Rosenberg, J., Mahy, R., Matthews, P., and D. Wing,
"Session Traversal Utilities for NAT (STUN)", RFC 5389,
October 2008.
[I-D.ietf-codec-opus]
Valin, J. and K. Vos, "Definition of the Opus Audio
Codec", draft-ietf-codec-opus-02 (work in progress),
February 2011.
[RFC5245] Rosenberg, J., "Interactive Connectivity Establishment
(ICE): A Protocol for Network Address Translator (NAT)
Traversal for Offer/Answer Protocols", RFC 5245,
April 2010.
[RFC3550] Schulzrinne, H., Casner, S., Frederick, R., and V.
Jacobson, "RTP: A Transport Protocol for Real-Time
Rosenberg, et al. Expires August 12, 2011 [Page 12]
Internet-Draft RTCWEB-fw February 2011
Applications", STD 64, RFC 3550, July 2003.
[RFC3261] Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston,
A., Peterson, J., Sparks, R., Handley, M., and E.
Schooler, "SIP: Session Initiation Protocol", RFC 3261,
June 2002.
[RFC3711] Baugher, M., McGrew, D., Naslund, M., Carrara, E., and K.
Norrman, "The Secure Real-time Transport Protocol (SRTP)",
RFC 3711, March 2004.
Authors' Addresses
Jonathan Rosenberg
Skype
Monmouth, NJ
US
Email: jdrosen@skype.net
URI: http://www.jdrosen.net
Matthew Kaufman
Skype
Palo Alto, CA
US
Email: matthew.kaufman@skype.net
Magnus Hiie
Skype
Palo Alto, CA
US
Email: magnus.hiie@skype.net
Francois Audet
Skype
Palo Alto, CA
US
Email: francois.audet@skype.net
Rosenberg, et al. Expires August 12, 2011 [Page 13]
Html markup produced by rfcmarkup 1.129d, available from
https://tools.ietf.org/tools/rfcmarkup/