draft-ietf-rddp-problem-statement-05.txt   rfc4297.txt 
Internet-Draft Allyn Romanow (Cisco) Network Working Group A. Romanow
Expires: April 2005 Jeff Mogul (HP) Request for Comments: 4297 Cisco
Tom Talpey (NetApp) Category: Informational J. Mogul
Stephen Bailey (Sandburst) HP
T. Talpey
NetApp
S. Bailey
Sandburst
December 2005
Remote Direct Memory Access (RDMA) over IP Problem Statement Remote Direct Memory Access (RDMA) over IP Problem Statement
draft-ietf-rddp-problem-statement-05
Status of this Memo
By submitting this Internet-Draft, I certify that any applicable
patent or other IPR claims of which I am aware have been disclosed,
or will be disclosed, and any of which I become aware will be
disclosed, in accordance with RFC 3668.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet-
Drafts.
Internet-Drafts are draft documents valid for a maximum of six Status of This Memo
months and may be updated, replaced, or obsoleted by other
documents at any time. It is inappropriate to use Internet-Drafts
as reference material or to cite them other than as "work in
progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at This memo provides information for the Internet community. It does
http://www.ietf.org/shadow.html not specify an Internet standard of any kind. Distribution of this
memo is unlimited.
Copyright Notice Copyright Notice
Copyright (C) The Internet Society (2004). All Rights Reserved. Copyright (C) The Internet Society (2005).
Abstract Abstract
Overhead due to the movement of user data in the end-system network Overhead due to the movement of user data in the end-system network
I/O processing path at high speeds is significant, and has limited I/O processing path at high speeds is significant, and has limited
the use of Internet protocols in interconnection networks and the the use of Internet protocols in interconnection networks, and the
Internet itself - especially where high bandwidth, low latency Internet itself -- especially where high bandwidth, low latency,
and/or low overhead are required by the hosted application. and/or low overhead are required by the hosted application.
This draft examines this overhead, and addresses an architectural, This document examines this overhead, and addresses an architectural,
IP-based "copy avoidance" solution for its elimination, by enabling IP-based "copy avoidance" solution for its elimination, by enabling
Remote Direct Memory Access (RDMA). Remote Direct Memory Access (RDMA).
Table Of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 2 1. Introduction ....................................................2
2. The high cost of data movement operations in network I/O . 4 2. The High Cost of Data Movement Operations in Network I/O ........4
2.1. Copy avoidance improves processing overhead . . . . . . . 5 2.1. Copy avoidance improves processing overhead. ...............5
3. Memory bandwidth is the root cause of the problem . . . . 6 3. Memory bandwidth is the root cause of the problem. ..............6
4. High copy overhead is problematic for many key Internet 4. High copy overhead is problematic for many key Internet
applications . . . . . . . . . . . . . . . . . . . . . . . 7 applications. ...................................................8
5. Copy Avoidance Techniques . . . . . . . . . . . . . . . . 10 5. Copy Avoidance Techniques ......................................10
5.1. A Conceptual Framework: DDP and RDMA . . . . . . . . . . . 12 5.1. A Conceptual Framework: DDP and RDMA ......................11
6. Conclusions . . . . . . . . . . . . . . . . . . . . . . . 12 6. Conclusions ....................................................12
7. Security Considerations . . . . . . . . . . . . . . . . . 13 7. Security Considerations ........................................12
8. Terminology . . . . . . . . . . . . . . . . . . . . . . . 14 8. Terminology ....................................................14
9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . 15 9. Acknowledgements ...............................................14
Informative References . . . . . . . . . . . . . . . . . . 15 10. Informative References ........................................15
Authors' Addresses . . . . . . . . . . . . . . . . . . . . 19
Full Copyright Statement . . . . . . . . . . . . . . . . . 20
1. Introduction 1. Introduction
This draft considers the problem of high host processing overhead This document considers the problem of high host processing overhead
associated with the movement of user data to and from the network associated with the movement of user data to and from the network
interface under high speed conditions. This problem is often interface under high speed conditions. This problem is often
referred to as the "I/O bottleneck" [CT90]. More specifically, the referred to as the "I/O bottleneck" [CT90]. More specifically, the
source of high overhead that is of interest here is data movement source of high overhead that is of interest here is data movement
operations - copying. The throughput of a system may therefore be operations, i.e., copying. The throughput of a system may therefore
limited by the overhead of this copying. This issue is not to be be limited by the overhead of this copying. This issue is not to be
confused with TCP offload, which is not addressed here. High speed confused with TCP offload, which is not addressed here. High speed
refers to conditions where the network link speed is high relative refers to conditions where the network link speed is high, relative
to the bandwidths of the host CPU and memory. With today's to the bandwidths of the host CPU and memory. With today's computer
computer systems, one Gigabit per second (Gbits/s) and over is systems, one Gigabit per second (Gbits/s) and over is considered high
considered high speed. speed.
High costs associated with copying are an issue primarily for large High costs associated with copying are an issue primarily for large
scale systems. Although smaller systems such as rack-mounted PCs scale systems. Although smaller systems such as rack-mounted PCs and
and small workstations would benefit from a reduction in copying small workstations would benefit from a reduction in copying
overhead, the benefit to smaller machines will be primarily in the overhead, the benefit to smaller machines will be primarily in the
next few years as they scale in the amount of bandwidth they next few years as they scale the amount of bandwidth they handle.
handle. Today it is large system machines with high bandwidth Today, it is large system machines with high bandwidth feeds, usually
feeds, usually multiprocessors and clusters, that are adversely multiprocessors and clusters, that are adversely affected by copying
affected by copying overhead. Examples of such machines include overhead. Examples of such machines include all varieties of
all varieties of servers: database servers, storage servers, servers: database servers, storage servers, application servers for
application servers for transaction processing, for e-commerce, and transaction processing, for e-commerce, and web serving, content
web serving, content distribution, video distribution, backups, distribution, video distribution, backups, data mining and decision
data mining and decision support, and scientific computing. support, and scientific computing.
Note that such servers almost exclusively service many concurrent Note that such servers almost exclusively service many concurrent
sessions (transport connections), which, in aggregate, are sessions (transport connections), which, in aggregate, are
responsible for > 1 Gbits/s of communication. Nonetheless, the responsible for > 1 Gbits/s of communication. Nonetheless, the cost
cost of copying overhead for a particular load is the same whether of copying overhead for a particular load is the same whether from
from few or many sessions. few or many sessions.
The I/O bottleneck, and the role of data movement operations, have The I/O bottleneck, and the role of data movement operations, have
been widely studied in research and industry over the last been widely studied in research and industry over the last
approximately 14 years, and we draw freely on these results. approximately 14 years, and we draw freely on these results.
Historically, the I/O bottleneck has received attention whenever Historically, the I/O bottleneck has received attention whenever new
new networking technology has substantially increased line rates - networking technology has substantially increased line rates: 100
100 Megabit per second (Mbits/s) Fast Ethernet and Fibre Megabit per second (Mbits/s) Fast Ethernet and Fibre Distributed Data
Distributed Data Interface [FDDI], 155 Mbits/s Asynchronous Interface [FDDI], 155 Mbits/s Asynchronous Transfer Mode [ATM], 1
Transfer Mode [ATM], 1 Gbits/s Ethernet. In earlier speed Gbits/s Ethernet. In earlier speed transitions, the availability of
transitions, the availability of memory bandwidth allowed the I/O memory bandwidth allowed the I/O bottleneck issue to be deferred.
bottleneck issue to be deferred. Now however, this is no longer Now however, this is no longer the case. While the I/O problem is
the case. While the I/O problem is significant at 1 Gbits/s, it is significant at 1 Gbits/s, it is the introduction of 10 Gbits/s
the introduction of 10 Gbits/s Ethernet which is motivating an Ethernet which is motivating an upsurge of activity in industry and
upsurge of activity in industry and research [DAFS, IB, VI, CGZ01, research [IB, VI, CGY01, Ma02, MAF+02].
Ma02, MAF+02].
Because of high overhead of end-host processing in current Because of high overhead of end-host processing in current
implementations, the TCP/IP protocol stack is not used for high implementations, the TCP/IP protocol stack is not used for high speed
speed transfer. Instead, special purpose network fabrics, using a transfer. Instead, special purpose network fabrics, using a
technology generally known as Remote Direct Memory Access (RDMA), technology generally known as Remote Direct Memory Access (RDMA),
have been developed and are widely used. RDMA is a set of have been developed and are widely used. RDMA is a set of mechanisms
mechanisms that allow the network adapter, under control of the that allow the network adapter, under control of the application, to
application, to steer data directly into and out of application steer data directly into and out of application buffers. Examples of
buffers. Examples of such interconnection fabrics include Fibre such interconnection fabrics include Fibre Channel [FIBRE] for block
Channel [FIBRE] for block storage transfer, Virtual Interface storage transfer, Virtual Interface Architecture [VI] for database
Architecture [VI] for database clusters, and Infiniband [IB], clusters, and Infiniband [IB], Compaq Servernet [SRVNET], and
Compaq Servernet [SRVNET] and Quadrics [QUAD] for System Area Quadrics [QUAD] for System Area Networks. These link level
Networks. These link level technologies limit application scaling technologies limit application scaling in both distance and size,
in both distance and size, meaning that the number of nodes cannot meaning that the number of nodes cannot be arbitrarily large.
be arbitrarily large.
This problem statement substantiates the claim that in network I/O This problem statement substantiates the claim that in network I/O
processing, high overhead results from data movement operations, processing, high overhead results from data movement operations,
specifically copying; and that copy avoidance significantly specifically copying; and that copy avoidance significantly decreases
decreases this processing overhead. It describes when and why the this processing overhead. It describes when and why the high
high processing overheads occur, explains why the overhead is processing overheads occur, explains why the overhead is problematic,
problematic, and points out which applications are most affected. and points out which applications are most affected.
The document goes on to discuss why the problem is relevant to the The document goes on to discuss why the problem is relevant to the
Internet and to Internet-based applications. Applications which Internet and to Internet-based applications. Applications that
store, manage and distribute the information of the Internet are store, manage, and distribute the information of the Internet are
well suited to applying the copy avoidance solution. They will well suited to applying the copy avoidance solution. They will
benefit by avoiding high processing overheads, which removes limits benefit by avoiding high processing overheads, which removes limits
to the available scaling of tiered end-systems. Copy avoidance to the available scaling of tiered end-systems. Copy avoidance also
also eliminates latency for these systems, which can further eliminates latency for these systems, which can further benefit
benefit effective distributed processing. effective distributed processing.
In addition, this document introduces an architectural approach to In addition, this document introduces an architectural approach to
solving the problem, which is developed in detail in [BT04]. It solving the problem, which is developed in detail in [BT05]. It also
also discusses how the proposed technology may introduce security discusses how the proposed technology may introduce security concerns
concerns and how they should be addressed. and how they should be addressed.
Finally, this document includes a Terminology section to aid as a Finally, this document includes a Terminology section to aid as a
reference for several new terms introduced by RDMA. reference for several new terms introduced by RDMA.
2. The high cost of data movement operations in network I/O 2. The High Cost of Data Movement Operations in Network I/O
A wealth of data from research and industry shows that copying is A wealth of data from research and industry shows that copying is
responsible for substantial amounts of processing overhead. It responsible for substantial amounts of processing overhead. It
further shows that even in carefully implemented systems, further shows that even in carefully implemented systems, eliminating
eliminating copies significantly reduces the overhead, as copies significantly reduces the overhead, as referenced below.
referenced below.
Clark et al. [CJRS89] in 1989 shows that TCP [Po81] overhead Clark et al. [CJRS89] in 1989 shows that TCP [Po81] overhead
processing is attributable to both operating system costs such as processing is attributable to both operating system costs (such as
interrupts, context switches, process management, buffer interrupts, context switches, process management, buffer management,
management, timer management, and to the costs associated with timer management) and the costs associated with processing individual
processing individual bytes, specifically computing the checksum bytes (specifically, computing the checksum and moving data in
and moving data in memory. They found moving data in memory is the memory). They found that moving data in memory is the more important
more important of the costs, and their experiments show that memory of the costs, and their experiments show that memory bandwidth is the
bandwidth is the greatest source of limitation. In the data greatest source of limitation. In the data presented [CJRS89], 64%
presented [CJRS89], 64% of the measured microsecond overhead was of the measured microsecond overhead was attributable to data
attributable to data touching operations, and 48% was accounted for touching operations, and 48% was accounted for by copying. The
by copying. The system measured Berkeley TCP on a Sun-3/60 using system measured Berkeley TCP on a Sun-3/60 using 1460 Byte Ethernet
1460 Byte Ethernet packets. packets.
In a well-implemented system, copying can occur between the network In a well-implemented system, copying can occur between the network
interface and the kernel, and between the kernel and application interface and the kernel, and between the kernel and application
buffers - two copies, each of which are two memory bus crossings - buffers; there are two copies, each of which are two memory bus
for read and write. Although in certain circumstances it is crossings, for read and write. Although in certain circumstances it
possible to do better, usually two copies are required on receive. is possible to do better, usually two copies are required on receive.
Subsequent work has consistently shown the same phenomenon as the Subsequent work has consistently shown the same phenomenon as the
earlier Clark study. A number of studies report results that data- earlier Clark study. A number of studies report results that data-
touching operations, checksumming and data movement, dominate the touching operations, checksumming and data movement, dominate the
processing costs for messages longer than 128 Bytes [BS96, CGY01, processing costs for messages longer than 128 Bytes [BS96, CGY01,
Ch96, CJRS89, DAPP93, KP96]. For smaller sized messages, per- Ch96, CJRS89, DAPP93, KP96]. For smaller sized messages, per-packet
packet overheads dominate [KP96, CGY01]. overheads dominate [KP96, CGY01].
The percentage of overhead due to data-touching operations The percentage of overhead due to data-touching operations increases
increases with packet size, since time spent on per-byte operations with packet size, since time spent on per-byte operations scales
scales linearly with message size [KP96]. For example, Chu [Ch96] linearly with message size [KP96]. For example, Chu [Ch96] reported
reported substantial per-byte latency costs as a percentage of substantial per-byte latency costs as a percentage of total
total networking software costs for an MTU size packet on networking software costs for an MTU size packet on a SPARCstation/20
SPARCstation/20 running memory-to-memory TCP tests over networks running memory-to-memory TCP tests over networks with 3 different MTU
with 3 different MTU sizes. The percentage of total software costs sizes. The percentage of total software costs attributable to
attributable to per-byte operations were: per-byte operations were:
1500 Byte Ethernet 18-25% 1500 Byte Ethernet 18-25%
4352 Byte FDDI 35-50% 4352 Byte FDDI 35-50%
9180 Byte ATM 55-65% 9180 Byte ATM 55-65%
Although many studies report results for data-touching operations Although many studies report results for data-touching operations,
including checksumming and data movement together, much work has including checksumming and data movement together, much work has
focused just on copying [BS96, B99, Ch96, TK95]. For example, focused just on copying [BS96, Br99, Ch96, TK95]. For example,
[KP96] reports results that separate processing times for checksum [KP96] reports results that separate processing times for checksum
from data movement operations. For the 1500 Byte Ethernet size, from data movement operations. For the 1500 Byte Ethernet size, 20%
20% of total processing overhead time is attributable to copying. of total processing overhead time is attributable to copying. The
The study used 2 DECstations 5000/200 connected by an FDDI network. study used 2 DECstations 5000/200 connected by an FDDI network. (In
(In this study checksum accounts for 30% of the processing time.) this study, checksum accounts for 30% of the processing time.)
2.1. Copy avoidance improves processing overhead 2.1. Copy avoidance improves processing overhead.
A number of studies show that eliminating copies substantially A number of studies show that eliminating copies substantially
reduces overhead. For example, results from copy-avoidance in the reduces overhead. For example, results from copy-avoidance in the
IO-Lite system [PDZ99], which aimed at improving web server IO-Lite system [PDZ99], which aimed at improving web server
performance, show a throughput increase of 43% over an optimized performance, show a throughput increase of 43% over an optimized web
web server, and 137% improvement over an Apache server. The system server, and 137% improvement over an Apache server. The system was
was implemented in a 4.4BSD derived UNIX kernel, and the implemented in a 4.4BSD-derived UNIX kernel, and the experiments used
experiments used a server system based on a 333MHz Pentium II PC a server system based on a 333MHz Pentium II PC connected to a
connected to a switched 100 Mbits/s Fast Ethernet. switched 100 Mbits/s Fast Ethernet.
There are many other examples where elimination of copying using a There are many other examples where elimination of copying using a
variety of different approaches showed significant improvement in variety of different approaches showed significant improvement in
system performance [CFF+94, DP93, EBBV95, KSZ95, TK95, Wa97]. We system performance [CFF+94, DP93, EBBV95, KSZ95, TK95, Wa97]. We
will discuss the results of one of these studies in detail in order will discuss the results of one of these studies in detail in order
to clarify the significant degree of improvement produced by copy to clarify the significant degree of improvement produced by copy
avoidance [Ch02]. avoidance [Ch02].
Recent work by Chase et al. [CGY01], measuring CPU utilization, Recent work by Chase et al. [CGY01], measuring CPU utilization, shows
shows that avoiding copies reduces CPU time spent on data access that avoiding copies reduces CPU time spent on data access from 24%
from 24% to 15% at 370 Mbits/s for a 32 KBytes MTU using an to 15% at 370 Mbits/s for a 32 KBytes MTU using an AlphaStation
AlphaStation XP1000 and a Myrinet adapter [BCF+95]. This is an XP1000 and a Myrinet adapter [BCF+95]. This is an absolute
absolute improvement of 9% due to copy avoidance. improvement of 9% due to copy avoidance.
The total CPU utilization was 35%, with data access accounting for The total CPU utilization was 35%, with data access accounting for
24%. Thus the relative importance of reducing copies is 26%. At 24%. Thus, the relative importance of reducing copies is 26%. At
370 Mbits/s, the system is not very heavily loaded. The relative 370 Mbits/s, the system is not very heavily loaded. The relative
improvement in achievable bandwidth is 34%. This is the improvement in achievable bandwidth is 34%. This is the improvement
improvement we would see if copy avoidance were added when the we would see if copy avoidance were added when the machine was
machine was saturated by network I/O. saturated by network I/O.
Note that improvement from the optimization becomes more important Note that improvement from the optimization becomes more important if
if the overhead it targets is a larger share of the total cost. the overhead it targets is a larger share of the total cost. This is
This is what happens if other sources of overhead, such as what happens if other sources of overhead, such as checksumming, are
checksumming, are eliminated. In [CGY01], after removing checksum eliminated. In [CGY01], after removing checksum overhead, copy
overhead, copy avoidance reduces CPU utilization from 26% to 10%. avoidance reduces CPU utilization from 26% to 10%. This is a 16%
This is a 16% absolute reduction, a 61% relative reduction, and a absolute reduction, a 61% relative reduction, and a 160% relative
160% relative improvement in achievable bandwidth. improvement in achievable bandwidth.
In fact, today's network interface hardware commonly offloads the In fact, today's network interface hardware commonly offloads the
checksum, which removes the other source of per-byte overhead. checksum, which removes the other source of per-byte overhead. They
They also coalesce interrupts to reduce per-packet costs. Thus, also coalesce interrupts to reduce per-packet costs. Thus, today
today copying costs account for a relatively larger part of CPU copying costs account for a relatively larger part of CPU utilization
utilization than previously, and therefore relatively more benefit than previously, and therefore relatively more benefit is to be
is to be gained in reducing them. (Of course this argument would gained in reducing them. (Of course this argument would be specious
be specious if the amount of overhead were insignificant, but it if the amount of overhead were insignificant, but it has been shown
has been shown to be substantial. [BS96, B99, Ch96, KP96, TK95]) to be substantial. [BS96, Br99, Ch96, KP96, TK95])
3. Memory bandwidth is the root cause of the problem 3. Memory bandwidth is the root cause of the problem.
Data movement operations are expensive because memory bandwidth is Data movement operations are expensive because memory bandwidth is
scarce relative to network bandwidth and CPU bandwidth [PAC+97]. scarce relative to network bandwidth and CPU bandwidth [PAC+97].
This trend existed in the past and is expected to continue into the This trend existed in the past and is expected to continue into the
future [HP97, STREAM], especially in large multiprocessor systems. future [HP97, STREAM], especially in large multiprocessor systems.
With copies crossing the bus twice per copy, network processing With copies crossing the bus twice per copy, network processing
overhead is high whenever network bandwidth is large in comparison overhead is high whenever network bandwidth is large in comparison to
to CPU and memory bandwidths. Generally with today's end-systems, CPU and memory bandwidths. Generally, with today's end-systems, the
the effects are observable at network speeds over 1 Gbits/s. In effects are observable at network speeds over 1 Gbits/s. In fact,
fact, with multiple bus crossings it is possible to see the bus with multiple bus crossings it is possible to see the bus bandwidth
bandwidth being the limiting factor for throughput. This prevents being the limiting factor for throughput. This prevents such an
such an end-system from silultaneously achieving full network end-system from simultaneously achieving full network bandwidth and
bandwidth and full application performance. full application performance.
A common question is whether increase in CPU processing power A common question is whether an increase in CPU processing power
alleviates the problem of high processing costs of network I/O. alleviates the problem of high processing costs of network I/O. The
The answer is no, it is the memory bandwidth that is the issue. answer is no, it is the memory bandwidth that is the issue. Faster
Faster CPUs do not help if the CPU spends most of its time waiting CPUs do not help if the CPU spends most of its time waiting for
for memory [CGY01]. memory [CGY01].
The widening gap between microprocessor performance and memory The widening gap between microprocessor performance and memory
performance has long been a widely recognized and well-understood performance has long been a widely recognized and well-understood
problem [PAC+97]. Hennessy [HP97] shows microprocessor performance problem [PAC+97]. Hennessy [HP97] shows microprocessor performance
grew from 1980-1998 at 60% per year, while the access time to DRAM grew from 1980-1998 at 60% per year, while the access time to DRAM
improved at 10% per year, giving rise to an increasing "processor- improved at 10% per year, giving rise to an increasing "processor-
memory performance gap". memory performance gap".
Another source of relevant data is the STREAM Benchmark Reference Another source of relevant data is the STREAM Benchmark Reference
Information website which provides information on the STREAM Information website, which provides information on the STREAM
benchmark [STREAM]. The benchmark is a simple synthetic benchmark benchmark [STREAM]. The benchmark is a simple synthetic benchmark
program that measures sustainable memory bandwidth (in MBytes/s) program that measures sustainable memory bandwidth (in MBytes/s) and
and the corresponding computation rate for simple vector kernels the corresponding computation rate for simple vector kernels measured
measured in MFLOPS. The website tracks information on sustainable in MFLOPS. The website tracks information on sustainable memory
memory bandwidth for hundreds of machines and all major vendors. bandwidth for hundreds of machines and all major vendors.
Results show measured system performance statistics. Processing Results show measured system performance statistics. Processing
performance from 1985-2001 increased at 50% per year on average, performance from 1985-2001 increased at 50% per year on average, and
and sustainable memory bandwidth from 1975 to 2001 increased at 35% sustainable memory bandwidth from 1975 to 2001 increased at 35% per
per year on average over all the systems measured. A similar 15% year, on average, over all the systems measured. A similar 15% per
per year lead of processing bandwidth over memory bandwidth shows year lead of processing bandwidth over memory bandwidth shows up in
up in another statistic, machine balance [Mc95], a measure of the another statistic, machine balance [Mc95], a measure of the relative
relative rate of CPU to memory bandwidth (FLOPS/cycle) / (sustained rate of CPU to memory bandwidth (FLOPS/cycle) / (sustained memory
memory ops/cycle) [STREAM]. ops/cycle) [STREAM].
Network bandwidth has been increasing about 10-fold roughly every 8 Network bandwidth has been increasing about 10-fold roughly every 8
years, which is a 40% per year growth rate. years, which is a 40% per year growth rate.
A typical example illustrates that the memory bandwidth compares A typical example illustrates that the memory bandwidth compares
unfavorably with link speed. The STREAM benchmark shows that a unfavorably with link speed. The STREAM benchmark shows that a
modern uniprocessor PC, for example the 1.2 GHz Athlon in 2001, modern uniprocessor PC, for example the 1.2 GHz Athlon in 2001, will
will move the data 3 times in doing a receive operation - 1 for the move the data 3 times in doing a receive operation: once for the
network interface to deposit the data in memory, and 2 for the CPU network interface to deposit the data in memory, and twice for the
to copy the data. With 1 GBytes/s of memory bandwidth, meaning one CPU to copy the data. With 1 GBytes/s of memory bandwidth, meaning
read or one write, the machine could handle approximately 2.67 one read or one write, the machine could handle approximately 2.67
Gbits/s of network bandwidth, one third the copy bandwidth. But Gbits/s of network bandwidth, one third the copy bandwidth. But this
this assumes 100% utilization, which is not possible, and more assumes 100% utilization, which is not possible, and more importantly
importantly the machine would be totally consumed! (A rule of the machine would be totally consumed! (A rule of thumb for
thumb for databases is that 20% of the machine should be required databases is that 20% of the machine should be required to service
to service I/O, leaving 80% for the database application. And, the I/O, leaving 80% for the database application. And, the less, the
less the better.) better.)
In 2001, 1 Gbits/s links were common. An application server may In 2001, 1 Gbits/s links were common. An application server may
typically have two 1 Gbits/s connections - one connection backend typically have two 1 Gbits/s connections: one connection backend to a
to a storage server and one front-end, say for serving HTTP storage server and one front-end, say for serving HTTP [FGM+99].
[FGM+99]. Thus the communications could use 2 Gbits/s. In our Thus, the communications could use 2 Gbits/s. In our typical
typical example, the machine could handle 2.7 Gbits/s at its example, the machine could handle 2.7 Gbits/s at its theoretical
theoretical maximum while doing nothing else. This means that the maximum while doing nothing else. This means that the machine
machine basically could not keep up with the communication demands basically could not keep up with the communication demands in 2001;
in 2001, with the relative growth trends the situation only gets with the relative growth trends, the situation only gets worse.
worse.
4. High copy overhead is problematic for many key Internet applications 4. High copy overhead is problematic for many key Internet
applications.
If a significant portion of resources on an application machine is If a significant portion of resources on an application machine is
consumed in network I/O rather than in application processing, it consumed in network I/O rather than in application processing, it
makes it difficult for the application to scale - to handle more makes it difficult for the application to scale, i.e., to handle more
clients, to offer more services. clients, to offer more services.
Several years ago the most affected applications were streaming Several years ago the most affected applications were streaming
multimedia, parallel file systems and supercomputing on clusters multimedia, parallel file systems, and supercomputing on clusters
[BS96]. In addition, today the applications that suffer from [BS96]. In addition, today the applications that suffer from copying
copying overhead are more central in Internet computing - they overhead are more central in Internet computing -- they store,
store, manage, and distribute the information of the Internet and manage, and distribute the information of the Internet and the
the enterprise. They include database applications doing enterprise. They include database applications doing transaction
transaction processing, e-commerce, web serving, decision support, processing, e-commerce, web serving, decision support, content
content distribution, video distribution, and backups. Clusters distribution, video distribution, and backups. Clusters are
are typically used for this category of application, since they typically used for this category of application, since they have
have advantages of availability and scalability. advantages of availability and scalability.
Today these applications, which provide and manage Internet and Today these applications, which provide and manage Internet and
corporate information, are typically run in data centers that are corporate information, are typically run in data centers that are
organized into three logical tiers. One tier is typically a set of organized into three logical tiers. One tier is typically a set of
web servers connecting to the WAN. The second tier is a set of web servers connecting to the WAN. The second tier is a set of
application servers that run the specific applications usually on application servers that run the specific applications usually on
more powerful machines, and the third tier is backend databases. more powerful machines, and the third tier is backend databases.
Physically, the first two tiers - web server and application server Physically, the first two tiers -- web server and application server
- are usually combined [Pi01]. For example an e-commerce server -- are usually combined [Pi01]. For example, an e-commerce server
communicates with a database server and with a customer site, or a communicates with a database server and with a customer site, or a
content distribution server connects to a server farm, or an OLTP content distribution server connects to a server farm, or an OLTP
server connects to a database and a customer site. server connects to a database and a customer site.
When network I/O uses too much memory bandwidth, performance on When network I/O uses too much memory bandwidth, performance on
network paths between tiers can suffer. (There might also be network paths between tiers can suffer. (There might also be
performance issues on Storage Area Network paths used either by the performance issues on Storage Area Network paths used either by the
database tier or the application tier.) The high overhead from database tier or the application tier.) The high overhead from
network-related memory copies diverts system resources from other network-related memory copies diverts system resources from other
application processing. It also can create bottlenecks that limit application processing. It also can create bottlenecks that limit
total system performance. total system performance.
There are a large and growing number of these application servers There is high motivation to maximize the processing capacity of each
distributed throughout the Internet. In 1999 approximately 3.4 CPU because scaling by adding CPUs, one way or another, has
million server units were shipped, in 2000, 3.9 million units, and
the estimated annual growth rate for 2000-2004 was 17 percent
[Ne00, Pa01].
There is high motivation to maximize the processing capacity of
each CPU, as scaling by adding CPUs one way or another has
drawbacks. For example, adding CPUs to a multiprocessor will not drawbacks. For example, adding CPUs to a multiprocessor will not
necessarily help, as a multiprocessor improves performance only necessarily help because a multiprocessor improves performance only
when the memory bus has additional bandwidth to spare. Clustering when the memory bus has additional bandwidth to spare. Clustering
can add additional complexity to handling the applications. can add additional complexity to handling the applications.
In order to scale a cluster or multiprocessor system, one must In order to scale a cluster or multiprocessor system, one must
proportionately scale the interconnect bandwidth. Interconnect proportionately scale the interconnect bandwidth. Interconnect
bandwidth governs the performance of communication-intensive bandwidth governs the performance of communication-intensive parallel
parallel applications; if this (often expressed in terms of applications; if this (often expressed in terms of "bisection
"bisection bandwidth") is too low, adding additional processors bandwidth") is too low, adding additional processors cannot improve
cannot improve system throughput. Interconnect latency can also system throughput. Interconnect latency can also limit the
limit the performance of applications that frequently share data performance of applications that frequently share data between
between processors. processors.
So, excessive overheads on network paths in a "scalable" system So, excessive overheads on network paths in a "scalable" system both
both can require the use of more processors than optimal, and can can require the use of more processors than optimal, and can reduce
reduce the marginal utility of those additional processors. the marginal utility of those additional processors.
Copy avoidance scales a machine upwards by removing at least two- Copy avoidance scales a machine upwards by removing at least two-
thirds the bus bandwidth load from the "very best" 1-copy (on thirds of the bus bandwidth load from the "very best" 1-copy (on
receive) implementations, and removes at least 80% of the bandwidth receive) implementations, and removes at least 80% of the bandwidth
overhead from the 2-copy implementations. overhead from the 2-copy implementations.
The removal of bus bandwidth requirement, in turn, removes The removal of bus bandwidth requirements, in turn, removes
bottlenecks from the network processing path and increases the bottlenecks from the network processing path and increases the
throughput of the machine. On a machine with limited bus throughput of the machine. On a machine with limited bus bandwidth,
bandwidth, the advantages of removing this load is immediately the advantages of removing this load is immediately evident, as the
evident, as the host can attain full network bandwidth. Even on a host can attain full network bandwidth. Even on a machine with bus
machine with bus bandwidth adequate to sustain full network bandwidth adequate to sustain full network bandwidth, removal of bus
bandwidth, removal of bus bandwidth load serves to increase the bandwidth load serves to increase the availability of the machine for
availabilty of the machine for the processing of user applications, the processing of user applications, in some cases dramatically.
in some cases dramatically.
An example showing poor performance with copies and improved An example showing poor performance with copies and improved scaling
scaling with copy avoidance is illustrative. The IO-Lite work with copy avoidance is illustrative. The IO-Lite work [PDZ99] shows
[PDZ99] shows higher server throughput servicing more clients using higher server throughput servicing more clients using a zero-copy
a zero-copy system. In an experiment designed to mimic real world system. In an experiment designed to mimic real world web conditions
web conditions by simulating the effect of TCP WAN connections on by simulating the effect of TCP WAN connections on the server, the
the server, the performance of 3 servers was compared. One server performance of 3 servers was compared. One server was Apache,
was Apache, another an optimized server called Flash, and the third another was an optimized server called Flash, and the third was the
the Flash server running IO-Lite, called Flash-Lite with zero copy. Flash server running IO-Lite, called Flash-Lite with zero copy. The
The measurement was of throughput in requests/second as a function measurement was of throughput in requests/second as a function of the
of the number of slow background clients that could be served. As number of slow background clients that could be served. As the table
the table shows, Flash-Lite has better throughput, especially as shows, Flash-Lite has better throughput, especially as the number of
the number of clients increases. clients increases.
Apache Flash Flash-Lite Apache Flash Flash-Lite
------ ----- ---------- ------ ----- ----------
#Clients Throughput reqs/s Throughput Throughput #Clients Throughput reqs/s Throughput Throughput
0 520 610 890 0 520 610 890
16 390 490 890 16 390 490 890
32 360 490 850 32 360 490 850
64 360 490 890 64 360 490 890
128 310 450 880 128 310 450 880
256 310 440 820 256 310 440 820
Traditional Web servers (which mostly send data and can keep most of
Traditional Web servers (which mostly send data and can keep most their content in the file cache) are not the worst case for copy
of their content in the file cache) are not the worst case for copy
overhead. Web proxies (which often receive as much data as they overhead. Web proxies (which often receive as much data as they
send) and complex Web servers based on System Area Networks or send) and complex Web servers based on System Area Networks or
multi-tier systems will suffer more from copy overheads than in the multi-tier systems will suffer more from copy overheads than in the
example above. example above.
5. Copy Avoidance Techniques 5. Copy Avoidance Techniques
There have been extensive research investigation and industry There have been extensive research investigation and industry
experience with two main alternative approaches to eliminating data experience with two main alternative approaches to eliminating data
movement overhead, often along with improving other Operating movement overhead, often along with improving other Operating System
System processing costs. In one approach, hardware and/or software processing costs. In one approach, hardware and/or software changes
changes within a single host reduce processing costs. In another within a single host reduce processing costs. In another approach,
approach, memory-to-memory networking [MAF+02], the exchange of memory-to-memory networking [MAF+02], the exchange of explicit data
explicit data placement information between hosts allows them to placement information between hosts allows them to reduce processing
reduce processing costs. costs.
The single host approaches range from new hardware and software The single host approaches range from new hardware and software
architectures [KSZ95, Wa97, DWB+93] to new or modified software architectures [KSZ95, Wa97, DWB+93] to new or modified software
systems [BP96, Ch96, TK95, DP93, PDZ99]. In the approach based on systems [BS96, Ch96, TK95, DP93, PDZ99]. In the approach based on
using a networking protocol to exchange information, the network using a networking protocol to exchange information, the network
adapter, under control of the application, places data directly adapter, under control of the application, places data directly into
into and out of application buffers, reducing the need for data and out of application buffers, reducing the need for data movement.
movement. Commonly this approach is called RDMA, Remote Direct Commonly this approach is called RDMA, Remote Direct Memory Access.
Memory Access.
As discussed below, research and industry experience has shown that As discussed below, research and industry experience has shown that
copy avoidance techniques within the receiver processing path alone copy avoidance techniques within the receiver processing path alone
have proven to be problematic. The research special purpose host have proven to be problematic. The research special purpose host
adapter systems had good performance and can be seen as precursors adapter systems had good performance and can be seen as precursors
for the commercial RDMA-based adapters [KSZ95, DWB+93]. In for the commercial RDMA-based adapters [KSZ95, DWB+93]. In software,
software, many implementations have successfully achieved zero-copy many implementations have successfully achieved zero-copy transmit,
transmit, but few have accomplished zero-copy receive. And those but few have accomplished zero-copy receive. And those that have
that have done so make strict alignment and no-touch requirements done so make strict alignment and no-touch requirements on the
on the application, greatly reducing the portability and usefulness application, greatly reducing the portability and usefulness of the
of the implementation. implementation.
In contrast, experience has proven satisfactory with memory-to- In contrast, experience has proven satisfactory with memory-to-memory
memory systems that permit RDMA - performance has been good and systems that permit RDMA; performance has been good and there have
there have not been system or networking difficulties. RDMA is a not been system or networking difficulties. RDMA is a single
single solution. Once implemented, it can be used with any OS and solution. Once implemented, it can be used with any OS and machine
machine architecture, and it does not need to be revised when architecture, and it does not need to be revised when either of these
either of these changes. are changed.
In early work, one goal of the software approaches was to show that In early work, one goal of the software approaches was to show that
TCP could go faster with appropriate OS support [CJR89, CFF+94]. TCP could go faster with appropriate OS support [CJRS89, CFF+94].
While this goal was achieved, further investigation and experience While this goal was achieved, further investigation and experience
showed that, though possible to craft software solutions, specific showed that, though possible to craft software solutions, specific
system optimizations have been complex, fragile, extremely system optimizations have been complex, fragile, extremely
interdependent with other system parameters in complex ways, and interdependent with other system parameters in complex ways, and
often of only marginal improvement [CFF+94, CGY01, Ch96, DAPP93, often of only marginal improvement [CFF+94, CGY01, Ch96, DAPP93,
KSZ95, PDZ99]. The network I/O system interacts with other aspects KSZ95, PDZ99]. The network I/O system interacts with other aspects
of the Operating System such as machine architecture and file I/O, of the Operating System such as machine architecture and file I/O,
and disk I/O [Br99, Ch96, DP93]. and disk I/O [Br99, Ch96, DP93].
For example, the Solaris Zero-Copy TCP work [Ch96], which relies on For example, the Solaris Zero-Copy TCP work [Ch96], which relies on
page remapping, shows that the results are highly interdependent page remapping, shows that the results are highly interdependent with
with other systems, such as the file system, and that the other systems, such as the file system, and that the particular
particular optimizations are specific for particular architectures, optimizations are specific for particular architectures, meaning that
meaning for each variation in architecture optimizations must be for each variation in architecture, optimizations must be re-crafted
re-crafted [Ch96]. [Ch96].
With RDMA, application I/O buffers are mapped directly, and the With RDMA, application I/O buffers are mapped directly, and the
authorized peer may access it without incurring additional authorized peer may access it without incurring additional processing
processing overhead. When RDMA is implemented in hardware, overhead. When RDMA is implemented in hardware, arbitrary data
arbitrary data movement can be performed without involving the host movement can be performed without involving the host CPU at all.
CPU at all.
A number of research projects and industry products have been based A number of research projects and industry products have been based
on the memory-to-memory approach to copy avoidance. These include on the memory-to-memory approach to copy avoidance. These include
U-Net [EBBV95], SHRIMP [BLA+94], Hamlyn [BJM+96], Infiniband [IB], U-Net [EBBV95], SHRIMP [BLA+94], Hamlyn [BJM+96], Infiniband [IB],
Winsock Direct [Pi01]. Several memory-to-memory systems have been Winsock Direct [Pi01]. Several memory-to-memory systems have been
widely used and have generally been found to be robust, to have widely used and have generally been found to be robust, to have good
good performance, and to be relatively simple to implement. These performance, and to be relatively simple to implement. These include
include VI [VI], Myrinet [BCF+95], Quadrics [QUAD], Compaq/Tandem VI [VI], Myrinet [BCF+95], Quadrics [QUAD], Compaq/Tandem Servernet
Servernet [SRVNET]. Networks based on these memory-to-memory [SRVNET]. Networks based on these memory-to-memory architectures
architectures have been used widely in scientific applications and have been used widely in scientific applications and in data centers
in data centers for block storage, file system access, and for block storage, file system access, and transaction processing.
transaction processing.
By exporting direct memory access "across the wire", applications By exporting direct memory access "across the wire", applications may
may direct the network stack to manage all data directly from direct the network stack to manage all data directly from application
application buffers. A large and growing class of applications has buffers. A large and growing class that takes advantage of such
already emerged which takes advantage of such capabilities, capabilities of applications has already emerged. It includes all
including all the major databases, as well as file systems such as the major databases, as well as network protocols such as Sockets
DAFS [DAFS] and network protocols such as Sockets Direct [SDP]. Direct [SDP].
5.1. A Conceptual Framework: DDP and RDMA 5.1. A Conceptual Framework: DDP and RDMA
An RDMA solution can be usefully viewed as being comprised of two An RDMA solution can be usefully viewed as being comprised of two
distinct components: "direct data placement (DDP)" and "remote distinct components: "direct data placement (DDP)" and "remote direct
direct memory access (RDMA) semantics". They are distinct in memory access (RDMA) semantics". They are distinct in purpose and
purpose and also in practice - they may be implemented as separate also in practice -- they may be implemented as separate protocols.
protocols.
The more fundamental of the two is the direct data placement The more fundamental of the two is the direct data placement
facility. This is the means by which memory is exposed to the facility. This is the means by which memory is exposed to the remote
remote peer in an appropriate fashion, and the means by which the peer in an appropriate fashion, and the means by which the peer may
peer may access it, for instance reading and writing. access it, for instance, reading and writing.
The RDMA control functions are semantically layered atop direct The RDMA control functions are semantically layered atop direct data
data placement. Included are operations that provide "control" placement. Included are operations that provide "control" features,
features, such as connection and termination, and the ordering of such as connection and termination, and the ordering of operations
operations and signaling their completions. A "send" facility is and signaling their completions. A "send" facility is provided.
provided.
While the functions (and potentially protocols) are distinct, While the functions (and potentially protocols) are distinct,
historically both aspects taken together have been referred as historically both aspects taken together have been referred to as
"RDMA". The facilities of direct data placement are useful in and "RDMA". The facilities of direct data placement are useful in and of
of themselves, and may be employed by other upper layer protocols themselves, and may be employed by other upper layer protocols to
to facilitate data transfer. Therefore, it is often useful to facilitate data transfer. Therefore, it is often useful to refer to
refer to DDP as the data placement functionality and RDMA as the DDP as the data placement functionality and RDMA as the control
control aspect. aspect.
[BT04] develops an architecture for DDP and RDMA atop the Internet [BT05] develops an architecture for DDP and RDMA atop the Internet
Protocol Suite, and is a companion draft to this problem statement. Protocol Suite, and is a companion document to this problem
statement.
6. Conclusions 6. Conclusions
This Problem Statement concludes that an IP-based, general solution This Problem Statement concludes that an IP-based, general solution
for reducing processing overhead in end-hosts is desirable. for reducing processing overhead in end-hosts is desirable.
It has shown that high overhead of the processing of network data It has shown that high overhead of the processing of network data
leads to end-host bottlenecks. These bottlenecks are in large part leads to end-host bottlenecks. These bottlenecks are in large part
attributable to the copying of data. The bus bandwidth of machines attributable to the copying of data. The bus bandwidth of machines
has historically been limited, and the bandwidth of high-speed has historically been limited, and the bandwidth of high-speed
interconnects taxes it heavily. interconnects taxes it heavily.
An architectural solution to alleviate these bottlenecks best An architectural solution to alleviate these bottlenecks best
satisifies the issue. Further, the high speed of today's satisfies the issue. Further, the high speed of today's
interconnects and the deployment of these hosts on Internet interconnects and the deployment of these hosts on Internet
Protocol-based networks leads to the desireability to layer such a Protocol-based networks leads to the desirability of layering such a
solution on the Internet Protocol Suite. The architecture solution on the Internet Protocol Suite. The architecture described
described in [BT04] is such a proposal. in [BT05] is such a proposal.
7. Security Considerations 7. Security Considerations
Solutions to the problem of reducing copying overhead in high Solutions to the problem of reducing copying overhead in high
bandwidth transfers may introduce new security concerns. Any bandwidth transfers may introduce new security concerns. Any
proposed solution must be analyzed for security vulnerabilities and proposed solution must be analyzed for security vulnerabilities and
any such vulnerabilities addressed. Potential security weaknesses any such vulnerabilities addressed. Potential security weaknesses --
due to resource issues that might lead to denial-of-service due to resource issues that might lead to denial-of-service attacks,
attacks, overwrites and other concurrent operations, the ordering overwrites and other concurrent operations, the ordering of
of completions as required by the RDMA protocol, the granularity of completions as required by the RDMA protocol, the granularity of
transfer, and any other identified vulnerabilities; need to be transfer, and any other identified vulnerabilities -- need to be
examined, described and an adequate resolution to them found. examined, described, and an adequate resolution to them found.
Layered atop Internet transport protocols, the RDMA protocols will Layered atop Internet transport protocols, the RDMA protocols will
gain leverage from and must permit integration with Internet gain leverage from and must permit integration with Internet security
security standards, such as IPsec and TLS [IPSEC, TLS]. However, standards, such as IPsec and TLS [IPSEC, TLS]. However, there may be
there may be implementation ramifications for certain security implementation ramifications for certain security approaches with
approaches with respect to RDMA, due to its copy avoidance. respect to RDMA, due to its copy avoidance.
IPsec, operating to secure the connection on a packet-by-packet IPsec, operating to secure the connection on a packet-by-packet
basis, seems to be a natural fit to securing RDMA placement, which basis, seems to be a natural fit to securing RDMA placement, which
operates in conjunction with transport. Because RDMA enables an operates in conjunction with transport. Because RDMA enables an
implementation to avoid buffering, it is preferable to perform all implementation to avoid buffering, it is preferable to perform all
applicable security protection prior to processing of each segment applicable security protection prior to processing of each segment by
by the transport and RDMA layers. Such a layering enables the most the transport and RDMA layers. Such a layering enables the most
efficient secure RDMA implementation. efficient secure RDMA implementation.
The TLS record protocol, on the other hand, is layered on top of The TLS record protocol, on the other hand, is layered on top of
reliable transports and cannot provide such security assurance reliable transports and cannot provide such security assurance until
until an entire record is available, which may require the an entire record is available, which may require the buffering and/or
buffering and/or assembly of several distinct messages prior to TLS assembly of several distinct messages prior to TLS processing. This
processing. This defers RDMA processing and introduces overheads defers RDMA processing and introduces overheads that RDMA is designed
that RDMA is designed to avoid. TLS therefore is viewed as to avoid. Therefore, TLS is viewed as potentially a less natural fit
potentially a less natural fit for protecting the RDMA protocols. for protecting the RDMA protocols.
It is necessary to guarantee properties such as confidentiality, It is necessary to guarantee properties such as confidentiality,
integrity, and authentication on an RDMA communications channel. integrity, and authentication on an RDMA communications channel.
However, these properties cannot defend against all attacks from However, these properties cannot defend against all attacks from
properly authenticated peers, which might be malicious, properly authenticated peers, which might be malicious, compromised,
compromised, or buggy. Therefore the RDMA design must address or buggy. Therefore, the RDMA design must address protection against
protection against such attacks. For example, an RDMA peer should such attacks. For example, an RDMA peer should not be able to read
not be able to read or write memory regions without prior consent. or write memory regions without prior consent.
Further, it must not be possible to evade memory consistency checks Further, it must not be possible to evade memory consistency checks
at the recipient. The RDMA design must allow the recipient to rely at the recipient. The RDMA design must allow the recipient to rely
on its consistent memory contents by explicitly controlling peer on its consistent memory contents by explicitly controlling peer
access to memory regions at appropriate times. access to memory regions at appropriate times.
Peer connections which do not pass authentication and authorization Peer connections that do not pass authentication and authorization
checks by upper layers must not be permitted to begin processing in checks by upper layers must not be permitted to begin processing in
RDMA mode with an inappropriate endpoint. Once associated, peer RDMA mode with an inappropriate endpoint. Once associated, peer
accesses to memory regions must be authenticated and made subject accesses to memory regions must be authenticated and made subject to
to authorization checks in the context of the association and authorization checks in the context of the association and connection
connection on which they are to be performed, prior to any transfer on which they are to be performed, prior to any transfer operation or
operation or data being accessed. data being accessed.
The RDMA protocols must ensure that these region protections be The RDMA protocols must ensure that these region protections be under
under strict application control. Remote access to local memory by strict application control. Remote access to local memory by a
a network peer is particularly important in the Internet context, network peer is particularly important in the Internet context, where
where such access can be exported globally. such access can be exported globally.
8. Terminology 8. Terminology
This section contains general terminology definitions for this This section contains general terminology definitions for this
document and for Remote Direct Memory Access in general. document and for Remote Direct Memory Access in general.
Remote Direct Memory Access (RDMA) Remote Direct Memory Access (RDMA)
A method of accessing memory on a remote system in which the A method of accessing memory on a remote system in which the
local system specifies the location of the data to be local system specifies the location of the data to be
transferred. transferred.
RDMA Protocol RDMA Protocol
A protocol that supports RDMA Operations to transfer data A protocol that supports RDMA Operations to transfer data
between systems. between systems.
Fabric Fabric
The collection of links, switches, and routers that connect a The collection of links, switches, and routers that connect a
set of systems. set of systems.
Storage Area Network (SAN) Storage Area Network (SAN)
A network where disks, tapes and other storage devices are A network where disks, tapes, and other storage devices are made
made available to one or more end-systems via a fabric. available to one or more end-systems via a fabric.
System Area Network System Area Network
A network where clustered systems share services, such as A network where clustered systems share services, such as
storage and interprocess communication, via a fabric. storage and interprocess communication, via a fabric.
Fibre Channel (FC) Fibre Channel (FC)
An ANSI standard link layer with associated protocols, An ANSI standard link layer with associated protocols, typically
typically used to implement Storage Area Networks. [FIBRE] used to implement Storage Area Networks. [FIBRE]
Virtual Interface Architecture (VI, VIA) Virtual Interface Architecture (VI, VIA)
An RDMA interface definition developed by an industry group An RDMA interface definition developed by an industry group and
and implemented with a variety of differing wire protocols. implemented with a variety of differing wire protocols. [VI]
[VI]
Infiniband (IB) Infiniband (IB)
An RDMA interface, protocol suite and link layer specification An RDMA interface, protocol suite and link layer specification
defined by an industry trade association. [IB] defined by an industry trade association. [IB]
9. Acknowledgements 9. Acknowledgements
Jeff Chase generously provided many useful insights and Jeff Chase generously provided many useful insights and information.
information. Thanks to Jim Pinkerton for many helpful discussions. Thanks to Jim Pinkerton for many helpful discussions.
10. Informative References 10. Informative References
[ATM] [ATM] The ATM Forum, "Asynchronous Transfer Mode Physical Layer
The ATM Forum, "Asynchronous Transfer Mode Physical Layer Specification" af-phy-0015.000, etc. available from
Specification" af-phy-0015.000, etc. drafts available from http://www.atmforum.com/standards/approved.html.
http://www.atmforum.com/standards/approved.html
[BCF+95] [BCF+95] N. J. Boden, D. Cohen, R. E. Felderman, A. E. Kulawik, C.
N. J. Boden, D. Cohen, R. E. Felderman, A. E. Kulawik, C. L. L. Seitz, J. N. Seizovic, and W. Su. "Myrinet - A
Seitz, J. N. Seizovic, and W. Su. "Myrinet - A gigabit-per- gigabit-per-second local-area network", IEEE Micro,
second local-area network", IEEE Micro, February 1995 February 1995.
[BJM+96] [BJM+96] G. Buzzard, D. Jacobson, M. Mackey, S. Marovich, J.
G. Buzzard, D. Jacobson, M. Mackey, S. Marovich, J. Wilkes, Wilkes, "An implementation of the Hamlyn send-managed
"An implementation of the Hamlyn send-managed interface interface architecture", in Proceedings of the Second
architecture", in Proceedings of the Second Symposium on Symposium on Operating Systems Design and Implementation,
Operating Systems Design and Implementation, USENIX Assoc., USENIX Assoc., October 1996.
October 1996
[BLA+94] [BLA+94] M. A. Blumrich, K. Li, R. Alpert, C. Dubnicki, E. W.
M. A. Blumrich, K. Li, R. Alpert, C. Dubnicki, E. W. Felten, Felten, "A virtual memory mapped network interface for the
"A virtual memory mapped network interface for the SHRIMP SHRIMP multicomputer", in Proceedings of the 21st Annual
multicomputer", in Proceedings of the 21st Annual Symposium on Symposium on Computer Architecture, April 1994, pp. 142-
Computer Architecture, April 1994, pp. 142-153 153.
[Br99] [Br99] J. C. Brustoloni, "Interoperation of copy avoidance in
J. C. Brustoloni, "Interoperation of copy avoidance in network network and file I/O", Proceedings of IEEE Infocom, 1999,
and file I/O", Proceedings of IEEE Infocom, 1999, pp. 534-542 pp. 534-542.
[BS96] [BS96] J. C. Brustoloni, P. Steenkiste, "Effects of buffering
J. C. Brustoloni, P. Steenkiste, "Effects of buffering semantics on I/O performance", Proceedings OSDI'96,
semantics on I/O performance", Proceedings OSDI'96, USENIX, USENIX, Seattle, WA October 1996, pp. 277-291.
Seattle, WA October 1996, pp. 277-291
[BT04] [BT05] Bailey, S. and T. Talpey, "The Architecture of Direct Data
S. Bailey, T. Talpey, "The Architecture of Direct Data
Placement (DDP) And Remote Direct Memory Access (RDMA) On Placement (DDP) And Remote Direct Memory Access (RDMA) On
Internet Protocols", Internet Draft Work in Progress, draft- Internet Protocols", RFC 4296, December 2005.
ietf-rddp-arch-06, October 2004
[CFF+94] [CFF+94] C-H Chang, D. Flower, J. Forecast, H. Gray, B. Hawe, A.
C-H Chang, D. Flower, J. Forecast, H. Gray, B. Hawe, A. Nadkarni, K. K. Ramakrishnan, U. Shikarpur, K. Wilde,
Nadkarni, K. K. Ramakrishnan, U. Shikarpur, K. Wilde, "High- "High-performance TCP/IP and UDP/IP networking in DEC
performance TCP/IP and UDP/IP networking in DEC OSF/1 for OSF/1 for Alpha AXP", Proceedings of the 3rd IEEE
Alpha AXP", Proceedings of the 3rd IEEE Symposium on High Symposium on High Performance Distributed Computing,
Performance Distributed Computing, August 1994, pp. 36-42 August 1994, pp. 36-42.
[CGY01] [CGY01] J. S. Chase, A. J. Gallatin, and K. G. Yocum, "End system
J. S. Chase, A. J. Gallatin, and K. G. Yocum, "End system
optimizations for high-speed TCP", IEEE Communications optimizations for high-speed TCP", IEEE Communications
Magazine, Volume: 39, Issue: 4 , April 2001, pp 68-74. Magazine, Volume: 39, Issue: 4 , April 2001, pp 68-74.
http://www.cs.duke.edu/ari/publications/end-system.{ps,pdf} http://www.cs.duke.edu/ari/publications/end-
system.{ps,pdf}.
[Ch96]
H.K. Chu, "Zero-copy TCP in Solaris", Proc. of the USENIX 1996
Annual Technical Conference, San Diego, CA, January 1996
[Ch02] [Ch96] H.K. Chu, "Zero-copy TCP in Solaris", Proc. of the USENIX
Jeffrey Chase, Personal communication 1996 Annual Technical Conference, San Diego, CA, January
1996.
[CJRS89] [Ch02] Jeffrey Chase, Personal communication.
D. D. Clark, V. Jacobson, J. Romkey, H. Salwen, "An analysis
of TCP processing overhead", IEEE Communications Magazine,
volume: 27, Issue: 6, June 1989, pp 23-29
[CT90] [CJRS89] D. D. Clark, V. Jacobson, J. Romkey, H. Salwen, "An
D. D. Clark, D. Tennenhouse, "Architectural considerations for analysis of TCP processing overhead", IEEE Communications
a new generation of protocols", Proceedings of the ACM SIGCOMM Magazine, volume: 27, Issue: 6, June 1989, pp 23-29.
Conference, 1990
[DAFS] [CT90] D. D. Clark, D. Tennenhouse, "Architectural considerations
DAFS Collaborative, "Direct Access File System Specification for a new generation of protocols", Proceedings of the ACM
v1.0", September 2001, available from SIGCOMM Conference, 1990.
http://www.dafscollaborative.org
[DAPP93] [DAPP93] P. Druschel, M. B. Abbott, M. A. Pagels, L. L. Peterson,
P. Druschel, M. B. Abbott, M. A. Pagels, L. L. Peterson, "Network subsystem design", IEEE Network, July 1993, pp.
"Network subsystem design", IEEE Network, July 1993, pp. 8-17 8-17.
[DP93] [DP93] P. Druschel, L. L. Peterson, "Fbufs: a high-bandwidth
P. Druschel, L. L. Peterson, "Fbufs: a high-bandwidth cross- cross-domain transfer facility", Proceedings of the 14th
domain transfer facility", Proceedings of the 14th ACM ACM Symposium of Operating Systems Principles, December
Symposium of Operating Systems Principles, December 1993 1993.
[DWB+93] [DWB+93] C. Dalton, G. Watson, D. Banks, C. Calamvokis, A. Edwards,
C. Dalton, G. Watson, D. Banks, C. Calamvokis, A. Edwards, J. J. Lumley, "Afterburner: architectural support for high-
Lumley, "Afterburner: architectural support for high-
performance protocols", Technical Report, HP Laboratories performance protocols", Technical Report, HP Laboratories
Bristol, HPL-93-46, July 1993 Bristol, HPL-93-46, July 1993.
[EBBV95] [EBBV95] T. von Eicken, A. Basu, V. Buch, and W. Vogels, "U-Net: A
T. von Eicken, A. Basu, V. Buch, and W. Vogels, "U-Net: A
user-level network interface for parallel and distributed user-level network interface for parallel and distributed
computing", Proc. of the 15th ACM Symposium on Operating computing", Proc. of the 15th ACM Symposium on Operating
Systems Principles, Copper Mountain, Colorado, December 3-6, Systems Principles, Copper Mountain, Colorado, December
1995 3-6, 1995.
[FDDI] [FDDI] International Standards Organization, "Fibre Distributed
International Standards Organization, "Fibre Distributed Data Data Interface", ISO/IEC 9314, committee drafts available
Interface", ISO/IEC 9314, committee drafts available from from http://www.iso.org.
http://www.iso.org
[FGM+99] [FGM+99] Fielding, R., Gettys, J., Mogul, J., Frystyk, H.,
R. Fielding, J. Gettys, J. Mogul, F. Frystyk, L. Masinter, P. Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext
Leach, T. Berners-Lee, "Hypertext Transfer Protocol - Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999.
HTTP/1.1", RFC 2616, June 1999
[FIBRE] [FIBRE] ANSI Technical Committee T10, "Fibre Channel Protocol
ANSI Technical Committee T10, "Fibre Channel Protocol (FCP)" (FCP)" (and as revised and updated), ANSI X3.269:1996
(and as revised and updated), ANSI X3.269:1996 [R2001], [R2001], committee draft available from
committee draft available from
http://www.t10.org/drafts.htm#FibreChannel http://www.t10.org/drafts.htm#FibreChannel
[HP97] J. L. Hennessy, D. A. Patterson, Computer Organization and
[HP97]
J. L. Hennessy, D. A. Patterson, Computer Organization and
Design, 2nd Edition, San Francisco: Morgan Kaufmann Design, 2nd Edition, San Francisco: Morgan Kaufmann
Publishers, 1997 Publishers, 1997.
[IB] InfiniBand Trade Association, "InfiniBand Architecture [IB] InfiniBand Trade Association, "InfiniBand Architecture
Specification, Volumes 1 and 2", Release 1.1, November 2002, Specification, Volumes 1 and 2", Release 1.1, November
available from http://www.infinibandta.org/specs 2002, available from http://www.infinibandta.org/specs.
[KP96] [IPSEC] Kent, S. and R. Atkinson, "Security Architecture for the
J. Kay, J. Pasquale, "Profiling and reducing processing Internet Protocol", RFC 2401, November 1998.
overheads in TCP/IP", IEEE/ACM Transactions on Networking, Vol
4, No. 6, pp.817-828, December 1996
[KSZ95] [KP96] J. Kay, J. Pasquale, "Profiling and reducing processing
K. Kleinpaste, P. Steenkiste, B. Zill, "Software support for overheads in TCP/IP", IEEE/ACM Transactions on Networking,
outboard buffering and checksumming", SIGCOMM'95 Vol 4, No. 6, pp.817-828, December 1996.
[Ma02]
K. Magoutis, "Design and Implementation of a Direct Access
File System (DAFS) Kernel Server for FreeBSD", in Proceedings
of USENIX BSDCon 2002 Conference, San Francisco, CA, February
11-14, 2002.
[MAF+02] [KSZ95] K. Kleinpaste, P. Steenkiste, B. Zill, "Software support
K. Magoutis, S. Addetia, A. Fedorova, M. I. Seltzer, J. S. for outboard buffering and checksumming", SIGCOMM'95.
Chase, D. Gallatin, R. Kisley, R. Wickremesinghe, E. Gabber,
"Structure and Performance of the Direct Access File System
(DAFS)", in Proceedings of the 2002 USENIX Annual Technical
Conference, Monterey, CA, June 9-14, 2002.
[Mc95] [Ma02] K. Magoutis, "Design and Implementation of a Direct Access
J. D. McCalpin, "A Survey of memory bandwidth and machine File System (DAFS) Kernel Server for FreeBSD", in
balance in current high performance computers", IEEE TCCA Proceedings of USENIX BSDCon 2002 Conference, San
Newsletter, December 1995 Francisco, CA, February 11-14, 2002.
[Ne00] [MAF+02] K. Magoutis, S. Addetia, A. Fedorova, M. I. Seltzer, J.
A. Newman, "IDC report paints conflicted picture of server S. Chase, D. Gallatin, R. Kisley, R. Wickremesinghe, E.
market circa 2004", ServerWatch, July 24, 2000 Gabber, "Structure and Performance of the Direct Access
http://serverwatch.internet.com/news/2000_07_24_a.html File System (DAFS)", in Proceedings of the 2002 USENIX
Annual Technical Conference, Monterey, CA, June 9-14,
2002.
[Pa01] [Mc95] J. D. McCalpin, "A Survey of memory bandwidth and machine
M. Pastore, "Server shipments for 2000 surpass those in 1999", balance in current high performance computers", IEEE TCCA
ServerWatch, February 7, 2001 Newsletter, December 1995.
http://serverwatch.internet.com/news/2001_02_07_a.html
[PAC+97] [PAC+97] D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K.
D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, Keeton, C. Kozyrakis, R. Thomas, K. Yelick , "A case for
C. Kozyrakis, R. Thomas, K. Yelick , "A case for intelligient intelligient RAM: IRAM", IEEE Micro, April 1997.
RAM: IRAM", IEEE Micro, April 1997
[PDZ99] [PDZ99] V. S. Pai, P. Druschel, W. Zwaenepoel, "IO-Lite: a unified
V. S. Pai, P. Druschel, W. Zwaenepoel, "IO-Lite: a unified I/O I/O buffering and caching system", Proc. of the 3rd
buffering and caching system", Proc. of the 3rd Symposium on Symposium on Operating Systems Design and Implementation,
Operating Systems Design and Implementation, New Orleans, LA, New Orleans, LA, February 1999.
February 1999
[Pi01] [Pi01] J. Pinkerton, "Winsock Direct: The Value of System Area
J. Pinkerton, "Winsock Direct: The Value of System Area
Networks", May 2001, available from Networks", May 2001, available from
http://www.microsoft.com/windows2000/techinfo/ http://www.microsoft.com/windows2000/techinfo/
howitworks/communications/winsock.asp howitworks/communications/winsock.asp.
[Po81] [Po81] Postel, J., "Transmission Control Protocol", STD 7, RFC
J. Postel, "Transmission Control Protocol - DARPA Internet 793, September 1981.
Program Protocol Specification", RFC 793, September 1981
[QUAD]
Quadrics Ltd., Quadrics QSNet product information, available
from http://www.quadrics.com/website/pages/02qsn.html
[SDP] [QUAD] Quadrics Ltd., Quadrics QSNet product information,
InfiniBand Trade Association, "Sockets Direct Protocol v1.0", available from
Annex A of InfiniBand Architecture Specification Volume 1, http://www.quadrics.com/website/pages/02qsn.html.
Release 1.1, November 2002, available from
http://www.infinibandta.org/specs
[SRVNET] [SDP] InfiniBand Trade Association, "Sockets Direct Protocol
R. Horst, "TNet: A reliable system area network", IEEE Micro, v1.0", Annex A of InfiniBand Architecture Specification
pp. 37-45, February 1995 Volume 1, Release 1.1, November 2002, available from
http://www.infinibandta.org/specs.
[STREAM] [SRVNET] R. Horst, "TNet: A reliable system area network", IEEE
J. D. McAlpin, The STREAM Benchmark Reference Information, Micro, pp. 37-45, February 1995.
http://www.cs.virginia.edu/stream/
[TK95] [STREAM] J. D. McAlpin, The STREAM Benchmark Reference Information,
M. N. Thadani, Y. A. Khalidi, "An efficient zero-copy I/O http://www.cs.virginia.edu/stream/.
framework for UNIX", Technical Report, SMLI TR-95-39, May 1995
[VI] Compaq Computer Corp., Intel Corporation and Microsoft [TK95] M. N. Thadani, Y. A. Khalidi, "An efficient zero-copy I/O
Corporation, "Virtual Interface Architecture Specification framework for UNIX", Technical Report, SMLI TR-95-39, May
Version 1.0", December 1997, available from 1995.
http://www.vidf.org/info/04standards.html
[Wa97] [TLS] Dierks, T. and C. Allen, "The TLS Protocol Version 1.0",
J. R. Walsh, "DART: Fast application-level networking via RFC 2246, January 1999.
[VI] D. Cameron and G. Regnier, "The Virtual Interface
Architecture", ISBN 0971288704, Intel Press, April 2002,
more info at http://www.intel.com/intelpress/via/.
[Wa97] J. R. Walsh, "DART: Fast application-level networking via
data-copy avoidance", IEEE Network, July/August 1997, pp. data-copy avoidance", IEEE Network, July/August 1997, pp.
28-38 28-38.
Authors' Addresses Authors' Addresses
Stephen Bailey Stephen Bailey
Sandburst Corporation Sandburst Corporation
600 Federal Street 600 Federal Street
Andover, MA 01810 USA Andover, MA 01810 USA
Phone: +1 978 689 1614 Phone: +1 978 689 1614
Email: steph@sandburst.com EMail: steph@sandburst.com
Jeffrey C. Mogul Jeffrey C. Mogul
Western Research Laboratory HP Labs
Hewlett-Packard Company Hewlett-Packard Company
1501 Page Mill Road, MS 1251 1501 Page Mill Road, MS 1117
Palo Alto, CA 94304 USA Palo Alto, CA 94304 USA
Phone: +1 650 857 2206 (email preferred) Phone: +1 650 857 2206 (EMail preferred)
Email: JeffMogul@acm.org EMail: JeffMogul@acm.org
Allyn Romanow Allyn Romanow
Cisco Systems, Inc. Cisco Systems, Inc.
170 W. Tasman Drive 170 W. Tasman Drive
San Jose, CA 95134 USA San Jose, CA 95134 USA
Phone: +1 408 525 8836 Phone: +1 408 525 8836
Email: allyn@cisco.com EMail: allyn@cisco.com
Tom Talpey Tom Talpey
Network Appliance Network Appliance
375 Totten Pond Road 1601 Trapelo Road
Waltham, MA 02451 USA Waltham, MA 02451 USA
Phone: +1 781 768 5329 Phone: +1 781 768 5329
Email: thomas.talpey@netapp.com EMail: thomas.talpey@netapp.com
Full Copyright Statement Full Copyright Statement
Copyright (C) The Internet Society (2004). This document is Copyright (C) The Internet Society (2005).
subject to the rights, licenses and restrictions contained in BCP
78 and except as set forth therein, the authors retain all their
rights.
This document and the information contained herein are provided on This document is subject to the rights, licenses and restrictions
an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE contained in BCP 78, and except as set forth therein, the authors
REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND retain all their rights.
THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT This document and the information contained herein are provided on an
THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
PARTICULAR PURPOSE. ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Intellectual Property Intellectual Property
The IETF takes no position regarding the validity or scope of any The IETF takes no position regarding the validity or scope of any
Intellectual Property Rights or other rights that might be claimed Intellectual Property Rights or other rights that might be claimed to
to pertain to the implementation or use of the technology described pertain to the implementation or use of the technology described in
in this document or the extent to which any license under such this document or the extent to which any license under such rights
rights might or might not be available; nor does it represent that might or might not be available; nor does it represent that it has
it has made any independent effort to identify any such rights. made any independent effort to identify any such rights. Information
Information on the procedures with respect to rights in RFC on the procedures with respect to rights in RFC documents can be
documents can be found in BCP 78 and BCP 79. found in BCP 78 and BCP 79.
Copies of IPR disclosures made to the IETF Secretariat and any Copies of IPR disclosures made to the IETF Secretariat and any
assurances of licenses to be made available, or the result of an assurances of licenses to be made available, or the result of an
attempt made to obtain a general license or permission for the use attempt made to obtain a general license or permission for the use of
of such proprietary rights by implementers or users of this such proprietary rights by implementers or users of this
specification can be obtained from the IETF on-line IPR repository specification can be obtained from the IETF on-line IPR repository at
at http://www.ietf.org/ipr. http://www.ietf.org/ipr.
The IETF invites any interested party to bring to its attention any The IETF invites any interested party to bring to its attention any
copyrights, patents or patent applications, or other proprietary copyrights, patents or patent applications, or other proprietary
rights that may cover technology that may be required to implement rights that may cover technology that may be required to implement
this standard. Please address the information to the IETF at ietf- this standard. Please address the information to the IETF at ietf-
ipr@ietf.org. ipr@ietf.org.
Acknowledgement Acknowledgement
Funding for the RFC Editor function is currently provided by the Funding for the RFC Editor function is currently provided by the
Internet Society. Internet Society.
 End of changes. 146 change blocks. 
563 lines changed or deleted 500 lines changed or added

This html diff was produced by rfcdiff 1.27, available from http://www.levkowetz.com/ietf/tools/rfcdiff/