draft-ietf-nfsv4-rpcrdma-00.txt   draft-ietf-nfsv4-rpcrdma-01.txt 
Internet-Draft Brent Callaghan Internet-Draft Brent Callaghan
Expires: January 2005 Sun Microsystems, Inc. Expires: August 2005 Tom Talpey
Tom Talpey
Network Appliance, Inc.
Document: draft-ietf-nfsv4-rpcrdma-00.txt July, 2004 Document: draft-ietf-nfsv4-rpcrdma-01 February, 2005
RDMA Transport for ONC RPC RDMA Transport for ONC RPC
Status of this Memo Status of this Memo
By submitting this Internet-Draft, I certify that any applicable By submitting this Internet-Draft, I certify that any applicable
patent or other IPR claims of which I am aware have been disclosed, patent or other IPR claims of which I am aware have been disclosed,
or will be disclosed, and any of which I become aware will be or will be disclosed, and any of which I become aware will be
disclosed, in accordance with RFC 3668. disclosed, in accordance with RFC 3668.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet- other groups may also distribute working documents as Internet-
Drafts. Drafts.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six
and may be updated, replaced, or obsoleted by other documents at any months and may be updated, replaced, or obsoleted by other
time. It is inappropriate to use Internet-Drafts as reference documents at any time. It is inappropriate to use Internet-Drafts
material or to cite them other than as "work in progress." as reference material or to cite them other than as "work in
progress."
The list of current Internet-Drafts can be accessed at The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet- http://www.ietf.org/ietf/1id-abstracts.txt The list of
Draft Shadow Directories can be accessed at Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html. http://www.ietf.org/shadow.html.
Copyright Notice Copyright Notice
Copyright (C) The Internet Society (2004). All Rights Reserved. Copyright (C) The Internet Society (2005). All Rights Reserved.
Abstract Abstract
A protocol is described providing RDMA as a new transport for ONC A protocol is described providing RDMA as a new transport for ONC
RPC. The RDMA transport binding conveys the benefits of efficient, RPC. The RDMA transport binding conveys the benefits of efficient,
bulk data transport over high speed networks, while providing for bulk data transport over high speed networks, while providing for
minimal change to RPC applications and with no required revision of minimal change to RPC applications and with no required revision of
the application RPC protocol, or the RPC protocol itself. the application RPC protocol, or the RPC protocol itself.
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
2. Abstract RDMA Model . . . . . . . . . . . . . . . . . . . . 3 2. Abstract RDMA Model . . . . . . . . . . . . . . . . . . . . 3
3. Protocol Outline . . . . . . . . . . . . . . . . . . . . . . 5 3. Protocol Outline . . . . . . . . . . . . . . . . . . . . . . 4
3.1. Short Messages . . . . . . . . . . . . . . . . . . . . . . 5 3.1. Short Messages . . . . . . . . . . . . . . . . . . . . . . 4
3.2. Data Chunks . . . . . . . . . . . . . . . . . . . . . . . 6 3.2. Data Chunks . . . . . . . . . . . . . . . . . . . . . . . 5
3.3. Flow Control . . . . . . . . . . . . . . . . . . . . . . . 6 3.3. Flow Control . . . . . . . . . . . . . . . . . . . . . . . 5
3.4. XDR Encoding with Chunks . . . . . . . . . . . . . . . . . 7 3.4. XDR Encoding with Chunks . . . . . . . . . . . . . . . . . 6
3.5. Padding . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.5. Padding . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.6. XDR Decoding with Read Chunks . . . . . . . . . . . . . 10 3.6. XDR Decoding with Read Chunks . . . . . . . . . . . . . 10
3.7. XDR Decoding with Write Chunks . . . . . . . . . . . . . 11 3.7. XDR Decoding with Write Chunks . . . . . . . . . . . . . 11
3.8. RPC Call and Reply . . . . . . . . . . . . . . . . . . . 11 3.8. RPC Call and Reply . . . . . . . . . . . . . . . . . . . 12
4. RPC RDMA Message Layout . . . . . . . . . . . . . . . . . 14 4. RPC RDMA Message Layout . . . . . . . . . . . . . . . . . 14
4.1. RPC RDMA Transport Header . . . . . . . . . . . . . . . 14 4.1. RPC over RDMA Header . . . . . . . . . . . . . . . . . . 14
4.2. XDR Language Description . . . . . . . . . . . . . . . . 16 4.2. XDR Language Description . . . . . . . . . . . . . . . . 16
5. Large Chunkless Messages . . . . . . . . . . . . . . . . . 18 5. Large Chunkless Messages . . . . . . . . . . . . . . . . . 18
5.1. Message as an RDMA Read Chunk . . . . . . . . . . . . . 19 5.1. Message as an RDMA Read Chunk . . . . . . . . . . . . . 18
5.2. RDMA Write of Long Replies . . . . . . . . . . . . . . . 20 5.2. RDMA Write of Long Replies (Reply Chunks) . . . . . . . 20
5.3. RPC RDMA header errors . . . . . . . . . . . . . . . . . 21 5.3. RPC over RDMA header errors . . . . . . . . . . . . . . 21
6. Connection Configuration Protocol . . . . . . . . . . . . 22 6. Connection Configuration Protocol . . . . . . . . . . . . 21
6.1. Initial Connection State . . . . . . . . . . . . . . . . 22 6.1. Initial Connection State . . . . . . . . . . . . . . . . 22
6.2. Protocol Description . . . . . . . . . . . . . . . . . . 23 6.2. Protocol Description . . . . . . . . . . . . . . . . . . 22
7. Memory Registration Overhead . . . . . . . . . . . . . . . 24 7. Memory Registration Overhead . . . . . . . . . . . . . . . 24
8. Errors and Error Recovery . . . . . . . . . . . . . . . . 24 8. Errors and Error Recovery . . . . . . . . . . . . . . . . 24
9. Node Addressing . . . . . . . . . . . . . . . . . . . . . 25 9. Node Addressing . . . . . . . . . . . . . . . . . . . . . 24
10. RPC Binding . . . . . . . . . . . . . . . . . . . . . . . 25 10. RPC Binding . . . . . . . . . . . . . . . . . . . . . . . 25
11. Security . . . . . . . . . . . . . . . . . . . . . . . . 25 11. Security . . . . . . . . . . . . . . . . . . . . . . . . 25
12. IANA Considerations . . . . . . . . . . . . . . . . . . . 26 12. IANA Considerations . . . . . . . . . . . . . . . . . . . 25
13. Acknowledgements . . . . . . . . . . . . . . . . . . . . 26 13. Acknowledgements . . . . . . . . . . . . . . . . . . . . 26
14. Normative References . . . . . . . . . . . . . . . . . . 26 14. Normative References . . . . . . . . . . . . . . . . . . 26
15. Informative References . . . . . . . . . . . . . . . . . 27 15. Informative References . . . . . . . . . . . . . . . . . 26
16. Authors' Addresses . . . . . . . . . . . . . . . . . . . 28 16. Authors' Addresses . . . . . . . . . . . . . . . . . . . 27
17. Full Copyright Statement . . . . . . . . . . . . . . . . 28 17. Full Copyright Statement . . . . . . . . . . . . . . . . 27
Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . 29 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . 28
1. Introduction 1. Introduction
RDMA is a technique for efficient movement of data over high speed RDMA is a technique for efficient movement of data between end
transports. It facilitates data movement via direct memory access by nodes, which becomes increasingly compelling over high speed
hardware, yielding faster transfers of data over a network while transports. By directing data into destination buffers as it sent
reducing host CPU overhead. on a network, and placing it via direct memory access by hardware,
the double benefit of faster transfers and reduced host overhead is
obtained.
ONC RPC [RFC1831] is a remote procedure call protocol that has been ONC RPC [RFC1831] is a remote procedure call protocol that has been
run over a variety of transports. Most implementations today use UDP run over a variety of transports. Most RPC implementations today
or TCP. RPC messages are defined in terms of an eXternal Data use UDP or TCP. RPC messages are defined in terms of an eXternal
Representation (XDR) [RFC1832] which provides a canonical data Data Representation (XDR) [RFC1832] which provides a canonical data
representation across a variety of host architectures. An XDR data representation across a variety of host architectures. An XDR data
stream is conveyed differently on each type of transport. On UDP, stream is conveyed differently on each type of transport. On UDP,
RPC messages are encapsulated inside datagrams, while on a TCP byte RPC messages are encapsulated inside datagrams, while on a TCP byte
stream, RPC messages are delineated by a record marking protocol. An stream, RPC messages are delineated by a record marking protocol.
RDMA transport also conveys RPC messages in a unique fashion that An RDMA transport also conveys RPC messages in a unique fashion
must be fully described if client and server implementations are to that must be fully described if client and server implementations
interoperate. are to interoperate.
RDMA transports present new semantics unlike the behaviors of either RDMA transports present new semantics unlike the behaviors of
UDP and TCP. They retain message delineations like UDP while also either UDP and TCP alone. They retain message delineations like
providing a reliable, sequenced data transfer like TCP. All provide UDP while also providing a reliable, sequenced data transfer like
the new efficient, bulk transfer service of RDMA. RDMA transports TCP. And, they provide the new efficient, bulk transfer service of
are therefore naturally viewed as a new transport type by ONC RPC. RDMA. RDMA transports are therefore naturally viewed as a new
transport type by ONC RPC.
RDMA as a transport will benefit the performance of RPC protocols RDMA as a transport will benefit the performance of RPC protocols
that move large "chunks" of data, since RDMA hardware excels at that move large "chunks" of data, since RDMA hardware excels at
moving data efficiently between host memory and a high speed network moving data efficiently between host memory and a high speed
with little or no host CPU involvement. In this context, the NFS network with little or no host CPU involvement. In this context,
protocol, in all its versions, is an obvious beneficiary of RDMA. the NFS protocol, in all its versions, is an obvious beneficiary of
Many other RPC-based protocols will also benefit. RDMA. A complete problem statement is discussed in [NFSRDMAPS],
and related NFSv4 issues are discussed in [NFSSESS]. Many other
RPC-based protocols will also benefit.
Although the RDMA transport described here provides relatively Although the RDMA transport described here provides relatively
transparent support for any RPC application, the proposal goes transparent support for any RPC application, the proposal goes
further in describing mechanisms that can optimize the use of RDMA further in describing mechanisms that can optimize the use of RDMA
with more active participation by the RPC application. with more active participation by the RPC application.
2. Abstract RDMA Model 2. Abstract RDMA Model
An RPC transport is responsible for conveying an RPC message from a An RPC transport is responsible for conveying an RPC message from a
sender to a receiver. An RPC message is either an RPC call from a sender to a receiver. An RPC message is either an RPC call from a
client to a server, or an RPC reply from the server back to the client to a server, or an RPC reply from the server back to the
client. An RPC message contains an RPC call header followed by client. An RPC message contains an RPC call header followed by
arguments if the message is an RPC call, or an RPC reply header arguments if the message is an RPC call, or an RPC reply header
followed by results if the message is an RPC reply. The call header followed by results if the message is an RPC reply. The call
contains a transaction ID (XID) followed by the program and procedure header contains a transaction ID (XID) followed by the program and
number as well as a security credential. An RPC reply header begins procedure number as well as a security credential. An RPC reply
with an XID that matches that of the RPC call message, followed by a header begins with an XID that matches that of the RPC call
security verifier and results. All data in an RPC message is XDR message, followed by a security verifier and results. All data in
encoded. For a complete description of the RPC protocol and XDR an RPC message is XDR encoded. For a complete description of the
encoding, see [RFC1831] and [RFC1832]. RPC protocol and XDR encoding, see [RFC1831] and [RFC1832].
This protocol assumes an abstract model for RDMA transports. The This protocol assumes an abstract model for RDMA transports. The
following terms, common in the RDMA lexicon, are used in this following terms, common in the RDMA lexicon, are used in this
document. A more complete glossary of RDMA terms can be found in document. A more complete glossary of RDMA terms can be found in
[RDMA]. [RDMA].
o Registered Memory o Registered Memory
All data moved via tagged RDMA operations must be resident in
All data moved via RDMA must be resident in registered registered memory at its destination. This protocol assumes
memory at its source and destination. Each segment of that each segment of registered memory is identified with a
registered memory must be identified with a Steering Tag steering tag of no more than 32 bits and memory addresses of
(STag) of no more than 32 bits and memory addresses of up up to 64 bits in length.
to 64 bits in length.
o RDMA Send o RDMA Send
The RDMA provider supports an RDMA Send operation with The RDMA provider supports an RDMA Send operation with
completion signalled at the receiver when data is placed completion signalled at the receiver when data is placed in a
in a pre-posted buffer. The amount of transferred data pre-posted buffer. The amount of transferred data is limited
is limited only by the size of the receiver's buffer. only by the size of the receiver's buffer. Sends complete at
Sends complete at the receiver in the order they were the receiver in the order they were issued at the sender.
issued at the sender.
o RDMA Write o RDMA Write
The RDMA provider supports an RDMA Write operation to directly
The RDMA provider supports an RDMA Write operation to place data in the receiver's buffer. An RDMA Write is
directly place data in the receiver's buffer. An RDMA initiated by the sender and completion is signalled at the
Write is initiated by the sender and completion is sender. No completion is signalled at the receiver. The
signalled at the sender. No completion is signalled at sender uses a steering tag, memory address and length of the
the receiver. The sender uses a Steering Tag (STag), remote destination buffer. RDMA Writes are not necessarily
memory address and length of the remote destination ordered with respect to one another, but are ordered with
buffer. A subsequent completion, provided by RDMA Send, respect to RDMA Sends; a subsequent RDMA Send completion must
must be obtained at the receiver to guarantee that RDMA be obtained at the receiver to notify that prior RDMA Write
Write data has been successfully placed in the receiver's data has been successfully placed in the receiver's memory.
memory.
o RDMA Read o RDMA Read
The RDMA provider supports an RDMA Read operation to directly
place peer source data in the requester's buffer. An RDMA
Read is initiated by the receiver and completion is signalled
at the receiver. The receiver provides steering tags, memory
addresses and a length for the remote source and local
destination buffers. Since the peer at the data source
receives no notification of RDMA Read completion, there is an
assumption that on receiving the data the receiver will signal
completion with an RDMA Send message, so that the peer can
free the source buffers and the associated steering tags.
The RDMA provider supports an RDMA Read operation to This protocol is designed to function with equivalent semantics
directly place peer source data in the requester's buffer. over all appropriate RDMA transports. In its abstract form, this
An RDMA Read is initiated by the receiver and completion is protocol does not implement RDMA directly. Instead, it conveys to
signalled at the receiver. The receiver provides the RPC peer, information sufficient to direct an RDMA
Steering Tags, memory addresses and a length for the implementation to perform transfers containing RPC data, and to
remote source and local destination buffers. communicate their result(s). It therefore becomes a useful,
Since the peer at the data source receives no notification implementable standard when mapped onto a specific RDMA transport,
of RDMA Read completion, there is an assumption that on such as iWARP [RDDP] or Infiniband [IB].
receiving the data the receiver will signal completion
with an RDMA Send message, so that the peer can free the
source buffers.
In its abstract form, this protocol is not an interoperable stan-
dard. It becomes a useful, implementable standard only when mapped
onto a specific RDMA transport, like iWARP [RDDP] or Infiniband
[IB].
3. Protocol Outline 3. Protocol Outline
An RPC message can be conveyed in identical fashion, whether it is a An RPC message can be conveyed in identical fashion, whether it is
CALL or REPLY message. In each case, the transmission of the message a call or reply message. In each case, the transmission of the
proper is preceded by transmission of a transport header for use by message proper is preceded by transmission of a transport-specific
RPC over RDMA transports. This header is analogous to the record header for use by RPC over RDMA transports. This header is
marking used for RPC over TCP, but is more extensive, since RDMA analogous to the record marking used for RPC over TCP, but is more
transports support several modes of data transfer and it is important extensive, since RDMA transports support several modes of data
to allow the client and server to use the most efficient mode for any transfer and it is important to allow the client and server to use
given transfer. Multiple segments of a message may be transferred in the most efficient mode for any given transfer. Multiple segments
different ways to different remote memory destinations. of a message may be transferred in different ways to different
remote memory destinations.
All transfers of a CALL or REPLY begin with an RDMA send which All transfers of a call or reply begin with an RDMA Send which
transfers at least the transport header, usually with the CALL or transfers at least the RPC over RDMA header, usually with the call
REPLY message appended, or at least some part thereof. Because the or reply message appended, or at least some part thereof. Because
size of what may be transmitted via RDMA send is limited by the size the size of what may be transmitted via RDMA Send is limited by the
of the receiver's pre-posted buffer, the RPC over RDMA transport size of the receiver's pre-posted buffer, the RPC over RDMA
provides a number of methods to reduce the amount transferred by transport provides a number of methods to reduce the amount
means of the RDMA send, when necessary, by transferring various parts transferred by means of the RDMA Send, when necessary, by
of the message using RDMA read and RDMA write. transferring various parts of the message using RDMA Read and RDMA
Write.
3.1. Short Messages 3.1. Short Messages
Many RPC messages are quite short. For example, the NFS version 3 Many RPC messages are quite short. For example, the NFS version 3
GETATTR request, is only 56 bytes: 20 bytes of RPC header plus a 32 GETATTR request, is only 56 bytes: 20 bytes of RPC header plus a 32
byte filehandle argument and 4 bytes of length. The reply to this byte filehandle argument and 4 bytes of length. The reply to this
common request is about 100 bytes. common request is about 100 bytes.
There is no benefit in transferring such small messages with an RDMA There is no benefit in transferring such small messages with an
Read or Write operation. The overhead in transferring STags and RDMA Read or Write operation. The overhead in transferring
memory addresses is justified only by large transfers. The critical steering tags and memory addresses is justified only by large
message size that justifies RDMA transfer will vary depending on the transfers. The critical message size that justifies RDMA transfer
RDMA implementation and network, but is typically of the order of a will vary depending on the RDMA implementation and network, but is
few kilobytes. It is appropriate to transfer a short message with an typically of the order of a few kilobytes. It is appropriate to
RDMA Send to a pre-posted buffer. The transport header with the transfer a short message with an RDMA Send to a pre-posted buffer.
short message (CALL or REPLY) immediately following is transferred The RPC over RDMA header with the short message (call or reply)
using a single RDMA send operation. immediately following is transferred using a single RDMA Send
operation.
Short RPC messages over an RDMA transport will look like this: Short RPC messages over an RDMA transport will look like this:
Client Server Client Server
| RPC Call | | RPC Call |
Send | ------------------------------> | Send | ------------------------------> |
| | | |
| RPC Reply | | RPC Reply |
| <------------------------------ | Send | <------------------------------ | Send
3.2. Data Chunks 3.2. Data Chunks
Some protocols, like NFS, have RPC procedures that can transfer very Some protocols, like NFS, have RPC procedures that can transfer
large "chunks" of data in the RPC call or reply and would cause the very large "chunks" of data in the RPC call or reply and would
maximum send size to be exceeded if one tried to transfer them as cause the maximum send size to be exceeded if one tried to transfer
part of the RDMA send. These large chunks typically range from a them as part of the RDMA Send. These large chunks typically range
kilobyte to a megabyte or more. An RDMA transport can transfer large from a kilobyte to a megabyte or more. An RDMA transport can
chunks of data more efficiently via the direct placement of an RDMA transfer large chunks of data more efficiently via the direct
Read or RDMA Write operation. Using direct placement instead of in- placement of an RDMA Read or RDMA Write operation. Using direct
line transfer not only avoids expensive data copies, but provides placement instead of in-line transfer not only avoids expensive
correct data alignment at the destination. data copies, but provides correct data alignment at the
destination.
3.3. Flow Control 3.3. Flow Control
It is critical to provide flow control for an RDMA connection. RDMA It is critical to provide RDMA Send flow control for an RDMA
receive operations will fail if a pre-posted receive buffer is not connection. RDMA receive operations will fail if a pre-posted
available to accept an incoming RDMA Send. Such errors are fatal to receive buffer is not available to accept an incoming RDMA Send,
the connection. This is a departure from conventional TCP/IP and repeated occurrences of such errors can be fatal to the
connection. This is a departure from conventional TCP/IP
networking where buffers are allocated dynamically on an as-needed networking where buffers are allocated dynamically on an as-needed
basis, and pre-posting is not required. basis, and pre-posting is not required.
It is not practical to provide for fixed credit limits at the RPC It is not practical to provide for fixed credit limits at the RPC
server. Fixed limits scale poorly, since posted buffers are server. Fixed limits scale poorly, since posted buffers are
dedicated to the associated connection until consumed by receive dedicated to the associated connection until consumed by receive
operations. Additionally for protocol correctness, the server must operations. Additionally for protocol correctness, the server must
be able to reply whether or not a new buffer can be posted to accept always be able to reply to client requests, whether or not new
future receives. buffers have been posted to accept future receives.
Flow control is implemented as a simple request/grant protocol in the Flow control for RDMA Send operations is implemented as a simple
transport header associated with each RPC message. The transport request/grant protocol in the RPC over RDMA header associated with
header for RPC CALL messages contains a requested credit value for each RPC message. The RPC over RDMA header for RPC call messages
the server, which may be dynamically adjusted by the caller to match contains a requested credit value for the server, which may be
its expected needs. The transport header for the RPC REPLY messages dynamically adjusted by the caller to match its expected needs.
provide the granted result, which may have any value except it may The RPC over RDMA header for the RPC reply messages provide the
not be zero when no in-progress operations are present at the server, granted result, which may have any value except it may not be zero
since such a value would result in deadlock. The value may be when no in-progress operations are present at the server, since
adjusted up or down at each opportunity to match the server's needs such a value would result in deadlock. The value may be adjusted
or policies. up or down at each opportunity to match the server's needs or
policies.
While RPC CALLs may complete in any order, the current flow control While RPC call may complete in any order, the current flow control
limit at the RPC server is known to the RPC client from the Send limit at the RPC server is known to the RPC client from the Send
ordering properties. It is always the most recent server granted ordering properties. It is always the most recent server granted
credits minus the number of requests in flight. credits minus the number of requests in flight.
Certain RDMA implementations may impose additional flow control
restrictions, such as limits on RDMA Read operations in progress at
the responder. Because these operations are outside the scope of
this protocol, they are not addressed and must be provided for by
other layers. For example, a simple upper layer RPC consumer might
perform single-issue RDMA Read requests, while a more
sophisticated, multithreaded RPC consumer may implement its own
FIFO queue of such operations.
3.4. XDR Encoding with Chunks 3.4. XDR Encoding with Chunks
The data comprising an RPC call or reply message is marshaled or The data comprising an RPC call or reply message is marshaled or
serialized into a contiguous stream by an XDR routine. XDR data serialized into a contiguous stream by an XDR routine. XDR data
types such as integers, strings, arrays and linked lists are commonly types such as integers, strings, arrays and linked lists are
implemented over two very simple functions that encode either an XDR commonly implemented over two very simple functions that encode
data unit (32 bits) or an array of bytes. either an XDR data unit (32 bits) or an array of bytes.
Normally, the separate data items in an XDR call or reply are encoded Normally, the separate data items in an RPC call or reply are
as a contiguous sequence of bytes for network transmission over UDP encoded as a contiguous sequence of bytes for network transmission
or TCP. However, in the case of an RDMA transport, local routines over UDP or TCP. However, in the case of an RDMA transport, local
such as XDR encode can determine that an opaque byte array is large routines such as XDR encode can determine that (for instance) an
enough to be more efficiently moved via an RDMA data transfer opaque byte array is large enough to be more efficiently moved via
operation like RDMA Read or RDMA Write. an RDMA data transfer operation like RDMA Read or RDMA Write.
When sending any message (request or reply) that contains a candidate Semantically speaking, the protocol has no restriction regarding
large data chunk, the XDR encoding routine avoids moving the data data types which may or may not be chunked. In practice however,
into the XDR stream. Instead, it does not encode the data portion, efficiency considerations lead to the conclusion that certain data
but records the address and size of each chunk in a separate "read types are not generally "chunkable". Typically, only opaque and
chunk list" encoded within RPC RDMA transport-specific headers. Such aggregate data types which may attain substantial size are
chunks will be transferred via RDMA Read operations initiated by the considered to be eligible. With today's hardware this size may be
receiver. a kilobyte or more. However any object may be chosen for chunking
in any given message.
Since the chunks are to be moved via RDMA, the memory for each chunk The eligibility of XDR data items to be candidates for being moved
must be registered. This registration may take place within XDR as data chunks (as opposed to being marshalled inline) is not
itself, providing for full transparency to upper layers, or it may be specified by the RPC over RDMA protocol. Chunk eligibility
performed by any other specific local implementation. criteria must be determined by each upper layer in order to provide
for an interoperable specification. One such example with
rationale, for the NFS protocol family, is provided in [NFSDDP].
The interface by which an upper layer implementation communicates
the eligibility of a data item locally to RPC for chunking is out
of scope for this specification. In many implementations, it is
possible to implement a transparent RPC chunking facility.
However, such implementations may lead to inefficiencies, either
because they require the RPC layer to perform expensive
registration and deregistration of memory "on the fly", or they may
require using RDMA chunks in reply messages, along with the
resulting additional handshaking with the RPC over RDMA peer.
However, these issues are purely local and implementations are free
to innovate.
When sending any message (request or reply) that contains an
eligible large data chunk, the XDR encoding routine avoids moving
the data into the XDR stream. Instead, it does not encode the data
portion, but records the address and size of each chunk in a
separate "read chunk list" encoded within RPC RDMA transport-
specific headers. Such chunks will be transferred via RDMA Read
operations initiated by the receiver.
When the read chunks are to be moved via RDMA, the memory for each
chunk must be registered. This registration may take place within
XDR itself, providing for full transparency to upper layers, or it
may be performed by any other specific local implementation.
Additionally, when making an RPC call that can result in bulk data Additionally, when making an RPC call that can result in bulk data
transferred in the reply, it is desirable to provide chunks to accept transferred in the reply, it is desirable to provide chunks to
the data directly via RDMA Write. These chunks will therefore be accept the data directly via RDMA Write. These write chunks will
pre-filled by the server prior to responding, and XDR decode at the therefore be pre-filled by the server prior to responding, and XDR
client will not be required. These "write chunk lists" undergo a decode at the client will not be required. These chunks undergo a
similar registration and advertisement to chunks built as a part of similar registration and advertisement via "write chunk lists"
XDR encoding. Just as with an encoded read chunk list, the memory built as a part of XDR encoding.
referenced in an encoded write chunk list must be pre-registered. If
the client chooses not to make a write chunk list available, then the Some RPC client implementations are not able to determine where an
server must return data inline in the reply, or via a read chunk RPC call's results reside during the "encode" phase. This makes it
list. difficult or impossible for the RPC client layer to encode the
write chunk list at the time of building the request. In this
case, it is difficult for the RPC implementation to provide
transparency to the RPC consumer, which may require recoding to
provide result information at this earlier stage.
Therefore if the RPC client does not make a write chunk list
available to receive the result, then the RPC server must return
data inline in the reply, or if it so chooses, via a read chunk
list. RPC clients are discouraged from omitting write chunk lists
for eligible replies, due to the lower performance of the
additional handshaking to perform data transfer, and the
requirement that the server must expose (and preserve) the reply
data for a period of time. In the absence of a server-provided
read chunk list in the reply, if the encoded reply overflows the
inline buffer, the RPC will fail.
When any data within a message is provided via either read or write When any data within a message is provided via either read or write
chunks, the chunk itself refers only to the data portion of the XDR chunks, the chunk itself refers only to the data portion of the XDR
stream element. In particular, for counted fields (e.g. a "<>" stream element. In particular, for counted fields (e.g. a "<>"
encoding) the byte count which is encoded as part of the field encoding) the byte count which is encoded as part of the field
remains in the XDR stream, as well as being encoded in the chunk remains in the XDR stream, as well as being encoded in the chunk
list. Only the data portion is elided. This is important to list. Only the data portion is elided. This is important to
maintain upper layer implementation compatibility - both the count maintain upper layer implementation compatibility - both the count
and the data must be transferred as part of the XDR stream. In and the data must be transferred as part of the XDR stream. In
addition, any byte count in the XDR stream must match the sum of the addition, any byte count in the XDR stream must match the sum of
byte counts present in the corresponding read or write chunk list. the byte counts present in the corresponding read or write chunk
If they do not agree, an RPC protocol encoding error results. list. If they do not agree, an RPC protocol encoding error
results.
The following items are contained in a chunk list entry. The following items are contained in a chunk list entry.
STag Handle
Steering tag or handle obtained when the chunk Steering tag or handle obtained when the chunk memory is
memory is registered for RDMA. registered for RDMA.
Length Length
The length of the chunk in bytes. The length of the chunk in bytes.
Offset Offset
The offset or memory address of the chunk. The offset or memory address of the chunk. In order to
support the widest array of RDMA implementations, as well as
the most general steering tag scheme, this field is
unconditionally included in each chunk list entry.
Position Position
For data which is to be encoded, the position in For data which is to be encoded, the position in the XDR
the XDR stream where the chunk would normally stream where the chunk would normally reside. Note that the
reside. It is possible that a contiguous sequence chunk therefore inserts its data into the XDR stream at this
of chunks might all have the same position. For position, but its transfer is no longer "inline". Also note
data which is to be decoded, no "position" is it is possible that a contiguous sequence of chunks might all
used. have the same position. For data which is to be decoded, no
"position" is used.
When XDR marshaling is complete, the chunk list is XDR encoded, When XDR marshaling is complete, the chunk list is XDR encoded,
then sent to the receiver prepended to the RPC message. Any source then sent to the receiver prepended to the RPC message. Any source
data for a read chunk, or the destination of a write chunk, remain data for a read chunk, or the destination of a write chunk, remain
behind in the sender's registered memory. behind in the sender's registered memory and their actual payload
is not marshalled into the request or reply.
+----------------+----------------+------------- +----------------+----------------+-------------
| | | | RPC over RDMA | |
| RDMA header w/ | RPC Header | Non-chunk args/results | header w/ | RPC Header | Non-chunk args/results
| chunks | | | chunks | |
+----------------+----------------+------------- +----------------+----------------+-------------
Read chunk lists are structured differently from write chunk lists. Read chunk lists and write chunk lists are structured somewhat
This is due to the different usage - read chunks are decoded and differently. This is due to the different usage - read chunks are
indexed by their position in the XDR data stream, and may be used decoded and indexed by their position in the XDR data stream, their
for both arguments and results. Write chunks on the other hand are size is always known, and may be used for both arguments and
used only for results, and have no preassigned offset in the XDR results. Write chunks on the other hand are used only for results,
stream until the results are produced. The mapping of Write chunks and have neither a preassigned offset in the XDR stream nor a size
onto designated NFS procedures and results is described in [NFS- until the results are produced. The mapping of Write chunks onto
DDP]. designated NFS procedures and their results is described in
[NFSDDP].
Therefore, read chunks are encoded as a single array, with each Therefore, read chunks are encoded into a read chunk list as a
entry tagged by its position in the XDR stream. Write chunks are single array, with each entry tagged by its position in the XDR
encoded as a list of arrays of RDMA buffers, with each list element stream. Write chunks are encoded as a list of arrays of RDMA
providing buffers for a separate result. buffers, with each list element (an array) providing buffers for a
separate result. Individual write chunk list elements may thereby
result in being partially or fully filled, or in fact not being
filled at all.
3.5. Padding 3.5. Padding
Alignment of specific opaque data enables certain scatter/gather Alignment of specific opaque data enables certain scatter/gather
optimizations. Padding leverages the useful property that RDMA optimizations. Padding leverages the useful property that RDMA
transfers preserve alignment of data, even when they are placed into transfers preserve alignment of data, even when they are placed
pre-posted receive buffers by Sends. into pre-posted receive buffers by Sends.
Many servers can make good use of such padding. Padding allows the Many servers can make good use of such padding. Padding allows the
chaining of RDMA receive buffers such that any data transferred by chaining of RDMA receive buffers such that any data transferred by
RDMA on behalf of RPC requests will be placed into appropriately RDMA on behalf of RPC requests will be placed into appropriately
aligned buffers on the system that receives the transfer. In this aligned buffers on the system that receives the transfer. In this
way, the need for servers to perform RDMA Read to satisfy all but the way, the need for servers to perform RDMA Read to satisfy all but
largest client writes is obviated. the largest client writes is obviated.
The effect of padding is demonstrated below showing prior bytes on an The effect of padding is demonstrated below showing prior bytes on
XDR stream (XXX) followed by an opaque field consisting of four an XDR stream (XXX) followed by an opaque field consisting of four
length bytes (LLLL) followed by data bytes (DDDD). The receiver of length bytes (LLLL) followed by data bytes (DDDD). The receiver of
the RDMA Send has posted two chained receive buffers. Without the RDMA Send has posted two chained receive buffers. Without
padding, the opaque data is split across the two buffers. With the padding, the opaque data is split across the two buffers. With the
addition of padding bytes (ppp) prior to the first data byte, the addition of padding bytes (ppp) prior to the first data byte, the
data can be forced to align correctly in the second buffer. data can be forced to align correctly in the second buffer.
Buffer 1 Buffer 2 Buffer 1 Buffer 2
Unpadded -------------- -------------- Unpadded -------------- --------------
XXXXXXXLLLLDDDDDDDDDDDDDD ---> XXXXXXXLLLLDDD DDDDDDDDDDD XXXXXXXLLLLDDDDDDDDDDDDDD ---> XXXXXXXLLLLDDD DDDDDDDDDDD
skipping to change at page 11, line 4 skipping to change at page 11, line 11
padding, the opaque data is split across the two buffers. With the padding, the opaque data is split across the two buffers. With the
addition of padding bytes (ppp) prior to the first data byte, the addition of padding bytes (ppp) prior to the first data byte, the
data can be forced to align correctly in the second buffer. data can be forced to align correctly in the second buffer.
Buffer 1 Buffer 2 Buffer 1 Buffer 2
Unpadded -------------- -------------- Unpadded -------------- --------------
XXXXXXXLLLLDDDDDDDDDDDDDD ---> XXXXXXXLLLLDDD DDDDDDDDDDD XXXXXXXLLLLDDDDDDDDDDDDDD ---> XXXXXXXLLLLDDD DDDDDDDDDDD
Padded Padded
XXXXXXXLLLLpppDDDDDDDDDDDDDD ---> XXXXXXXLLLLppp DDDDDDDDDDDDDD XXXXXXXLLLLpppDDDDDDDDDDDDDD ---> XXXXXXXLLLLppp DDDDDDDDDDDDDD
Padding is implemented completely within the RDMA transport encoding, Padding is implemented completely within the RDMA transport
flagged with a specific message type. Where padding is applied, two encoding, flagged with a specific message type. Where padding is
values are passed to the peer: an "rdma_align" which is the padding applied, two values are passed to the peer: an "rdma_align" which
value used, and "rdma_thresh", which is the opaque data size at or is the padding value used, and "rdma_thresh", which is the opaque
above which padding is applied. For instance, if the server is using data size at or above which padding is applied. For instance, if
chained 4 KB receive buffers, then up to (4 KB - 1) padding bytes the server is using chained 4 KB receive buffers, then up to (4 KB
could be used to achieve alignment of the data. If padding is to - 1) padding bytes could be used to achieve alignment of the data.
apply only to chunks at least 1 KB in size, then the threshold should If padding is to apply only to chunks at least 1 KB in size, then
be set to 1 KB. The XDR routine at the peer will consult these the threshold should be set to 1 KB. The XDR routine at the peer
values when decoding opaque values. Where the decoded length exceeds will consult these values when decoding opaque values. Where the
the rdma_thresh, the XDR decode will skip over the appropriate decoded length exceeds the rdma_thresh, the XDR decode will skip
padding as indicated by rdma_align and the current XDR stream over the appropriate padding as indicated by rdma_align and the
position. current XDR stream position.
3.6. XDR Decoding with Read Chunks 3.6. XDR Decoding with Read Chunks
The XDR decode process moves data from an XDR stream into a data The XDR decode process moves data from an XDR stream into a data
structure provided by the client or server application. Where structure provided by the client or server application. Where
elements of the destination data structure are buffers or strings, elements of the destination data structure are buffers or strings,
the RPC application can either pre-allocate storage to receive the the RPC application can either pre-allocate storage to receive the
data, or leave the string or buffer fields null and allow the XDR data, or leave the string or buffer fields null and allow the XDR
decode to automatically allocate storage of sufficient size. decode stage of RPC processing to automatically allocate storage of
sufficient size.
When decoding a message from an RDMA transport, the receiver first When decoding a message from an RDMA transport, the receiver first
XDR decodes the chunk lists from the RDMA transport header, then XDR decodes the chunk lists from the RPC over RDMA header, then
proceeds to decode the body of the RPC message (arguments or proceeds to decode the body of the RPC message (arguments or
results). Whenever the XDR offset in the decode stream matches that results). Whenever the XDR offset in the decode stream matches
of a chunk in the read chunk list, the XDR routine initiates an RDMA that of a chunk in the read chunk list, the XDR routine initiates
Read to bring over the chunk data into locally registered memory for an RDMA Read to bring over the chunk data into locally registered
the destination buffer. After completing such a transfer, the RPC memory for the destination buffer.
receiver must issue an RDMA_DONE message (described in Section 3.8)
to notify the peer that the source buffers can be freed. When processing an RPC request, the RPC receiver (server)
acknowledges its completion of use of the source buffers by simply
replying to the RPC sender (client), and the peer may free all
source buffers advertised by the request.
When processing an RPC reply, after completing such a transfer the
RPC receiver (client) must issue an RDMA_DONE message (described in
Section 3.8) to notify the peer (server) that the source buffers
can be freed.
The read chunk list is constructed and used entirely within the The read chunk list is constructed and used entirely within the
RPC/XDR layer. Other than specifying the minimum chunk size, the RPC/XDR layer. Other than specifying the minimum chunk size, the
management of the read chunk list is automatic and transparent to an management of the read chunk list is automatic and transparent to
RPC application. an RPC application.
3.7. XDR Decoding with Write Chunks 3.7. XDR Decoding with Write Chunks
When a "write chunk list" is provided for the results of the RPC When a "write chunk list" is provided for the results of the RPC
CALL, the server must provide any corresponding data via RDMA Write call, the server must provide any corresponding data via RDMA Write
to the memory referenced in the chunk list entries. The RPC REPLY to the memory referenced in the chunk list entries. The RPC reply
conveys this by returning the write chunk list to the client with the conveys this by returning the write chunk list to the client with
lengths rewritten to match the actual transfer. The XDR "decode" of the lengths rewritten to match the actual transfer. The XDR
the reply therefore performs no local data transfer but merely "decode" of the reply therefore performs no local data transfer but
returns the length obtained from the reply. merely returns the length obtained from the reply.
Each decoded result consumes one entry in the write chunk list, which Each decoded result consumes one entry in the write chunk list,
in turn consists of an array of RDMA segments. The length is which in turn consists of an array of RDMA segments. The length is
therefore the sum of all returned lengths in all segments comprising therefore the sum of all returned lengths in all segments
the corresponding list entry. As each list entry is "decoded", the comprising the corresponding list entry. As each list entry is
entire entry is consumed. "decoded", the entire entry is consumed.
The write chunk list is constructed and used by the RPC application. The write chunk list is constructed and used by the RPC
The RPC/XDR layer simply conveys the list between client and server application. The RPC/XDR layer simply conveys the list between
and initiates the RDMA Writes back to the client. The mapping of client and server and initiates the RDMA Writes back to the client.
write chunk list entries to procedure arguments must be determined The mapping of write chunk list entries to procedure arguments must
for each protocol. An example of a mapping is described in [NFSDDP]. be determined for each protocol. An example of a mapping is
described in [NFSDDP].
3.8. RPC Call and Reply 3.8. RPC Call and Reply
The RDMA transport for RPC provides three methods of moving data The RDMA transport for RPC provides three methods of moving data
between client and server: between client and server:
In-line In-line
Data are moved between client and server Data are moved between client and server within an RDMA Send.
within an RDMA Send.
RDMA Read RDMA Read
Data are moved between client and server Data are moved between client and server via an RDMA Read
via an RDMA Read operation via STag, address operation via steering tag, address and offset obtained from a
and offset obtained from a read chunk list. read chunk list.
RDMA Write RDMA Write
Result data is moved from server to client Result data is moved from server to client via an RDMA Write
via an RDMA Write operation via STag, address operation via steering tag, address and offset obtained from a
and offset obtained from a write chunk list write chunk list or reply chunk in the client's RPC call
or reply chunk in the client's RPC call message. message.
These methods of data movement may occur in combinations within a These methods of data movement may occur in combinations within a
single RPC. For instance, an RPC call may contain some in-line single RPC. For instance, an RPC call may contain some in-line
data along with some large chunks transferred via RDMA Read by the data along with some large chunks transferred via RDMA Read by the
server. The reply to that call may have some result chunks that server. The reply to that call may have some result chunks that
the server RDMA Writes back to the client. The following protocol the server RDMA Writes back to the client. The following protocol
interactions illustrate RPC calls that use these methods to move interactions illustrate RPC calls that use these methods to move
RPC message data: RPC message data:
An RPC with write chunks in the call message looks like this: An RPC with write chunks in the call message looks like this:
skipping to change at page 13, line 28 skipping to change at page 13, line 35
| | | |
| Chunk 1 | | Chunk 1 |
| <------------------------------ | Write | <------------------------------ | Write
| : | | : |
| Chunk n | | Chunk n |
| <------------------------------ | Write | <------------------------------ | Write
| | | |
| RPC Reply | | RPC Reply |
| <------------------------------ | Send | <------------------------------ | Send
In the presence of write chunks, RDMA ordering provides the
guarantee that any RDMA Write operations from the server have
completed prior to the client's RPC reply processing.
An RPC with read chunks in the call message looks like this: An RPC with read chunks in the call message looks like this:
Client Server Client Server
| RPC Call + Read Chunk list | | RPC Call + Read Chunk list |
Send | ------------------------------> | Send | ------------------------------> |
| | | |
| Chunk 1 | | Chunk 1 |
| +------------------------------ | Read | +------------------------------ | Read
| v-----------------------------> | | v-----------------------------> |
| : | | : |
skipping to change at page 14, line 21 skipping to change at page 14, line 39
| <------------------------------ | Send | <------------------------------ | Send
| | | |
| Chunk 1 | | Chunk 1 |
Read | ------------------------------+ | Read | ------------------------------+ |
| <-----------------------------v | | <-----------------------------v |
| : | | : |
| Chunk n | | Chunk n |
Read | ------------------------------+ | Read | ------------------------------+ |
| <-----------------------------v | | <-----------------------------v |
| | | |
| RPC Done | | Done |
Send | ------------------------------> | Send | ------------------------------> |
The final RPC Done message allows the client to signal the server The final Done message allows the client to signal the server that
that it has received the chunks, so the server can de-register and it has received the chunks, so the server can de-register and free
free the memory holding the chunks. An RPC Done completion is not the memory holding the chunks. A Done completion is not necessary
necessary for an RPC call, since the RPC reply Send is itself a for an RPC call, since the RPC reply Send is itself a receive
receive completion notification. completion notification.
The RPC Done message has no effect on protocol latency since the The Done message has no effect on protocol latency since the client
client has no expectation of a reply from the server. Nor does it has no expectation of a reply from the server. Nor does it
adversely affect bandwidth since it is only 16 bytes in length. In adversely affect bandwidth since it is only 16 bytes in length. In
the event that the client fails to return the Done message, the the event that the client fails to return the Done message within
server can proceed with a de-register and free chunk buffers after some timeout period, the server may conclude that a protocol
a time-out. violation has occurred and close the RPC connection, or it may
proceed with a de-register and free its chunk buffers. This may
result in a fatal RDMA error if the client later attempts to
perform an RDMA Read operation, which amounts to the same thing.
It is important to note that the RPC Done message consumes a credit It is important to note that the Done message consumes a credit at
at the server. The client must account for this in its accounting the server. The client must account for this in its accounting of
of available credits, and the server should replenish the credit available credits, and the server should replenish the credit
consumed by RPC Done at its earliest oportunity. consumed by Done at its earliest opportunity.
Finally, it is possible to conceive of RPC exchanges that involve Finally, it is possible to conceive of RPC exchanges that involve
any or all combinations of write chunks in the RPC CALL, read any or all combinations of write chunks in the RPC call, read
chunks in the RPC CALL, and read chunks in the RPC REPLY. Support chunks in the RPC call, and read chunks in the RPC reply. Support
for such exchanges is straightforward from a protocol perspective, for such exchanges is straightforward from a protocol perspective,
but in practice such exchanges would be quite rare, limited to but in practice such exchanges would be quite rare, limited to
upper layer protocol exchanges which transferred bulk data in both upper layer protocol exchanges which transferred bulk data in both
the call and corresponding reply. the call and corresponding reply.
4. RPC RDMA Message Layout 4. RPC RDMA Message Layout
RPC call and reply messages are conveyed across an RDMA transport RPC call and reply messages are conveyed across an RDMA transport
with a prepended RDMA transport header. The transport header with a prepended RPC over RDMA header. The RPC over RDMA header
includes data for RDMA flow control credits, padding parameters and includes data for RDMA flow control credits, padding parameters and
lists of addresses that provide direct data placement via RDMA Read lists of addresses that provide direct data placement via RDMA Read
and Write operations. The layout of the RPC message itself is and Write operations. The layout of the RPC message itself is
unchanged from that described in [RFC1831] except for the possible unchanged from that described in [RFC1831] except for the possible
exclusion of large data chunks that will be moved by RDMA Read or exclusion of large data chunks that will be moved by RDMA Read or
Write operations. If the RPC message (along with the transport Write operations. If the RPC message (along with the RPC over RDMA
header) is too long for the posted receive buffer (even after any header) is too long for the posted receive buffer (even after any
large chunks are removed), then the entire RPC message can be moved large chunks are removed), then the entire RPC message can be moved
separately as a chunk, leaving just the transport header in the RDMA separately as a chunk, leaving just the RPC over RDMA header in the
Send. RDMA Send.
4.1. RPC RDMA Transport Header 4.1. RPC over RDMA Header
The RPC RDMA transport header begins with four 32-bit fields that are The RPC over RDMA header begins with four 32-bit fields that are
always present and which control the RDMA interaction including RDMA- always present and which control the RDMA interaction including
specific flow control. These are then followed by a number of items RDMA-specific flow control. These are then followed by a number of
such as chunk lists and padding which may or may not be present items such as chunk lists and padding which may or may not be
depending on the type of transmission. The four fields which are present depending on the type of transmission. The four fields
always present are: which are always present are:
1. Transaction ID (XID). 1. Transaction ID (XID).
The XID generated for the RPC call and reply. Having The XID generated for the RPC call and reply. Having the XID
the XID at the beginning of the message makes it easy to at the beginning of the message makes it easy to establish the
establish the message context. This XID mirrors the XID message context. This XID mirrors the XID in the RPC header,
in the RPC call header, and takes precedence. and takes precedence. The receiver may ignore the XID in the
RPC header, if it so chooses.
2. Version number. 2. Version number.
This version of the RPC RDMA message protocol is 1. This version of the RPC RDMA message protocol is 1. The
The version number must be increased by one whenever the version number must be increased by one whenever the format of
format of the RPC RDMA messages is changed. the RPC RDMA messages is changed.
3. Flow control credit value. 3. Flow control credit value.
When sent in an RPC CALL message, the requested value is When sent in an RPC call message, the requested value is
provided. When sent in an RPC REPLY message, the provided. When sent in an RPC reply message, the granted
granted value is returned. RPC CALLs must not be sent value is returned. RPC calls must not be sent in excess of
in excess of the currently granted limit. the currently granted limit.
4. Message type. 4. Message type.
RDMA_MSG = 0 indicates that chunk lists and RPC message
follow. RDMA_NOMSG = 1 indicates that after the chunk o RDMA_MSG = 0 indicates that chunk lists and RPC message
lists there is no RPC message. In this case, the chunk follow.
lists provide information to allow the message proper to
be transferred using RDMA read or write and thus is not o RDMA_NOMSG = 1 indicates that after the chunk lists there
appended to the RPC RDMA transport header. RDMA_MSGP = is no RPC message. In this case, the chunk lists provide
2 indicates that a chunk list and RPC message with some information to allow the message proper to be transferred
padding follow. RDMA_DONE = 3 indicates that the using RDMA Read or write and thus is not appended to the
message signals the completion of a chunk transfer via RPC over RDMA header.
RDMA Read. RDMA_ERROR = 4 is used to signal any detected
error(s) in the RPC RDMA chunk encoding. o RDMA_MSGP = 2 indicates that a chunk list and RPC message
with some padding follow.
0 RDMA_DONE = 3 indicates that the message signals the
completion of a chunk transfer via RDMA Read.
o RDMA_ERROR = 4 is used to signal any detected error(s) in
the RPC RDMA chunk encoding.
Because the version number is encoded as part of this header, and Because the version number is encoded as part of this header, and
the RDMA_ERROR message type is used to indicate errors, these first the RDMA_ERROR message type is used to indicate errors, these first
four fields and the start of the following message body must always four fields and the start of the following message body must always
remain aligned at these fixed offsets for all versions of the RPC remain aligned at these fixed offsets for all versions of the RPC
RDMA transport header. over RDMA header.
For a message of type RDMA_MSG or RDMA_NOMSG, the Read and Write For a message of type RDMA_MSG or RDMA_NOMSG, the Read and Write
chunk lists follow. If the Read chunk list is null (a 32 bit word chunk lists follow. If the Read chunk list is null (a 32 bit word
of zeros), then there are no chunks to be transferred separately of zeros), then there are no chunks to be transferred separately
and the RPC message follows in its entirety. If non-null, then and the RPC message follows in its entirety. If non-null, then
it's the beginning of an XDR encoded sequence of Read chunk list it's the beginning of an XDR encoded sequence of Read chunk list
entries. If the Write chunk list is non-null, then an XDR encoded entries. If the Write chunk list is non-null, then an XDR encoded
sequence of Write chunk entries follows. sequence of Write chunk entries follows.
If the message type is RDMA_MSGP, then two additional fields that If the message type is RDMA_MSGP, then two additional fields that
specify the padding alignment and threshold are inserted prior to specify the padding alignment and threshold are inserted prior to
the Read and Write chunk lists. the Read and Write chunk lists.
A transport header of message type RDMA_MSG or RDMA_MSGP will be A header of message type RDMA_MSG or RDMA_MSGP will be followed by
followed by the RPC call or reply message, beginning with the XID. the RPC call or RPC reply message body, beginning with the XID.
This XID should match the one at the beginning of the RPC message The XID in the RDMA_MSG or RDMA_MSGP header must match this.
header.
+--------+---------+---------+-----------+-------------+---------- +--------+---------+---------+-----------+-------------+----------
| | | | Message | NULLs | RPC Call | | | | Message | NULLs | RPC Call
| XID | Version | Credits | Type | or | or | XID | Version | Credits | Type | or | or
| | | | | Chunk Lists | Reply Msg | | | | | Chunk Lists | Reply Msg
+--------+---------+---------+-----------+-------------+---------- +--------+---------+---------+-----------+-------------+----------
Note that in the case of RDMA_DONE and RDMA_ERROR, no chunk list or Note that in the case of RDMA_DONE and RDMA_ERROR, no chunk list or
RPC message follows. As an implementation hint: a gather operation RPC message follows. As an implementation hint: a gather operation
on the Send of the RDMA RPC message can be used to marshal the ini- on the Send of the RDMA RPC message can be used to marshal the
tial header, the chunk list, and the RPC message itself. initial header, the chunk list, and the RPC message itself.
4.2. XDR Language Description 4.2. XDR Language Description
Here is the message layout in XDR language. Here is the message layout in XDR language.
struct xdr_rdma_segment { struct xdr_rdma_segment {
uint32 handle; /* Registered memory handle */ uint32 handle; /* Registered memory handle */
uint32 length; /* Length of the chunk in bytes */ uint32 length; /* Length of the chunk in bytes */
uint64 offset; /* Chunk virtual address or offset */ uint64 offset; /* Chunk virtual address or offset */
}; };
skipping to change at page 20, line 6 skipping to change at page 19, line 36
}; };
5. Large Chunkless Messages 5. Large Chunkless Messages
The receiver of RDMA Send messages is required to have previously The receiver of RDMA Send messages is required to have previously
posted one or more correctly sized buffers. The client can inform posted one or more correctly sized buffers. The client can inform
the server of the maximum size of its RDMA Send messages via the the server of the maximum size of its RDMA Send messages via the
Connection Configuration Protocol described later in this document. Connection Configuration Protocol described later in this document.
Since RPC messages are frequently small, memory savings can be Since RPC messages are frequently small, memory savings can be
achieved by posting small buffers. Even large messages like NFS READ achieved by posting small buffers. Even large messages like NFS
or WRITE will be quite small once the chunks are removed from the READ or WRITE will be quite small once the chunks are removed from
message. However, there may be large, chunkless messages that would the message. However, there may be large, chunkless messages that
demand a very large buffer be posted. A good example is an NFS would demand a very large buffer be posted. A good example is an
READDIR reply which may contain a large number of small filename NFS READDIR reply which may contain a large number of small
strings. Also, the NFS version 4 protocol [RFC3530] features filename strings. Also, the NFS version 4 protocol [RFC3530]
COMPOUND request and reply messages of unbounded length. features COMPOUND request and reply messages of unbounded length.
Ideally, each upper layer will negotiate these limits. However, it Ideally, each upper layer will negotiate these limits. However, it
is frequently necessary to provide a transparent solution. is frequently necessary to provide a transparent solution.
5.1. Message as an RDMA Read Chunk 5.1. Message as an RDMA Read Chunk
One relatively simple method is to have the client identify any RPC One relatively simple method is to have the client identify any RPC
message that exceeds the server's posted buffer size and move it message that exceeds the server's posted buffer size and move it
separately as a chunk, i.e. reference it as the first entry in the separately as a chunk, i.e. reference it as the first entry in the
read chunk list with an XDR position of zero. read chunk list with an XDR position of zero.
skipping to change at page 20, line 45 skipping to change at page 20, line 28
| XID | Version | Credits | RDMA_NOMSG | Chunk Lists | | XID | Version | Credits | RDMA_NOMSG | Chunk Lists |
| | | | | | | | | | | |
+--------+---------+---------+------------+-------------+ +--------+---------+---------+------------+-------------+
| |
| +---------- | +----------
| | Long RPC Call | | Long RPC Call
+->| or +->| or
| Reply Message | Reply Message
+---------- +----------
If the receiver gets a transport header with a message type of If the receiver gets an RPC over RDMA header with a message type of
RDMA_NOMSG and finds an initial read chunk list entry with a zero XDR RDMA_NOMSG and finds an initial read chunk list entry with a zero
position, it allocates a registered buffer and issues an RDMA Read of XDR position, it allocates a registered buffer and issues an RDMA
the long RPC message into it. The receiver then proceeds to XDR Read of the long RPC message into it. The receiver then proceeds
decode the RPC message as if it had received it in-line with the Send to XDR decode the RPC message as if it had received it in-line with
data. Further decoding may issue additional RDMA Reads to bring over the Send data. Further decoding may issue additional RDMA Reads to
additional chunks. bring over additional chunks.
Although the handling of long messages requires one extra network Although the handling of long messages requires one extra network
turnaround, in practice these messages should be rare if the posted turnaround, in practice these messages should be rare if the posted
receive buffers are correctly sized, and of course they will be non- receive buffers are correctly sized, and of course they will be
existent for RDMA-aware upper layers. non-existent for RDMA-aware upper layers.
An RPC with long reply returned via RDMA Read looks like this: An RPC with long reply returned via RDMA Read looks like this:
Client Server Client Server
| RPC Call | | RPC Call |
Send | ------------------------------> | Send | ------------------------------> |
| | | |
| RPC Transport Header | | RDMA over RPC Header |
| <------------------------------ | Send | <------------------------------ | Send
| | | |
| Long RPC Reply Msg | | Long RPC Reply Msg |
Read | ------------------------------+ | Read | ------------------------------+ |
| <-----------------------------v | | <-----------------------------v |
| | | |
| RPC Done | | Done |
Send | ------------------------------> | Send | ------------------------------> |
5.2. RDMA Write of Long Replies 5.2. RDMA Write of Long Replies (Reply Chunks)
An alternative method of handling long, chunkless RPC replies is to A superior method of handling long, chunkless RPC replies is to
have the client post a large buffer into which the server can write a have the client post a large buffer into which the server can write
large RPC reply. This has the advantage that an RDMA Write may be a large RPC reply. This has the advantage that an RDMA Write may
slightly faster in network latency than an RDMA Read. Additionally, be slightly faster in network latency than an RDMA Read.
for a reply it removes the need for an RDMA_DONE message if the large Additionally, for a reply it removes the need for an RDMA_DONE
reply is returned as a Read chunk. message if the large reply is returned as a Read chunk.
This protocol supports direct return of a large reply via the This protocol supports direct return of a large reply via the
inclusion of an optional rdma_reply write chunk after the read chunk inclusion of an optional rdma_reply write chunk after the read
list and the write chunk list. The client allocates a buffer sized chunk list and the write chunk list. The client allocates a buffer
to receive a large reply and enters its STag, address and length in sized to receive a large reply and enters its steering tag, address
the rdma_reply write chunk. If the reply message is too long to and length in the rdma_reply write chunk. If the reply message is
return in-line with an RDMA Send (exceeds the size of the client's too long to return in-line with an RDMA Send (exceeds the size of
posted receive buffer), even with read chunks removed, then the the client's posted receive buffer), even with read chunks removed,
server RDMA writes the RPC reply message into the buffer indicated by then the server RDMA Writes the RPC reply message into the buffer
the rdma_reply chunk. If the client doesn't provide an rdma_reply indicated by the rdma_reply chunk. If the client doesn't provide
chunk, or if it's too small, then the message must be returned as a an rdma_reply chunk, or if it's too small, then the message must be
Read chunk. returned as a Read chunk.
An RPC with long reply returned via RDMA Write looks like this: An RPC with long reply returned via RDMA Write looks like this:
Client Server Client Server
| RPC Call with rdma_reply | | RPC Call with rdma_reply |
Send | ------------------------------> | Send | ------------------------------> |
| | | |
| Long RPC Reply Msg | | Long RPC Reply Msg |
| <------------------------------ | Write | <------------------------------ | Write
| | | |
| RPC Transport Header | | RDMA over RPC Header |
| <------------------------------ | Send | <------------------------------ | Send
The use of RDMA Write to return long replies requires that the The use of RDMA Write to return long replies requires that the
client application anticipate a long reply and have some knowledge client application anticipate a long reply and have some knowledge
of its size so that a correctly sized buffer can be allocated. of its size so that a correctly sized buffer can be allocated.
This is certainly true of NFS READDIR replies; where the client This is certainly true of NFS READDIR replies; where the client
already provides an upper bound on the size of the encoded direc- already provides an upper bound on the size of the encoded
tory fragment to be returned by the server. directory fragment to be returned by the server.
5.3. RPC RDMA header errors The use of these "reply chunks" is highly efficient and convenient
for both client and server. Their use is encouraged for eligible
RPC operations such as NFS READDIR, which would otherwise require
extensive chunk management within the results or use of RDMA Read
and a Done message.
5.3. RPC over RDMA header errors
When a peer receives an RPC RDMA message, it must perform certain When a peer receives an RPC RDMA message, it must perform certain
basic validity checks on the header and chunk contents. If errors basic validity checks on the header and chunk contents. If errors
are detected in an RPC request, an RDMA_ERROR reply should be are detected in an RPC request, an RDMA_ERROR reply should be
generated. generated.
Two types of errors are defined, version mismatch and invalid chunk Two types of errors are defined, version mismatch and invalid chunk
format. When the peer detects an RPC RDMA header version which it format. When the peer detects an RPC over RDMA header version
does not support (currently this draft defines only version 1), it which it does not support (currently this draft defines only
replies with an error code of ERR_VERS, and provides the low and high version 1), it replies with an error code of ERR_VERS, and provides
inclusive version numbers it does, in fact, support. The version the low and high inclusive version numbers it does, in fact,
number in this reply can be any value otherwise valid at the support. The version number in this reply can be any value
receiver. When other decoding errors are detected in the header or otherwise valid at the receiver. When other decoding errors are
chunks, either an RPC decode error may be returned, or the error code detected in the header or chunks, either an RPC decode error may be
ERR_CHUNK. returned, or the error code ERR_CHUNK.
6. Connection Configuration Protocol 6. Connection Configuration Protocol
RDMA Send operations require the receiver to post one or more buffers RDMA Send operations require the receiver to post one or more
at the RDMA connection endpoint, each large enough to receive the buffers at the RDMA connection endpoint, each large enough to
largest Send message. Buffers are consumed as Send messages are receive the largest Send message. Buffers are consumed as Send
received. If a buffer is too small, or if there are no buffers messages are received. If a buffer is too small, or if there are
posted, the RDMA transport will return an error and break the RDMA no buffers posted, the RDMA transport may return an error and break
connection. The receiver must post sufficient, correctly sized the RDMA connection. The receiver must post sufficient, correctly
buffers to avoid buffer overrun or capacity errors. sized buffers to avoid buffer overrun or capacity errors.
The protocol described above includes only a mechanism for managing The protocol described above includes only a mechanism for managing
the number of such receive buffers, and no explicit features to allow the number of such receive buffers, and no explicit features to
the client and server to provision or control buffer sizing, nor any allow the client and server to provision or control buffer sizing,
other session parameters. nor any other session parameters.
In the past, this type of connection management has not been In the past, this type of connection management has not been
necessary for RPC. RPC over UDP or TCP does not have a protocol to necessary for RPC. RPC over UDP or TCP does not have a protocol to
negotiate the link. The server can get a rough idea of the maximum negotiate the link. The server can get a rough idea of the maximum
size of messages from the server protocol code. However, a protocol size of messages from the server protocol code. However, a
to negotiate transport features on a more dynamic basis is desirable. protocol to negotiate transport features on a more dynamic basis is
desirable.
The Connection Configuration Protocol allows the client to pass its The Connection Configuration Protocol allows the client to pass its
connection requirements to the server, and allows the server to connection requirements to the server, and allows the server to
inform the client of its connection limits. inform the client of its connection limits.
6.1. Initial Connection State 6.1. Initial Connection State
This protocol will be used for connection setup prior to the use of This protocol will be used for connection setup prior to the use of
another RPC protocol that uses the RDMA transport. It operates in- another RPC protocol that uses the RDMA transport. It operates in-
band, i.e. it uses the connection itself to negotiate the connection band, i.e. it uses the connection itself to negotiate the
parameters. To provide a basis for connection negotiation, the connection parameters. To provide a basis for connection
connection is assumed to provide a basic level of interoperability: negotiation, the connection is assumed to provide a basic level of
the ability to exchange at least one RPC message at a time that is at interoperability: the ability to exchange at least one RPC message
least 1 KB in size. The server may exceed this basic level of at a time that is at least 1 KB in size. The server may exceed
configuration, but the client must not assume it. this basic level of configuration, but the client must not assume
it.
6.2. Protocol Description 6.2. Protocol Description
Version 1 of the protocol consists of a single procedure that allows Version 1 of the protocol consists of a single procedure that
the client to inform the server of its connection requirements and allows the client to inform the server of its connection
the server to return connection information to the client. requirements and the server to return connection information to the
client.
The maxcallsize argument is the maximum size of an RPC call message The maxcall_sendsize argument is the maximum size of an RPC call
that the client will send in-line in an RDMA Send message to the message that the client will send in-line in an RDMA Send message
server. The server may return a maxcallsize value that is smaller or to the server. The server may return a maxcall_sendsize value that
larger than the client's request. The client must not send an in- is smaller or larger than the client's request. The client must
line call message larger than what the server will accept. The not send an in-line call message larger than what the server will
maxcallsize limits only the size of in-line RPC calls. It does not accept. The maxcall_sendsize limits only the size of in-line RPC
limit the size of long RPC messages transferred as an initial chunk calls. It does not limit the size of long RPC messages transferred
in the Read chunk list. as an initial chunk in the Read chunk list.
The maxreplysize is the maximum size of an in-line RPC message that The maxreply_sendsize is the maximum size of an in-line RPC message
the client will accept from the server. that the client will accept from the server.
The maxrdmaread is the maximum number of RDMA Reads which may be The maxrdmaread is the maximum number of RDMA Reads which may be
active at the peer. This number correlates to the RDMA incoming RDMA active at the peer. This number correlates to the RDMA incoming
Read count ("IRD") configured into each originating endpoint by the RDMA Read count ("IRD") configured into each originating endpoint
client or server. If more than this number of RDMA Read operations by the client or server. If more than this number of RDMA Read
by the connected peer are issued simultaneously, connection loss or operations by the connected peer are issued simultaneously,
suboptimal flow control may result, therefore the value should be connection loss or suboptimal flow control may result, therefore
observed at all times. The peers' values need not be equal. If the value should be observed at all times. The peers' values need
zero, the peer must not issue requests which require RDMA Read to not be equal. If zero, the peer must not issue requests which
satisfy, as no transfer will be possible. require RDMA Read to satisfy, as no transfer will be possible.
The align value is the value recommended by the server for opaque The align value is the value recommended by the server for opaque
data values such as strings and counted byte arrays. The client can data values such as strings and counted byte arrays. The client
use this value to compute the number of prepended pad bytes when XDR can use this value to compute the number of prepended pad bytes
encoding opaque values in the RPC call message. when XDR encoding opaque values in the RPC call message.
typedef unsigned int uint32; typedef unsigned int uint32;
struct config_rdma_req { struct config_rdma_req {
uint32 maxcallsize; /* max size of in-line RPC call */ uint32 maxcall_sendsize;
uint32 maxreplysize; /* max size of in-line RPC reply */ /* max size of in-line RPC call */
uint32 maxrdmaread; /* max active RDMA Reads at client */ uint32 maxreply_sendsize;
/* max size of in-line RPC reply */
uint32 maxrdmaread;
/* max active RDMA Reads at client */
}; };
struct config_rdma_reply { struct config_rdma_reply {
uint32 maxcallsize; /* max call size accepted by server */ uint32 maxcall_sendsize;
uint32 align; /* server's receive buffer alignment */ /* max call size accepted by server */
uint32 maxrdmaread; /* max active RDMA Reads at server */ uint32 align;
/* server's receive buffer alignment */
uint32 maxrdmaread;
/* max active RDMA Reads at server */
}; };
program CONFIG_RDMA_PROG { program CONFIG_RDMA_PROG {
version VERS1 { version VERS1 {
/* /*
* Config call/reply * Config call/reply
*/ */
config_rdma_reply CONF_RDMA(config_rdma_req) = 1; config_rdma_reply CONF_RDMA(config_rdma_req) = 1;
} = 1; } = 1;
} = nnnnnn; <-- Need program number assigned } = nnnnnn; <-- Need program number assigned
7. Memory Registration Overhead 7. Memory Registration Overhead
skipping to change at page 25, line 21 skipping to change at page 25, line 15
version VERS1 { version VERS1 {
/* /*
* Config call/reply * Config call/reply
*/ */
config_rdma_reply CONF_RDMA(config_rdma_req) = 1; config_rdma_reply CONF_RDMA(config_rdma_req) = 1;
} = 1; } = 1;
} = nnnnnn; <-- Need program number assigned } = nnnnnn; <-- Need program number assigned
7. Memory Registration Overhead 7. Memory Registration Overhead
RDMA requires that all data be transferred between registered memory RDMA requires that all data be transferred between registered
regions at the source and destination. All protocol headers as well memory regions at the source and destination. All protocol headers
as separately transferred data chunks must use registered memory. as well as separately transferred data chunks must use registered
Since the cost of registering and de-registering memory can be a memory. Since the cost of registering and de-registering memory
large proportion of the RDMA transaction cost, it is important to can be a large proportion of the RDMA transaction cost, it is
minimize registration activity. This is easily achieved within RPC important to minimize registration activity. This is easily
controlled memory by allocating chunk list data and RPC headers in a achieved within RPC controlled memory by allocating chunk list data
reusable way from pre-registered pools. and RPC headers in a reusable way from pre-registered pools.
The data chunks transferred via RDMA may occupy memory that persists The data chunks transferred via RDMA may occupy memory that
outside the bounds of the RPC transaction. Hence, the default persists outside the bounds of the RPC transaction. Hence, the
behavior of an RDMA transport is to register and de-register these default behavior of an RPC over RDMA transport is to register and
chunks on every transaction. However, this is not a limitation of de-register these chunks on every transaction. However, this is
the protocol - only of the existing local RPC API. The API is easily not a limitation of the protocol - only of the existing local RPC
extended through such functions as rpc_control(3) to change the API. The API is easily extended through such functions as
default behavior so that the application can assume responsibility rpc_control(3) to change the default behavior so that the
for controlling memory registration through an RPC-provided application can assume responsibility for controlling memory
registered memory allocator. registration through an RPC-provided registered memory allocator.
8. Errors and Error Recovery 8. Errors and Error Recovery
Error reporting and recovery is outside the scope of this protocol. Error reporting and recovery is outside the scope of this protocol.
It is assumed that the link itself will provide some degree of error It is assumed that the link itself will provide some degree of
detection and retransmission. Additionally, the RPC layer itself can error detection and retransmission. Additionally, the RPC layer
accept errors from the link level and recover via retransmission. itself can accept errors from the link level and recover via
RPC recovery can handle complete loss and re-establishment of the retransmission. RPC recovery can handle complete loss and re-
link. establishment of the link.
9. Node Addressing 9. Node Addressing
In setting up a new RDMA connection, the first action by an RPC In setting up a new RDMA connection, the first action by an RPC
client will be to obtain a transport address for the server. The client will be to obtain a transport address for the server. The
mechanism used to obtain this address, and to open an RDMA connection mechanism used to obtain this address, and to open an RDMA
is dependent on the type of RDMA transport, and outside the scope of connection is dependent on the type of RDMA transport, and outside
this protocol. the scope of this protocol.
10. RPC Binding 10. RPC Binding
RPC services normally register with a portmap or rpcbind service, RPC services normally register with a portmap or rpcbind service,
which associates an RPC program number with a service address. In which associates an RPC program number with a service address. In
the case of UDP or TCP, the service address for NFS is normally port the case of UDP or TCP, the service address for NFS is normally
2049. This policy should be no different with RDMA interconnects. port 2049. This policy should be no different with RDMA
interconnects.
One possibility is to have the server's portmapper register itself on One possibility is to have the server's portmapper register itself
the RDMA interconnect at a "well known" service address. On UDP or on the RDMA interconnect at a "well known" service address. On UDP
TCP, this corresponds to port 111. A client could connect to this or TCP, this corresponds to port 111. A client could connect to
service address and use the portmap protocol to obtain a service this service address and use the portmap protocol to obtain a
address in response to a program number, e.g. a VI discriminator or service address in response to a program number, e.g. a VI
an Infiniband GID. discriminator or an Infiniband GID.
11. Security 11. Security
ONC RPC provides its own security via the RPCSEC_GSS framework [RFC ONC RPC provides its own security via the RPCSEC_GSS framework [RFC
2203]. RPCSEC_GSS can provide message authentication, integrity 2203]. RPCSEC_GSS can provide message authentication, integrity
checking, and privacy. This security mechanism will be unaffected by checking, and privacy. This security mechanism will be unaffected
the RDMA transport. The data integrity and privacy features alter by the RDMA transport. The data integrity and privacy features
the body of the message, presenting it as a single chunk. For large alter the body of the message, presenting it as a single chunk.
messages the chunk may be large enough to qualify for RDMA Read For large messages the chunk may be large enough to qualify for
transfer. However, there is much data movement associated with RDMA Read transfer. However, there is much data movement
computation and verification of integrity, or encryption/decryption, associated with computation and verification of integrity, or
so any performance advantage will be lost. encryption/decryption, so any performance advantage will be lost.
There should be no new issues here with exposed addresses. The only There should be no new issues here with exposed addresses. The
exposed addresses here are in the chunk list and in the transport only exposed addresses here are in the chunk list and in the
packets generated by an RDMA. The data contained in these addresses transport packets generated by an RDMA. The data contained in
is adequately protected by RPCSEC_GSS integrity and privacy. these addresses is adequately protected by RPCSEC_GSS integrity and
RPCSEC_GSS security mechanisms are typically implemented by the host privacy. RPCSEC_GSS security mechanisms are typically implemented
CPU. This additional data movement and CPU use may cancel out much by the host CPU. This additional data movement and CPU use may
of the RDMA direct placement and offload benefit. cancel out much of the RDMA direct placement and offload benefit.
A more appropriate security mechanism for RDMA links may be link- A more appropriate security mechanism for RDMA links may be link-
level protection, like IPSec, which may be co-located in the RDMA level protection, like IPSec, which may be co-located in the RDMA
link hardware. The use of link-level protection may be negotiated link hardware. The use of link-level protection may be negotiated
through the use of a new RPCSEC_GSS mechanism like the Credential through the use of a new RPCSEC_GSS mechanism like the Credential
Cache GSS Mechanism (CCM) [CCM]. Cache GSS Mechanism [CCM].
12. IANA Considerations 12. IANA Considerations
As a new RPC transport, this protocol should have no effect on RPC As a new RPC transport, this protocol should have no effect on RPC
program numbers or registered port numbers. The new RPC transport program numbers or registered port numbers. The new RPC transport
should be assigned a new RPC "netid". If adopted, the Connection should be assigned a new RPC "netid". If adopted, the Connection
Configuration protocol described herein will require an RPC program Configuration protocol described herein will require an RPC program
number assignment. number assignment.
13. Acknowledgements 13. Acknowledgements
The authors wish to thank Rob Thurlow, John Howard, Chet Juszczak, The authors wish to thank Rob Thurlow, John Howard, Chet Juszczak,
Alex Chiu, Peter Staubach, Dave Noveck, Brian Pawlowski, Steve Alex Chiu, Peter Staubach, Dave Noveck, Brian Pawlowski, Steve
Kleiman, Mike Eisler, Mark Wittle and Shantanu Mehendale for their Kleiman, Mike Eisler, Mark Wittle and Shantanu Mehendale for their
contributions to this document. contributions to this document.
14. Normative References 14. Normative References
[RFC1831] [RFC1831]
R. Srinivasan, "RPC: Remote Procedure Call Protocol Specification R. Srinivasan, "RPC: Remote Procedure Call Protocol Specification
Version 2", Version 2", Standards Track RFC,
Standards Track RFC,
http://www.ietf.org/rfc/rfc1831.txt http://www.ietf.org/rfc/rfc1831.txt
[RFC1832] [RFC1832]
R. Srinivasan, "XDR: External Data Representation Standard", R. Srinivasan, "XDR: External Data Representation Standard",
Standards Track RFC, Standards Track RFC, http://www.ietf.org/rfc/rfc1832.txt
http://www.ietf.org/rfc/rfc1832.txt
[RFC1813] [RFC1813]
B. Callaghan, B. Pawlowski, P. Staubach, "NFS Version 3 Protocol B. Callaghan, B. Pawlowski, P. Staubach, "NFS Version 3 Protocol
Specification", Specification", Informational RFC,
Informational RFC,
http://www.ietf.org/rfc/rfc1813.txt http://www.ietf.org/rfc/rfc1813.txt
[RFC3530] [RFC3530]
S. Shepler, B. Callaghan, D. Robinson, R. Thurlow, C. Beame, M. S. Shepler, B. Callaghan, D. Robinson, R. Thurlow, C. Beame, M.
Eisler, D. Noveck, "NFS version 4 Protocol", Eisler, D. Noveck, "NFS version 4 Protocol", Standards Track RFC,
Standards Track RFC,
http://www.ietf.org/rfc/rfc3530.txt http://www.ietf.org/rfc/rfc3530.txt
[RFC2203] [RFC2203]
M. Eisler, A. Chiu, L. Ling, "RPCSEC_GSS Protocol Specification", M. Eisler, A. Chiu, L. Ling, "RPCSEC_GSS Protocol Specification",
Standards Track RFC, Standards Track RFC, http://www.ietf.org/rfc/rfc2203.txt
http://www.ietf.org/rfc/rfc2203.txt
15. Informative References 15. Informative References
[RDMA] R. Recio et al, "An RDMA Protocol Specification", [RDMA]
Internet Draft Work in Progress, R. Recio et al, "An RDMA Protocol Specification", Internet Draft
http://www.ietf.org/internet-drafts/ Work in Progress, http://www.ietf.org/internet-drafts/draft-ietf-
draft-ietf-rddp-rdmap-01.txt rddp-rdmap-03.txt
[CCM] M. Eisler, N. Williams, "CCM: The Credential Cache GSS Mechanism",
Internet Draft Work in Progress,
http://www.ietf.org/internet-drafts/
draft-ietf-nfsv4-ccm-03.txt
[NFSRDMA] [CCM]
T. Talpey, S. Shepler, J. Bauman, "NFSv4 Session Extensions" M. Eisler, N. Williams, "CCM: The Credential Cache GSS Mechanism",
Internet Draft Work in Progress, Internet Draft Work in Progress, http://www.ietf.org/internet-
http://www.ietf.org/internet-drafts/ drafts/draft-ietf-nfsv4-ccm-03.txt
draft-ietf-nfsv4-session-00.txt
[NFSDDP] [NFSDDP]
B. Callaghan, T. Talpey, "NFS Direct Data Placement" B. Callaghan, T. Talpey, "NFS Direct Data Placement" Internet Draft
Internet Draft Work in Progress, Work in Progress, http://www.ietf.org/internet-drafts/draft-ietf-
http://www.ietf.org/internet-drafts/ nfsv4-nfsdirect-01.txt
draft-ietf-nfsv4-nfsdirect-00.txt
[RDDP] [RDDP]
Remote Direct Data Placement Working Group Charter, Remote Direct Data Placement Working Group Charter,
http://www.ietf.org/html.charters/rddp-charter.html http://www.ietf.org/html.charters/rddp-charter.html
[RDDPPS] [NFSRDMAPS]
Remote Direct Data Placement Working Group Problem Statement, T. Talpey, C. Juszczak, "NFS RDMA Problem Statement", Internet
Internet Draft Work in Progress, Draft Work in Progress, http://www.ietf.org/internet-drafts/draft-
A. Romanow, J. Mogul, T. Talpey, S. Bailey, ietf-nfsv4-nfs-rdma-problem-statement-02.txt
http://www.ietf.org/internet-drafts/
draft-ietf-rddp-problem-statement-04.txt [NFSSESS]
T. Talpey, S. Shepler, J. Bauman, "NFSv4 Session Extensions",
Internet Draft Work in Progress, http://www.ietf.org/internet-
drafts/draft-ietf-nfsv4-nfs-sess-01.txt
[IB] [IB]
Infiniband Architecture Specification, Infiniband Architecture Specification, http://www.infinibandta.org
http://www.infinibandta.org
16. Authors' Addresses 16. Authors' Addresses
Brent Callaghan Brent Callaghan
Sun Microsystems, Inc. 1614 Montalto Dr.
17 Network Circle Mountain View, California 94040 USA
Menlo Park, California 94025 USA
Phone: +1 650 786 5067 Phone: +1 650 968 2333
EMail: brent.callaghan@sun.com EMail: brent.callaghan@gmail.com
Tom Talpey Tom Talpey
Network Appliance, Inc. Network Appliance, Inc.
375 Totten Pond Road 375 Totten Pond Road
Waltham, MA 02451 USA Waltham, MA 02451 USA
Phone: +1 781 768 5329 Phone: +1 781 768 5329
EMail: thomas.talpey@netapp.com EMail: thomas.talpey@netapp.com
17. Full Copyright Statement 17. Full Copyright Statement
Copyright (C) The Internet Society (2004). This document is sub-
ject to the rights, licenses and restrictions contained in BCP 78 Copyright (C) The Internet Society (2005). This document is
and except as set forth therein, the authors retain all their subject to the rights, licenses and restrictions contained in BCP
78 and except as set forth therein, the authors retain all their
rights. rights.
This document and the information contained herein are provided on This document and the information contained herein are provided on
an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REP- an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE
RESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND
INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES,
IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT
THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR
WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A
PARTICULAR PURPOSE.
Intellectual Property Intellectual Property
The IETF takes no position regarding the validity or scope of any The IETF takes no position regarding the validity or scope of any
Intellectual Property Rights or other rights that might be claimed Intellectual Property Rights or other rights that might be claimed
to pertain to the implementation or use of the technology described to pertain to the implementation or use of the technology described
in this document or the extent to which any license under such in this document or the extent to which any license under such
rights might or might not be available; nor does it represent that rights might or might not be available; nor does it represent that
it has made any independent effort to identify any such rights. it has made any independent effort to identify any such rights.
Information on the procedures with respect to rights in RFC docu- Information on the procedures with respect to rights in RFC
ments can be found in BCP 78 and BCP 79. documents can be found in BCP 78 and BCP 79.
Copies of IPR disclosures made to the IETF Secretariat and any Copies of IPR disclosures made to the IETF Secretariat and any
assurances of licenses to be made available, or the result of an assurances of licenses to be made available, or the result of an
attempt made to obtain a general license or permission for the use attempt made to obtain a general license or permission for the use
of such proprietary rights by implementers or users of this speci- of such proprietary rights by implementers or users of this
fication can be obtained from the IETF on-line IPR repository at specification can be obtained from the IETF on-line IPR repository
http://www.ietf.org/ipr. at http://www.ietf.org/ipr.
The IETF invites any interested party to bring to its attention any The IETF invites any interested party to bring to its attention any
copyrights, patents or patent applications, or other proprietary copyrights, patents or patent applications, or other proprietary
rights that may cover technology that may be required to implement rights that may cover technology that may be required to implement
this standard. Please address the information to the IETF at ietf- this standard. Please address the information to the IETF at ietf-
ipr@ietf.org. ipr@ietf.org.
Acknowledgement Acknowledgement
Funding for the RFC Editor function is currently provided by the Funding for the RFC Editor function is currently provided by the
 End of changes. 

This html diff was produced by rfcdiff 1.23, available from http://www.levkowetz.com/ietf/tools/rfcdiff/