draft-ietf-nfsv4-rpcrdma-09.txt   rfc5666.txt 
NFSv4 Working Group Tom Talpey Internet Engineering Task Force (IETF) T. Talpey
Internet-Draft NetApp Request for Comments: 5666 Unaffiliated
Intended status: Standards Track Brent Callaghan Category: Standards Track B. Callaghan
Expires: June 3, 2009 Apple ISSN: 2070-1721 Apple
December 3, 2008 January 2010
Remote Direct Memory Access Transport for Remote Procedure Call Remote Direct Memory Access Transport for Remote Procedure Call
draft-ietf-nfsv4-rpcrdma-09
Status of this Memo Abstract
This Internet-Draft is submitted to IETF in full conformance with This document describes a protocol providing Remote Direct Memory
the provisions of BCP 78 and BCP 79. Access (RDMA) as a new transport for Remote Procedure Call (RPC).
The RDMA transport binding conveys the benefits of efficient, bulk-
data transport over high-speed networks, while providing for minimal
change to RPC applications and with no required revision of the
application RPC protocol, or the RPC protocol itself.
Internet-Drafts are working documents of the Internet Engineering Status of This Memo
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet-
Drafts.
Internet-Drafts are draft documents valid for a maximum of six This is an Internet Standards Track document.
months and may be updated, replaced, or obsoleted by other
documents at any time. It is inappropriate to use Internet-Drafts
as reference material or to cite them other than as "work in
progress."
The list of current Internet-Drafts can be accessed at This document is a product of the Internet Engineering Task Force
http://www.ietf.org/ietf/1id-abstracts.txt. (IETF). It represents the consensus of the IETF community. It has
received public review and has been approved for publication by the
Internet Engineering Steering Group (IESG). Further information on
Internet Standards is available in Section 2 of RFC 5741.
The list of Internet-Draft Shadow Directories can be accessed at Information about the current status of this document, any errata,
http://www.ietf.org/shadow.html. and how to provide feedback on it may be obtained at
http://www.rfc-editor.org/info/rfc5666.
This Internet-Draft will expire on June 3, 2009. Copyright Notice
Abstract Copyright (c) 2010 IETF Trust and the persons identified as the
document authors. All rights reserved.
A protocol is described providing Remote Direct Memory Access This document is subject to BCP 78 and the IETF Trust's Legal
(RDMA) as a new transport for Remote Procedure Call (RPC). The Provisions Relating to IETF Documents
RDMA transport binding conveys the benefits of efficient, bulk data (http://trustee.ietf.org/license-info) in effect on the date of
transport over high speed networks, while providing for minimal publication of this document. Please review these documents
change to RPC applications and with no required revision of the carefully, as they describe your rights and restrictions with respect
application RPC protocol, or the RPC protocol itself. to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
This document may contain material from IETF Documents or IETF
Contributions published or made publicly available before November
10, 2008. The person(s) controlling the copyright in some of this
material may not have granted the IETF Trust the right to allow
modifications of such material outside the IETF Standards Process.
Without obtaining an adequate license from the person(s) controlling
the copyright in such materials, this document may not be modified
outside the IETF Standards Process, and derivative works of it may
not be created outside the IETF Standards Process, except to format
it for publication as an RFC or to translate it into languages other
than English.
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 1. Introduction ....................................................3
2. Abstract RDMA Requirements . . . . . . . . . . . . . . . . . 3 1.1. Requirements Language ......................................4
3. Protocol Outline . . . . . . . . . . . . . . . . . . . . . . 4 2. Abstract RDMA Requirements ......................................4
3.1. Short Messages . . . . . . . . . . . . . . . . . . . . . . 5 3. Protocol Outline ................................................5
3.2. Data Chunks . . . . . . . . . . . . . . . . . . . . . . . 5 3.1. Short Messages .............................................6
3.3. Flow Control . . . . . . . . . . . . . . . . . . . . . . . 6 3.2. Data Chunks ................................................6
3.4. XDR Encoding with Chunks . . . . . . . . . . . . . . . . . 7 3.3. Flow Control ...............................................7
3.5. XDR Decoding with Read Chunks . . . . . . . . . . . . . 10 3.4. XDR Encoding with Chunks ...................................8
3.6. XDR Decoding with Write Chunks . . . . . . . . . . . . . 11 3.5. XDR Decoding with Read Chunks .............................11
3.7. XDR Roundup and Chunks . . . . . . . . . . . . . . . . . 12 3.6. XDR Decoding with Write Chunks ............................12
3.8. RPC Call and Reply . . . . . . . . . . . . . . . . . . . 13 3.7. XDR Roundup and Chunks ....................................13
3.9. Padding . . . . . . . . . . . . . . . . . . . . . . . . 16 3.8. RPC Call and Reply ........................................14
4. RPC RDMA Message Layout . . . . . . . . . . . . . . . . . 17 3.9. Padding ...................................................17
4.1. RPC over RDMA Header . . . . . . . . . . . . . . . . . . 17 4. RPC RDMA Message Layout ........................................18
4.2. RPC over RDMA header errors . . . . . . . . . . . . . . 19 4.1. RPC-over-RDMA Header ......................................18
4.3. XDR Language Description . . . . . . . . . . . . . . . . 20 4.2. RPC-over-RDMA Header Errors ...............................20
5. Long Messages . . . . . . . . . . . . . . . . . . . . . . 22 4.3. XDR Language Description ..................................20
5.1. Message as an RDMA Read Chunk . . . . . . . . . . . . . 22 5. Long Messages ..................................................22
5.2. RDMA Write of Long Replies (Reply Chunks) . . . . . . . 24 5.1. Message as an RDMA Read Chunk .............................23
6. Connection Configuration Protocol . . . . . . . . . . . . 25 5.2. RDMA Write of Long Replies (Reply Chunks) .................24
6.1. Initial Connection State . . . . . . . . . . . . . . . . 26 6. Connection Configuration Protocol ..............................25
6.2. Protocol Description . . . . . . . . . . . . . . . . . . 26 6.1. Initial Connection State ..................................26
7. Memory Registration Overhead . . . . . . . . . . . . . . . 28 6.2. Protocol Description ......................................26
8. Errors and Error Recovery . . . . . . . . . . . . . . . . 28 7. Memory Registration Overhead ...................................28
9. Node Addressing . . . . . . . . . . . . . . . . . . . . . 28 8. Errors and Error Recovery ......................................28
10. RPC Binding . . . . . . . . . . . . . . . . . . . . . . . 29 9. Node Addressing ................................................28
11. Security Considerations . . . . . . . . . . . . . . . . . 30 10. RPC Binding ...................................................29
12. IANA Considerations . . . . . . . . . . . . . . . . . . . 31 11. Security Considerations .......................................30
13. Acknowledgments . . . . . . . . . . . . . . . . . . . . . 33 12. IANA Considerations ...........................................31
14. Normative References . . . . . . . . . . . . . . . . . . 33 13. Acknowledgments ...............................................32
15. Informative References . . . . . . . . . . . . . . . . . 34 14. References ....................................................33
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 35 14.1. Normative References .....................................33
Intellectual Property Statement . . . . . . . . . . . . . . . 35 14.2. Informative References ...................................33
Disclaimer of Validity . . . . . . . . . . . . . . . . . . . . 36
Copyright Statement . . . . . . . . . . . . . . . . . . . . . 36
Requirements Language 1. Introduction
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", Remote Direct Memory Access (RDMA) [RFC5040, RFC5041], [IB] is a
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in technique for efficient movement of data between end nodes, which
this document are to be interpreted as described in [RFC2119]. becomes increasingly compelling over high-speed transports. By
directing data into destination buffers as it is sent on a network,
and placing it via direct memory access by hardware, the double
benefit of faster transfers and reduced host overhead is obtained.
1. Introduction Open Network Computing Remote Procedure Call (ONC RPC, or simply,
RPC) [RFC5531] is a remote procedure call protocol that has been run
over a variety of transports. Most RPC implementations today use UDP
or TCP. RPC messages are defined in terms of an eXternal Data
Representation (XDR) [RFC4506], which provides a canonical data
representation across a variety of host architectures. An XDR data
stream is conveyed differently on each type of transport. On UDP,
RPC messages are encapsulated inside datagrams, while on a TCP byte
stream, RPC messages are delineated by a record marking protocol. An
RDMA transport also conveys RPC messages in a unique fashion that
must be fully described if client and server implementations are to
interoperate.
Remote Direct Memory Access (RDMA) [RFC5040, RFC5041] [IB] is a RDMA transports present new semantics unlike the behaviors of either
technique for efficient movement of data between end nodes, which UDP or TCP alone. They retain message delineations like UDP while
becomes increasingly compelling over high speed transports. By also providing a reliable, sequenced data transfer like TCP. Also,
directing data into destination buffers as it is sent on a network, they provide the new efficient, bulk-transfer service of RDMA. RDMA
and placing it via direct memory access by hardware, the double transports are therefore naturally viewed as a new transport type by
benefit of faster transfers and reduced host overhead is obtained. RPC.
Open Network Computing Remote Procedure Call (ONC RPC, or simply, RDMA as a transport will benefit the performance of RPC protocols
RPC) [RFC1831bis] is a remote procedure call protocol that has been that move large "chunks" of data, since RDMA hardware excels at
run over a variety of transports. Most RPC implementations today moving data efficiently between host memory and a high-speed network
use UDP or TCP. RPC messages are defined in terms of an eXternal with little or no host CPU involvement. In this context, the Network
Data Representation (XDR) [RFC4506] which provides a canonical data File System (NFS) protocol, in all its versions [RFC1094] [RFC1813]
representation across a variety of host architectures. An XDR data [RFC3530] [RFC5661], is an obvious beneficiary of RDMA. A complete
stream is conveyed differently on each type of transport. On UDP, problem statement is discussed in [RFC5532], and related NFSv4 issues
RPC messages are encapsulated inside datagrams, while on a TCP byte are discussed in [RFC5661]. Many other RPC-based protocols will also
stream, RPC messages are delineated by a record marking protocol. benefit.
An RDMA transport also conveys RPC messages in a unique fashion
that must be fully described if client and server implementations
are to interoperate.
RDMA transports present new semantics unlike the behaviors of Although the RDMA transport described here provides relatively
either UDP or TCP alone. They retain message delineations like UDP transparent support for any RPC application, the proposal goes
while also providing a reliable, sequenced data transfer like TCP. further in describing mechanisms that can optimize the use of RDMA
And, they provide the new efficient, bulk transfer service of RDMA. with more active participation by the RPC application.
RDMA transports are therefore naturally viewed as a new transport
type by RPC.
RDMA as a transport will benefit the performance of RPC protocols 1.1. Requirements Language
that move large "chunks" of data, since RDMA hardware excels at
moving data efficiently between host memory and a high speed
network with little or no host CPU involvement. In this context,
the NFS protocol, in all its versions [RFC1094] [RFC1813] [RFC3530]
[NFSv4.1], is an obvious beneficiary of RDMA. A complete problem
statement is discussed in [NFSRDMAPS], and related NFSv4 issues are
discussed in [NFSv4.1]. Many other RPC-based protocols will also
benefit.
Although the RDMA transport described here provides relatively The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
transparent support for any RPC application, the proposal goes "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
further in describing mechanisms that can optimize the use of RDMA document are to be interpreted as described in [RFC2119].
with more active participation by the RPC application.
2. Abstract RDMA Requirements 2. Abstract RDMA Requirements
An RPC transport is responsible for conveying an RPC message from a An RPC transport is responsible for conveying an RPC message from a
sender to a receiver. An RPC message is either an RPC call from a sender to a receiver. An RPC message is either an RPC call from a
client to a server, or an RPC reply from the server back to the client to a server, or an RPC reply from the server back to the
client. An RPC message contains an RPC call header followed by client. An RPC message contains an RPC call header followed by
arguments if the message is an RPC call, or an RPC reply header arguments if the message is an RPC call, or an RPC reply header
followed by results if the message is an RPC reply. The call followed by results if the message is an RPC reply. The call header
header contains a transaction ID (XID) followed by the program and contains a transaction ID (XID) followed by the program and procedure
procedure number as well as a security credential. An RPC reply number as well as a security credential. An RPC reply header begins
header begins with an XID that matches that of the RPC call with an XID that matches that of the RPC call message, followed by a
message, followed by a security verifier and results. All data in security verifier and results. All data in an RPC message is XDR
an RPC message is XDR encoded. For a complete description of the encoded. For a complete description of the RPC protocol and XDR
RPC protocol and XDR encoding, see [RFC1831bis] and [RFC4506]. encoding, see [RFC5531] and [RFC4506].
This protocol assumes the following abstract model for RDMA This protocol assumes the following abstract model for RDMA
transports. These terms, common in the RDMA lexicon, are used in transports. These terms, common in the RDMA lexicon, are used in
this document. A more complete glossary of RDMA terms can be found this document. A more complete glossary of RDMA terms can be found
in [RFC5040]. in [RFC5040].
o Registered Memory o Registered Memory
All data moved via tagged RDMA operations is resident in All data moved via tagged RDMA operations is resident in
registered memory at its destination. This protocol assumes registered memory at its destination. This protocol assumes
that each segment of registered memory MUST be identified with that each segment of registered memory MUST be identified with a
a steering tag of no more than 32 bits and memory addresses of steering tag of no more than 32 bits and memory addresses of up
up to 64 bits in length. to 64 bits in length.
o RDMA Send o RDMA Send
The RDMA provider supports an RDMA Send operation with The RDMA provider supports an RDMA Send operation with
completion signalled at the receiver when data is placed in a completion signaled at the receiver when data is placed in a
pre-posted buffer. The amount of transferred data is limited pre-posted buffer. The amount of transferred data is limited
only by the size of the receiver's buffer. Sends complete at only by the size of the receiver's buffer. Sends complete at
the receiver in the order they were issued at the sender. the receiver in the order they were issued at the sender.
o RDMA Write o RDMA Write
The RDMA provider supports an RDMA Write operation to directly The RDMA provider supports an RDMA Write operation to directly
place data in the receiver's buffer. An RDMA Write is place data in the receiver's buffer. An RDMA Write is initiated
initiated by the sender and completion is signalled at the by the sender and completion is signaled at the sender. No
sender. No completion is signalled at the receiver. The completion is signaled at the receiver. The sender uses a
sender uses a steering tag, memory address and length of the steering tag, memory address, and length of the remote
remote destination buffer. RDMA Writes are not necessarily destination buffer. RDMA Writes are not necessarily ordered
ordered with respect to one another, but are ordered with with respect to one another, but are ordered with respect to
respect to RDMA Sends; a subsequent RDMA Send completion RDMA Sends; a subsequent RDMA Send completion obtained at the
obtained at the receiver guarantees that prior RDMA Write data receiver guarantees that prior RDMA Write data has been
has been successfully placed in the receiver's memory. successfully placed in the receiver's memory.
o RDMA Read o RDMA Read
The RDMA provider supports an RDMA Read operation to directly The RDMA provider supports an RDMA Read operation to directly
place peer source data in the requester's buffer. An RDMA place peer source data in the requester's buffer. An RDMA Read
Read is initiated by the receiver and completion is signalled is initiated by the receiver and completion is signaled at the
at the receiver. The receiver provides steering tags, memory receiver. The receiver provides steering tags, memory
addresses and a length for the remote source and local addresses, and a length for the remote source and local
destination buffers. Since the peer at the data source destination buffers. Since the peer at the data source receives
receives no notification of RDMA Read completion, there is an no notification of RDMA Read completion, there is an assumption
assumption that on receiving the data the receiver will signal that on receiving the data, the receiver will signal completion
completion with an RDMA Send message, so that the peer can with an RDMA Send message, so that the peer can free the source
free the source buffers and the associated steering tags. buffers and the associated steering tags.
This protocol is designed to be carried over all RDMA transports This protocol is designed to be carried over all RDMA transports
meeting the stated requirements. This protocol conveys to the RPC meeting the stated requirements. This protocol conveys to the RPC
peer, information sufficient for that RPC peer to direct an RDMA peer information sufficient for that RPC peer to direct an RDMA layer
layer to perform transfers containing RPC data, and to communicate to perform transfers containing RPC data and to communicate their
their result(s). For example, it is readily carried over RDMA result(s). For example, it is readily carried over RDMA transports
transports such as iWARP [RFC5040, RFC5041] or Infiniband [IB]. such as Internet Wide Area RDMA Protocol (iWARP) [RFC5040, RFC5041],
or InfiniBand [IB].
3. Protocol Outline 3. Protocol Outline
An RPC message can be conveyed in identical fashion, whether it is An RPC message can be conveyed in identical fashion, whether it is a
a call or reply message. In each case, the transmission of the call or reply message. In each case, the transmission of the message
message proper is preceded by transmission of a transport-specific proper is preceded by transmission of a transport-specific header for
header for use by RPC over RDMA transports. This header is use by RPC-over-RDMA transports. This header is analogous to the
analogous to the record marking used for RPC over TCP, but is more record marking used for RPC over TCP, but is more extensive, since
extensive, since RDMA transports support several modes of data RDMA transports support several modes of data transfer; it is
transfer and it is important to allow the upper layer protocol to important to allow the upper-layer protocol to specify the most
specify the most efficient mode for each of the segments in a efficient mode for each of the segments in a message. Multiple
message. Multiple segments of a message may thereby be transferred segments of a message may thereby be transferred in different ways to
in different ways to different remote memory destinations. different remote memory destinations.
All transfers of a call or reply begin with an RDMA Send which All transfers of a call or reply begin with an RDMA Send that
transfers at least the RPC over RDMA header, usually with the call transfers at least the RPC-over-RDMA header, usually with the call or
or reply message appended, or at least some part thereof. Because reply message appended, or at least some part thereof. Because the
the size of what may be transmitted via RDMA Send is limited by the size of what may be transmitted via RDMA Send is limited by the size
size of the receiver's pre-posted buffer, the RPC over RDMA of the receiver's pre-posted buffer, the RPC-over-RDMA transport
transport provides a number of methods to reduce the amount provides a number of methods to reduce the amount transferred by
transferred by means of the RDMA Send, when necessary, by means of the RDMA Send, when necessary, by transferring various parts
transferring various parts of the message using RDMA Read and RDMA of the message using RDMA Read and RDMA Write.
Write.
RPC over RDMA framing replaces all other RPC framing (such as TCP RPC-over-RDMA framing replaces all other RPC framing (such as TCP
record marking) when used atop an RPC/RDMA association, even though record marking) when used atop an RPC/RDMA association, even though
the underlying RDMA protocol may itself be layered atop a protocol the underlying RDMA protocol may itself be layered atop a protocol
with a defined RPC framing (such as TCP). It is however possible with a defined RPC framing (such as TCP). It is however possible for
for RPC/RDMA to be dynamically enabled, in the course of RPC/RDMA to be dynamically enabled, in the course of negotiating the
negotiating the use of RDMA via an upper layer exchange. Because use of RDMA via an upper-layer exchange. Because RPC framing
RPC framing delimits an entire RPC request or reply, the resulting delimits an entire RPC request or reply, the resulting shift in
shift in framing must occur between distinct RPC messages, and in framing must occur between distinct RPC messages, and in concert with
concert with the transport. the transport.
3.1. Short Messages 3.1. Short Messages
Many RPC messages are quite short. For example, the NFS version 3 Many RPC messages are quite short. For example, the NFS version 3
GETATTR request, is only 56 bytes: 20 bytes of RPC header, plus a GETATTR request, is only 56 bytes: 20 bytes of RPC header, plus a
32 byte file handle argument and 4 bytes of length. The reply to 32-byte file handle argument and 4 bytes of length. The reply to
this common request is about 100 bytes. this common request is about 100 bytes.
There is no benefit in transferring such small messages with an There is no benefit in transferring such small messages with an RDMA
RDMA Read or Write operation. The overhead in transferring Read or Write operation. The overhead in transferring steering tags
steering tags and memory addresses is justified only by large and memory addresses is justified only by large transfers. The
transfers. The critical message size that justifies RDMA transfer critical message size that justifies RDMA transfer will vary
will vary depending on the RDMA implementation and network, but is depending on the RDMA implementation and network, but is typically of
typically of the order of a few kilobytes. It is appropriate to the order of a few kilobytes. It is appropriate to transfer a short
transfer a short message with an RDMA Send to a pre-posted buffer. message with an RDMA Send to a pre-posted buffer. The RPC-over-RDMA
The RPC over RDMA header with the short message (call or reply) header with the short message (call or reply) immediately following
immediately following is transferred using a single RDMA Send is transferred using a single RDMA Send operation.
operation.
Short RPC messages over an RDMA transport: Short RPC messages over an RDMA transport:
RPC Client RPC Server RPC Client RPC Server
| RPC Call | | RPC Call |
Send | ------------------------------> | Send | ------------------------------> |
| | | |
| RPC Reply | | RPC Reply |
| <------------------------------ | Send | <------------------------------ | Send
3.2. Data Chunks 3.2. Data Chunks
Some protocols, like NFS, have RPC procedures that can transfer Some protocols, like NFS, have RPC procedures that can transfer very
very large "chunks" of data in the RPC call or reply and would large chunks of data in the RPC call or reply and would cause the
cause the maximum send size to be exceeded if one tried to transfer maximum send size to be exceeded if one tried to transfer them as
them as part of the RDMA Send. These large chunks typically range part of the RDMA Send. These large chunks typically range from a
from a kilobyte to a megabyte or more. An RDMA transport can kilobyte to a megabyte or more. An RDMA transport can transfer large
transfer large chunks of data more efficiently via the direct chunks of data more efficiently via the direct placement of an RDMA
placement of an RDMA Read or RDMA Write operation. Using direct Read or RDMA Write operation. Using direct placement instead of
placement instead of inline transfer not only avoids expensive data inline transfer not only avoids expensive data copies, but provides
copies, but provides correct data alignment at the destination. correct data alignment at the destination.
3.3. Flow Control 3.3. Flow Control
It is critical to provide RDMA Send flow control for an RDMA It is critical to provide RDMA Send flow control for an RDMA
connection. RDMA receive operations will fail if a pre-posted connection. RDMA receive operations will fail if a pre-posted
receive buffer is not available to accept an incoming RDMA Send, receive buffer is not available to accept an incoming RDMA Send, and
and repeated occurrences of such errors can be fatal to the repeated occurrences of such errors can be fatal to the connection.
connection. This is a departure from conventional TCP/IP This is a departure from conventional TCP/IP networking where buffers
networking where buffers are allocated dynamically on an as-needed are allocated dynamically on an as-needed basis, and where
basis, and where pre-posting is not required. pre-posting is not required.
It is not practical to provide for fixed credit limits at the RPC It is not practical to provide for fixed credit limits at the RPC
server. Fixed limits scale poorly, since posted buffers are server. Fixed limits scale poorly, since posted buffers are
dedicated to the associated connection until consumed by receive dedicated to the associated connection until consumed by receive
operations. Additionally for protocol correctness, the RPC server operations. Additionally, for protocol correctness, the RPC server
must always be able to reply to client requests, whether or not new must always be able to reply to client requests, whether or not new
buffers have been posted to accept future receives. (Note that the buffers have been posted to accept future receives. (Note that the
RPC server may in fact be a client at some other layer. For RPC server may in fact be a client at some other layer. For example,
example, NFSv4 callbacks are processed by the NFSv4 client, acting NFSv4 callbacks are processed by the NFSv4 client, acting as an RPC
as an RPC server. The credit discussions apply equally in either server. The credit discussions apply equally in either case.)
case.)
Flow control for RDMA Send operations is implemented as a simple Flow control for RDMA Send operations is implemented as a simple
request/grant protocol in the RPC over RDMA header associated with request/grant protocol in the RPC-over-RDMA header associated with
each RPC message. The RPC over RDMA header for RPC call messages each RPC message. The RPC-over-RDMA header for RPC call messages
contains a requested credit value for the RPC server, which MAY be contains a requested credit value for the RPC server, which MAY be
dynamically adjusted by the caller to match its expected needs. dynamically adjusted by the caller to match its expected needs. The
The RPC over RDMA header for the RPC reply messages provides the RPC-over-RDMA header for the RPC reply messages provides the granted
granted result, which MAY have any value except it MUST NOT be zero result, which MAY have any value except it MUST NOT be zero when no
when no in-progress operations are present at the server, since in-progress operations are present at the server, since such a value
such a value would result in deadlock. The value MAY be adjusted would result in deadlock. The value MAY be adjusted up or down at
up or down at each opportunity to match the server's needs or each opportunity to match the server's needs or policies.
policies.
The RPC client MUST NOT send unacknowledged requests in excess of The RPC client MUST NOT send unacknowledged requests in excess of
this granted RPC server credit limit. If the limit is exceeded, this granted RPC server credit limit. If the limit is exceeded, the
the RDMA layer may signal an error, possibly terminating the RDMA layer may signal an error, possibly terminating the connection.
connection. Even if an error does not occur, it is OPTIONAL that Even if an error does not occur, it is OPTIONAL that the server
the server handle the excess request(s), and it MAY return an RPC handle the excess request(s), and it MAY return an RPC error to the
error to the client. Also note that the never-zero requirement client. Also note that the never-zero requirement implies that an
implies that an RPC server MUST always provide at least one credit RPC server MUST always provide at least one credit to each connected
to each connected RPC client from which no requests are RPC client from which no requests are outstanding. The client would
outstanding. The client would deadlock otherwise, unable to send deadlock otherwise, unable to send another request.
another request.
While RPC calls complete in any order, the current flow control While RPC calls complete in any order, the current flow control limit
limit at the RPC server is known to the RPC client from the Send at the RPC server is known to the RPC client from the Send ordering
ordering properties. It is always the most recent server-granted properties. It is always the most recent server-granted credit value
credit value minus the number of requests in flight. minus the number of requests in flight.
Certain RDMA implementations may impose additional flow control Certain RDMA implementations may impose additional flow control
restrictions, such as limits on RDMA Read operations in progress at restrictions, such as limits on RDMA Read operations in progress at
the responder. Because these operations are outside the scope of the responder. Because these operations are outside the scope of
this protocol, they are not addressed and SHOULD be provided for by this protocol, they are not addressed and SHOULD be provided for by
other layers. For example, a simple upper layer RPC consumer might other layers. For example, a simple upper-layer RPC consumer might
perform single-issue RDMA Read requests, while a more perform single-issue RDMA Read requests, while a more sophisticated,
sophisticated, multithreaded RPC consumer might implement its own multithreaded RPC consumer might implement its own First In, First
FIFO queue of such operations. For further discussion of possible Out (FIFO) queue of such operations. For further discussion of
protocol implementations capable of negotiating these values, see possible protocol implementations capable of negotiating these
section 6 "Connection Configuration Protocol" of this draft, or values, see Section 6 "Connection Configuration Protocol" of this
[NFSv4.1]. document, or [RFC5661].
3.4. XDR Encoding with Chunks 3.4. XDR Encoding with Chunks
The data comprising an RPC call or reply message is marshaled or The data comprising an RPC call or reply message is marshaled or
serialized into a contiguous stream by an XDR routine. XDR data serialized into a contiguous stream by an XDR routine. XDR data
types such as integers, strings, arrays and linked lists are types such as integers, strings, arrays, and linked lists are
commonly implemented over two very simple functions that encode commonly implemented over two very simple functions that encode
either an XDR data unit (32 bits) or an array of bytes. either an XDR data unit (32 bits) or an array of bytes.
Normally, the separate data items in an RPC call or reply are Normally, the separate data items in an RPC call or reply are encoded
encoded as a contiguous sequence of bytes for network transmission as a contiguous sequence of bytes for network transmission over UDP
over UDP or TCP. However, in the case of an RDMA transport, local or TCP. However, in the case of an RDMA transport, local routines
routines such as XDR encode can determine that (for instance) an such as XDR encode can determine that (for instance) an opaque byte
opaque byte array is large enough to be more efficiently moved via array is large enough to be more efficiently moved via an RDMA data
an RDMA data transfer operation like RDMA Read or RDMA Write. transfer operation like RDMA Read or RDMA Write.
Semantically speaking, the protocol has no restriction regarding Semantically speaking, the protocol has no restriction regarding data
data types which may or may not be represented by a read or write types that may or may not be represented by a read or write chunk.
chunk. In practice however, efficiency considerations lead to the In practice however, efficiency considerations lead to the conclusion
conclusion that certain data types are not generally "chunkable". that certain data types are not generally "chunkable". Typically,
Typically, only those opaque and aggregate data types that may only those opaque and aggregate data types that may attain
attain substantial size are considered to be eligible. With substantial size are considered to be eligible. With today's
today's hardware this size may be a kilobyte or more. However any hardware, this size may be a kilobyte or more. However, any object
object MAY be chosen for chunking in any given message. MAY be chosen for chunking in any given message.
The eligibility of XDR data items to be candidates for being moved The eligibility of XDR data items to be candidates for being moved as
as data chunks (as opposed to being marshaled inline) is not data chunks (as opposed to being marshaled inline) is not specified
specified by the RPC over RDMA protocol. Chunk eligibility by the RPC-over-RDMA protocol. Chunk eligibility criteria MUST be
criteria MUST be determined by each upper layer in order to provide determined by each upper-layer in order to provide for an
for an interoperable specification. One such example with interoperable specification. One such example with rationale, for
rationale, for the NFS protocol family, is provided in [NFSDDP]. the NFS protocol family, is provided in [RFC5667].
The interface by which an upper layer implementation communicates The interface by which an upper-layer implementation communicates the
the eligibility of a data item locally to RPC for chunking is out eligibility of a data item locally to RPC for chunking is out of
of scope for this specification. In many implementations, it is scope for this specification. In many implementations, it is
possible to implement a transparent RPC chunking facility. possible to implement a transparent RPC chunking facility. However,
However, such implementations may lead to inefficiencies, either such implementations may lead to inefficiencies, either because they
because they require the RPC layer to perform expensive require the RPC layer to perform expensive registration and
registration and deregistration of memory "on the fly", or they may de-registration of memory "on the fly", or they may require using
require using RDMA chunks in reply messages, along with the RDMA chunks in reply messages, along with the resulting additional
resulting additional handshaking with the RPC over RDMA peer. handshaking with the RPC-over-RDMA peer. However, these issues are
However, these issues are internal and generally confined to the internal and generally confined to the local interface between RPC
local interface between RPC and its upper layers, one in which and its upper layers, one in which implementations are free to
implementations are free to innovate. The only requirement is that innovate. The only requirement is that the resulting RPC RDMA
the resulting RPC RDMA protocol sent to the peer is valid for the protocol sent to the peer is valid for the upper layer. See, for
upper layer. See for example [NFSDDP]. example, [RFC5667].
When sending any message (request or reply) that contains an When sending any message (request or reply) that contains an eligible
eligible large data chunk, the XDR encoding routine avoids moving large data chunk, the XDR encoding routine avoids moving the data
the data into the XDR stream. Instead, it does not encode the data into the XDR stream. Instead, it does not encode the data portion,
portion, but records the address and size of each chunk in a but records the address and size of each chunk in a separate "read
separate "read chunk list" encoded within RPC RDMA transport- chunk list" encoded within RPC RDMA transport-specific headers. Such
specific headers. Such chunks will be transferred via RDMA Read chunks will be transferred via RDMA Read operations initiated by the
operations initiated by the receiver. receiver.
When the read chunks are to be moved via RDMA, the memory for each When the read chunks are to be moved via RDMA, the memory for each
chunk is registered. This registration may take place within XDR chunk is registered. This registration may take place within XDR
itself, providing for full transparency to upper layers, or it may itself, providing for full transparency to upper layers, or it may be
be performed by any other specific local implementation. performed by any other specific local implementation.
Additionally, when making an RPC call that can result in bulk data Additionally, when making an RPC call that can result in bulk data
transferred in the reply, write chunks MAY be provided to accept transferred in the reply, write chunks MAY be provided to accept the
the data directly via RDMA Write. These write chunks will data directly via RDMA Write. These write chunks will therefore be
therefore be pre-filled by the RPC server prior to responding, and pre-filled by the RPC server prior to responding, and XDR decode of
XDR decode of the data at the client will not be required. These the data at the client will not be required. These chunks undergo a
chunks undergo a similar registration and advertisement via "write similar registration and advertisement via "write chunk lists" built
chunk lists" built as a part of XDR encoding. as a part of XDR encoding.
Some RPC client implementations are not able to determine where an Some RPC client implementations are not able to determine where an
RPC call's results reside during the "encode" phase. This makes it RPC call's results reside during the "encode" phase. This makes it
difficult or impossible for the RPC client layer to encode the difficult or impossible for the RPC client layer to encode the write
write chunk list at the time of building the request. In this chunk list at the time of building the request. In this case, it is
case, it is difficult for the RPC implementation to provide difficult for the RPC implementation to provide transparency to the
transparency to the RPC consumer, which may require recoding to RPC consumer, which may require recoding to provide result
provide result information at this earlier stage. information at this earlier stage.
Therefore if the RPC client does not make a write chunk list Therefore, if the RPC client does not make a write chunk list
available to receive the result, then the RPC server MAY return available to receive the result, then the RPC server MAY return data
data inline in the reply, or if the upper layer specification inline in the reply, or if the upper-layer specification permits, it
permits, it MAY be returned via a read chunk list. It is NOT MAY be returned via a read chunk list. It is NOT RECOMMENDED that
RECOMMENDED that upper layer RPC client protocol specifications upper-layer RPC client protocol specifications omit write chunk lists
omit write chunk lists for eligible replies, due to the lower for eligible replies, due to the lower performance of the additional
performance of the additional handshaking to perform data transfer, handshaking to perform data transfer, and the requirement that the
and the requirement that the RPC server must expose (and preserve) RPC server must expose (and preserve) the reply data for a period of
the reply data for a period of time. In the absence of a server- time. In the absence of a server-provided read chunk list in the
provided read chunk list in the reply, if the encoded reply reply, if the encoded reply overflows the posted receive buffer, the
overflows the posted receive buffer, the RPC will fail with an RDMA RPC will fail with an RDMA transport error.
transport error.
When any data within a message is provided via either read or write When any data within a message is provided via either read or write
chunks, the chunk itself refers only to the data portion of the XDR chunks, the chunk itself refers only to the data portion of the XDR
stream element. In particular, for counted fields (e.g., a "<>" stream element. In particular, for counted fields (e.g., a "<>"
encoding) the byte count which is encoded as part of the field encoding) the byte count that is encoded as part of the field remains
remains in the XDR stream, and is also encoded in the chunk list. in the XDR stream, and is also encoded in the chunk list. The data
The data portion is however elided from the encoded XDR stream, and portion is however elided from the encoded XDR stream, and is
is transferred as part of chunk list processing. This is important transferred as part of chunk list processing. It is important to
to maintain upper layer implementation compatibility - both the maintain upper-layer implementation compatibility -- both the count
count and the data must be transferred as part of the logical XDR and the data must be transferred as part of the logical XDR stream.
stream. While the chunk list processing results in the data being While the chunk list processing results in the data being available
available to the upper layer peer for XDR decoding, the length to the upper-layer peer for XDR decoding, the length present in the
present in the chunk list entries is not. Any byte count in the chunk list entries is not. Any byte count in the XDR stream MUST
XDR stream MUST match the sum of the byte counts present in the match the sum of the byte counts present in the corresponding read or
corresponding read or write chunk list. If they do not agree, an write chunk list. If they do not agree, an RPC protocol encoding
RPC protocol encoding error results. error results.
The following items are contained in a chunk list entry. The following items are contained in a chunk list entry.
Handle Handle
Steering tag or handle obtained when the chunk memory is Steering tag or handle obtained when the chunk memory is
registered for RDMA. registered for RDMA.
Length Length
The length of the chunk in bytes. The length of the chunk in bytes.
Offset Offset
The offset or beginning memory address of the chunk. In order The offset or beginning memory address of the chunk. In order
to support the widest array of RDMA implementations, as well to support the widest array of RDMA implementations, as well as
as the most general steering tag scheme, this field is the most general steering tag scheme, this field is
unconditionally included in each chunk list entry. unconditionally included in each chunk list entry.
While zero-based offset schemes are available in many RDMA While zero-based offset schemes are available in many RDMA
implementations, their use by RPC requires individual implementations, their use by RPC requires individual
registration of each read or write chunk. On many such registration of each read or write chunk. On many such
implementations this can be a significant overhead. By implementations, this can be a significant overhead. By
providing an offset in each chunk, many pre-registration or providing an offset in each chunk, many pre-registration or
region-based registrations can be readily supported, and by region-based registrations can be readily supported, and by
using a single, universal chunk representation, the RPC RDMA using a single, universal chunk representation, the RPC RDMA
protocol implementation is simplified to its most general protocol implementation is simplified to its most general form.
form.
Position Position
For data which is to be encoded, the position in the XDR For data that is to be encoded, the position in the XDR stream
stream where the chunk would normally reside. Note that the where the chunk would normally reside. Note that the chunk
chunk therefore inserts its data into the XDR stream at this therefore inserts its data into the XDR stream at this position,
position, but its transfer is no longer "inline". Also note but its transfer is no longer "inline". Also note therefore
therefore that all chunks belonging to a single RPC argument that all chunks belonging to a single RPC argument or result
or result will have the same position. For data which is to will have the same position. For data that is to be decoded, no
be decoded, no position is used. position is used.
When XDR marshaling is complete, the chunk list is XDR encoded, When XDR marshaling is complete, the chunk list is XDR encoded, then
then sent to the receiver prepended to the RPC message. Any source sent to the receiver prepended to the RPC message. Any source data
data for a read chunk, or the destination of a write chunk, remain for a read chunk, or the destination of a write chunk, remain behind
behind in the sender's registered memory and their actual payload in the sender's registered memory, and their actual payload is not
is not marshaled into the request or reply. marshaled into the request or reply.
+----------------+----------------+------------- +----------------+----------------+-------------
| RPC over RDMA | | | RPC-over-RDMA | |
| header w/ | RPC Header | Non-chunk args/results | header w/ | RPC Header | Non-chunk args/results
| chunks | | | chunks | |
+----------------+----------------+------------- +----------------+----------------+-------------
Read chunk lists and write chunk lists are structured somewhat Read chunk lists and write chunk lists are structured somewhat
differently. This is due to the different usage - read chunks are differently. This is due to the different usage -- read chunks are
decoded and indexed by their argument's or result's position in the decoded and indexed by their argument's or result's position in the
XDR data stream; their size is always known. Write chunks on the XDR data stream; their size is always known. Write chunks, on the
other hand are used only for results, and have neither a other hand, are used only for results, and have neither a preassigned
preassigned offset in the XDR stream, nor a size until the results offset in the XDR stream nor a size until the results are produced,
are produced, since the buffers may be only partially filled, or since the buffers may be only partially filled, or may not be used
may not be used for results at all. Their presence in the XDR for results at all. Their presence in the XDR stream is therefore
stream is therefore not known until the reply is processed. The not known until the reply is processed. The mapping of write chunks
mapping of Write chunks onto designated NFS procedures and their onto designated NFS procedures and their results is described in
results is described in [NFSDDP]. [RFC5667].
Therefore, read chunks are encoded into a read chunk list as a Therefore, read chunks are encoded into a read chunk list as a single
single array, with each entry tagged by its (known) size and its array, with each entry tagged by its (known) size and its argument's
argument's or result's position in the XDR stream. Write chunks or result's position in the XDR stream. Write chunks are encoded as
are encoded as a list of arrays of RDMA buffers, with each list a list of arrays of RDMA buffers, with each list element (an array)
element (an array) providing buffers for a separate result. providing buffers for a separate result. Individual write chunk list
Individual write chunk list elements MAY thereby result in being elements MAY thereby result in being partially or fully filled, or in
partially or fully filled, or in fact not being filled at all. fact not being filled at all. Unused write chunks, or unused bytes
Unused write chunks, or unused bytes in write chunk buffer lists, in write chunk buffer lists, are not returned as results, and their
are not returned as results, and their memory is returned to the memory is returned to the upper layer as part of RPC completion.
upper layer as part of RPC completion. However, the RPC layer MUST However, the RPC layer MUST NOT assume that the buffers have not been
NOT assume that the buffers have not been modified. modified.
3.5. XDR Decoding with Read Chunks 3.5. XDR Decoding with Read Chunks
The XDR decode process moves data from an XDR stream into a data The XDR decode process moves data from an XDR stream into a data
structure provided by the RPC client or server application. Where structure provided by the RPC client or server application. Where
elements of the destination data structure are buffers or strings, elements of the destination data structure are buffers or strings,
the RPC application can either pre-allocate storage to receive the the RPC application can either pre-allocate storage to receive the
data, or leave the string or buffer fields null and allow the XDR data or leave the string or buffer fields null and allow the XDR
decode stage of RPC processing to automatically allocate storage of decode stage of RPC processing to automatically allocate storage of
sufficient size. sufficient size.
When decoding a message from an RDMA transport, the receiver first When decoding a message from an RDMA transport, the receiver first
XDR decodes the chunk lists from the RPC over RDMA header, then XDR decodes the chunk lists from the RPC-over-RDMA header, then
proceeds to decode the body of the RPC message (arguments or proceeds to decode the body of the RPC message (arguments or
results). Whenever the XDR offset in the decode stream matches results). Whenever the XDR offset in the decode stream matches that
that of a chunk in the read chunk list, the XDR routine initiates of a chunk in the read chunk list, the XDR routine initiates an RDMA
an RDMA Read to bring over the chunk data into locally registered Read to bring over the chunk data into locally registered memory for
memory for the destination buffer. the destination buffer.
When processing an RPC request, the RPC receiver (RPC server) When processing an RPC request, the RPC receiver (RPC server)
acknowledges its completion of use of the source buffers by simply acknowledges its completion of use of the source buffers by simply
replying to the RPC sender (client), and the peer may then free all replying to the RPC sender (client), and the peer may then free all
source buffers advertised by the request. source buffers advertised by the request.
When processing an RPC reply, after completing such a transfer the When processing an RPC reply, after completing such a transfer, the
RPC receiver (client) MUST issue an RDMA_DONE message (described in RPC receiver (client) MUST issue an RDMA_DONE message (described in
Section 3.8) to notify the peer (server) that the source buffers Section 3.8) to notify the peer (server) that the source buffers can
can be freed. be freed.
The read chunk list is constructed and used entirely within the The read chunk list is constructed and used entirely within the
RPC/XDR layer. Other than specifying the minimum chunk size, the RPC/XDR layer. Other than specifying the minimum chunk size, the
management of the read chunk list is automatic and transparent to management of the read chunk list is automatic and transparent to an
an RPC application. RPC application.
3.6. XDR Decoding with Write Chunks 3.6. XDR Decoding with Write Chunks
When a "write chunk list" is provided for the results of the RPC When a write chunk list is provided for the results of the RPC call,
call, the RPC server MUST provide any corresponding data via RDMA the RPC server MUST provide any corresponding data via RDMA Write to
Write to the memory referenced in the chunk list entries. The RPC the memory referenced in the chunk list entries. The RPC reply
reply conveys this by returning the write chunk list to the client conveys this by returning the write chunk list to the client with the
with the lengths rewritten to match the actual transfer. The XDR lengths rewritten to match the actual transfer. The XDR decode of
"decode" of the reply therefore performs no local data transfer but the reply therefore performs no local data transfer but merely
merely returns the length obtained from the reply. returns the length obtained from the reply.
Each decoded result consumes one entry in the write chunk list, Each decoded result consumes one entry in the write chunk list, which
which in turn consists of an array of RDMA segments. The length is in turn consists of an array of RDMA segments. The length is
therefore the sum of all returned lengths in all segments therefore the sum of all returned lengths in all segments comprising
comprising the corresponding list entry. As each list entry is the corresponding list entry. As each list entry is decoded, the
"decoded", the entire entry is consumed. entire entry is consumed.
The write chunk list is constructed and used by the RPC The write chunk list is constructed and used by the RPC application.
application. The RPC/XDR layer simply conveys the list between The RPC/XDR layer simply conveys the list between client and server
client and server and initiates the RDMA Writes back to the client. and initiates the RDMA Writes back to the client. The mapping of
The mapping of write chunk list entries to procedure arguments MUST write chunk list entries to procedure arguments MUST be determined
be determined for each protocol. An example of a mapping is for each protocol. An example of a mapping is described in
described in [NFSDDP]. [RFC5667].
3.7. XDR Roundup and Chunks 3.7. XDR Roundup and Chunks
The XDR protocol requires 4-byte alignment of each new encoded The XDR protocol requires 4-byte alignment of each new encoded
element in any XDR stream. This requirement is for efficiency and element in any XDR stream. This requirement is for efficiency and
ease of decode/unmarshaling at the receiver - if the XDR stream ease of decode/unmarshaling at the receiver -- if the XDR stream
buffer begins on a native machine boundary, then the XDR elements buffer begins on a native machine boundary, then the XDR elements
will lie on similarly predictable offsets in memory. will lie on similarly predictable offsets in memory.
Within XDR, when non-4-byte encodes (such as an odd-length string Within XDR, when non-4-byte encodes (such as an odd-length string or
or bulk data) are marshaled, their length is encoded literally, bulk data) are marshaled, their length is encoded literally, while
while their data is padded to begin the next element at a 4-byte their data is padded to begin the next element at a 4-byte boundary
boundary in the XDR stream. For TCP or RDMA inline encoding, this in the XDR stream. For TCP or RDMA inline encoding, this minimal
minimal overhead is required because the transport-specific framing overhead is required because the transport-specific framing relies on
relies on the fact that the relative offset of the elements in the the fact that the relative offset of the elements in the XDR stream
XDR stream from the start of the message determines the XDR from the start of the message determines the XDR position during
position during decode. decode.
On the other hand, RPC/RDMA Read chunks carry the XDR position of On the other hand, RPC/RDMA Read chunks carry the XDR position of
each chunked element and length of the Chunk segment, and can be each chunked element and length of the Chunk segment, and can be
placed by the receiver exactly where they belong in the receiver's placed by the receiver exactly where they belong in the receiver's
memory without regard to the alignment of their position in the XDR memory without regard to the alignment of their position in the XDR
stream. Since any rounded-up data is not actually part of the stream. Since any rounded-up data is not actually part of the upper
upper layer's message, the receiver will not reference it, and layer's message, the receiver will not reference it, and there is no
there is no reason to set it to any particular value in the reason to set it to any particular value in the receiver's memory.
receiver's memory.
When roundup is present at the end of a sequence of chunks, the When roundup is present at the end of a sequence of chunks, the
length of the sequence will terminate it at a non-4-byte XDR length of the sequence will terminate it at a non-4-byte XDR
position. When the receiver proceeds to decode the remaining part position. When the receiver proceeds to decode the remaining part of
of the XDR stream, it inspects the XDR position indicated by the the XDR stream, it inspects the XDR position indicated by the next
next chunk. Because this position will not match (else roundup chunk. Because this position will not match (else roundup would not
would not have occurred), the receiver decoding will fall back to have occurred), the receiver decoding will fall back to inspecting
inspecting the remaining inline portion. If in turn, no data the remaining inline portion. If in turn, no data remains to be
remains to be decoded from the inline portion, then the receiver decoded from the inline portion, then the receiver MUST conclude that
MUST conclude that roundup is present, and therefore advances the roundup is present, and therefore it advances the XDR decode position
XDR decode position to that indicated by the next chunk (if any). to that indicated by the next chunk (if any). In this way, roundup
In this way, roundup is passed without ever actually transferring is passed without ever actually transferring additional XDR bytes.
additional XDR bytes.
Some protocol operations over RPC/RDMA, for instance NFS writes of Some protocol operations over RPC/RDMA, for instance NFS writes of
data encountered at the end of a file or in direct i/o situations, data encountered at the end of a file or in direct I/O situations,
commonly yield these roundups within RDMA Read Chunks. Because any commonly yield these roundups within RDMA Read Chunks. Because any
roundup bytes are not actually present in the data buffers being roundup bytes are not actually present in the data buffers being
written, memory for these bytes would come from noncontiguous written, memory for these bytes would come from noncontiguous
buffers, either as an additional memory registration segment, or as buffers, either as an additional memory registration segment or as an
an additional Chunk. The overhead of these operations can be additional Chunk. The overhead of these operations can be
significant to both the sender to marshal them, and even higher to significant to both the sender to marshal them and even higher to the
the receiver which to transfer them. Senders SHOULD therefore receiver to which to transfer them. Senders SHOULD therefore avoid
avoid encoding individual RDMA Read Chunks for roundup whenever encoding individual RDMA Read Chunks for roundup whenever possible.
possible. It is acceptable, but not necessary, to include roundup It is acceptable, but not necessary, to include roundup data in an
data in an existing RDMA Read Chunk, but only if it is already existing RDMA Read Chunk, but only if it is already present in the
present in the XDR stream to carry upper layer data. XDR stream to carry upper-layer data.
Note that there is no exposure of additional data at the sender due Note that there is no exposure of additional data at the sender due
to eliding roundup data from the XDR stream, since any additional to eliding roundup data from the XDR stream, since any additional
sender buffers are never exposed to the peer. The data is sender buffers are never exposed to the peer. The data is literally
literally not there to be transferred. not there to be transferred.
For RDMA Write Chunks, a simpler encoding method applies. Again, For RDMA Write Chunks, a simpler encoding method applies. Again,
roundup bytes are not transferred, instead the chunk length sent to roundup bytes are not transferred, instead the chunk length sent to
the receiver in the reply is simply increased to include any the receiver in the reply is simply increased to include any roundup.
roundup. Because of the requirement that the RDMA Write chunks are Because of the requirement that the RDMA Write Chunks are filled
filled sequentially without gaps, this situation can only occur on sequentially without gaps, this situation can only occur on the final
the final chunk receiving data. Therefore there is no opportunity chunk receiving data. Therefore, there is no opportunity for roundup
for roundup data to insert misalignment or positional gaps into the data to insert misalignment or positional gaps into the XDR stream.
XDR stream.
3.8. RPC Call and Reply 3.8. RPC Call and Reply
The RDMA transport for RPC provides three methods of moving data The RDMA transport for RPC provides three methods of moving data
between RPC client and server: between RPC client and server:
Inline Inline
Data are moved between RPC client and server within an RDMA Data is moved between RPC client and server within an RDMA Send.
Send.
RDMA Read RDMA Read
Data are moved between RPC client and server via an RDMA Read Data is moved between RPC client and server via an RDMA Read
operation via steering tag, address and offset obtained from a operation via steering tag; address and offset obtained from a
read chunk list. read chunk list.
RDMA Write RDMA Write
Result data is moved from RPC server to client via an RDMA Result data is moved from RPC server to client via an RDMA Write
Write operation via steering tag, address and offset obtained operation via steering tag; address and offset obtained from a
from a write chunk list or reply chunk in the client's RPC write chunk list or reply chunk in the client's RPC call
call message. message.
These methods of data movement may occur in combinations within a These methods of data movement may occur in combinations within a
single RPC. For instance, an RPC call may contain some inline data single RPC. For instance, an RPC call may contain some inline data
along with some large chunks to be transferred via RDMA Read to the along with some large chunks to be transferred via RDMA Read to the
server. The reply to that call may have some result chunks that server. The reply to that call may have some result chunks that the
the server RDMA Writes back to the client. The following protocol server RDMA Writes back to the client. The following protocol
interactions illustrate RPC calls that use these methods to move interactions illustrate RPC calls that use these methods to move RPC
RPC message data: message data:
An RPC with write chunks in the call message: An RPC with write chunks in the call message:
RPC Client RPC Server RPC Client RPC Server
| RPC Call + Write Chunk list | | RPC Call + Write Chunk list |
Send | ------------------------------> | Send | ------------------------------> |
| | | |
| Chunk 1 | | Chunk 1 |
| <------------------------------ | Write | <------------------------------ | Write
| : | | : |
| Chunk n | | Chunk n |
| <------------------------------ | Write | <------------------------------ | Write
| | | |
| RPC Reply | | RPC Reply |
| <------------------------------ | Send | <------------------------------ | Send
In the presence of write chunks, RDMA ordering provides the In the presence of write chunks, RDMA ordering provides the guarantee
guarantee that all data in the RDMA Write operations has been that all data in the RDMA Write operations has been placed in memory
placed in memory prior to the client's RPC reply processing. prior to the client's RPC reply processing.
An RPC with read chunks in the call message: An RPC with read chunks in the call message:
RPC Client RPC Server RPC Client RPC Server
| RPC Call + Read Chunk list | | RPC Call + Read Chunk list |
Send | ------------------------------> | Send | ------------------------------> |
| | | |
| Chunk 1 | | Chunk 1 |
| +------------------------------ | Read | +------------------------------ | Read
| v-----------------------------> | | v-----------------------------> |
| : | | : |
| Chunk n | | Chunk n |
| +------------------------------ | Read | +------------------------------ | Read
| v-----------------------------> | | v-----------------------------> |
| | | |
| RPC Reply | | RPC Reply |
| <------------------------------ | Send | <------------------------------ | Send
An RPC with read chunks in the reply message: An RPC with read chunks in the reply message:
RPC Client RPC Server RPC Client RPC Server
| RPC Call | | RPC Call |
Send | ------------------------------> | Send | ------------------------------> |
| | | |
| RPC Reply + Read Chunk list | | RPC Reply + Read Chunk list |
| <------------------------------ | Send | <------------------------------ | Send
| | | |
| Chunk 1 | | Chunk 1 |
Read | ------------------------------+ | Read | ------------------------------+ |
| <-----------------------------v | | <-----------------------------v |
| : | | : |
| Chunk n | | Chunk n |
Read | ------------------------------+ | Read | ------------------------------+ |
| <-----------------------------v | | <-----------------------------v |
| | | |
| Done | | Done |
Send | ------------------------------> | Send | ------------------------------> |
The final Done message allows the RPC client to signal the server The final Done message allows the RPC client to signal the server
that it has received the chunks, so the server can de-register and that it has received the chunks, so the server can de-register and
free the memory holding the chunks. A Done completion is not free the memory holding the chunks. A Done completion is not
necessary for an RPC call, since the RPC reply Send is itself a necessary for an RPC call, since the RPC reply Send is itself a
receive completion notification. In the event that the client receive completion notification. In the event that the client fails
fails to return the Done message within some timeout period, the to return the Done message within some timeout period, the server MAY
server MAY conclude that a protocol violation has occurred and conclude that a protocol violation has occurred and close the RPC
close the RPC connection, or it MAY proceed with a de-register and connection, or it MAY proceed with a de-register and free its chunk
free its chunk buffers. This may result in a fatal RDMA error if buffers. This may result in a fatal RDMA error if the client later
the client later attempts to perform an RDMA Read operation, which attempts to perform an RDMA Read operation, which amounts to the same
amounts to the same thing. thing.
The use of read chunks in RPC reply messages is much less efficient The use of read chunks in RPC reply messages is much less efficient
than providing write chunks in the originating RPC calls, due to than providing write chunks in the originating RPC calls, due to the
the additional message exchanges, the need for the RPC server to additional message exchanges, the need for the RPC server to
advertise buffers to the peer, the necessity of the server advertise buffers to the peer, the necessity of the server
maintaining a timer for the purpose of recovery from misbehaving maintaining a timer for the purpose of recovery from misbehaving
clients, and the need for additional memory registration. Their clients, and the need for additional memory registration. Their use
use is NOT RECOMMENDED by upper layers where efficiency is a is NOT RECOMMENDED by upper layers where efficiency is a primary
primary concern. [NFSDDP] However, they MAY be employed by upper concern [RFC5667]. However, they MAY be employed by upper-layer
layer protocol bindings which are primarily concerned with protocol bindings that are primarily concerned with transparency,
transparency, since they can frequently be implemented completely since they can frequently be implemented completely within the RPC
within the RPC lower layers. lower layers.
It is important to note that the Done message consumes a credit at It is important to note that the Done message consumes a credit at
the RPC server. The RPC server SHOULD provide sufficient credits the RPC server. The RPC server SHOULD provide sufficient credits to
to the client to allow the Done message to be sent without deadlock the client to allow the Done message to be sent without deadlock
(driving the outstanding credit count to zero). The RPC client (driving the outstanding credit count to zero). The RPC client MUST
MUST account for its required Done messages to the server in its account for its required Done messages to the server in its
accounting of available credits, and the server SHOULD replenish accounting of available credits, and the server SHOULD replenish any
any credit consumed by its use of such exchanges at its earliest credit consumed by its use of such exchanges at its earliest
opportunity. opportunity.
Finally, it is possible to conceive of RPC exchanges that involve Finally, it is possible to conceive of RPC exchanges that involve any
any or all combinations of write chunks in the RPC call, read or all combinations of write chunks in the RPC call, read chunks in
chunks in the RPC call, and read chunks in the RPC reply. Support the RPC call, and read chunks in the RPC reply. Support for such
for such exchanges is straightforward from a protocol perspective, exchanges is straightforward from a protocol perspective, but in
but in practice such exchanges would be quite rare, limited to practice such exchanges would be quite rare, limited to upper-layer
upper layer protocol exchanges which transferred bulk data in both protocol exchanges that transferred bulk data in both the call and
the call and corresponding reply. corresponding reply.
3.9. Padding 3.9. Padding
Alignment of specific opaque data enables certain scatter/gather Alignment of specific opaque data enables certain scatter/gather
optimizations. Padding leverages the useful property that RDMA optimizations. Padding leverages the useful property that RDMA
transfers preserve alignment of data, even when they are placed transfers preserve alignment of data, even when they are placed into
into pre-posted receive buffers by Sends. pre-posted receive buffers by Sends.
Many servers can make good use of such padding. Padding allows the Many servers can make good use of such padding. Padding allows the
chaining of RDMA receive buffers such that any data transferred by chaining of RDMA receive buffers such that any data transferred by
RDMA on behalf of RPC requests will be placed into appropriately RDMA on behalf of RPC requests will be placed into appropriately
aligned buffers on the system that receives the transfer. In this aligned buffers on the system that receives the transfer. In this
way, the need for servers to perform RDMA Read to satisfy all but way, the need for servers to perform RDMA Read to satisfy all but the
the largest client writes is obviated. largest client writes is obviated.
The effect of padding is demonstrated below showing prior bytes on The effect of padding is demonstrated below showing prior bytes on an
an XDR stream (XXX) followed by an opaque field consisting of four XDR stream ("XXX" in the figure below) followed by an opaque field
length bytes (LLLL) followed by data bytes (DDDD). The receiver of consisting of four length bytes ("LLLL") followed by data bytes
the RDMA Send has posted two chained receive buffers. Without ("DDD"). The receiver of the RDMA Send has posted two chained
padding, the opaque data is split across the two buffers. With the receive buffers. Without padding, the opaque data is split across
addition of padding bytes ("ppp" in the figure below) prior to the the two buffers. With the addition of padding bytes ("ppp") prior to
first data byte, the data can be forced to align correctly in the the first data byte, the data can be forced to align correctly in the
second buffer. second buffer.
Buffer 1 Buffer 2 Buffer 1 Buffer 2
Unpadded -------------- -------------- Unpadded -------------- --------------
XXXXXXXLLLLDDDDDDDDDDDDDD ---> XXXXXXXLLLLDDD DDDDDDDDDDD XXXXXXXLLLLDDDDDDDDDDDDDD ---> XXXXXXXLLLLDDD DDDDDDDDDDD
Padded Padded
XXXXXXXLLLLpppDDDDDDDDDDDDDD ---> XXXXXXXLLLLppp DDDDDDDDDDDDDD XXXXXXXLLLLpppDDDDDDDDDDDDDD ---> XXXXXXXLLLLppp DDDDDDDDDDDDDD
Padding is implemented completely within the RDMA transport Padding is implemented completely within the RDMA transport encoding,
encoding, flagged with a specific message type. Where padding is flagged with a specific message type. Where padding is applied, two
applied, two values are passed to the peer: an "rdma_align" which values are passed to the peer: an "rdma_align", which is the padding
is the padding value used, and "rdma_thresh", which is the opaque value used, and "rdma_thresh", which is the opaque data size at or
data size at or above which padding is applied. For instance, if above which padding is applied. For instance, if the server is using
the server is using chained 4 KB receive buffers, then up to (4 KB chained 4 KB receive buffers, then up to (4 KB - 1) padding bytes
- 1) padding bytes could be used to achieve alignment of the data. could be used to achieve alignment of the data. The XDR routine at
The XDR routine at the peer MUST consult these values when decoding the peer MUST consult these values when decoding opaque values.
opaque values. Where the decoded length exceeds the rdma_thresh, Where the decoded length exceeds the rdma_thresh, the XDR decode MUST
the XDR decode MUST skip over the appropriate padding as indicated skip over the appropriate padding as indicated by rdma_align and the
by rdma_align and the current XDR stream position. current XDR stream position.
4. RPC RDMA Message Layout 4. RPC RDMA Message Layout
RPC call and reply messages are conveyed across an RDMA transport RPC call and reply messages are conveyed across an RDMA transport
with a prepended RPC over RDMA header. The RPC over RDMA header with a prepended RPC-over-RDMA header. The RPC-over-RDMA header
includes data for RDMA flow control credits, padding parameters and includes data for RDMA flow control credits, padding parameters, and
lists of addresses that provide direct data placement via RDMA Read lists of addresses that provide direct data placement via RDMA Read
and Write operations. The layout of the RPC message itself is and Write operations. The layout of the RPC message itself is
unchanged from that described in [RFC1831bis] except for the unchanged from that described in [RFC5531] except for the possible
possible exclusion of large data chunks that will be moved by RDMA exclusion of large data chunks that will be moved by RDMA Read or
Read or Write operations. If the RPC message (along with the RPC Write operations. If the RPC message (along with the RPC-over-RDMA
over RDMA header) is too long for the posted receive buffer (even header) is too long for the posted receive buffer (even after any
after any large chunks are removed), then the entire RPC message large chunks are removed), then the entire RPC message MAY be moved
MAY be moved separately as a chunk, leaving just the RPC over RDMA separately as a chunk, leaving just the RPC-over-RDMA header in the
header in the RDMA Send. RDMA Send.
4.1. RPC over RDMA Header 4.1. RPC-over-RDMA Header
The RPC over RDMA header begins with four 32-bit fields that are The RPC-over-RDMA header begins with four 32-bit fields that are
always present and which control the RDMA interaction including always present and that control the RDMA interaction including RDMA-
RDMA-specific flow control. These are then followed by a number of specific flow control. These are then followed by a number of items
items such as chunk lists and padding which MAY or MUST NOT be such as chunk lists and padding that MAY or MUST NOT be present
present depending on the type of transmission. The four fields depending on the type of transmission. The four fields that are
which are always present are: always present are:
1. Transaction ID (XID). 1. Transaction ID (XID).
The XID generated for the RPC call and reply. Having the XID The XID generated for the RPC call and reply. Having the XID at
at the beginning of the message makes it easy to establish the the beginning of the message makes it easy to establish the
message context. This XID MUST be the same as the XID in the message context. This XID MUST be the same as the XID in the RPC
RPC header. The receiver MAY perform its processing based header. The receiver MAY perform its processing based solely on
solely on the XID in the RPC over RDMA header, and thereby the XID in the RPC-over-RDMA header, and thereby ignore the XID in
ignore the XID in the RPC header, if it so chooses. the RPC header, if it so chooses.
2. Version number. 2. Version number.
This version of the RPC RDMA message protocol is 1. The This version of the RPC RDMA message protocol is 1. The version
version number MUST be increased by one whenever the format of number MUST be increased by 1 whenever the format of the RPC RDMA
the RPC RDMA messages is changed. messages is changed.
3. Flow control credit value. 3. Flow control credit value.
When sent in an RPC call message, the requested value is When sent in an RPC call message, the requested value is provided.
provided. When sent in an RPC reply message, the granted When sent in an RPC reply message, the granted value is returned.
value is returned. RPC calls SHOULD NOT be sent in excess of RPC calls SHOULD NOT be sent in excess of the currently granted
the currently granted limit. limit.
4. Message type. 4. Message type.
o RDMA_MSG = 0 indicates that chunk lists and RPC message o RDMA_MSG = 0 indicates that chunk lists and RPC message follow.
follow.
o RDMA_NOMSG = 1 indicates that after the chunk lists there o RDMA_NOMSG = 1 indicates that after the chunk lists there is no
is no RPC message. In this case, the chunk lists provide RPC message. In this case, the chunk lists provide information
information to allow the message proper to be transferred to allow the message proper to be transferred using RDMA Read
using RDMA Read or write and thus is not appended to the or Write and thus is not appended to the RPC-over-RDMA header.
RPC over RDMA header.
o RDMA_MSGP = 2 indicates that a chunk list and RPC message o RDMA_MSGP = 2 indicates that a chunk list and RPC message with
with some padding follow. some padding follow.
0 RDMA_DONE = 3 indicates that the message signals the o RDMA_DONE = 3 indicates that the message signals the completion
completion of a chunk transfer via RDMA Read. of a chunk transfer via RDMA Read.
o RDMA_ERROR = 4 is used to signal any detected error(s) in o RDMA_ERROR = 4 is used to signal any detected error(s) in the
the RPC RDMA chunk encoding. RPC RDMA chunk encoding.
Because the version number is encoded as part of this header, and Because the version number is encoded as part of this header, and the
the RDMA_ERROR message type is used to indicate errors, these first RDMA_ERROR message type is used to indicate errors, these first four
four fields and the start of the following message body MUST always fields and the start of the following message body MUST always remain
remain aligned at these fixed offsets for all versions of the RPC aligned at these fixed offsets for all versions of the RPC-over-RDMA
over RDMA header. header.
For a message of type RDMA_MSG or RDMA_NOMSG, the Read and Write For a message of type RDMA_MSG or RDMA_NOMSG, the Read and Write
chunk lists follow. If the Read chunk list is null (a 32 bit word chunk lists follow. If the Read chunk list is null (a 32-bit word of
of zeros), then there are no chunks to be transferred separately zeros), then there are no chunks to be transferred separately and the
and the RPC message follows in its entirety. If non-null, then RPC message follows in its entirety. If non-null, then it's the
it's the beginning of an XDR encoded sequence of Read chunk list beginning of an XDR encoded sequence of Read chunk list entries. If
entries. If the Write chunk list is non-null, then an XDR encoded the Write chunk list is non-null, then an XDR encoded sequence of
sequence of Write chunk entries follows. Write chunk entries follows.
If the message type is RDMA_MSGP, then two additional fields that If the message type is RDMA_MSGP, then two additional fields that
specify the padding alignment and threshold are inserted prior to specify the padding alignment and threshold are inserted prior to the
the Read and Write chunk lists. Read and Write chunk lists.
A header of message type RDMA_MSG or RDMA_MSGP MUST be followed by A header of message type RDMA_MSG or RDMA_MSGP MUST be followed by
the RPC call or RPC reply message body, beginning with the XID. the RPC call or RPC reply message body, beginning with the XID. The
The XID in the RDMA_MSG or RDMA_MSGP header MUST match this. XID in the RDMA_MSG or RDMA_MSGP header MUST match this.
+--------+---------+---------+-----------+-------------+---------- +--------+---------+---------+-----------+-------------+----------
| | | | Message | NULLs | RPC Call | | | | Message | NULLs | RPC Call
| XID | Version | Credits | Type | or | or | XID | Version | Credits | Type | or | or
| | | | | Chunk Lists | Reply Msg | | | | | Chunk Lists | Reply Msg
+--------+---------+---------+-----------+-------------+---------- +--------+---------+---------+-----------+-------------+----------
Note that in the case of RDMA_DONE and RDMA_ERROR, no chunk list or Note that in the case of RDMA_DONE and RDMA_ERROR, no chunk list or
RPC message follows. As an implementation hint: a gather operation RPC message follows. As an implementation hint: a gather operation
on the Send of the RDMA RPC message can be used to marshal the on the Send of the RDMA RPC message can be used to marshal the
initial header, the chunk list, and the RPC message itself. initial header, the chunk list, and the RPC message itself.
4.2. RPC over RDMA header errors 4.2. RPC-over-RDMA Header Errors
When a peer receives an RPC RDMA message, it MUST perform the When a peer receives an RPC RDMA message, it MUST perform the
following basic validity checks on the header and chunk contents. following basic validity checks on the header and chunk contents. If
If such errors are detected in the request, an RDMA_ERROR reply such errors are detected in the request, an RDMA_ERROR reply MUST be
MUST be generated. generated.
Two types of errors are defined, version mismatch and invalid chunk Two types of errors are defined, version mismatch and invalid chunk
format. When the peer detects an RPC over RDMA header version format. When the peer detects an RPC-over-RDMA header version that
which it does not support (currently this draft defines only it does not support (currently this document defines only version 1),
version 1), it replies with an error code of ERR_VERS, and provides it replies with an error code of ERR_VERS, and provides the low and
the low and high inclusive version numbers it does, in fact, high inclusive version numbers it does, in fact, support. The
support. The version number in this reply MUST be any value version number in this reply MUST be any value otherwise valid at the
otherwise valid at the receiver. When other decoding errors are receiver. When other decoding errors are detected in the header or
detected in the header or chunks, either an RPC decode error MAY be chunks, either an RPC decode error MAY be returned or the RPC/RDMA
returned, or the ROC/RDMA error code ERR_CHUNK MUST be returned. error code ERR_CHUNK MUST be returned.
4.3. XDR Language Description 4.3. XDR Language Description
Here is the message layout in XDR language. Here is the message layout in XDR language.
struct xdr_rdma_segment { struct xdr_rdma_segment {
uint32 handle; /* Registered memory handle */ uint32 handle; /* Registered memory handle */
uint32 length; /* Length of the chunk in bytes */ uint32 length; /* Length of the chunk in bytes */
uint64 offset; /* Chunk virtual address or offset */ uint64 offset; /* Chunk virtual address or offset */
}; };
struct xdr_read_chunk { struct xdr_read_chunk {
uint32 position; /* Position in XDR stream */ uint32 position; /* Position in XDR stream */
struct xdr_rdma_segment target; struct xdr_rdma_segment target;
}; };
struct xdr_read_list { struct xdr_read_list {
struct xdr_read_chunk entry; struct xdr_read_chunk entry;
struct xdr_read_list *next; struct xdr_read_list *next;
}; };
struct xdr_write_chunk {
struct xdr_rdma_segment target<>;
};
struct xdr_write_chunk { struct xdr_write_list {
struct xdr_rdma_segment target<>; struct xdr_write_chunk entry;
}; struct xdr_write_list *next;
};
struct xdr_write_list { struct rdma_msg {
struct xdr_write_chunk entry; uint32 rdma_xid; /* Mirrors the RPC header xid */
struct xdr_write_list *next; uint32 rdma_vers; /* Version of this protocol */
}; uint32 rdma_credit; /* Buffers requested/granted */
rdma_body rdma_body;
};
struct rdma_msg { enum rdma_proc {
uint32 rdma_xid; /* Mirrors the RPC header xid */ RDMA_MSG=0, /* An RPC call or reply msg */
uint32 rdma_vers; /* Version of this protocol */ RDMA_NOMSG=1, /* An RPC call or reply msg - separate body */
uint32 rdma_credit; /* Buffers requested/granted */ RDMA_MSGP=2, /* An RPC call or reply msg with padding */
rdma_body rdma_body; RDMA_DONE=3, /* Client signals reply completion */
}; RDMA_ERROR=4 /* An RPC RDMA encoding error */
};
enum rdma_proc { union rdma_body switch (rdma_proc proc) {
RDMA_MSG=0, /* An RPC call or reply msg */ case RDMA_MSG:
RDMA_NOMSG=1, /* An RPC call or reply msg - separate body */ rpc_rdma_header rdma_msg;
RDMA_MSGP=2, /* An RPC call or reply msg with padding */ case RDMA_NOMSG:
RDMA_DONE=3, /* Client signals reply completion */ rpc_rdma_header_nomsg rdma_nomsg;
RDMA_ERROR=4 /* An RPC RDMA encoding error */ case RDMA_MSGP:
}; rpc_rdma_header_padded rdma_msgp;
union rdma_body switch (rdma_proc proc) { case RDMA_DONE:
case RDMA_MSG: void;
rpc_rdma_header rdma_msg; case RDMA_ERROR:
case RDMA_NOMSG: rpc_rdma_error rdma_error;
rpc_rdma_header_nomsg rdma_nomsg; };
case RDMA_MSGP:
rpc_rdma_header_padded rdma_msgp;
case RDMA_DONE:
void;
case RDMA_ERROR:
rpc_rdma_error rdma_error;
};
struct rpc_rdma_header { struct rpc_rdma_header {
struct xdr_read_list *rdma_reads; struct xdr_read_list *rdma_reads;
struct xdr_write_list *rdma_writes; struct xdr_write_list *rdma_writes;
struct xdr_write_chunk *rdma_reply; struct xdr_write_chunk *rdma_reply;
/* rpc body follows */ /* rpc body follows */
}; };
struct rpc_rdma_header_nomsg { struct rpc_rdma_header_nomsg {
struct xdr_read_list *rdma_reads; struct xdr_read_list *rdma_reads;
struct xdr_write_list *rdma_writes; struct xdr_write_list *rdma_writes;
struct xdr_write_chunk *rdma_reply; struct xdr_write_chunk *rdma_reply;
};
struct rpc_rdma_header_padded { };
uint32 rdma_align; /* Padding alignment */
uint32 rdma_thresh; /* Padding threshold */
struct xdr_read_list *rdma_reads;
struct xdr_write_list *rdma_writes;
struct xdr_write_chunk *rdma_reply;
/* rpc body follows */
};
enum rpc_rdma_errcode {
ERR_VERS = 1,
ERR_CHUNK = 2
};
union rpc_rdma_error switch (rpc_rdma_errcode err) { struct rpc_rdma_header_padded {
case ERR_VERS: uint32 rdma_align; /* Padding alignment */
uint32 rdma_vers_low; uint32 rdma_thresh; /* Padding threshold */
uint32 rdma_vers_high; struct xdr_read_list *rdma_reads;
case ERR_CHUNK: struct xdr_write_list *rdma_writes;
void; struct xdr_write_chunk *rdma_reply;
default: /* rpc body follows */
uint32 rdma_extra[8]; };
};
enum rpc_rdma_errcode {
ERR_VERS = 1,
ERR_CHUNK = 2
};
union rpc_rdma_error switch (rpc_rdma_errcode err) {
case ERR_VERS:
uint32 rdma_vers_low;
uint32 rdma_vers_high;
case ERR_CHUNK:
void;
default:
uint32 rdma_extra[8];
};
5. Long Messages 5. Long Messages
The receiver of RDMA Send messages is required by RDMA to have The receiver of RDMA Send messages is required by RDMA to have
previously posted one or more adequately sized buffers. The RPC previously posted one or more adequately sized buffers. The RPC
client can inform the server of the maximum size of its RDMA Send client can inform the server of the maximum size of its RDMA Send
messages via the Connection Configuration Protocol described later messages via the Connection Configuration Protocol described later in
in this document. this document.
Since RPC messages are frequently small, memory savings can be Since RPC messages are frequently small, memory savings can be
achieved by posting small buffers. Even large messages like NFS achieved by posting small buffers. Even large messages like NFS READ
READ or WRITE will be quite small once the chunks are removed from or WRITE will be quite small once the chunks are removed from the
the message. However, there may be large messages that would message. However, there may be large messages that would demand a
demand a very large buffer be posted, where the contents of the very large buffer be posted, where the contents of the buffer may not
buffer may not be a chunkable XDR element. A good example is an be a chunkable XDR element. A good example is an NFS READDIR reply,
NFS READDIR reply which may contain a large number of small which may contain a large number of small filename strings. Also,
filename strings. Also, the NFS version 4 protocol [RFC3530] the NFS version 4 protocol [RFC3530] features COMPOUND request and
features COMPOUND request and reply messages of unbounded length. reply messages of unbounded length.
Ideally, each upper layer will negotiate these limits. However, it Ideally, each upper layer will negotiate these limits. However, it
is frequently necessary to provide a transparent solution. is frequently necessary to provide a transparent solution.
5.1. Message as an RDMA Read Chunk 5.1. Message as an RDMA Read Chunk
One relatively simple method is to have the client identify any RPC One relatively simple method is to have the client identify any RPC
message that exceeds the RPC server's posted buffer size and move message that exceeds the RPC server's posted buffer size and move it
it separately as a chunk, i.e., reference it as the first entry in separately as a chunk, i.e., reference it as the first entry in the
the read chunk list with an XDR position of zero. read chunk list with an XDR position of zero.
Normal Message Normal Message
+--------+---------+---------+------------+-------------+---------- +--------+---------+---------+------------+-------------+----------
| | | | | | RPC Call | | | | | | RPC Call
| XID | Version | Credits | RDMA_MSG | Chunk Lists | or | XID | Version | Credits | RDMA_MSG | Chunk Lists | or
| | | | | | Reply Msg | | | | | | Reply Msg
+--------+---------+---------+------------+-------------+---------- +--------+---------+---------+------------+-------------+----------
Long Message Long Message
+--------+---------+---------+------------+-------------+ +--------+---------+---------+------------+-------------+
| | | | | | | | | | | |
| XID | Version | Credits | RDMA_NOMSG | Chunk Lists | | XID | Version | Credits | RDMA_NOMSG | Chunk Lists |
| | | | | | | | | | | |
+--------+---------+---------+------------+-------------+ +--------+---------+---------+------------+-------------+
| |
| +---------- | +----------
| | Long RPC Call | | Long RPC Call
+->| or +->| or
| Reply Message | Reply Message
+---------- +----------
If the receiver gets an RPC over RDMA header with a message type of If the receiver gets an RPC-over-RDMA header with a message type of
RDMA_NOMSG and finds an initial read chunk list entry with a zero RDMA_NOMSG and finds an initial read chunk list entry with a zero XDR
XDR position, it allocates a registered buffer and issues an RDMA position, it allocates a registered buffer and issues an RDMA Read of
Read of the long RPC message into it. The receiver then proceeds the long RPC message into it. The receiver then proceeds to XDR
to XDR decode the RPC message as if it had received it inline with decode the RPC message as if it had received it inline with the Send
the Send data. Further decoding may issue additional RDMA Reads to data. Further decoding may issue additional RDMA Reads to bring over
bring over additional chunks. additional chunks.
Although the handling of long messages requires one extra network Although the handling of long messages requires one extra network
turnaround, in practice these messages will be rare if the posted turnaround, in practice these messages will be rare if the posted
receive buffers are correctly sized, and of course they will be receive buffers are correctly sized, and of course they will be
non-existent for RDMA-aware upper layers. non-existent for RDMA-aware upper layers.
A long call RPC with request supplied via RDMA Read A long call RPC with request supplied via RDMA Read
RPC Client RPC Server RPC Client RPC Server
| RDMA over RPC Header | | RDMA-over-RPC Header |
Send | ------------------------------> | Send | ------------------------------> |
| | | |
| Long RPC Call Msg | | Long RPC Call Msg |
| +------------------------------ | Read | +------------------------------ | Read
| v-----------------------------> | | v-----------------------------> |
| | | |
| RDMA over RPC Reply | | RDMA-over-RPC Reply |
| <------------------------------ | Send | <------------------------------ | Send
An RPC with long reply returned via RDMA Read An RPC with long reply returned via RDMA Read
RPC Client RPC Server RPC Client RPC Server
| RPC Call | | RPC Call |
Send | ------------------------------> | Send | ------------------------------> |
| | | |
| RDMA over RPC Header | | RDMA-over-RPC Header |
| <------------------------------ | Send | <------------------------------ | Send
| | | |
| Long RPC Reply Msg | | Long RPC Reply Msg |
Read | ------------------------------+ | Read | ------------------------------+ |
| <-----------------------------v | | <-----------------------------v |
| | | |
| Done | | Done |
Send | ------------------------------> | Send | ------------------------------> |
It is possible for a single RPC procedure to employ both a long It is possible for a single RPC procedure to employ both a long call
call for its arguments, and a long reply for its results. However, for its arguments and a long reply for its results. However, such an
such an operation is atypical, as few upper layers define such operation is atypical, as few upper layers define such exchanges.
exchanges.
5.2. RDMA Write of Long Replies (Reply Chunks) 5.2. RDMA Write of Long Replies (Reply Chunks)
A superior method of handling long RPC replies is to have the RPC A superior method of handling long RPC replies is to have the RPC
client post a large buffer into which the server can write a large client post a large buffer into which the server can write a large
RPC reply. This has the advantage that an RDMA Write may be RPC reply. This has the advantage that an RDMA Write may be slightly
slightly faster in network latency than an RDMA Read, and does not faster in network latency than an RDMA Read, and does not require the
require the server to wait for the completion as it must for RDMA server to wait for the completion as it must for RDMA Read.
Read. Additionally, for a reply it removes the need for an Additionally, for a reply it removes the need for an RDMA_DONE
RDMA_DONE message if the large reply is returned as a Read chunk. message if the large reply is returned as a Read chunk.
This protocol supports direct return of a large reply via the This protocol supports direct return of a large reply via the
inclusion of an OPTIONAL rdma_reply write chunk after the read inclusion of an OPTIONAL rdma_reply write chunk after the read chunk
chunk list and the write chunk list. The client allocates a buffer list and the write chunk list. The client allocates a buffer sized
sized to receive a large reply and enters its steering tag, address to receive a large reply and enters its steering tag, address and
and length in the rdma_reply write chunk. If the reply message is length in the rdma_reply write chunk. If the reply message is too
too long to return inline with an RDMA Send (exceeds the size of long to return inline with an RDMA Send (exceeds the size of the
the client's posted receive buffer), even with read chunks removed, client's posted receive buffer), even with read chunks removed, then
then the RPC server performs an RDMA Write of the RPC reply message the RPC server performs an RDMA Write of the RPC reply message into
into the buffer indicated by the rdma_reply chunk. If the client the buffer indicated by the rdma_reply chunk. If the client doesn't
doesn't provide an rdma_reply chunk, or if it's too small, then if provide an rdma_reply chunk, or if it's too small, then if the upper-
the upper layer specification permits, the message MAY be returned layer specification permits, the message MAY be returned as a Read
as a Read chunk. chunk.
An RPC with long reply returned via RDMA Write An RPC with long reply returned via RDMA Write
RPC Client RPC Server RPC Client RPC Server
| RPC Call with rdma_reply | | RPC Call with rdma_reply |
Send | ------------------------------> | Send | ------------------------------> |
| | | |
| Long RPC Reply Msg | | Long RPC Reply Msg |
| <------------------------------ | Write | <------------------------------ | Write
| | | |
| RDMA over RPC Header | | RDMA-over-RPC Header |
| <------------------------------ | Send | <------------------------------ | Send
The use of RDMA Write to return long replies requires that the The use of RDMA Write to return long replies requires that the client
client applications anticipate a long reply and have some knowledge applications anticipate a long reply and have some knowledge of its
of its size so that an adequately sized buffer can be allocated. size so that an adequately sized buffer can be allocated. This is
This is certainly true of NFS READDIR replies; where the client certainly true of NFS READDIR replies; where the client already
already provides an upper bound on the size of the encoded provides an upper bound on the size of the encoded directory fragment
directory fragment to be returned by the server. to be returned by the server.
The use of these "reply chunks" is highly efficient and convenient The use of these "reply chunks" is highly efficient and convenient
for both RPC client and server. Their use is encouraged for for both RPC client and server. Their use is encouraged for eligible
eligible RPC operations such as NFS READDIR, which would otherwise RPC operations such as NFS READDIR, which would otherwise require
require extensive chunk management within the results or use of extensive chunk management within the results or use of RDMA Read and
RDMA Read and a Done message. [NFSDDP] a Done message [RFC5667].
6. Connection Configuration Protocol 6. Connection Configuration Protocol
RDMA Send operations require the receiver to post one or more RDMA Send operations require the receiver to post one or more buffers
buffers at the RDMA connection endpoint, each large enough to at the RDMA connection endpoint, each large enough to receive the
receive the largest Send message. Buffers are consumed as Send largest Send message. Buffers are consumed as Send messages are
messages are received. If a buffer is too small, or if there are received. If a buffer is too small, or if there are no buffers
no buffers posted, the RDMA transport MAY return an error and break posted, the RDMA transport MAY return an error and break the RDMA
the RDMA connection. The receiver MUST post sufficient, adequately connection. The receiver MUST post sufficient, adequately buffers to
buffers to avoid buffer overrun or capacity errors. avoid buffer overrun or capacity errors.
The protocol described above includes only a mechanism for managing The protocol described above includes only a mechanism for managing
the number of such receive buffers, and no explicit features to the number of such receive buffers and no explicit features to allow
allow the RPC client and server to provision or control buffer the RPC client and server to provision or control buffer sizing, nor
sizing, nor any other session parameters. any other session parameters.
In the past, this type of connection management has not been In the past, this type of connection management has not been
necessary for RPC. RPC over UDP or TCP does not have a protocol to necessary for RPC. RPC over UDP or TCP does not have a protocol to
negotiate the link. The server can get a rough idea of the maximum negotiate the link. The server can get a rough idea of the maximum
size of messages from the server protocol code. However, a size of messages from the server protocol code. However, a protocol
protocol to negotiate transport features on a more dynamic basis is to negotiate transport features on a more dynamic basis is desirable.
desirable.
The Connection Configuration Protocol allows the client to pass its The Connection Configuration Protocol allows the client to pass its
connection requirements to the server, and allows the server to connection requirements to the server, and allows the server to
inform the client of its connection limits. inform the client of its connection limits.
Use of the Connection Configuration Protocol by an upper layer is Use of the Connection Configuration Protocol by an upper layer is
OPTIONAL. OPTIONAL.
6.1. Initial Connection State 6.1. Initial Connection State
This protocol MAY be used for connection setup prior to the use of This protocol MAY be used for connection setup prior to the use of
another RPC protocol that uses the RDMA transport. It operates in- another RPC protocol that uses the RDMA transport. It operates
band, i.e., it uses the connection itself to negotiate the in-band, i.e., it uses the connection itself to negotiate the
connection parameters. To provide a basis for connection connection parameters. To provide a basis for connection
negotiation, the connection is assumed to provide a basic level of negotiation, the connection is assumed to provide a basic level of
interoperability: the ability to exchange at least one RPC message interoperability: the ability to exchange at least one RPC message at
at a time that is at least 1 KB in size. The server MAY exceed a time that is at least 1 KB in size. The server MAY exceed this
this basic level of configuration, but the client MUST NOT assume basic level of configuration, but the client MUST NOT assume more
more than one, and MUST receive a valid reply from the server than one, and MUST receive a valid reply from the server carrying the
carrying the actual number of available receive messages, prior to actual number of available receive messages, prior to sending its
sending its next request. next request.
6.2. Protocol Description 6.2. Protocol Description
Version 1 of the Connection Configuration protocol consists of a Version 1 of the Connection Configuration Protocol consists of a
single procedure that allows the client to inform the server of its single procedure that allows the client to inform the server of its
connection requirements and the server to return connection connection requirements and the server to return connection
information to the client. information to the client.
The maxcall_sendsize argument is the maximum size of an RPC call The maxcall_sendsize argument is the maximum size of an RPC call
message that the client MAY send inline in an RDMA Send message to message that the client MAY send inline in an RDMA Send message to
the server. The server MAY return a maxcall_sendsize value that is the server. The server MAY return a maxcall_sendsize value that is
smaller or larger than the client's request. The client MUST NOT smaller or larger than the client's request. The client MUST NOT
send an inline call message larger than what the server will send an inline call message larger than what the server will accept.
accept. The maxcall_sendsize limits only the size of inline RPC The maxcall_sendsize limits only the size of inline RPC calls. It
calls. It does not limit the size of long RPC messages transferred does not limit the size of long RPC messages transferred as an
as an initial chunk in the Read chunk list. initial chunk in the Read chunk list.
The maxreply_sendsize is the maximum size of an inline RPC message The maxreply_sendsize is the maximum size of an inline RPC message
that the client will accept from the server. that the client will accept from the server.
The maxrdmaread is the maximum number of RDMA Reads which may be The maxrdmaread is the maximum number of RDMA Reads that may be
active at the peer. This number correlates to the RDMA incoming active at the peer. This number correlates to the RDMA incoming RDMA
RDMA Read count ("IRD") configured into each originating endpoint Read count ("IRD") configured into each originating endpoint by the
by the client or server. If more than this number of RDMA Read client or server. If more than this number of RDMA Read operations
operations by the connected peer are issued simultaneously, by the connected peer are issued simultaneously, connection loss or
connection loss or suboptimal flow control may result, therefore suboptimal flow control may result; therefore, the value SHOULD be
the value SHOULD be observed at all times. The peers' values need observed at all times. The peers' values need not be equal. If
not be equal. If zero, the peer MUST NOT issue requests which zero, the peer MUST NOT issue requests that require RDMA Read to
require RDMA Read to satisfy, as no transfer will be possible. satisfy, as no transfer will be possible.
The align value is the value recommended by the server for opaque The align value is the value recommended by the server for opaque
data values such as strings and counted byte arrays. The client data values such as strings and counted byte arrays. The client MAY
MAY use this value to compute the number of prepended pad bytes use this value to compute the number of prepended pad bytes when XDR
when XDR encoding opaque values in the RPC call message. encoding opaque values in the RPC call message.
typedef unsigned int uint32; typedef unsigned int uint32;
struct config_rdma_req { struct config_rdma_req {
uint32 maxcall_sendsize; uint32 maxcall_sendsize;
/* max size of inline RPC call */ /* max size of inline RPC call */
uint32 maxreply_sendsize; uint32 maxreply_sendsize;
/* max size of inline RPC reply */ /* max size of inline RPC reply */
uint32 maxrdmaread; uint32 maxrdmaread;
/* max active RDMA Reads at client */ /* max active RDMA Reads at client */
}; };
struct config_rdma_reply { struct config_rdma_reply {
uint32 maxcall_sendsize; uint32 maxcall_sendsize;
/* max call size accepted by server */ /* max call size accepted by server */
uint32 align; uint32 align;
/* server's receive buffer alignment */ /* server's receive buffer alignment */
uint32 maxrdmaread; uint32 maxrdmaread;
/* max active RDMA Reads at server */ /* max active RDMA Reads at server */
}; };
program CONFIG_RDMA_PROG { program CONFIG_RDMA_PROG {
version VERS1 { version VERS1 {
/* /*
* Config call/reply * Config call/reply
*/ */
config_rdma_reply CONF_RDMA(config_rdma_req) = 1; config_rdma_reply CONF_RDMA(config_rdma_req) = 1;
} = 1; } = 1;
} = 100417; } = 100417;
7. Memory Registration Overhead 7. Memory Registration Overhead
RDMA requires that all data be transferred between registered RDMA requires that all data be transferred between registered memory
memory regions at the source and destination. All protocol headers regions at the source and destination. All protocol headers as well
as well as separately transferred data chunks use registered as separately transferred data chunks use registered memory. Since
memory. Since the cost of registering and de-registering memory the cost of registering and de-registering memory can be a large
can be a large proportion of the RDMA transaction cost, it is proportion of the RDMA transaction cost, it is important to minimize
important to minimize registration activity. This is easily registration activity. This is easily achieved within RPC controlled
achieved within RPC controlled memory by allocating chunk list data memory by allocating chunk list data and RPC headers in a reusable
and RPC headers in a reusable way from pre-registered pools. way from pre-registered pools.
The data chunks transferred via RDMA MAY occupy memory that The data chunks transferred via RDMA MAY occupy memory that persists
persists outside the bounds of the RPC transaction. Hence, the outside the bounds of the RPC transaction. Hence, the default
default behavior of an RPC over RDMA transport is to register and behavior of an RPC-over-RDMA transport is to register and de-register
de-register these chunks on every transaction. However, this is these chunks on every transaction. However, this is not a limitation
not a limitation of the protocol - only of the existing local RPC of the protocol -- only of the existing local RPC API. The API is
API. The API is easily extended through such functions as easily extended through such functions as rpc_control(3) to change
rpc_control(3) to change the default behavior so that the the default behavior so that the application can assume
application can assume responsibility for controlling memory responsibility for controlling memory registration through an RPC-
registration through an RPC-provided registered memory allocator. provided registered memory allocator.
8. Errors and Error Recovery 8. Errors and Error Recovery
RPC RDMA protocol errors are described in section 4. RPC errors RPC RDMA protocol errors are described in Section 4. RPC errors and
and RPC error recovery are not affected by the protocol, and RPC error recovery are not affected by the protocol, and proceed as
proceed as for any RPC error condition. RDMA Transport error for any RPC error condition. RDMA transport error reporting and
reporting and recovery are outside the scope of this protocol. recovery are outside the scope of this protocol.
It is assumed that the link itself will provide some degree of It is assumed that the link itself will provide some degree of error
error detection and retransmission. iWARP's MPA layer (when used detection and retransmission. iWARP's Marker PDU Aligned (MPA) layer
over TCP), SCTP, as well as the Infiniband link layer all provide (when used over TCP), Stream Control Transmission Protocol (SCTP), as
CRC protection of the RDMA payload, and CRC-class protection is a well as the InfiniBand link layer all provide Cyclic Redundancy Check
general attribute of such transports. Additionally, the RPC layer (CRC) protection of the RDMA payload, and CRC-class protection is a
itself can accept errors from the link level and recover via general attribute of such transports. Additionally, the RPC layer
retransmission. RPC recovery can handle complete loss and re- itself can accept errors from the link level and recover via
establishment of the link. retransmission. RPC recovery can handle complete loss and
re-establishment of the link.
See section 11 for further discussion of the use of RPC-level See Section 11 for further discussion of the use of RPC-level
integrity schemes to detect errors, and related efficiency issues. integrity schemes to detect errors and related efficiency issues.
9. Node Addressing 9. Node Addressing
In setting up a new RDMA connection, the first action by an RPC In setting up a new RDMA connection, the first action by an RPC
client will be to obtain a transport address for the server. The client will be to obtain a transport address for the server. The
mechanism used to obtain this address, and to open an RDMA mechanism used to obtain this address, and to open an RDMA connection
connection is dependent on the type of RDMA transport, and is the is dependent on the type of RDMA transport, and is the responsibility
responsibility of each RPC protocol binding and its local of each RPC protocol binding and its local implementation.
implementation.
10. RPC Binding 10. RPC Binding
RPC services normally register with a portmap or rpcbind [RFC1833] RPC services normally register with a portmap or rpcbind [RFC1833]
service, which associates an RPC program number with a service service, which associates an RPC program number with a service
address. (In the case of UDP or TCP, the service address for NFS address. (In the case of UDP or TCP, the service address for NFS is
is normally port 2049.) This policy is no different with RDMA normally port 2049.) This policy is no different with RDMA
interconnects, although it may require the allocation of port interconnects, although it may require the allocation of port numbers
numbers appropriate to each upper layer binding which uses the RPC appropriate to each upper-layer binding that uses the RPC framing
framing defined here. defined here.
When mapped atop the iWARP [RFC5040, RFC5041] transport, which uses When mapped atop the iWARP [RFC5040, RFC5041] transport, which uses
IP port addressing due to its layering on TCP and/or SCTP, port IP port addressing due to its layering on TCP and/or SCTP, port
mapping is trivial and consists merely of issuing the port in the mapping is trivial and consists merely of issuing the port in the
connection process. The NFS/RDMA protocol service address has been connection process. The NFS/RDMA protocol service address has been
assigned port 20049 by IANA, for both iWARP/TCP and iWARP/SCTP. assigned port 20049 by IANA, for both iWARP/TCP and iWARP/SCTP.
When mapped atop Infiniband [IB], which uses a GID-based service When mapped atop InfiniBand [IB], which uses a Group Identifier
endpoint naming scheme, a translation MUST be employed. One such (GID)-based service endpoint naming scheme, a translation MUST be
translation is defined in the Infiniband Port Addressing Annex employed. One such translation is defined in the InfiniBand Port
[IBPORT], which is appropriate for translating IP port addressing Addressing Annex [IBPORT], which is appropriate for translating IP
to the Infiniband network. Therefore, in this case, IP port port addressing to the InfiniBand network. Therefore, in this case,
addressing may be readily employed by the upper layer. IP port addressing may be readily employed by the upper layer.
When a mapping standard or convention exists for IP ports on an When a mapping standard or convention exists for IP ports on an RDMA
RDMA interconnect, there are several possibilities for each upper interconnect, there are several possibilities for each upper layer to
layer to consider: consider:
One possibility is to have an upper layer server register its One possibility is to have an upper-layer server register its
mapped IP port with the rpcbind service, under the netid (or mapped IP port with the rpcbind service, under the netid (or
netid's) defined here. An RPC/RDMA-aware client can then netid's) defined here. An RPC/RDMA-aware client can then resolve
resolve its desired service to a mappable port, and proceed to its desired service to a mappable port, and proceed to connect.
connect. This is the most flexible and compatible approach, This is the most flexible and compatible approach, for those upper
for those upper layers which are defined to use the rpcbind layers that are defined to use the rpcbind service.
service.
A second possibility is to have the server's portmapper A second possibility is to have the server's portmapper register
register itself on the RDMA interconnect at a "well known" itself on the RDMA interconnect at a "well known" service address.
service address. (On UDP or TCP, this corresponds to port (On UDP or TCP, this corresponds to port 111.) A client could
111.) A client could connect to this service address and use connect to this service address and use the portmap protocol to
the portmap protocol to obtain a service address in response obtain a service address in response to a program number, e.g., an
to a program number, e.g., an iWARP port number, or an iWARP port number, or an InfiniBand GID.
Infiniband GID.
Alternatively, the client could simply connect to the mapped Alternatively, the client could simply connect to the mapped well-
well-known port for the service itself, if it is appropriately known port for the service itself, if it is appropriately defined.
defined. By convention, the NFS/RDMA service, when operating By convention, the NFS/RDMA service, when operating atop such an
atop such an Infiniband fabric, will use the same 20049 InfiniBand fabric, will use the same 20049 assignment as for
assignment as for iWARP. iWARP.
Historically, different RPC protocols have taken different Historically, different RPC protocols have taken different approaches
approaches to their port assignment, therefore the specific method to their port assignment; therefore, the specific method is left to
is left to each RPC/RDMA-enabled upper layer binding, and not each RPC/RDMA-enabled upper-layer binding, and not addressed here.
addressed here.
This specification defines two new "netid" values, to be used for In Section 12, "IANA Considerations", this specification defines two
registration of upper layers atop iWARP [RFC5040, RFC5041] and new "netid" values, to be used for registration of upper layers atop
(when a suitable port translation service is available) Infiniband iWARP [RFC5040, RFC5041] and (when a suitable port translation
[IB] in section 12, "IANA Considerations." Additional RDMA-capable service is available) InfiniBand [IB]. Additional RDMA-capable
networks MAY define their own netids, or if they provide a port networks MAY define their own netids, or if they provide a port
translation, MAY share the one defined here. translation, MAY share the one defined here.
11. Security Considerations 11. Security Considerations
RPC provides its own security via the RPCSEC_GSS framework RPC provides its own security via the RPCSEC_GSS framework [RFC2203].
[RFC2203]. RPCSEC_GSS can provide message authentication, RPCSEC_GSS can provide message authentication, integrity checking,
integrity checking, and privacy. This security mechanism will be and privacy. This security mechanism will be unaffected by the RDMA
unaffected by the RDMA transport. The data integrity and privacy transport. The data integrity and privacy features alter the body of
features alter the body of the message, presenting it as a single the message, presenting it as a single chunk. For large messages the
chunk. For large messages the chunk may be large enough to qualify chunk may be large enough to qualify for RDMA Read transfer.
for RDMA Read transfer. However, there is much data movement However, there is much data movement associated with computation and
associated with computation and verification of integrity, or verification of integrity, or encryption/decryption, so certain
encryption/decryption, so certain performance advantages may be performance advantages may be lost.
lost.
For efficiency, a more appropriate security mechanism for RDMA For efficiency, a more appropriate security mechanism for RDMA links
links may be link-level protection, such as certain configurations may be link-level protection, such as certain configurations of
of IPsec, which may be co-located in the RDMA hardware. The use of IPsec, which may be co-located in the RDMA hardware. The use of
link-level protection MAY be negotiated through the use of the new link-level protection MAY be negotiated through the use of the new
RPCSEC_GSS mechanism defined in [RPCSECGSSV2] in conjunction with RPCSEC_GSS mechanism defined in [RFC5403] in conjunction with the
the Channel Binding mechanism [RFC5056] and IPsec Channel Channel Binding mechanism [RFC5056] and IPsec Channel Connection
Connection Latching [BTNSLATCH]. Use of such mechanisms is Latching [RFC5660]. Use of such mechanisms is REQUIRED where
REQUIRED where integrity and/or privacy is desired, and where integrity and/or privacy is desired, and where efficiency is
efficiency is required. required.
An additional consideration is the protection of the integrity and An additional consideration is the protection of the integrity and
privacy of local memory by the RDMA transport itself. The use of privacy of local memory by the RDMA transport itself. The use of
RDMA by RPC MUST NOT introduce any vulnerabilities to system memory RDMA by RPC MUST NOT introduce any vulnerabilities to system memory
contents, or to memory owned by user processes. These protections contents, or to memory owned by user processes. These protections
are provided by the RDMA layer specifications, and specifically are provided by the RDMA layer specifications, and specifically their
their security models. It is REQUIRED that any RDMA provider used security models. It is REQUIRED that any RDMA provider used for RPC
for RPC transport be conformant to the requirements of [RFC5042] in transport be conformant to the requirements of [RFC5042] in order to
order to satisfy these protections. satisfy these protections.
Once delivered securely by the RDMA provider, any RDMA-exposed Once delivered securely by the RDMA provider, any RDMA-exposed
addresses will contain only RPC payloads in the chunk lists, addresses will contain only RPC payloads in the chunk lists,
transferred under the protection of RPCSEC_GSS integrity and transferred under the protection of RPCSEC_GSS integrity and privacy.
privacy. By these means, the data will be protected end-to-end, as By these means, the data will be protected end-to-end, as required by
required by the RPC layer security model. the RPC layer security model.
Where upper layer protocols choose to supply results to the Where upper-layer protocols choose to supply results to the requester
requester via Read chunks, a server resource deficit can arise if via read chunks, a server resource deficit can arise if the client
the client does not promptly acknowledge their status via the does not promptly acknowledge their status via the RDMA_DONE message.
RDMA_DONE message. This can potentially lead to a denial of This can potentially lead to a denial-of-service situation, with a
service situation, with a single client unfairly (and single client unfairly (and unnecessarily) consuming server RDMA
unnecessarily) consuming server RDMA resources. Servers for such resources. Servers for such upper-layer protocols MUST protect
upper layer protocols MUST protect against this situation, against this situation, originating from one or many clients. For
originating from one or many clients. For example, a time-based example, a time-based window of buffer availability may be offered,
window of buffer availability may be offered, if the client fails if the client fails to obtain the data within the window, it will
to obtain the data within the window, it will simply retry using simply retry using ordinary RPC retry semantics. Or, a more severe
ordinary RPC retry semantics. Or, a more severe method would be method would be for the server to simply close the client's RDMA
for the server to simply close the client's RDMA connection, connection, freeing the RDMA resources and allowing the server to
freeing the RDMA resources and allowing the server to reclaim them. reclaim them.
A fairer and more useful method is provided by the protocol itself. A fairer and more useful method is provided by the protocol itself.
The server MAY use the rdma_credit value to limit the number of The server MAY use the rdma_credit value to limit the number of
outstanding requests for each client. By including the number of outstanding requests for each client. By including the number of
outstanding RDMA_DONE completions in the computation of available outstanding RDMA_DONE completions in the computation of available
client credits, the server can limit its exposure to each client, client credits, the server can limit its exposure to each client, and
and therefore provide uninterrupted service as its resources therefore provide uninterrupted service as its resources permit.
permit.
However, the server must ensure that it does not decrease the However, the server must ensure that it does not decrease the credit
credit count to zero with this method, since the RDMA_DONE message count to zero with this method, since the RDMA_DONE message is not
is not acknowledged. If the credit count were to drop to zero acknowledged. If the credit count were to drop to zero solely due to
solely due to outstanding RDMA_DONE messages, the client would outstanding RDMA_DONE messages, the client would deadlock since it
deadlock since it would never obtain a new credit with which to would never obtain a new credit with which to continue. Therefore,
continue. Therefore, if the server adjusts credits to zero for if the server adjusts credits to zero for outstanding RDMA_DONE, it
outstanding RDMA_DONE, it MUST withhold its reply to at least one MUST withhold its reply to at least one message in order to provide
message in order to provide the next credit. The time-based window the next credit. The time-based window (or any other appropriate
(or any other appropriate method) SHOULD be used by the server to method) SHOULD be used by the server to recover resources in the
recover resources in the event that the client never returns. event that the client never returns.
The "Connection Configuration Protocol", when used, MUST be The Connection Configuration Protocol, when used, MUST be protected
protected by an appropriate RPC security flavor, to ensure it is by an appropriate RPC security flavor, to ensure it is not attacked
not attacked in the process of initiating an RPC/RDMA connection. in the process of initiating an RPC/RDMA connection.
12. IANA Considerations 12. IANA Considerations
Three new assignments are specified by this document: Three new assignments are specified by this document:
- A new set of RPC "netids" for resolving RPC/RDMA services
- Optional service port assignments for upper layer bindings
- An RPC program number assignment for the configuration protocol
These assignments have been established, as below.
The new RPC transport has been assigned an RPC "netid", which is an
rpcbind [RFC1833] string used to describe the underlying protocol
in order for RPC to select the appropriate transport framing, as
well as the format of the service addresses and ports.
The following "nc_proto" registry strings are defined for this
purpose:
NC_RDMA "rdma"
NC_RDMA6 "rdma6"
These netids MAY be used for any RDMA network satisfying the
requirements of section 2, and able to identify service endpoints
using IP port addressing, possibly through use of a translation
service as described above in section 10, RPC Binding. The "rdma"
netid is to be used when IPv4 addressing is employed by the
underlying transport, and "rdma6" for IPv6 addressing.
The netid assignment policy and registry are defined in [IANA-
NETID].
As a new RPC transport, this protocol has no effect on RPC program
numbers or existing registered port numbers. However, new port
numbers MAY be registered for use by RPC/RDMA-enabled services, as
appropriate to the new networks over which the services will
operate.
For example, the NFS/RDMA service defined in [NFSDDP] has been - A new set of RPC "netids" for resolving RPC/RDMA services
assigned the port 20049, in the IANA registry:
nfsrdma 20049/tcp Network File System (NFS) over RDMA - Optional service port assignments for upper-layer bindings
nfsrdma 20049/udp Network File System (NFS) over RDMA
nfsrdma 20049/sctp Network File System (NFS) over RDMA
The OPTIONAL Connection Configuration protocol described herein - An RPC program number assignment for the configuration protocol
requires an RPC program number assignment. The value "100417" has
been assigned:
rdmaconfig 100417 rpc.rdmaconfig These assignments have been established, as below.
The RPC program number assignment policy and registry are defined The new RPC transport has been assigned an RPC "netid", which is an
in [RFC1831bis]. rpcbind [RFC1833] string used to describe the underlying protocol in
order for RPC to select the appropriate transport framing, as well as
the format of the service addresses and ports.
13. Acknowledgments The following "Netid" registry strings are defined for this purpose:
The authors wish to thank Rob Thurlow, John Howard, Chet Juszczak, NC_RDMA "rdma"
Alex Chiu, Peter Staubach, Dave Noveck, Brian Pawlowski, Steve NC_RDMA6 "rdma6"
Kleiman, Mike Eisler, Mark Wittle, Shantanu Mehendale, David
Robinson and Mallikarjun Chadalapaka for their contributions to
this document.
14. Normative References These netids MAY be used for any RDMA network satisfying the
requirements of Section 2, and able to identify service endpoints
using IP port addressing, possibly through use of a translation
service as described above in Section 10, "RPC Binding". The "rdma"
netid is to be used when IPv4 addressing is employed by the
underlying transport, and "rdma6" for IPv6 addressing.
[RFC2119] The netid assignment policy and registry are defined in [RFC5665].
S. Bradner, "Key words for use in RFCs to Indicate Requirement
Levels", Best Current Practice, BCP 14, RFC 2119, March 1997.
[RFC1831bis] As a new RPC transport, this protocol has no effect on RPC program
R. Thurlow, Ed., "RPC: Remote Procedure Call Protocol numbers or existing registered port numbers. However, new port
Specification Version 2", Internet Draft Work in Progress, numbers MAY be registered for use by RPC/RDMA-enabled services, as
draft-ietf-nfsv4-rfc1831bis appropriate to the new networks over which the services will operate.
[RFC4506] For example, the NFS/RDMA service defined in [RFC5667] has been
M. Eisler Ed., "XDR: External Data Representation Standard", assigned the port 20049, in the IANA registry:
Standards Track RFC, http://www.ietf.org/rfc/rfc4506.txt
[RFC1833] nfsrdma 20049/tcp Network File System (NFS) over RDMA
R. Srinivasan, "Binding Protocols for ONC RPC Version 2", nfsrdma 20049/udp Network File System (NFS) over RDMA
Standards Track RFC, http://www.ietf.org/rfc/rfc1833.txt nfsrdma 20049/sctp Network File System (NFS) over RDMA
[RFC2203] The OPTIONAL Connection Configuration Protocol described herein
M. Eisler, A. Chiu, L. Ling, "RPCSEC_GSS Protocol requires an RPC program number assignment. The value "100417" has
Specification", Standards Track RFC, been assigned:
http://www.ietf.org/rfc/rfc2203.txt
[RPCSECGSSV2] rdmaconfig 100417 rpc.rdmaconfig
M. Eisler, "RPCSEC_GSS Version 2", Internet Draft Work in
Progress, draft-ietf-nfsv4-rpcsec-gss-v2
[RFC5056] The RPC program number assignment policy and registry are defined in
N. Williams, "On the Use of Channel Bindings to Secure [RFC5531].
Channels", Standards Track RFC
http://www.ietf.org/rfc/rfc5056.txt
[BTNSLATCH] 13. Acknowledgments
N. Williams, "IPsec Channels: Connection Latching", Internet
Draft Work in Progress, draft-ietf-btns-connection-latching
[RFC5042] The authors wish to thank Rob Thurlow, John Howard, Chet Juszczak,
J. Pinkerton, E. Deleganes, "Direct Data Placement Protocol Alex Chiu, Peter Staubach, Dave Noveck, Brian Pawlowski, Steve
(DDP) / Remote Direct Memory Access Protocol (RDMAP) Kleiman, Mike Eisler, Mark Wittle, Shantanu Mehendale, David
Security", Standards Track RFC, Robinson, and Mallikarjun Chadalapaka for their contributions to this
http://www.ietf.org/rfc/rfc5042.txt document.
[IANA-NETID] 14. References
M. Eisler, "IANA Considerations for RPC Net Identifiers and
Universal Address Formats", Internet Draft Work in Progress,
draft-ietf-nfsv4-rpc-netid
15. Informative References 14.1. Normative References
[RFC1094] [RFC1833] Srinivasan, R., "Binding Protocols for ONC RPC Version 2",
Sun Microsystems, "NFS: Network File System Protocol RFC 1833, August 1995.
Specification", (NFS version 2) Informational RFC,
http://www.ietf.org/rfc/rfc1094.txt
[RFC1813] [RFC2203] Eisler, M., Chiu, A., and L. Ling, "RPCSEC_GSS Protocol
B. Callaghan, B. Pawlowski, P. Staubach, "NFS Version 3 Specification", RFC 2203, September 1997.
Protocol Specification", Informational RFC,
http://www.ietf.org/rfc/rfc1813.txt
[RFC3530] [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
S. Shepler, B. Callaghan, D. Robinson, R. Thurlow, C. Beame, Requirement Levels", BCP 14, RFC 2119, March 1997.
M. Eisler, D. Noveck, "NFS version 4 Protocol", Standards
Track RFC, http://www.ietf.org/rfc/rfc3530.txt
[NFSDDP] [RFC4506] Eisler, M., Ed., "XDR: External Data Representation
B. Callaghan, T. Talpey, "NFS Direct Data Placement", Internet Standard", STD 67, RFC 4506, May 2006.
Draft Work in Progress, draft-ietf-nfsv4-nfsdirect
[RFC5040] [RFC5042] Pinkerton, J. and E. Deleganes, "Direct Data Placement
R. Recio et al., "A Remote Direct Memory Access Protocol Protocol (DDP) / Remote Direct Memory Access Protocol
Specification", Standards Track RFC, (RDMAP) Security", RFC 5042, October 2007.
http://www.ietf.org/rfc/rfc5040.txt
[RFC5041] [RFC5056] Williams, N., "On the Use of Channel Bindings to Secure
H. Shah et al., "Direct Data Placement over Reliable Channels", RFC 5056, November 2007.
Transports", Standards Track RFC,
http://www.ietf.org/rfc/rfc5041.txt
[NFSRDMAPS] [RFC5403] Eisler, M., "RPCSEC_GSS Version 2", RFC 5403, February
T. Talpey, C. Juszczak, "NFS RDMA Problem Statement", Internet 2009.
Draft Work in Progress, draft-ietf-nfsv4-nfs-rdma-problem-
statement
[NFSv4.1] [RFC5531] Thurlow, R., "RPC: Remote Procedure Call Protocol
S. Shepler et al., ed., "NFSv4 Minor Version 1", Internet Specification Version 2", RFC 5531, May 2009.
Draft Work in Progress, draft-ietf-nfsv4-minorversion1
[IB] [RFC5660] Williams, N., "IPsec Channels: Connection Latching", RFC
Infiniband Architecture Specification, available from 5660, October 2009.
http://www.infinibandta.org
[IBPORT] [RFC5665] Eisler, M., "IANA Considerations for Remote Procedure Call
Infiniband Trade Association, "IP Addressing Annex", available (RPC) Network Identifiers and Universal Address Formats",
from http://www.infinibandta.org RFC 5665, January 2010.
Authors' Addresses 14.2. Informative References
Tom Talpey [RFC1094] Sun Microsystems, "NFS: Network File System Protocol
Network Appliance, Inc. specification", RFC 1094, March 1989.
1601 Trapelo Road, #16
Waltham, MA 02451 USA
Phone: +1 781 768 5329 [RFC1813] Callaghan, B., Pawlowski, B., and P. Staubach, "NFS
EMail: thomas.talpey@netapp.com Version 3 Protocol Specification", RFC 1813, June 1995.
Brent Callaghan [RFC3530] Shepler, S., Callaghan, B., Robinson, D., Thurlow, R.,
Apple Computer, Inc. Beame, C., Eisler, M., and D. Noveck, "Network File System
MS: 302-4K (NFS) version 4 Protocol", RFC 3530, April 2003.
2 Infinite Loop
Cupertino, CA 95014 USA
EMail: brentc@apple.com [RFC5040] Recio, R., Metzler, B., Culley, P., Hilland, J., and D.
Garcia, "A Remote Direct Memory Access Protocol
Specification", RFC 5040, October 2007.
Intellectual Property Statement [RFC5041] Shah, H., Pinkerton, J., Recio, R., and P. Culley, "Direct
Data Placement over Reliable Transports", RFC 5041,
October 2007.
The IETF Trust takes no position regarding the validity or scope of [RFC5532] Talpey, T. and C. Juszczak, "Network File System (NFS)
any Intellectual Property Rights or other rights that might be Remote Direct Memory Access (RDMA) Problem Statement", RFC
claimed to pertain to the implementation or use of the technology 5532, May 2009.
described in any IETF Document or the extent to which any license
under such rights might or might not be available; nor does it
represent that it has made any independent effort to identify any
such rights.
Copies of Intellectual Property disclosures made to the IETF [RFC5661] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed.,
Secretariat and any assurances of licenses to be made available, or "Network File System Version 4 Minor Version 1 Protocol",
the result of an attempt made to obtain a general license or RFC 5661, January 2010.
permission for the use of such proprietary rights by implementers
or users of this specification can be obtained from the IETF on-
line IPR repository at http://www.ietf.org/ipr
The IETF invites any interested party to bring to its attention any [RFC5667] Talpey, T. and B. Callaghan, "Network File System (NFS)
copyrights, patents or patent applications, or other proprietary Direct Data Placement", RFC 5667, January 2010.
rights that may cover technology that may be required to implement
any standard or specification contained in an IETF Document. Please
address the information to the IETF at ietf-ipr@ietf.org
The definitive version of an IETF Document is that published by, or [IB] InfiniBand Trade Association, InfiniBand Architecture
under the auspices of, the IETF. Versions of IETF Documents that Specifications, available from
are published by third parties, including those that are translated http://www.infinibandta.org.
into other languages, should not be considered to be definitive
versions of IETF Documents. The definitive version of these Legal
Provisions is that published by, or under the auspices of, the
IETF. Versions of these Legal Provisions that are published by
third parties, including those that are translated into other
languages, should not be considered to be definitive versions of
these Legal Provisions.
For the avoidance of doubt, each Contributor to the IETF Standards [IBPORT] InfiniBand Trade Association, "IP Addressing Annex",
Process licenses each Contribution that he or she makes as part of available from http://www.infinibandta.org.
the IETF Standards Process to the IETF Trust pursuant to the
provisions of RFC 5378. No language to the contrary, or terms,
conditions or rights that differ from or are inconsistent with the
rights and licenses granted under RFC 5378, shall have any effect
and shall be null and void, whether published or posted by such
Contributor, or included with or in such Contribution.
Disclaimer of Validity Authors' Addresses
All IETF Documents and the information contained therein are Tom Talpey
provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION 170 Whitman St.
HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET Stow, MA 01775 USA
SOCIETY, THE IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE
DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT
LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION THEREIN
WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Copyright Statement EMail: tmtalpey@gmail.com
Copyright (c) 2008 IETF Trust and the persons identified as the Brent Callaghan
document authors. All rights reserved. Apple Computer, Inc.
MS: 302-4K
2 Infinite Loop
Cupertino, CA 95014 USA
This document is subject to BCP 78 and the IETF Trust's Legal EMail: brentc@apple.com
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with
respect to this document.
 End of changes. 237 change blocks. 
1316 lines changed or deleted 1226 lines changed or added

This html diff was produced by rfcdiff 1.37b. The latest version is available from http://tools.ietf.org/tools/rfcdiff/