draft-ietf-nfsv4-rpcrdma-01.txt   draft-ietf-nfsv4-rpcrdma-02.txt 
Internet-Draft Brent Callaghan Internet-Draft Brent Callaghan
Expires: August 2005 Tom Talpey Expires: April 2006 Tom Talpey
Document: draft-ietf-nfsv4-rpcrdma-01 February, 2005 Document: draft-ietf-nfsv4-rpcrdma-02 October, 2005
RDMA Transport for ONC RPC RDMA Transport for ONC RPC
Status of this Memo Status of this Memo
By submitting this Internet-Draft, I certify that any applicable By submitting this Internet-Draft, each author represents that any
patent or other IPR claims of which I am aware have been disclosed, applicable patent or other IPR claims of which he or she is aware
or will be disclosed, and any of which I become aware will be have been or will be disclosed, and any of which he or she becomes
disclosed, in accordance with RFC 3668. aware will be disclosed, in accordance with Section 6 of BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet- other groups may also distribute working documents as Internet-
Drafts. Drafts.
Internet-Drafts are draft documents valid for a maximum of six Internet-Drafts are draft documents valid for a maximum of six
months and may be updated, replaced, or obsoleted by other months and may be updated, replaced, or obsoleted by other
documents at any time. It is inappropriate to use Internet-Drafts documents at any time. It is inappropriate to use Internet-Drafts
as reference material or to cite them other than as "work in as reference material or to cite them other than as "work in
progress." progress."
The list of current Internet-Drafts can be accessed at The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt The list of http://www.ietf.org/ietf/1id-abstracts.txt The list of
Internet-Draft Shadow Directories can be accessed at Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html. http://www.ietf.org/shadow.html.
Copyright Notice
Copyright (C) The Internet Society (2005). All Rights Reserved.
Abstract Abstract
A protocol is described providing RDMA as a new transport for ONC A protocol is described providing RDMA as a new transport for ONC
RPC. The RDMA transport binding conveys the benefits of efficient, RPC. The RDMA transport binding conveys the benefits of efficient,
bulk data transport over high speed networks, while providing for bulk data transport over high speed networks, while providing for
minimal change to RPC applications and with no required revision of minimal change to RPC applications and with no required revision of
the application RPC protocol, or the RPC protocol itself. the application RPC protocol, or the RPC protocol itself.
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
2. Abstract RDMA Model . . . . . . . . . . . . . . . . . . . . 3 2. Abstract RDMA Requirements . . . . . . . . . . . . . . . . . 3
3. Protocol Outline . . . . . . . . . . . . . . . . . . . . . . 4 3. Protocol Outline . . . . . . . . . . . . . . . . . . . . . . 4
3.1. Short Messages . . . . . . . . . . . . . . . . . . . . . . 4 3.1. Short Messages . . . . . . . . . . . . . . . . . . . . . . 4
3.2. Data Chunks . . . . . . . . . . . . . . . . . . . . . . . 5 3.2. Data Chunks . . . . . . . . . . . . . . . . . . . . . . . 5
3.3. Flow Control . . . . . . . . . . . . . . . . . . . . . . . 5 3.3. Flow Control . . . . . . . . . . . . . . . . . . . . . . . 5
3.4. XDR Encoding with Chunks . . . . . . . . . . . . . . . . . 6 3.4. XDR Encoding with Chunks . . . . . . . . . . . . . . . . . 7
3.5. Padding . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.5. Padding . . . . . . . . . . . . . . . . . . . . . . . . 10
3.6. XDR Decoding with Read Chunks . . . . . . . . . . . . . 10 3.6. XDR Decoding with Read Chunks . . . . . . . . . . . . . 11
3.7. XDR Decoding with Write Chunks . . . . . . . . . . . . . 11 3.7. XDR Decoding with Write Chunks . . . . . . . . . . . . . 12
3.8. RPC Call and Reply . . . . . . . . . . . . . . . . . . . 12 3.8. RPC Call and Reply . . . . . . . . . . . . . . . . . . . 12
4. RPC RDMA Message Layout . . . . . . . . . . . . . . . . . 14 4. RPC RDMA Message Layout . . . . . . . . . . . . . . . . . 15
4.1. RPC over RDMA Header . . . . . . . . . . . . . . . . . . 14 4.1. RPC over RDMA Header . . . . . . . . . . . . . . . . . . 16
4.2. XDR Language Description . . . . . . . . . . . . . . . . 16 4.2. RPC over RDMA header errors . . . . . . . . . . . . . . 17
5. Large Chunkless Messages . . . . . . . . . . . . . . . . . 18 4.3. XDR Language Description . . . . . . . . . . . . . . . . 18
5.1. Message as an RDMA Read Chunk . . . . . . . . . . . . . 18 5. Long Messages . . . . . . . . . . . . . . . . . . . . . . 20
5.2. RDMA Write of Long Replies (Reply Chunks) . . . . . . . 20 5.1. Message as an RDMA Read Chunk . . . . . . . . . . . . . 20
5.3. RPC over RDMA header errors . . . . . . . . . . . . . . 21 5.2. RDMA Write of Long Replies (Reply Chunks) . . . . . . . 22
6. Connection Configuration Protocol . . . . . . . . . . . . 21 6. Connection Configuration Protocol . . . . . . . . . . . . 23
6.1. Initial Connection State . . . . . . . . . . . . . . . . 22 6.1. Initial Connection State . . . . . . . . . . . . . . . . 24
6.2. Protocol Description . . . . . . . . . . . . . . . . . . 22 6.2. Protocol Description . . . . . . . . . . . . . . . . . . 24
7. Memory Registration Overhead . . . . . . . . . . . . . . . 24 7. Memory Registration Overhead . . . . . . . . . . . . . . . 25
8. Errors and Error Recovery . . . . . . . . . . . . . . . . 24 8. Errors and Error Recovery . . . . . . . . . . . . . . . . 26
9. Node Addressing . . . . . . . . . . . . . . . . . . . . . 24 9. Node Addressing . . . . . . . . . . . . . . . . . . . . . 26
10. RPC Binding . . . . . . . . . . . . . . . . . . . . . . . 25 10. RPC Binding . . . . . . . . . . . . . . . . . . . . . . . 26
11. Security . . . . . . . . . . . . . . . . . . . . . . . . 25 11. Security . . . . . . . . . . . . . . . . . . . . . . . . 27
12. IANA Considerations . . . . . . . . . . . . . . . . . . . 25 12. IANA Considerations . . . . . . . . . . . . . . . . . . . 27
13. Acknowledgements . . . . . . . . . . . . . . . . . . . . 26 13. Acknowledgements . . . . . . . . . . . . . . . . . . . . 27
14. Normative References . . . . . . . . . . . . . . . . . . 26 14. Normative References . . . . . . . . . . . . . . . . . . 27
15. Informative References . . . . . . . . . . . . . . . . . 26 15. Informative References . . . . . . . . . . . . . . . . . 28
16. Authors' Addresses . . . . . . . . . . . . . . . . . . . 27 16. Authors' Addresses . . . . . . . . . . . . . . . . . . . 29
17. Full Copyright Statement . . . . . . . . . . . . . . . . 27 17. Intellectual Property and Copyright Statements . . . . . 29
Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . 28 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . 30
1. Introduction 1. Introduction
RDMA is a technique for efficient movement of data between end RDMA is a technique for efficient movement of data between end
nodes, which becomes increasingly compelling over high speed nodes, which becomes increasingly compelling over high speed
transports. By directing data into destination buffers as it sent transports. By directing data into destination buffers as it sent
on a network, and placing it via direct memory access by hardware, on a network, and placing it via direct memory access by hardware,
the double benefit of faster transfers and reduced host overhead is the double benefit of faster transfers and reduced host overhead is
obtained. obtained.
skipping to change at page 3, line 34 skipping to change at page 3, line 34
the NFS protocol, in all its versions, is an obvious beneficiary of the NFS protocol, in all its versions, is an obvious beneficiary of
RDMA. A complete problem statement is discussed in [NFSRDMAPS], RDMA. A complete problem statement is discussed in [NFSRDMAPS],
and related NFSv4 issues are discussed in [NFSSESS]. Many other and related NFSv4 issues are discussed in [NFSSESS]. Many other
RPC-based protocols will also benefit. RPC-based protocols will also benefit.
Although the RDMA transport described here provides relatively Although the RDMA transport described here provides relatively
transparent support for any RPC application, the proposal goes transparent support for any RPC application, the proposal goes
further in describing mechanisms that can optimize the use of RDMA further in describing mechanisms that can optimize the use of RDMA
with more active participation by the RPC application. with more active participation by the RPC application.
2. Abstract RDMA Model 2. Abstract RDMA Requirements
An RPC transport is responsible for conveying an RPC message from a An RPC transport is responsible for conveying an RPC message from a
sender to a receiver. An RPC message is either an RPC call from a sender to a receiver. An RPC message is either an RPC call from a
client to a server, or an RPC reply from the server back to the client to a server, or an RPC reply from the server back to the
client. An RPC message contains an RPC call header followed by client. An RPC message contains an RPC call header followed by
arguments if the message is an RPC call, or an RPC reply header arguments if the message is an RPC call, or an RPC reply header
followed by results if the message is an RPC reply. The call followed by results if the message is an RPC reply. The call
header contains a transaction ID (XID) followed by the program and header contains a transaction ID (XID) followed by the program and
procedure number as well as a security credential. An RPC reply procedure number as well as a security credential. An RPC reply
header begins with an XID that matches that of the RPC call header begins with an XID that matches that of the RPC call
message, followed by a security verifier and results. All data in message, followed by a security verifier and results. All data in
an RPC message is XDR encoded. For a complete description of the an RPC message is XDR encoded. For a complete description of the
RPC protocol and XDR encoding, see [RFC1831] and [RFC1832]. RPC protocol and XDR encoding, see [RFC1831] and [RFC1832].
This protocol assumes an abstract model for RDMA transports. The This protocol assumes the following abstract model for RDMA
following terms, common in the RDMA lexicon, are used in this transports. Theese terms, common in the RDMA lexicon, are used in
document. A more complete glossary of RDMA terms can be found in this document. A more complete glossary of RDMA terms can be found
[RDMA]. in [RDMA].
o Registered Memory o Registered Memory
All data moved via tagged RDMA operations must be resident in All data moved via tagged RDMA operations must be resident in
registered memory at its destination. This protocol assumes registered memory at its destination. This protocol assumes
that each segment of registered memory is identified with a that each segment of registered memory may be identified with
steering tag of no more than 32 bits and memory addresses of a steering tag of no more than 32 bits and memory addresses of
up to 64 bits in length. up to 64 bits in length.
o RDMA Send o RDMA Send
The RDMA provider supports an RDMA Send operation with The RDMA provider supports an RDMA Send operation with
completion signalled at the receiver when data is placed in a completion signalled at the receiver when data is placed in a
pre-posted buffer. The amount of transferred data is limited pre-posted buffer. The amount of transferred data is limited
only by the size of the receiver's buffer. Sends complete at only by the size of the receiver's buffer. Sends complete at
the receiver in the order they were issued at the sender. the receiver in the order they were issued at the sender.
o RDMA Write o RDMA Write
skipping to change at page 4, line 44 skipping to change at page 4, line 44
place peer source data in the requester's buffer. An RDMA place peer source data in the requester's buffer. An RDMA
Read is initiated by the receiver and completion is signalled Read is initiated by the receiver and completion is signalled
at the receiver. The receiver provides steering tags, memory at the receiver. The receiver provides steering tags, memory
addresses and a length for the remote source and local addresses and a length for the remote source and local
destination buffers. Since the peer at the data source destination buffers. Since the peer at the data source
receives no notification of RDMA Read completion, there is an receives no notification of RDMA Read completion, there is an
assumption that on receiving the data the receiver will signal assumption that on receiving the data the receiver will signal
completion with an RDMA Send message, so that the peer can completion with an RDMA Send message, so that the peer can
free the source buffers and the associated steering tags. free the source buffers and the associated steering tags.
This protocol is designed to function with equivalent semantics This protocol is designed to be carried over all RDMA transports
over all appropriate RDMA transports. In its abstract form, this meeting the stated requirements. This protocol conveys to the RPC
protocol does not implement RDMA directly. Instead, it conveys to peer, information sufficient for that RPC peer to direct an RDMA
the RPC peer, information sufficient to direct an RDMA layer to perform transfers containing RPC data, and to communicate
implementation to perform transfers containing RPC data, and to their result(s). For example, it is readily carried over RDMA
communicate their result(s). It therefore becomes a useful, transports such as iWARP [RDDP] or Infiniband [IB].
implementable standard when mapped onto a specific RDMA transport,
such as iWARP [RDDP] or Infiniband [IB].
3. Protocol Outline 3. Protocol Outline
An RPC message can be conveyed in identical fashion, whether it is An RPC message can be conveyed in identical fashion, whether it is
a call or reply message. In each case, the transmission of the a call or reply message. In each case, the transmission of the
message proper is preceded by transmission of a transport-specific message proper is preceded by transmission of a transport-specific
header for use by RPC over RDMA transports. This header is header for use by RPC over RDMA transports. This header is
analogous to the record marking used for RPC over TCP, but is more analogous to the record marking used for RPC over TCP, but is more
extensive, since RDMA transports support several modes of data extensive, since RDMA transports support several modes of data
transfer and it is important to allow the client and server to use transfer and it is important to allow the client and server to use
skipping to change at page 6, line 7 skipping to change at page 6, line 7
transfers. The critical message size that justifies RDMA transfer transfers. The critical message size that justifies RDMA transfer
will vary depending on the RDMA implementation and network, but is will vary depending on the RDMA implementation and network, but is
typically of the order of a few kilobytes. It is appropriate to typically of the order of a few kilobytes. It is appropriate to
transfer a short message with an RDMA Send to a pre-posted buffer. transfer a short message with an RDMA Send to a pre-posted buffer.
The RPC over RDMA header with the short message (call or reply) The RPC over RDMA header with the short message (call or reply)
immediately following is transferred using a single RDMA Send immediately following is transferred using a single RDMA Send
operation. operation.
Short RPC messages over an RDMA transport will look like this: Short RPC messages over an RDMA transport will look like this:
Client Server RPC Client RPC Server
| RPC Call | | RPC Call |
Send | ------------------------------> | Send | ------------------------------> |
| | | |
| RPC Reply | | RPC Reply |
| <------------------------------ | Send | <------------------------------ | Send
3.2. Data Chunks 3.2. Data Chunks
Some protocols, like NFS, have RPC procedures that can transfer Some protocols, like NFS, have RPC procedures that can transfer
very large "chunks" of data in the RPC call or reply and would very large "chunks" of data in the RPC call or reply and would
cause the maximum send size to be exceeded if one tried to transfer cause the maximum send size to be exceeded if one tried to transfer
them as part of the RDMA Send. These large chunks typically range them as part of the RDMA Send. These large chunks typically range
from a kilobyte to a megabyte or more. An RDMA transport can from a kilobyte to a megabyte or more. An RDMA transport can
transfer large chunks of data more efficiently via the direct transfer large chunks of data more efficiently via the direct
placement of an RDMA Read or RDMA Write operation. Using direct placement of an RDMA Read or RDMA Write operation. Using direct
placement instead of in-line transfer not only avoids expensive placement instead of inline transfer not only avoids expensive data
data copies, but provides correct data alignment at the copies, but provides correct data alignment at the destination.
destination.
3.3. Flow Control 3.3. Flow Control
It is critical to provide RDMA Send flow control for an RDMA It is critical to provide RDMA Send flow control for an RDMA
connection. RDMA receive operations will fail if a pre-posted connection. RDMA receive operations will fail if a pre-posted
receive buffer is not available to accept an incoming RDMA Send, receive buffer is not available to accept an incoming RDMA Send,
and repeated occurrences of such errors can be fatal to the and repeated occurrences of such errors can be fatal to the
connection. This is a departure from conventional TCP/IP connection. This is a departure from conventional TCP/IP
networking where buffers are allocated dynamically on an as-needed networking where buffers are allocated dynamically on an as-needed
basis, and pre-posting is not required. basis, and pre-posting is not required.
It is not practical to provide for fixed credit limits at the RPC It is not practical to provide for fixed credit limits at the RPC
server. Fixed limits scale poorly, since posted buffers are server. Fixed limits scale poorly, since posted buffers are
dedicated to the associated connection until consumed by receive dedicated to the associated connection until consumed by receive
operations. Additionally for protocol correctness, the server must operations. Additionally for protocol correctness, the RPC server
always be able to reply to client requests, whether or not new must always be able to reply to client requests, whether or not new
buffers have been posted to accept future receives. buffers have been posted to accept future receives. (Note that the
RPC server may in fact be a client at some other layer. For
example, NFSv4 callbacks are processed by the NFSv4 client, acting
as an RPC server. The credit discussions apply equally in either
case.)
Flow control for RDMA Send operations is implemented as a simple Flow control for RDMA Send operations is implemented as a simple
request/grant protocol in the RPC over RDMA header associated with request/grant protocol in the RPC over RDMA header associated with
each RPC message. The RPC over RDMA header for RPC call messages each RPC message. The RPC over RDMA header for RPC call messages
contains a requested credit value for the server, which may be contains a requested credit value for the RPC server, which may be
dynamically adjusted by the caller to match its expected needs. dynamically adjusted by the caller to match its expected needs.
The RPC over RDMA header for the RPC reply messages provide the The RPC over RDMA header for the RPC reply messages provides the
granted result, which may have any value except it may not be zero granted result, which may have any value except it may not be zero
when no in-progress operations are present at the server, since when no in-progress operations are present at the server, since
such a value would result in deadlock. The value may be adjusted such a value would result in deadlock. The value may be adjusted
up or down at each opportunity to match the server's needs or up or down at each opportunity to match the server's needs or
policies. policies.
The RPC client must not send unacknowledged requests in excess of
this granted RPC server credit limit. If the limit is exceeded,
the RDMA layer may signal an error, possibly terminating the
connection. Even if an error does not occur, there is no
requirement that the server must handle the excess request(s), and
it may return an RPC error to the client. Also note that the
never-zero requirement implies that an RPC server must always
provide at least one credit to each connected RPC client. It does
not however require that the server must always be prepared to
receive a request from each client, for example when the server is
busy processing all granted client requests.
While RPC call may complete in any order, the current flow control While RPC call may complete in any order, the current flow control
limit at the RPC server is known to the RPC client from the Send limit at the RPC server is known to the RPC client from the Send
ordering properties. It is always the most recent server granted ordering properties. It is always the most recent server-granted
credits minus the number of requests in flight. credit value minus the number of requests in flight.
Certain RDMA implementations may impose additional flow control Certain RDMA implementations may impose additional flow control
restrictions, such as limits on RDMA Read operations in progress at restrictions, such as limits on RDMA Read operations in progress at
the responder. Because these operations are outside the scope of the responder. Because these operations are outside the scope of
this protocol, they are not addressed and must be provided for by this protocol, they are not addressed and must be provided for by
other layers. For example, a simple upper layer RPC consumer might other layers. For example, a simple upper layer RPC consumer might
perform single-issue RDMA Read requests, while a more perform single-issue RDMA Read requests, while a more
sophisticated, multithreaded RPC consumer may implement its own sophisticated, multithreaded RPC consumer may implement its own
FIFO queue of such operations. FIFO queue of such operations. For further discussion of possible
protocol implementations capable of negotiating these values, see
section 6 "Connection Configuration Protocol" of this draft, or
[NFSSESS].
3.4. XDR Encoding with Chunks 3.4. XDR Encoding with Chunks
The data comprising an RPC call or reply message is marshaled or The data comprising an RPC call or reply message is marshaled or
serialized into a contiguous stream by an XDR routine. XDR data serialized into a contiguous stream by an XDR routine. XDR data
types such as integers, strings, arrays and linked lists are types such as integers, strings, arrays and linked lists are
commonly implemented over two very simple functions that encode commonly implemented over two very simple functions that encode
either an XDR data unit (32 bits) or an array of bytes. either an XDR data unit (32 bits) or an array of bytes.
Normally, the separate data items in an RPC call or reply are Normally, the separate data items in an RPC call or reply are
encoded as a contiguous sequence of bytes for network transmission encoded as a contiguous sequence of bytes for network transmission
over UDP or TCP. However, in the case of an RDMA transport, local over UDP or TCP. However, in the case of an RDMA transport, local
routines such as XDR encode can determine that (for instance) an routines such as XDR encode can determine that (for instance) an
opaque byte array is large enough to be more efficiently moved via opaque byte array is large enough to be more efficiently moved via
an RDMA data transfer operation like RDMA Read or RDMA Write. an RDMA data transfer operation like RDMA Read or RDMA Write.
Semantically speaking, the protocol has no restriction regarding Semantically speaking, the protocol has no restriction regarding
data types which may or may not be chunked. In practice however, data types which may or may not be represented by a read or write
efficiency considerations lead to the conclusion that certain data chunk. In practice however, efficiency considerations lead to the
types are not generally "chunkable". Typically, only opaque and conclusion that certain data types are not generally "chunkable".
aggregate data types which may attain substantial size are Typically, only opaque and aggregate data types which may attain
considered to be eligible. With today's hardware this size may be substantial size are considered to be eligible. With today's
a kilobyte or more. However any object may be chosen for chunking hardware this size may be a kilobyte or more. However any object
in any given message. may be chosen for chunking in any given message.
The eligibility of XDR data items to be candidates for being moved The eligibility of XDR data items to be candidates for being moved
as data chunks (as opposed to being marshalled inline) is not as data chunks (as opposed to being marshalled inline) is not
specified by the RPC over RDMA protocol. Chunk eligibility specified by the RPC over RDMA protocol. Chunk eligibility
criteria must be determined by each upper layer in order to provide criteria must be determined by each upper layer in order to provide
for an interoperable specification. One such example with for an interoperable specification. One such example with
rationale, for the NFS protocol family, is provided in [NFSDDP]. rationale, for the NFS protocol family, is provided in [NFSDDP].
The interface by which an upper layer implementation communicates The interface by which an upper layer implementation communicates
the eligibility of a data item locally to RPC for chunking is out the eligibility of a data item locally to RPC for chunking is out
of scope for this specification. In many implementations, it is of scope for this specification. In many implementations, it is
possible to implement a transparent RPC chunking facility. possible to implement a transparent RPC chunking facility.
However, such implementations may lead to inefficiencies, either However, such implementations may lead to inefficiencies, either
because they require the RPC layer to perform expensive because they require the RPC layer to perform expensive
registration and deregistration of memory "on the fly", or they may registration and deregistration of memory "on the fly", or they may
require using RDMA chunks in reply messages, along with the require using RDMA chunks in reply messages, along with the
resulting additional handshaking with the RPC over RDMA peer. resulting additional handshaking with the RPC over RDMA peer.
However, these issues are purely local and implementations are free However, these issues are internal and generally confined to the
to innovate. local interface between RPC and its upper layers, one in which
implementations are free to innovate. The only requirement is that
the resulting RPC RDMA protocol sent to the peer is valid for the
upper layer. See for example [NFSDDP].
When sending any message (request or reply) that contains an When sending any message (request or reply) that contains an
eligible large data chunk, the XDR encoding routine avoids moving eligible large data chunk, the XDR encoding routine avoids moving
the data into the XDR stream. Instead, it does not encode the data the data into the XDR stream. Instead, it does not encode the data
portion, but records the address and size of each chunk in a portion, but records the address and size of each chunk in a
separate "read chunk list" encoded within RPC RDMA transport- separate "read chunk list" encoded within RPC RDMA transport-
specific headers. Such chunks will be transferred via RDMA Read specific headers. Such chunks will be transferred via RDMA Read
operations initiated by the receiver. operations initiated by the receiver.
When the read chunks are to be moved via RDMA, the memory for each When the read chunks are to be moved via RDMA, the memory for each
chunk must be registered. This registration may take place within chunk must be registered. This registration may take place within
XDR itself, providing for full transparency to upper layers, or it XDR itself, providing for full transparency to upper layers, or it
may be performed by any other specific local implementation. may be performed by any other specific local implementation.
Additionally, when making an RPC call that can result in bulk data Additionally, when making an RPC call that can result in bulk data
transferred in the reply, it is desirable to provide chunks to transferred in the reply, it is desirable to provide chunks to
accept the data directly via RDMA Write. These write chunks will accept the data directly via RDMA Write. These write chunks will
therefore be pre-filled by the server prior to responding, and XDR therefore be pre-filled by the RPC server prior to responding, and
decode at the client will not be required. These chunks undergo a XDR decode at the client will not be required. These chunks
similar registration and advertisement via "write chunk lists" undergo a similar registration and advertisement via "write chunk
built as a part of XDR encoding. lists" built as a part of XDR encoding.
Some RPC client implementations are not able to determine where an Some RPC client implementations are not able to determine where an
RPC call's results reside during the "encode" phase. This makes it RPC call's results reside during the "encode" phase. This makes it
difficult or impossible for the RPC client layer to encode the difficult or impossible for the RPC client layer to encode the
write chunk list at the time of building the request. In this write chunk list at the time of building the request. In this
case, it is difficult for the RPC implementation to provide case, it is difficult for the RPC implementation to provide
transparency to the RPC consumer, which may require recoding to transparency to the RPC consumer, which may require recoding to
provide result information at this earlier stage. provide result information at this earlier stage.
Therefore if the RPC client does not make a write chunk list Therefore if the RPC client does not make a write chunk list
available to receive the result, then the RPC server must return available to receive the result, then the RPC server must return
data inline in the reply, or if it so chooses, via a read chunk data inline in the reply, or if it so chooses, via a read chunk
list. RPC clients are discouraged from omitting write chunk lists list. RPC clients are discouraged from omitting write chunk lists
for eligible replies, due to the lower performance of the for eligible replies, due to the lower performance of the
additional handshaking to perform data transfer, and the additional handshaking to perform data transfer, and the
requirement that the server must expose (and preserve) the reply requirement that the RPC server must expose (and preserve) the
data for a period of time. In the absence of a server-provided reply data for a period of time. In the absence of a server-
read chunk list in the reply, if the encoded reply overflows the provided read chunk list in the reply, if the encoded reply
inline buffer, the RPC will fail. overflows the posted receive buffer, the RPC will fail.
When any data within a message is provided via either read or write When any data within a message is provided via either read or write
chunks, the chunk itself refers only to the data portion of the XDR chunks, the chunk itself refers only to the data portion of the XDR
stream element. In particular, for counted fields (e.g. a "<>" stream element. In particular, for counted fields (e.g. a "<>"
encoding) the byte count which is encoded as part of the field encoding) the byte count which is encoded as part of the field
remains in the XDR stream, as well as being encoded in the chunk remains in the XDR stream, and is also encoded in the chunk list.
list. Only the data portion is elided. This is important to The data portion is however elided from the encoded XDR stream, and
maintain upper layer implementation compatibility - both the count is transferred as part of chunk list processing. This is important
and the data must be transferred as part of the XDR stream. In to maintain upper layer implementation compatibility - both the
addition, any byte count in the XDR stream must match the sum of count and the data must be transferred as part of the logical XDR
the byte counts present in the corresponding read or write chunk stream. While the chunk list processing results in the data being
list. If they do not agree, an RPC protocol encoding error available to the upper layer peer for XDR decoding, the length
results. present in the chunk list entries is not. Any byte count in the
XDR stream must match the sum of the byte counts present in the
corresponding read or write chunk list. If they do not agree, an
RPC protocol encoding error results.
The following items are contained in a chunk list entry. The following items are contained in a chunk list entry.
Handle Handle
Steering tag or handle obtained when the chunk memory is Steering tag or handle obtained when the chunk memory is
registered for RDMA. registered for RDMA.
Length Length
The length of the chunk in bytes. The length of the chunk in bytes.
Offset Offset
The offset or memory address of the chunk. In order to The offset or beginning memory address of the chunk. In order
support the widest array of RDMA implementations, as well as to support the widest array of RDMA implementations, as well
the most general steering tag scheme, this field is as the most general steering tag scheme, this field is
unconditionally included in each chunk list entry. unconditionally included in each chunk list entry.
While zero-based offset schemes are available in many RDMA
implementations, their use by RPC requires individual
registration of each read or write chunk. On many such
implementations this can be a significant overhead. By
providing an offset in each chunk, many pre-registration or
region-based registrations can be readily supported, and by
using a single, universal chunk representation, the RPC RDMA
protocol implementation is simplified to its most general
form.
Position Position
For data which is to be encoded, the position in the XDR For data which is to be encoded, the position in the XDR
stream where the chunk would normally reside. Note that the stream where the chunk would normally reside. Note that the
chunk therefore inserts its data into the XDR stream at this chunk therefore inserts its data into the XDR stream at this
position, but its transfer is no longer "inline". Also note position, but its transfer is no longer "inline". Also note
it is possible that a contiguous sequence of chunks might all it is possible that a contiguous sequence of chunks might all
have the same position. For data which is to be decoded, no have the same position. For data which is to be decoded, no
"position" is used. "position" is used.
When XDR marshaling is complete, the chunk list is XDR encoded, When XDR marshaling is complete, the chunk list is XDR encoded,
skipping to change at page 10, line 16 skipping to change at page 10, line 50
| RPC over RDMA | | | RPC over RDMA | |
| header w/ | RPC Header | Non-chunk args/results | header w/ | RPC Header | Non-chunk args/results
| chunks | | | chunks | |
+----------------+----------------+------------- +----------------+----------------+-------------
Read chunk lists and write chunk lists are structured somewhat Read chunk lists and write chunk lists are structured somewhat
differently. This is due to the different usage - read chunks are differently. This is due to the different usage - read chunks are
decoded and indexed by their position in the XDR data stream, their decoded and indexed by their position in the XDR data stream, their
size is always known, and may be used for both arguments and size is always known, and may be used for both arguments and
results. Write chunks on the other hand are used only for results, results. Write chunks on the other hand are used only for results,
and have neither a preassigned offset in the XDR stream nor a size and have neither a preassigned offset in the XDR stream, nor a size
until the results are produced. The mapping of Write chunks onto until the results are produced, since the buffers may not be used
designated NFS procedures and their results is described in for results at all, or may be partially filled. Their presence in
[NFSDDP]. the XDR stream is therefore not known until the reply is processed.
The mapping of Write chunks onto designated NFS procedures and
their results is described in [NFSDDP].
Therefore, read chunks are encoded into a read chunk list as a Therefore, read chunks are encoded into a read chunk list as a
single array, with each entry tagged by its position in the XDR single array, with each entry tagged by its (known) size and
stream. Write chunks are encoded as a list of arrays of RDMA position in the XDR stream. Write chunks are encoded as a list of
buffers, with each list element (an array) providing buffers for a arrays of RDMA buffers, with each list element (an array) providing
separate result. Individual write chunk list elements may thereby buffers for a separate result. Individual write chunk list
result in being partially or fully filled, or in fact not being elements may thereby result in being partially or fully filled, or
filled at all. in fact not being filled at all. Unused write chunks, or unused
bytes in write chunk buffer lists, are not returned as results, and
their memory is returned to the upper layer as part of RPC
completion. However, the RPC layer should not assume that the
buffers have not been modified.
3.5. Padding 3.5. Padding
Alignment of specific opaque data enables certain scatter/gather Alignment of specific opaque data enables certain scatter/gather
optimizations. Padding leverages the useful property that RDMA optimizations. Padding leverages the useful property that RDMA
transfers preserve alignment of data, even when they are placed transfers preserve alignment of data, even when they are placed
into pre-posted receive buffers by Sends. into pre-posted receive buffers by Sends.
Many servers can make good use of such padding. Padding allows the Many servers can make good use of such padding. Padding allows the
chaining of RDMA receive buffers such that any data transferred by chaining of RDMA receive buffers such that any data transferred by
skipping to change at page 11, line 31 skipping to change at page 12, line 31
If padding is to apply only to chunks at least 1 KB in size, then If padding is to apply only to chunks at least 1 KB in size, then
the threshold should be set to 1 KB. The XDR routine at the peer the threshold should be set to 1 KB. The XDR routine at the peer
will consult these values when decoding opaque values. Where the will consult these values when decoding opaque values. Where the
decoded length exceeds the rdma_thresh, the XDR decode will skip decoded length exceeds the rdma_thresh, the XDR decode will skip
over the appropriate padding as indicated by rdma_align and the over the appropriate padding as indicated by rdma_align and the
current XDR stream position. current XDR stream position.
3.6. XDR Decoding with Read Chunks 3.6. XDR Decoding with Read Chunks
The XDR decode process moves data from an XDR stream into a data The XDR decode process moves data from an XDR stream into a data
structure provided by the client or server application. Where structure provided by the RPC client or server application. Where
elements of the destination data structure are buffers or strings, elements of the destination data structure are buffers or strings,
the RPC application can either pre-allocate storage to receive the the RPC application can either pre-allocate storage to receive the
data, or leave the string or buffer fields null and allow the XDR data, or leave the string or buffer fields null and allow the XDR
decode stage of RPC processing to automatically allocate storage of decode stage of RPC processing to automatically allocate storage of
sufficient size. sufficient size.
When decoding a message from an RDMA transport, the receiver first When decoding a message from an RDMA transport, the receiver first
XDR decodes the chunk lists from the RPC over RDMA header, then XDR decodes the chunk lists from the RPC over RDMA header, then
proceeds to decode the body of the RPC message (arguments or proceeds to decode the body of the RPC message (arguments or
results). Whenever the XDR offset in the decode stream matches results). Whenever the XDR offset in the decode stream matches
that of a chunk in the read chunk list, the XDR routine initiates that of a chunk in the read chunk list, the XDR routine initiates
an RDMA Read to bring over the chunk data into locally registered an RDMA Read to bring over the chunk data into locally registered
memory for the destination buffer. memory for the destination buffer.
When processing an RPC request, the RPC receiver (server) When processing an RPC request, the RPC receiver (RPC server)
acknowledges its completion of use of the source buffers by simply acknowledges its completion of use of the source buffers by simply
replying to the RPC sender (client), and the peer may free all replying to the RPC sender (client), and the peer may free all
source buffers advertised by the request. source buffers advertised by the request.
When processing an RPC reply, after completing such a transfer the When processing an RPC reply, after completing such a transfer the
RPC receiver (client) must issue an RDMA_DONE message (described in RPC receiver (client) must issue an RDMA_DONE message (described in
Section 3.8) to notify the peer (server) that the source buffers Section 3.8) to notify the peer (server) that the source buffers
can be freed. can be freed.
The read chunk list is constructed and used entirely within the The read chunk list is constructed and used entirely within the
RPC/XDR layer. Other than specifying the minimum chunk size, the RPC/XDR layer. Other than specifying the minimum chunk size, the
management of the read chunk list is automatic and transparent to management of the read chunk list is automatic and transparent to
an RPC application. an RPC application.
3.7. XDR Decoding with Write Chunks 3.7. XDR Decoding with Write Chunks
When a "write chunk list" is provided for the results of the RPC When a "write chunk list" is provided for the results of the RPC
call, the server must provide any corresponding data via RDMA Write call, the RPC server must provide any corresponding data via RDMA
to the memory referenced in the chunk list entries. The RPC reply Write to the memory referenced in the chunk list entries. The RPC
conveys this by returning the write chunk list to the client with reply conveys this by returning the write chunk list to the client
the lengths rewritten to match the actual transfer. The XDR with the lengths rewritten to match the actual transfer. The XDR
"decode" of the reply therefore performs no local data transfer but "decode" of the reply therefore performs no local data transfer but
merely returns the length obtained from the reply. merely returns the length obtained from the reply.
Each decoded result consumes one entry in the write chunk list, Each decoded result consumes one entry in the write chunk list,
which in turn consists of an array of RDMA segments. The length is which in turn consists of an array of RDMA segments. The length is
therefore the sum of all returned lengths in all segments therefore the sum of all returned lengths in all segments
comprising the corresponding list entry. As each list entry is comprising the corresponding list entry. As each list entry is
"decoded", the entire entry is consumed. "decoded", the entire entry is consumed.
The write chunk list is constructed and used by the RPC The write chunk list is constructed and used by the RPC
application. The RPC/XDR layer simply conveys the list between application. The RPC/XDR layer simply conveys the list between
client and server and initiates the RDMA Writes back to the client. client and server and initiates the RDMA Writes back to the client.
The mapping of write chunk list entries to procedure arguments must The mapping of write chunk list entries to procedure arguments must
be determined for each protocol. An example of a mapping is be determined for each protocol. An example of a mapping is
described in [NFSDDP]. described in [NFSDDP].
3.8. RPC Call and Reply 3.8. RPC Call and Reply
The RDMA transport for RPC provides three methods of moving data The RDMA transport for RPC provides three methods of moving data
between client and server: between RPC client and server:
In-line Inline
Data are moved between client and server within an RDMA Send. Data are moved between RPC client and server within an RDMA
Send.
RDMA Read RDMA Read
Data are moved between client and server via an RDMA Read Data are moved between RPC client and server via an RDMA Read
operation via steering tag, address and offset obtained from a operation via steering tag, address and offset obtained from a
read chunk list. read chunk list.
RDMA Write RDMA Write
Result data is moved from server to client via an RDMA Write Result data is moved from RPC server to client via an RDMA
operation via steering tag, address and offset obtained from a Write operation via steering tag, address and offset obtained
write chunk list or reply chunk in the client's RPC call from a write chunk list or reply chunk in the client's RPC
message. call message.
These methods of data movement may occur in combinations within a These methods of data movement may occur in combinations within a
single RPC. For instance, an RPC call may contain some in-line single RPC. For instance, an RPC call may contain some inline data
data along with some large chunks transferred via RDMA Read by the along with some large chunks to be transferred via RDMA Read to the
server. The reply to that call may have some result chunks that server. The reply to that call may have some result chunks that
the server RDMA Writes back to the client. The following protocol the server RDMA Writes back to the client. The following protocol
interactions illustrate RPC calls that use these methods to move interactions illustrate RPC calls that use these methods to move
RPC message data: RPC message data:
An RPC with write chunks in the call message looks like this: An RPC with write chunks in the call message looks like this:
Client Server RPC Client RPC Server
| RPC Call + Write Chunk list | | RPC Call + Write Chunk list |
Send | ------------------------------> | Send | ------------------------------> |
| | | |
| Chunk 1 | | Chunk 1 |
| <------------------------------ | Write | <------------------------------ | Write
| : | | : |
| Chunk n | | Chunk n |
| <------------------------------ | Write | <------------------------------ | Write
| | | |
| RPC Reply | | RPC Reply |
| <------------------------------ | Send | <------------------------------ | Send
In the presence of write chunks, RDMA ordering provides the In the presence of write chunks, RDMA ordering provides the
guarantee that any RDMA Write operations from the server have guarantee that all data in the RDMA Write operations has been
completed prior to the client's RPC reply processing. placed in memory prior to the client's RPC reply processing.
An RPC with read chunks in the call message looks like this: An RPC with read chunks in the call message looks like this:
Client Server RPC Client RPC Server
| RPC Call + Read Chunk list | | RPC Call + Read Chunk list |
Send | ------------------------------> | Send | ------------------------------> |
| | | |
| Chunk 1 | | Chunk 1 |
| +------------------------------ | Read | +------------------------------ | Read
| v-----------------------------> | | v-----------------------------> |
| : | | : |
| Chunk n | | Chunk n |
| +------------------------------ | Read | +------------------------------ | Read
| v-----------------------------> | | v-----------------------------> |
| | | |
| RPC Reply | | RPC Reply |
| <------------------------------ | Send | <------------------------------ | Send
And an RPC with read chunks in the reply message looks like this: And an RPC with read chunks in the reply message looks like this:
Client Server RPC Client RPC Server
| RPC Call | | RPC Call |
Send | ------------------------------> | Send | ------------------------------> |
| | | |
| RPC Reply + Read Chunk list | | RPC Reply + Read Chunk list |
| <------------------------------ | Send | <------------------------------ | Send
| | | |
| Chunk 1 | | Chunk 1 |
Read | ------------------------------+ | Read | ------------------------------+ |
| <-----------------------------v | | <-----------------------------v |
| : | | : |
| Chunk n | | Chunk n |
Read | ------------------------------+ | Read | ------------------------------+ |
| <-----------------------------v | | <-----------------------------v |
| | | |
| Done | | Done |
Send | ------------------------------> | Send | ------------------------------> |
The final Done message allows the client to signal the server that The final Done message allows the RPC client to signal the server
it has received the chunks, so the server can de-register and free that it has received the chunks, so the server can de-register and
the memory holding the chunks. A Done completion is not necessary free the memory holding the chunks. A Done completion is not
for an RPC call, since the RPC reply Send is itself a receive necessary for an RPC call, since the RPC reply Send is itself a
completion notification. receive completion notification. In the event that the client
fails to return the Done message within some timeout period, the
server may conclude that a protocol violation has occurred and
close the RPC connection, or it may proceed with a de-register and
free its chunk buffers. This may result in a fatal RDMA error if
the client later attempts to perform an RDMA Read operation, which
amounts to the same thing.
The Done message has no effect on protocol latency since the client The use of read chunks in RPC reply messages is much less efficient
has no expectation of a reply from the server. Nor does it than providing write chunks in the originating RPC calls, due to
adversely affect bandwidth since it is only 16 bytes in length. In the additional message exchanges, the need for the RPC server to
the event that the client fails to return the Done message within advertise buffers to the peer, the necessity of the server
some timeout period, the server may conclude that a protocol maintaining a timer for the purpose of recovery from misbehaving
violation has occurred and close the RPC connection, or it may clients, and the need for additional memory registration. Their
proceed with a de-register and free its chunk buffers. This may use is not recommended by upper layers where efficiency is a
result in a fatal RDMA error if the client later attempts to primary concern. [NFSDDP] However, they may be employed by upper
perform an RDMA Read operation, which amounts to the same thing. layer protocol bindings which are primarily concerned with
transparency, since they can frequently be implemented completely
within the RPC lower layers.
It is important to note that the Done message consumes a credit at It is important to note that the Done message consumes a credit at
the server. The client must account for this in its accounting of the RPC server. The RPC server should provide sufficient credits
available credits, and the server should replenish the credit to the client to allow the Done message to be sent without deadlock
consumed by Done at its earliest opportunity. (driving the outstanding credit count to zero). The RPC client
must account for its required Done messages to the server in its
accounting of available credits, and the server should replenish
any credit consumed by its use of such exchanges at its earliest
opportunity.
Finally, it is possible to conceive of RPC exchanges that involve Finally, it is possible to conceive of RPC exchanges that involve
any or all combinations of write chunks in the RPC call, read any or all combinations of write chunks in the RPC call, read
chunks in the RPC call, and read chunks in the RPC reply. Support chunks in the RPC call, and read chunks in the RPC reply. Support
for such exchanges is straightforward from a protocol perspective, for such exchanges is straightforward from a protocol perspective,
but in practice such exchanges would be quite rare, limited to but in practice such exchanges would be quite rare, limited to
upper layer protocol exchanges which transferred bulk data in both upper layer protocol exchanges which transferred bulk data in both
the call and corresponding reply. the call and corresponding reply.
4. RPC RDMA Message Layout 4. RPC RDMA Message Layout
skipping to change at page 17, line 25 skipping to change at page 18, line 38
| | | | Message | NULLs | RPC Call | | | | Message | NULLs | RPC Call
| XID | Version | Credits | Type | or | or | XID | Version | Credits | Type | or | or
| | | | | Chunk Lists | Reply Msg | | | | | Chunk Lists | Reply Msg
+--------+---------+---------+-----------+-------------+---------- +--------+---------+---------+-----------+-------------+----------
Note that in the case of RDMA_DONE and RDMA_ERROR, no chunk list or Note that in the case of RDMA_DONE and RDMA_ERROR, no chunk list or
RPC message follows. As an implementation hint: a gather operation RPC message follows. As an implementation hint: a gather operation
on the Send of the RDMA RPC message can be used to marshal the on the Send of the RDMA RPC message can be used to marshal the
initial header, the chunk list, and the RPC message itself. initial header, the chunk list, and the RPC message itself.
4.2. XDR Language Description 4.2. RPC over RDMA header errors
When a peer receives an RPC RDMA message, it must perform certain
basic validity checks on the header and chunk contents. If errors
are detected in an RPC request, an RDMA_ERROR reply should be
generated.
Two types of errors are defined, version mismatch and invalid chunk
format. When the peer detects an RPC over RDMA header version
which it does not support (currently this draft defines only
version 1), it replies with an error code of ERR_VERS, and provides
the low and high inclusive version numbers it does, in fact,
support. The version number in this reply can be any value
otherwise valid at the receiver. When other decoding errors are
detected in the header or chunks, either an RPC decode error may be
returned, or the error code ERR_CHUNK.
4.3. XDR Language Description
Here is the message layout in XDR language. Here is the message layout in XDR language.
struct xdr_rdma_segment { struct xdr_rdma_segment {
uint32 handle; /* Registered memory handle */ uint32 handle; /* Registered memory handle */
uint32 length; /* Length of the chunk in bytes */ uint32 length; /* Length of the chunk in bytes */
uint64 offset; /* Chunk virtual address or offset */ uint64 offset; /* Chunk virtual address or offset */
}; };
struct xdr_read_chunk { struct xdr_read_chunk {
skipping to change at page 19, line 28 skipping to change at page 21, line 19
union rpc_rdma_error switch (rpc_rdma_errcode) { union rpc_rdma_error switch (rpc_rdma_errcode) {
case ERR_VERS: case ERR_VERS:
uint32 rdma_vers_low; uint32 rdma_vers_low;
uint32 rdma_vers_high; uint32 rdma_vers_high;
case ERR_CHUNK: case ERR_CHUNK:
void; void;
default: default:
uint32 rdma_extra[8]; uint32 rdma_extra[8];
}; };
5. Large Chunkless Messages 5. Long Messages
The receiver of RDMA Send messages is required to have previously The receiver of RDMA Send messages is required to have previously
posted one or more correctly sized buffers. The client can inform posted one or more adequately sized buffers. The RPC client can
the server of the maximum size of its RDMA Send messages via the inform the server of the maximum size of its RDMA Send messages via
Connection Configuration Protocol described later in this document. the Connection Configuration Protocol described later in this
document.
Since RPC messages are frequently small, memory savings can be Since RPC messages are frequently small, memory savings can be
achieved by posting small buffers. Even large messages like NFS achieved by posting small buffers. Even large messages like NFS
READ or WRITE will be quite small once the chunks are removed from READ or WRITE will be quite small once the chunks are removed from
the message. However, there may be large, chunkless messages that the message. However, there may be large messages that would
would demand a very large buffer be posted. A good example is an demand a very large buffer be posted, where the contents of the
buffer may not be a chunkable XDR element. A good example is an
NFS READDIR reply which may contain a large number of small NFS READDIR reply which may contain a large number of small
filename strings. Also, the NFS version 4 protocol [RFC3530] filename strings. Also, the NFS version 4 protocol [RFC3530]
features COMPOUND request and reply messages of unbounded length. features COMPOUND request and reply messages of unbounded length.
Ideally, each upper layer will negotiate these limits. However, it Ideally, each upper layer will negotiate these limits. However, it
is frequently necessary to provide a transparent solution. is frequently necessary to provide a transparent solution.
5.1. Message as an RDMA Read Chunk 5.1. Message as an RDMA Read Chunk
One relatively simple method is to have the client identify any RPC One relatively simple method is to have the client identify any RPC
message that exceeds the server's posted buffer size and move it message that exceeds the RPC server's posted buffer size and move
separately as a chunk, i.e. reference it as the first entry in the it separately as a chunk, i.e. reference it as the first entry in
read chunk list with an XDR position of zero. the read chunk list with an XDR position of zero.
Normal Message Normal Message
+--------+---------+---------+------------+-------------+---------- +--------+---------+---------+------------+-------------+----------
| | | | | | RPC Call | | | | | | RPC Call
| XID | Version | Credits | RDMA_MSG | Chunk Lists | or | XID | Version | Credits | RDMA_MSG | Chunk Lists | or
| | | | | | Reply Msg | | | | | | Reply Msg
+--------+---------+---------+------------+-------------+---------- +--------+---------+---------+------------+-------------+----------
Long Message Long Message
skipping to change at page 20, line 32 skipping to change at page 22, line 31
| +---------- | +----------
| | Long RPC Call | | Long RPC Call
+->| or +->| or
| Reply Message | Reply Message
+---------- +----------
If the receiver gets an RPC over RDMA header with a message type of If the receiver gets an RPC over RDMA header with a message type of
RDMA_NOMSG and finds an initial read chunk list entry with a zero RDMA_NOMSG and finds an initial read chunk list entry with a zero
XDR position, it allocates a registered buffer and issues an RDMA XDR position, it allocates a registered buffer and issues an RDMA
Read of the long RPC message into it. The receiver then proceeds Read of the long RPC message into it. The receiver then proceeds
to XDR decode the RPC message as if it had received it in-line with to XDR decode the RPC message as if it had received it inline with
the Send data. Further decoding may issue additional RDMA Reads to the Send data. Further decoding may issue additional RDMA Reads to
bring over additional chunks. bring over additional chunks.
Although the handling of long messages requires one extra network Although the handling of long messages requires one extra network
turnaround, in practice these messages should be rare if the posted turnaround, in practice these messages should be rare if the posted
receive buffers are correctly sized, and of course they will be receive buffers are correctly sized, and of course they will be
non-existent for RDMA-aware upper layers. non-existent for RDMA-aware upper layers.
An RPC with long reply returned via RDMA Read looks like this: An RPC with long reply returned via RDMA Read looks like this:
Client Server RPC Client RPC Server
| RPC Call | | RPC Call |
Send | ------------------------------> | Send | ------------------------------> |
| | | |
| RDMA over RPC Header | | RDMA over RPC Header |
| <------------------------------ | Send | <------------------------------ | Send
| | | |
| Long RPC Reply Msg | | Long RPC Reply Msg |
Read | ------------------------------+ | Read | ------------------------------+ |
| <-----------------------------v | | <-----------------------------v |
| | | |
| Done | | Done |
Send | ------------------------------> | Send | ------------------------------> |
5.2. RDMA Write of Long Replies (Reply Chunks) 5.2. RDMA Write of Long Replies (Reply Chunks)
A superior method of handling long, chunkless RPC replies is to A superior method of handling long RPC replies is to have the RPC
have the client post a large buffer into which the server can write client post a large buffer into which the server can write a large
a large RPC reply. This has the advantage that an RDMA Write may RPC reply. This has the advantage that an RDMA Write may be
be slightly faster in network latency than an RDMA Read. slightly faster in network latency than an RDMA Read.
Additionally, for a reply it removes the need for an RDMA_DONE Additionally, for a reply it removes the need for an RDMA_DONE
message if the large reply is returned as a Read chunk. message if the large reply is returned as a Read chunk.
This protocol supports direct return of a large reply via the This protocol supports direct return of a large reply via the
inclusion of an optional rdma_reply write chunk after the read inclusion of an optional rdma_reply write chunk after the read
chunk list and the write chunk list. The client allocates a buffer chunk list and the write chunk list. The client allocates a buffer
sized to receive a large reply and enters its steering tag, address sized to receive a large reply and enters its steering tag, address
and length in the rdma_reply write chunk. If the reply message is and length in the rdma_reply write chunk. If the reply message is
too long to return in-line with an RDMA Send (exceeds the size of too long to return inline with an RDMA Send (exceeds the size of
the client's posted receive buffer), even with read chunks removed, the client's posted receive buffer), even with read chunks removed,
then the server RDMA Writes the RPC reply message into the buffer then the RPC server performs an RDMA Write of the RPC reply message
indicated by the rdma_reply chunk. If the client doesn't provide into the buffer indicated by the rdma_reply chunk. If the client
an rdma_reply chunk, or if it's too small, then the message must be doesn't provide an rdma_reply chunk, or if it's too small, then the
returned as a Read chunk. message must be returned as a Read chunk.
An RPC with long reply returned via RDMA Write looks like this: An RPC with long reply returned via RDMA Write looks like this:
Client Server RPC Client RPC Server
| RPC Call with rdma_reply | | RPC Call with rdma_reply |
Send | ------------------------------> | Send | ------------------------------> |
| | | |
| Long RPC Reply Msg | | Long RPC Reply Msg |
| <------------------------------ | Write | <------------------------------ | Write
| | | |
| RDMA over RPC Header | | RDMA over RPC Header |
| <------------------------------ | Send | <------------------------------ | Send
The use of RDMA Write to return long replies requires that the The use of RDMA Write to return long replies requires that the
client application anticipate a long reply and have some knowledge client application anticipate a long reply and have some knowledge
of its size so that a correctly sized buffer can be allocated. of its size so that an adequately sized buffer can be allocated.
This is certainly true of NFS READDIR replies; where the client This is certainly true of NFS READDIR replies; where the client
already provides an upper bound on the size of the encoded already provides an upper bound on the size of the encoded
directory fragment to be returned by the server. directory fragment to be returned by the server.
The use of these "reply chunks" is highly efficient and convenient The use of these "reply chunks" is highly efficient and convenient
for both client and server. Their use is encouraged for eligible for both RPC client and server. Their use is encouraged for
RPC operations such as NFS READDIR, which would otherwise require eligible RPC operations such as NFS READDIR, which would otherwise
extensive chunk management within the results or use of RDMA Read require extensive chunk management within the results or use of
and a Done message. RDMA Read and a Done message. [NFSDDP]
5.3. RPC over RDMA header errors
When a peer receives an RPC RDMA message, it must perform certain
basic validity checks on the header and chunk contents. If errors
are detected in an RPC request, an RDMA_ERROR reply should be
generated.
Two types of errors are defined, version mismatch and invalid chunk
format. When the peer detects an RPC over RDMA header version
which it does not support (currently this draft defines only
version 1), it replies with an error code of ERR_VERS, and provides
the low and high inclusive version numbers it does, in fact,
support. The version number in this reply can be any value
otherwise valid at the receiver. When other decoding errors are
detected in the header or chunks, either an RPC decode error may be
returned, or the error code ERR_CHUNK.
6. Connection Configuration Protocol 6. Connection Configuration Protocol
RDMA Send operations require the receiver to post one or more RDMA Send operations require the receiver to post one or more
buffers at the RDMA connection endpoint, each large enough to buffers at the RDMA connection endpoint, each large enough to
receive the largest Send message. Buffers are consumed as Send receive the largest Send message. Buffers are consumed as Send
messages are received. If a buffer is too small, or if there are messages are received. If a buffer is too small, or if there are
no buffers posted, the RDMA transport may return an error and break no buffers posted, the RDMA transport may return an error and break
the RDMA connection. The receiver must post sufficient, correctly the RDMA connection. The receiver must post sufficient, adequately
sized buffers to avoid buffer overrun or capacity errors. buffers to avoid buffer overrun or capacity errors.
The protocol described above includes only a mechanism for managing The protocol described above includes only a mechanism for managing
the number of such receive buffers, and no explicit features to the number of such receive buffers, and no explicit features to
allow the client and server to provision or control buffer sizing, allow the RPC client and server to provision or control buffer
nor any other session parameters. sizing, nor any other session parameters.
In the past, this type of connection management has not been In the past, this type of connection management has not been
necessary for RPC. RPC over UDP or TCP does not have a protocol to necessary for RPC. RPC over UDP or TCP does not have a protocol to
negotiate the link. The server can get a rough idea of the maximum negotiate the link. The server can get a rough idea of the maximum
size of messages from the server protocol code. However, a size of messages from the server protocol code. However, a
protocol to negotiate transport features on a more dynamic basis is protocol to negotiate transport features on a more dynamic basis is
desirable. desirable.
The Connection Configuration Protocol allows the client to pass its The Connection Configuration Protocol allows the client to pass its
connection requirements to the server, and allows the server to connection requirements to the server, and allows the server to
skipping to change at page 23, line 40 skipping to change at page 25, line 23
band, i.e. it uses the connection itself to negotiate the band, i.e. it uses the connection itself to negotiate the
connection parameters. To provide a basis for connection connection parameters. To provide a basis for connection
negotiation, the connection is assumed to provide a basic level of negotiation, the connection is assumed to provide a basic level of
interoperability: the ability to exchange at least one RPC message interoperability: the ability to exchange at least one RPC message
at a time that is at least 1 KB in size. The server may exceed at a time that is at least 1 KB in size. The server may exceed
this basic level of configuration, but the client must not assume this basic level of configuration, but the client must not assume
it. it.
6.2. Protocol Description 6.2. Protocol Description
Version 1 of the protocol consists of a single procedure that Version 1 of the Connection Configuration protocol consists of a
allows the client to inform the server of its connection single procedure that allows the client to inform the server of its
requirements and the server to return connection information to the connection requirements and the server to return connection
client. information to the client.
The maxcall_sendsize argument is the maximum size of an RPC call The maxcall_sendsize argument is the maximum size of an RPC call
message that the client will send in-line in an RDMA Send message message that the client will send inline in an RDMA Send message to
to the server. The server may return a maxcall_sendsize value that the server. The server may return a maxcall_sendsize value that is
is smaller or larger than the client's request. The client must smaller or larger than the client's request. The client must not
not send an in-line call message larger than what the server will send an inline call message larger than what the server will
accept. The maxcall_sendsize limits only the size of in-line RPC accept. The maxcall_sendsize limits only the size of inline RPC
calls. It does not limit the size of long RPC messages transferred calls. It does not limit the size of long RPC messages transferred
as an initial chunk in the Read chunk list. as an initial chunk in the Read chunk list.
The maxreply_sendsize is the maximum size of an in-line RPC message The maxreply_sendsize is the maximum size of an inline RPC message
that the client will accept from the server. that the client will accept from the server.
The maxrdmaread is the maximum number of RDMA Reads which may be The maxrdmaread is the maximum number of RDMA Reads which may be
active at the peer. This number correlates to the RDMA incoming active at the peer. This number correlates to the RDMA incoming
RDMA Read count ("IRD") configured into each originating endpoint RDMA Read count ("IRD") configured into each originating endpoint
by the client or server. If more than this number of RDMA Read by the client or server. If more than this number of RDMA Read
operations by the connected peer are issued simultaneously, operations by the connected peer are issued simultaneously,
connection loss or suboptimal flow control may result, therefore connection loss or suboptimal flow control may result, therefore
the value should be observed at all times. The peers' values need the value should be observed at all times. The peers' values need
not be equal. If zero, the peer must not issue requests which not be equal. If zero, the peer must not issue requests which
skipping to change at page 24, line 28 skipping to change at page 26, line 10
The align value is the value recommended by the server for opaque The align value is the value recommended by the server for opaque
data values such as strings and counted byte arrays. The client data values such as strings and counted byte arrays. The client
can use this value to compute the number of prepended pad bytes can use this value to compute the number of prepended pad bytes
when XDR encoding opaque values in the RPC call message. when XDR encoding opaque values in the RPC call message.
typedef unsigned int uint32; typedef unsigned int uint32;
struct config_rdma_req { struct config_rdma_req {
uint32 maxcall_sendsize; uint32 maxcall_sendsize;
/* max size of in-line RPC call */ /* max size of inline RPC call */
uint32 maxreply_sendsize; uint32 maxreply_sendsize;
/* max size of in-line RPC reply */ /* max size of inline RPC reply */
uint32 maxrdmaread; uint32 maxrdmaread;
/* max active RDMA Reads at client */ /* max active RDMA Reads at client */
}; };
struct config_rdma_reply { struct config_rdma_reply {
uint32 maxcall_sendsize; uint32 maxcall_sendsize;
/* max call size accepted by server */ /* max call size accepted by server */
uint32 align; uint32 align;
/* server's receive buffer alignment */ /* server's receive buffer alignment */
uint32 maxrdmaread; uint32 maxrdmaread;
skipping to change at page 25, line 36 skipping to change at page 27, line 10
default behavior of an RPC over RDMA transport is to register and default behavior of an RPC over RDMA transport is to register and
de-register these chunks on every transaction. However, this is de-register these chunks on every transaction. However, this is
not a limitation of the protocol - only of the existing local RPC not a limitation of the protocol - only of the existing local RPC
API. The API is easily extended through such functions as API. The API is easily extended through such functions as
rpc_control(3) to change the default behavior so that the rpc_control(3) to change the default behavior so that the
application can assume responsibility for controlling memory application can assume responsibility for controlling memory
registration through an RPC-provided registered memory allocator. registration through an RPC-provided registered memory allocator.
8. Errors and Error Recovery 8. Errors and Error Recovery
Error reporting and recovery is outside the scope of this protocol. RPC RDMA protocol errors are described in section 4. RPC errors
and RPC error recovery are not affected by the protocol, and
proceed as for any RPC error condition. RDMA Transport error
reporting and recovery are outside the scope of this protocol.
It is assumed that the link itself will provide some degree of It is assumed that the link itself will provide some degree of
error detection and retransmission. Additionally, the RPC layer error detection and retransmission. iWARP's MPA layer (when used
over TCP), SCTP, as well as the Infiniband link layer all provide
CRC protection of the RDMA payload, and CRC-class protection is a
general attribute of such transports. Additionally, the RPC layer
itself can accept errors from the link level and recover via itself can accept errors from the link level and recover via
retransmission. RPC recovery can handle complete loss and re- retransmission. RPC recovery can handle complete loss and re-
establishment of the link. establishment of the link.
See section 11 for further discussion of the use of RPC-level
integrity schemes to detect errors, and related efficiency issues.
9. Node Addressing 9. Node Addressing
In setting up a new RDMA connection, the first action by an RPC In setting up a new RDMA connection, the first action by an RPC
client will be to obtain a transport address for the server. The client will be to obtain a transport address for the server. The
mechanism used to obtain this address, and to open an RDMA mechanism used to obtain this address, and to open an RDMA
connection is dependent on the type of RDMA transport, and outside connection is dependent on the type of RDMA transport, and is the
the scope of this protocol. responsibility of each RPC protocol binding and its local
implementation.
10. RPC Binding 10. RPC Binding
RPC services normally register with a portmap or rpcbind service, RPC services normally register with a portmap or rpcbind service,
which associates an RPC program number with a service address. In which associates an RPC program number with a service address. In
the case of UDP or TCP, the service address for NFS is normally the case of UDP or TCP, the service address for NFS is normally
port 2049. This policy should be no different with RDMA port 2049. This policy should be no different with RDMA
interconnects. interconnects.
One possibility is to have the server's portmapper register itself One possibility is to have the server's portmapper register itself
on the RDMA interconnect at a "well known" service address. On UDP on the RDMA interconnect at a "well known" service address. On UDP
or TCP, this corresponds to port 111. A client could connect to or TCP, this corresponds to port 111. A client could connect to
this service address and use the portmap protocol to obtain a this service address and use the portmap protocol to obtain a
service address in response to a program number, e.g. a VI service address in response to a program number, e.g. an iWARP port
discriminator or an Infiniband GID. number, or an Infiniband GID.
11. Security 11. Security
ONC RPC provides its own security via the RPCSEC_GSS framework [RFC ONC RPC provides its own security via the RPCSEC_GSS framework [RFC
2203]. RPCSEC_GSS can provide message authentication, integrity 2203]. RPCSEC_GSS can provide message authentication, integrity
checking, and privacy. This security mechanism will be unaffected checking, and privacy. This security mechanism will be unaffected
by the RDMA transport. The data integrity and privacy features by the RDMA transport. The data integrity and privacy features
alter the body of the message, presenting it as a single chunk. alter the body of the message, presenting it as a single chunk.
For large messages the chunk may be large enough to qualify for For large messages the chunk may be large enough to qualify for
RDMA Read transfer. However, there is much data movement RDMA Read transfer. However, there is much data movement
associated with computation and verification of integrity, or associated with computation and verification of integrity, or
encryption/decryption, so any performance advantage will be lost. encryption/decryption, so certain performance advantages may be
lost.
There should be no new issues here with exposed addresses. The For efficiency, more appropriate security mechanism for RDMA links
only exposed addresses here are in the chunk list and in the may be link-level protection, such as IPSec, which may be co-
transport packets generated by an RDMA. The data contained in located in the RDMA hardware. The use of link-level protection may
these addresses is adequately protected by RPCSEC_GSS integrity and be negotiated through the use of a new RPCSEC_GSS mechanism like
privacy. RPCSEC_GSS security mechanisms are typically implemented the Credential Cache GSS Mechanism [CCM]. Use of such mechanisms
by the host CPU. This additional data movement and CPU use may is recommended where end-to-end integrity and/or privacy is
cancel out much of the RDMA direct placement and offload benefit. desired, and where efficiency is required.
A more appropriate security mechanism for RDMA links may be link- There are no new issues here with exposed addresses. The only
level protection, like IPSec, which may be co-located in the RDMA exposed addresses here are in the chunk list and in the transport
link hardware. The use of link-level protection may be negotiated packets transferred via RDMA. The data contained in these
through the use of a new RPCSEC_GSS mechanism like the Credential addresses continues to be protected by RPCSEC_GSS integrity and
Cache GSS Mechanism [CCM]. privacy.
12. IANA Considerations 12. IANA Considerations
As a new RPC transport, this protocol should have no effect on RPC As a new RPC transport, this protocol should have no effect on RPC
program numbers or registered port numbers. The new RPC transport program numbers or registered port numbers. The new RPC transport
should be assigned a new RPC "netid". If adopted, the Connection should be assigned a new RPC "netid". If adopted, the Connection
Configuration protocol described herein will require an RPC program Configuration protocol described herein will require an RPC program
number assignment. number assignment.
13. Acknowledgements 13. Acknowledgements
skipping to change at page 27, line 41 skipping to change at page 29, line 28
http://www.ietf.org/rfc/rfc3530.txt http://www.ietf.org/rfc/rfc3530.txt
[RFC2203] [RFC2203]
M. Eisler, A. Chiu, L. Ling, "RPCSEC_GSS Protocol Specification", M. Eisler, A. Chiu, L. Ling, "RPCSEC_GSS Protocol Specification",
Standards Track RFC, http://www.ietf.org/rfc/rfc2203.txt Standards Track RFC, http://www.ietf.org/rfc/rfc2203.txt
15. Informative References 15. Informative References
[RDMA] [RDMA]
R. Recio et al, "An RDMA Protocol Specification", Internet Draft R. Recio et al, "An RDMA Protocol Specification", Internet Draft
Work in Progress, http://www.ietf.org/internet-drafts/draft-ietf- Work in Progress, draft-ietf-rddp-rdmap
rddp-rdmap-03.txt
[CCM] [CCM]
M. Eisler, N. Williams, "CCM: The Credential Cache GSS Mechanism", M. Eisler, N. Williams, "CCM: The Credential Cache GSS Mechanism",
Internet Draft Work in Progress, http://www.ietf.org/internet- Internet Draft Work in Progress, draft-ietf-nfsv4-ccm
drafts/draft-ietf-nfsv4-ccm-03.txt
[NFSDDP] [NFSDDP]
B. Callaghan, T. Talpey, "NFS Direct Data Placement" Internet Draft B. Callaghan, T. Talpey, "NFS Direct Data Placement" Internet Draft
Work in Progress, http://www.ietf.org/internet-drafts/draft-ietf- Work in Progress, draft-ietf-nfsv4-nfsdirect
nfsv4-nfsdirect-01.txt
[RDDP] [RDDP]
Remote Direct Data Placement Working Group Charter, Remote Direct Data Placement Working Group Charter,
http://www.ietf.org/html.charters/rddp-charter.html http://www.ietf.org/html.charters/rddp-charter.html
[NFSRDMAPS] [NFSRDMAPS]
T. Talpey, C. Juszczak, "NFS RDMA Problem Statement", Internet T. Talpey, C. Juszczak, "NFS RDMA Problem Statement", Internet
Draft Work in Progress, http://www.ietf.org/internet-drafts/draft- Draft Work in Progress, draft-ietf-nfsv4-nfs-rdma-problem-statement
ietf-nfsv4-nfs-rdma-problem-statement-02.txt
[NFSSESS] [NFSSESS]
T. Talpey, S. Shepler, J. Bauman, "NFSv4 Session Extensions", T. Talpey, S. Shepler, J. Bauman, "NFSv4 Session Extensions",
Internet Draft Work in Progress, http://www.ietf.org/internet- Internet Draft Work in Progress, draft-ietf-nfsv4-nfs-sess
drafts/draft-ietf-nfsv4-nfs-sess-01.txt
[IB] [IB]
Infiniband Architecture Specification, http://www.infinibandta.org Infiniband Architecture Specification, http://www.infinibandta.org
16. Authors' Addresses 16. Authors' Addresses
Brent Callaghan Brent Callaghan
1614 Montalto Dr. 1614 Montalto Dr.
Mountain View, California 94040 USA Mountain View, California 94040 USA
skipping to change at page 28, line 41 skipping to change at page 30, line 25
EMail: brent.callaghan@gmail.com EMail: brent.callaghan@gmail.com
Tom Talpey Tom Talpey
Network Appliance, Inc. Network Appliance, Inc.
375 Totten Pond Road 375 Totten Pond Road
Waltham, MA 02451 USA Waltham, MA 02451 USA
Phone: +1 781 768 5329 Phone: +1 781 768 5329
EMail: thomas.talpey@netapp.com EMail: thomas.talpey@netapp.com
17. Full Copyright Statement 17. Intellectual Property and Copyright Statements
Copyright (C) The Internet Society (2005). This document is
subject to the rights, licenses and restrictions contained in BCP
78 and except as set forth therein, the authors retain all their
rights.
This document and the information contained herein are provided on
an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE
REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND
THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT
THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR
ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A
PARTICULAR PURPOSE.
Intellectual Property Intellectual Property Statement
The IETF takes no position regarding the validity or scope of any The IETF takes no position regarding the validity or scope of any
Intellectual Property Rights or other rights that might be claimed Intellectual Property Rights or other rights that might be claimed
to pertain to the implementation or use of the technology described to pertain to the implementation or use of the technology described
in this document or the extent to which any license under such in this document or the extent to which any license under such
rights might or might not be available; nor does it represent that rights might or might not be available; nor does it represent that
it has made any independent effort to identify any such rights. it has made any independent effort to identify any such rights.
Information on the procedures with respect to rights in RFC Information on the procedures with respect to rights in RFC
documents can be found in BCP 78 and BCP 79. documents can be found in BCP 78 and BCP 79.
skipping to change at page 29, line 36 skipping to change at page 31, line 6
of such proprietary rights by implementers or users of this of such proprietary rights by implementers or users of this
specification can be obtained from the IETF on-line IPR repository specification can be obtained from the IETF on-line IPR repository
at http://www.ietf.org/ipr. at http://www.ietf.org/ipr.
The IETF invites any interested party to bring to its attention any The IETF invites any interested party to bring to its attention any
copyrights, patents or patent applications, or other proprietary copyrights, patents or patent applications, or other proprietary
rights that may cover technology that may be required to implement rights that may cover technology that may be required to implement
this standard. Please address the information to the IETF at ietf- this standard. Please address the information to the IETF at ietf-
ipr@ietf.org. ipr@ietf.org.
Acknowledgement Disclaimer of Validity
This document and the information contained herein are provided on
an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE
REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND
THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT
THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR
ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A
PARTICULAR PURPOSE.
Copyright Statement
Copyright (C) The Internet Society (2005). This document is
subject to the rights, licenses and restrictions contained in BCP
78, and except as set forth therein, the authors retain all their
rights.
Acknowledgement
Funding for the RFC Editor function is currently provided by the Funding for the RFC Editor function is currently provided by the
Internet Society. Internet Society.
 End of changes. 80 change blocks. 
246 lines changed or deleted 305 lines changed or added

This html diff was produced by rfcdiff 1.27, available from http://www.levkowetz.com/ietf/tools/rfcdiff/