draft-ietf-nfsv4-rpcrdma-03.txt   draft-ietf-nfsv4-rpcrdma-04.txt 
Internet-Draft Tom Talpey Internet-Draft Tom Talpey
Expires: December 2006 Brent Callaghan Expires: April 2007 Brent Callaghan
Document: draft-ietf-nfsv4-rpcrdma-03 June, 2006 Document: draft-ietf-nfsv4-rpcrdma-04 October, 2006
RDMA Transport for ONC RPC RDMA Transport for ONC RPC
Status of this Memo Status of this Memo
By submitting this Internet-Draft, each author represents that any By submitting this Internet-Draft, each author represents that any
applicable patent or other IPR claims of which he or she is aware applicable patent or other IPR claims of which he or she is aware
have been or will be disclosed, and any of which he or she becomes have been or will be disclosed, and any of which he or she becomes
aware will be disclosed, in accordance with Section 6 of BCP 79. aware will be disclosed, in accordance with Section 6 of BCP 79.
skipping to change at page 2, line 10 skipping to change at page 2, line 10
RPC. The RDMA transport binding conveys the benefits of efficient, RPC. The RDMA transport binding conveys the benefits of efficient,
bulk data transport over high speed networks, while providing for bulk data transport over high speed networks, while providing for
minimal change to RPC applications and with no required revision of minimal change to RPC applications and with no required revision of
the application RPC protocol, or the RPC protocol itself. the application RPC protocol, or the RPC protocol itself.
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
2. Abstract RDMA Requirements . . . . . . . . . . . . . . . . . 3 2. Abstract RDMA Requirements . . . . . . . . . . . . . . . . . 3
3. Protocol Outline . . . . . . . . . . . . . . . . . . . . . . 4 3. Protocol Outline . . . . . . . . . . . . . . . . . . . . . . 4
3.1. Short Messages . . . . . . . . . . . . . . . . . . . . . . 4 3.1. Short Messages . . . . . . . . . . . . . . . . . . . . . . 5
3.2. Data Chunks . . . . . . . . . . . . . . . . . . . . . . . 5 3.2. Data Chunks . . . . . . . . . . . . . . . . . . . . . . . 5
3.3. Flow Control . . . . . . . . . . . . . . . . . . . . . . . 5 3.3. Flow Control . . . . . . . . . . . . . . . . . . . . . . . 6
3.4. XDR Encoding with Chunks . . . . . . . . . . . . . . . . . 7 3.4. XDR Encoding with Chunks . . . . . . . . . . . . . . . . . 7
3.5. Padding . . . . . . . . . . . . . . . . . . . . . . . . 10 3.5. XDR Decoding with Read Chunks . . . . . . . . . . . . . 11
3.6. XDR Decoding with Read Chunks . . . . . . . . . . . . . 11 3.6. XDR Decoding with Write Chunks . . . . . . . . . . . . . 11
3.7. XDR Decoding with Write Chunks . . . . . . . . . . . . . 12 3.7. XDR Roundup and Chunks . . . . . . . . . . . . . . . . . 12
3.8. RPC Call and Reply . . . . . . . . . . . . . . . . . . . 12 3.8. RPC Call and Reply . . . . . . . . . . . . . . . . . . . 13
4. RPC RDMA Message Layout . . . . . . . . . . . . . . . . . 15 3.9. Padding . . . . . . . . . . . . . . . . . . . . . . . . 16
4.1. RPC over RDMA Header . . . . . . . . . . . . . . . . . . 16 4. RPC RDMA Message Layout . . . . . . . . . . . . . . . . . 17
4.2. RPC over RDMA header errors . . . . . . . . . . . . . . 17 4.1. RPC over RDMA Header . . . . . . . . . . . . . . . . . . 17
4.3. XDR Language Description . . . . . . . . . . . . . . . . 18 4.2. RPC over RDMA header errors . . . . . . . . . . . . . . 19
5. Long Messages . . . . . . . . . . . . . . . . . . . . . . 20 4.3. XDR Language Description . . . . . . . . . . . . . . . . 19
5.1. Message as an RDMA Read Chunk . . . . . . . . . . . . . 20 5. Long Messages . . . . . . . . . . . . . . . . . . . . . . 22
5.2. RDMA Write of Long Replies (Reply Chunks) . . . . . . . 22 5.1. Message as an RDMA Read Chunk . . . . . . . . . . . . . 22
6. Connection Configuration Protocol . . . . . . . . . . . . 23 5.2. RDMA Write of Long Replies (Reply Chunks) . . . . . . . 24
6.1. Initial Connection State . . . . . . . . . . . . . . . . 24 6. Connection Configuration Protocol . . . . . . . . . . . . 25
6.2. Protocol Description . . . . . . . . . . . . . . . . . . 24 6.1. Initial Connection State . . . . . . . . . . . . . . . . 26
7. Memory Registration Overhead . . . . . . . . . . . . . . . 25 6.2. Protocol Description . . . . . . . . . . . . . . . . . . 26
8. Errors and Error Recovery . . . . . . . . . . . . . . . . 26 7. Memory Registration Overhead . . . . . . . . . . . . . . . 28
9. Node Addressing . . . . . . . . . . . . . . . . . . . . . 26 8. Errors and Error Recovery . . . . . . . . . . . . . . . . 28
10. RPC Binding . . . . . . . . . . . . . . . . . . . . . . . 26 9. Node Addressing . . . . . . . . . . . . . . . . . . . . . 28
11. Security . . . . . . . . . . . . . . . . . . . . . . . . 27 10. RPC Binding . . . . . . . . . . . . . . . . . . . . . . . 29
12. IANA Considerations . . . . . . . . . . . . . . . . . . . 27 11. Security . . . . . . . . . . . . . . . . . . . . . . . . 30
13. Acknowledgements . . . . . . . . . . . . . . . . . . . . 28 12. IANA Considerations . . . . . . . . . . . . . . . . . . . 30
14. Normative References . . . . . . . . . . . . . . . . . . 28 13. Acknowledgements . . . . . . . . . . . . . . . . . . . . 31
15. Informative References . . . . . . . . . . . . . . . . . 29 14. Normative References . . . . . . . . . . . . . . . . . . 31
16. Authors' Addresses . . . . . . . . . . . . . . . . . . . 29 15. Informative References . . . . . . . . . . . . . . . . . 32
17. Intellectual Property and Copyright Statements . . . . . 30 16. Authors' Addresses . . . . . . . . . . . . . . . . . . . 33
Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . 31 17. Intellectual Property and Copyright Statements . . . . . 33
Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . 34
Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in
this document are to be interpreted as described in [RFC2119].
1. Introduction 1. Introduction
RDMA is a technique for efficient movement of data between end RDMA is a technique for efficient movement of data between end
nodes, which becomes increasingly compelling over high speed nodes, which becomes increasingly compelling over high speed
transports. By directing data into destination buffers as it is transports. By directing data into destination buffers as it is
sent on a network, and placing it via direct memory access by sent on a network, and placing it via direct memory access by
hardware, the double benefit of faster transfers and reduced host hardware, the double benefit of faster transfers and reduced host
overhead is obtained. overhead is obtained.
skipping to change at page 4, line 8 skipping to change at page 4, line 15
message, followed by a security verifier and results. All data in message, followed by a security verifier and results. All data in
an RPC message is XDR encoded. For a complete description of the an RPC message is XDR encoded. For a complete description of the
RPC protocol and XDR encoding, see [RFC1831] and [RFC1832]. RPC protocol and XDR encoding, see [RFC1831] and [RFC1832].
This protocol assumes the following abstract model for RDMA This protocol assumes the following abstract model for RDMA
transports. These terms, common in the RDMA lexicon, are used in transports. These terms, common in the RDMA lexicon, are used in
this document. A more complete glossary of RDMA terms can be found this document. A more complete glossary of RDMA terms can be found
in [RDMAP]. in [RDMAP].
o Registered Memory o Registered Memory
All data moved via tagged RDMA operations must be resident in All data moved via tagged RDMA operations is resident in
registered memory at its destination. This protocol assumes registered memory at its destination. This protocol assumes
that each segment of registered memory may be identified with that each segment of registered memory MUST be identified with
a steering tag of no more than 32 bits and memory addresses of a steering tag of no more than 32 bits and memory addresses of
up to 64 bits in length. up to 64 bits in length.
o RDMA Send o RDMA Send
The RDMA provider supports an RDMA Send operation with The RDMA provider supports an RDMA Send operation with
completion signalled at the receiver when data is placed in a completion signalled at the receiver when data is placed in a
pre-posted buffer. The amount of transferred data is limited pre-posted buffer. The amount of transferred data is limited
only by the size of the receiver's buffer. Sends complete at only by the size of the receiver's buffer. Sends complete at
the receiver in the order they were issued at the sender. the receiver in the order they were issued at the sender.
o RDMA Write o RDMA Write
The RDMA provider supports an RDMA Write operation to directly The RDMA provider supports an RDMA Write operation to directly
place data in the receiver's buffer. An RDMA Write is place data in the receiver's buffer. An RDMA Write is
initiated by the sender and completion is signalled at the initiated by the sender and completion is signalled at the
sender. No completion is signalled at the receiver. The sender. No completion is signalled at the receiver. The
sender uses a steering tag, memory address and length of the sender uses a steering tag, memory address and length of the
remote destination buffer. RDMA Writes are not necessarily remote destination buffer. RDMA Writes are not necessarily
ordered with respect to one another, but are ordered with ordered with respect to one another, but are ordered with
respect to RDMA Sends; a subsequent RDMA Send completion must respect to RDMA Sends; a subsequent RDMA Send completion
be obtained at the receiver to notify that prior RDMA Write obtained at the receiver guarantees that prior RDMA Write data
data has been successfully placed in the receiver's memory. has been successfully placed in the receiver's memory.
o RDMA Read o RDMA Read
The RDMA provider supports an RDMA Read operation to directly The RDMA provider supports an RDMA Read operation to directly
place peer source data in the requester's buffer. An RDMA place peer source data in the requester's buffer. An RDMA
Read is initiated by the receiver and completion is signalled Read is initiated by the receiver and completion is signalled
at the receiver. The receiver provides steering tags, memory at the receiver. The receiver provides steering tags, memory
addresses and a length for the remote source and local addresses and a length for the remote source and local
destination buffers. Since the peer at the data source destination buffers. Since the peer at the data source
receives no notification of RDMA Read completion, there is an receives no notification of RDMA Read completion, there is an
assumption that on receiving the data the receiver will signal assumption that on receiving the data the receiver will signal
skipping to change at page 5, line 28 skipping to change at page 5, line 35
All transfers of a call or reply begin with an RDMA Send which All transfers of a call or reply begin with an RDMA Send which
transfers at least the RPC over RDMA header, usually with the call transfers at least the RPC over RDMA header, usually with the call
or reply message appended, or at least some part thereof. Because or reply message appended, or at least some part thereof. Because
the size of what may be transmitted via RDMA Send is limited by the the size of what may be transmitted via RDMA Send is limited by the
size of the receiver's pre-posted buffer, the RPC over RDMA size of the receiver's pre-posted buffer, the RPC over RDMA
transport provides a number of methods to reduce the amount transport provides a number of methods to reduce the amount
transferred by means of the RDMA Send, when necessary, by transferred by means of the RDMA Send, when necessary, by
transferring various parts of the message using RDMA Read and RDMA transferring various parts of the message using RDMA Read and RDMA
Write. Write.
RPC over RDMA framing replaces all other RPC framing (such as TCP
record marking) when used atop an RPC/RDMA association, even though
the underlying RDMA protocol may itself be layered atop a protocol
with a defined RPC framing (such as TCP). An upper layer may
however define an exchange to dynamically enable RPC/RDMA on an
existing RPC association. Any such exchange must be carefully
architected so as to prevent any ambiguity as to the framing in use
for each side of the connection. Because RPC/RDMA framing delimits
an entire RPC request or reply, any such shift must occur between
distinct RPC messages.
3.1. Short Messages 3.1. Short Messages
Many RPC messages are quite short. For example, the NFS version 3 Many RPC messages are quite short. For example, the NFS version 3
GETATTR request, is only 56 bytes: 20 bytes of RPC header plus a 32 GETATTR request, is only 56 bytes: 20 bytes of RPC header plus a 32
byte filehandle argument and 4 bytes of length. The reply to this byte filehandle argument and 4 bytes of length. The reply to this
common request is about 100 bytes. common request is about 100 bytes.
There is no benefit in transferring such small messages with an There is no benefit in transferring such small messages with an
RDMA Read or Write operation. The overhead in transferring RDMA Read or Write operation. The overhead in transferring
steering tags and memory addresses is justified only by large steering tags and memory addresses is justified only by large
transfers. The critical message size that justifies RDMA transfer transfers. The critical message size that justifies RDMA transfer
will vary depending on the RDMA implementation and network, but is will vary depending on the RDMA implementation and network, but is
typically of the order of a few kilobytes. It is appropriate to typically of the order of a few kilobytes. It is appropriate to
transfer a short message with an RDMA Send to a pre-posted buffer. transfer a short message with an RDMA Send to a pre-posted buffer.
The RPC over RDMA header with the short message (call or reply) The RPC over RDMA header with the short message (call or reply)
immediately following is transferred using a single RDMA Send immediately following is transferred using a single RDMA Send
operation. operation.
Short RPC messages over an RDMA transport will look like this: Short RPC messages over an RDMA transport look like this:
RPC Client RPC Server RPC Client RPC Server
| RPC Call | | RPC Call |
Send | ------------------------------> | Send | ------------------------------> |
| | | |
| RPC Reply | | RPC Reply |
| <------------------------------ | Send | <------------------------------ | Send
3.2. Data Chunks 3.2. Data Chunks
skipping to change at page 6, line 34 skipping to change at page 6, line 45
copies, but provides correct data alignment at the destination. copies, but provides correct data alignment at the destination.
3.3. Flow Control 3.3. Flow Control
It is critical to provide RDMA Send flow control for an RDMA It is critical to provide RDMA Send flow control for an RDMA
connection. RDMA receive operations will fail if a pre-posted connection. RDMA receive operations will fail if a pre-posted
receive buffer is not available to accept an incoming RDMA Send, receive buffer is not available to accept an incoming RDMA Send,
and repeated occurrences of such errors can be fatal to the and repeated occurrences of such errors can be fatal to the
connection. This is a departure from conventional TCP/IP connection. This is a departure from conventional TCP/IP
networking where buffers are allocated dynamically on an as-needed networking where buffers are allocated dynamically on an as-needed
basis, and pre-posting is not required. basis, and where pre-posting is not required.
It is not practical to provide for fixed credit limits at the RPC It is not practical to provide for fixed credit limits at the RPC
server. Fixed limits scale poorly, since posted buffers are server. Fixed limits scale poorly, since posted buffers are
dedicated to the associated connection until consumed by receive dedicated to the associated connection until consumed by receive
operations. Additionally for protocol correctness, the RPC server operations. Additionally for protocol correctness, the RPC server
must always be able to reply to client requests, whether or not new must always be able to reply to client requests, whether or not new
buffers have been posted to accept future receives. (Note that the buffers have been posted to accept future receives. (Note that the
RPC server may in fact be a client at some other layer. For RPC server may in fact be a client at some other layer. For
example, NFSv4 callbacks are processed by the NFSv4 client, acting example, NFSv4 callbacks are processed by the NFSv4 client, acting
as an RPC server. The credit discussions apply equally in either as an RPC server. The credit discussions apply equally in either
case.) case.)
Flow control for RDMA Send operations is implemented as a simple Flow control for RDMA Send operations is implemented as a simple
request/grant protocol in the RPC over RDMA header associated with request/grant protocol in the RPC over RDMA header associated with
each RPC message. The RPC over RDMA header for RPC call messages each RPC message. The RPC over RDMA header for RPC call messages
contains a requested credit value for the RPC server, which may be contains a requested credit value for the RPC server, which MAY be
dynamically adjusted by the caller to match its expected needs. dynamically adjusted by the caller to match its expected needs.
The RPC over RDMA header for the RPC reply messages provides the The RPC over RDMA header for the RPC reply messages provides the
granted result, which may have any value except it may not be zero granted result, which MAY have any value except it MAY NOT be zero
when no in-progress operations are present at the server, since when no in-progress operations are present at the server, since
such a value would result in deadlock. The value may be adjusted such a value would result in deadlock. The value MAY be adjusted
up or down at each opportunity to match the server's needs or up or down at each opportunity to match the server's needs or
policies. policies.
The RPC client must not send unacknowledged requests in excess of The RPC client MUST NOT send unacknowledged requests in excess of
this granted RPC server credit limit. If the limit is exceeded, this granted RPC server credit limit. If the limit is exceeded,
the RDMA layer may signal an error, possibly terminating the the RDMA layer may signal an error, possibly terminating the
connection. Even if an error does not occur, there is no connection. Even if an error does not occur, it is NOT REQUIRED
requirement that the server must handle the excess request(s), and that the server handle the excess request(s), and it MAY return an
it may return an RPC error to the client. Also note that the RPC error to the client. Also note that the never-zero requirement
never-zero requirement implies that an RPC server must always implies that an RPC server MUST always provide at least one credit
provide at least one credit to each connected RPC client. It does to each connected RPC client. It does however NOT REQUIRE that the
not however require that the server must always be prepared to server always be prepared to receive a request from each client,
receive a request from each client, for example when the server is for example when the server is busy processing all granted client
busy processing all granted client requests. requests.
While RPC call may complete in any order, the current flow control While RPC calls complete in any order, the current flow control
limit at the RPC server is known to the RPC client from the Send limit at the RPC server is known to the RPC client from the Send
ordering properties. It is always the most recent server-granted ordering properties. It is always the most recent server-granted
credit value minus the number of requests in flight. credit value minus the number of requests in flight.
Certain RDMA implementations may impose additional flow control Certain RDMA implementations may impose additional flow control
restrictions, such as limits on RDMA Read operations in progress at restrictions, such as limits on RDMA Read operations in progress at
the responder. Because these operations are outside the scope of the responder. Because these operations are outside the scope of
this protocol, they are not addressed and must be provided for by this protocol, they are not addressed and SHOULD be provided for by
other layers. For example, a simple upper layer RPC consumer might other layers. For example, a simple upper layer RPC consumer might
perform single-issue RDMA Read requests, while a more perform single-issue RDMA Read requests, while a more
sophisticated, multithreaded RPC consumer may implement its own sophisticated, multithreaded RPC consumer might implement its own
FIFO queue of such operations. For further discussion of possible FIFO queue of such operations. For further discussion of possible
protocol implementations capable of negotiating these values, see protocol implementations capable of negotiating these values, see
section 6 "Connection Configuration Protocol" of this draft, or section 6 "Connection Configuration Protocol" of this draft, or
[NFSv4.1]. [NFSv4.1].
3.4. XDR Encoding with Chunks 3.4. XDR Encoding with Chunks
The data comprising an RPC call or reply message is marshaled or The data comprising an RPC call or reply message is marshaled or
serialized into a contiguous stream by an XDR routine. XDR data serialized into a contiguous stream by an XDR routine. XDR data
types such as integers, strings, arrays and linked lists are types such as integers, strings, arrays and linked lists are
skipping to change at page 8, line 15 skipping to change at page 8, line 27
opaque byte array is large enough to be more efficiently moved via opaque byte array is large enough to be more efficiently moved via
an RDMA data transfer operation like RDMA Read or RDMA Write. an RDMA data transfer operation like RDMA Read or RDMA Write.
Semantically speaking, the protocol has no restriction regarding Semantically speaking, the protocol has no restriction regarding
data types which may or may not be represented by a read or write data types which may or may not be represented by a read or write
chunk. In practice however, efficiency considerations lead to the chunk. In practice however, efficiency considerations lead to the
conclusion that certain data types are not generally "chunkable". conclusion that certain data types are not generally "chunkable".
Typically, only opaque and aggregate data types which may attain Typically, only opaque and aggregate data types which may attain
substantial size are considered to be eligible. With today's substantial size are considered to be eligible. With today's
hardware this size may be a kilobyte or more. However any object hardware this size may be a kilobyte or more. However any object
may be chosen for chunking in any given message. MAY be chosen for chunking in any given message.
The eligibility of XDR data items to be candidates for being moved The eligibility of XDR data items to be candidates for being moved
as data chunks (as opposed to being marshalled inline) is not as data chunks (as opposed to being marshaled inline) is not
specified by the RPC over RDMA protocol. Chunk eligibility specified by the RPC over RDMA protocol. Chunk eligibility
criteria must be determined by each upper layer in order to provide criteria MUST be determined by each upper layer in order to provide
for an interoperable specification. One such example with for an interoperable specification. One such example with
rationale, for the NFS protocol family, is provided in [NFSDDP]. rationale, for the NFS protocol family, is provided in [NFSDDP].
The interface by which an upper layer implementation communicates The interface by which an upper layer implementation communicates
the eligibility of a data item locally to RPC for chunking is out the eligibility of a data item locally to RPC for chunking is out
of scope for this specification. In many implementations, it is of scope for this specification. In many implementations, it is
possible to implement a transparent RPC chunking facility. possible to implement a transparent RPC chunking facility.
However, such implementations may lead to inefficiencies, either However, such implementations may lead to inefficiencies, either
because they require the RPC layer to perform expensive because they require the RPC layer to perform expensive
registration and deregistration of memory "on the fly", or they may registration and deregistration of memory "on the fly", or they may
skipping to change at page 8, line 48 skipping to change at page 9, line 11
When sending any message (request or reply) that contains an When sending any message (request or reply) that contains an
eligible large data chunk, the XDR encoding routine avoids moving eligible large data chunk, the XDR encoding routine avoids moving
the data into the XDR stream. Instead, it does not encode the data the data into the XDR stream. Instead, it does not encode the data
portion, but records the address and size of each chunk in a portion, but records the address and size of each chunk in a
separate "read chunk list" encoded within RPC RDMA transport- separate "read chunk list" encoded within RPC RDMA transport-
specific headers. Such chunks will be transferred via RDMA Read specific headers. Such chunks will be transferred via RDMA Read
operations initiated by the receiver. operations initiated by the receiver.
When the read chunks are to be moved via RDMA, the memory for each When the read chunks are to be moved via RDMA, the memory for each
chunk must be registered. This registration may take place within chunk is registered. This registration may take place within XDR
XDR itself, providing for full transparency to upper layers, or it itself, providing for full transparency to upper layers, or it may
may be performed by any other specific local implementation. be performed by any other specific local implementation.
Additionally, when making an RPC call that can result in bulk data Additionally, when making an RPC call that can result in bulk data
transferred in the reply, it is desirable to provide chunks to transferred in the reply, it is desirable to provide chunks to
accept the data directly via RDMA Write. These write chunks will accept the data directly via RDMA Write. These write chunks will
therefore be pre-filled by the RPC server prior to responding, and therefore be pre-filled by the RPC server prior to responding, and
XDR decode at the client will not be required. These chunks XDR decode of the data at the client will not be required. These
undergo a similar registration and advertisement via "write chunk chunks undergo a similar registration and advertisement via "write
lists" built as a part of XDR encoding. chunk lists" built as a part of XDR encoding.
Some RPC client implementations are not able to determine where an Some RPC client implementations are not able to determine where an
RPC call's results reside during the "encode" phase. This makes it RPC call's results reside during the "encode" phase. This makes it
difficult or impossible for the RPC client layer to encode the difficult or impossible for the RPC client layer to encode the
write chunk list at the time of building the request. In this write chunk list at the time of building the request. In this
case, it is difficult for the RPC implementation to provide case, it is difficult for the RPC implementation to provide
transparency to the RPC consumer, which may require recoding to transparency to the RPC consumer, which may require recoding to
provide result information at this earlier stage. provide result information at this earlier stage.
Therefore if the RPC client does not make a write chunk list Therefore if the RPC client does not make a write chunk list
available to receive the result, then the RPC server must return available to receive the result, then the RPC server MAY return
data inline in the reply, or if it so chooses, via a read chunk data inline in the reply, or if the upper layer specification
list. RPC clients are discouraged from omitting write chunk lists permits, it MAY be returned via a read chunk list. It is NOT
for eligible replies, due to the lower performance of the RECOMMENDED that upper layer RPC client protocol specifcations omit
additional handshaking to perform data transfer, and the write chunk lists for eligible replies, due to the lower
requirement that the RPC server must expose (and preserve) the performance of the additional handshaking to perform data transfer,
reply data for a period of time. In the absence of a server- and the requirement that the RPC server must expose (and preserve)
the reply data for a period of time. In the absence of a server-
provided read chunk list in the reply, if the encoded reply provided read chunk list in the reply, if the encoded reply
overflows the posted receive buffer, the RPC will fail. overflows the posted receive buffer, the RPC will fail with an RDMA
transport error.
When any data within a message is provided via either read or write When any data within a message is provided via either read or write
chunks, the chunk itself refers only to the data portion of the XDR chunks, the chunk itself refers only to the data portion of the XDR
stream element. In particular, for counted fields (e.g. a "<>" stream element. In particular, for counted fields (e.g. a "<>"
encoding) the byte count which is encoded as part of the field encoding) the byte count which is encoded as part of the field
remains in the XDR stream, and is also encoded in the chunk list. remains in the XDR stream, and is also encoded in the chunk list.
The data portion is however elided from the encoded XDR stream, and The data portion is however elided from the encoded XDR stream, and
is transferred as part of chunk list processing. This is important is transferred as part of chunk list processing. This is important
to maintain upper layer implementation compatibility - both the to maintain upper layer implementation compatibility - both the
count and the data must be transferred as part of the logical XDR count and the data must be transferred as part of the logical XDR
stream. While the chunk list processing results in the data being stream. While the chunk list processing results in the data being
available to the upper layer peer for XDR decoding, the length available to the upper layer peer for XDR decoding, the length
present in the chunk list entries is not. Any byte count in the present in the chunk list entries is not. Any byte count in the
XDR stream must match the sum of the byte counts present in the XDR stream MUST match the sum of the byte counts present in the
corresponding read or write chunk list. If they do not agree, an corresponding read or write chunk list. If they do not agree, an
RPC protocol encoding error results. RPC protocol encoding error results.
The following items are contained in a chunk list entry. The following items are contained in a chunk list entry.
Handle Handle
Steering tag or handle obtained when the chunk memory is Steering tag or handle obtained when the chunk memory is
registered for RDMA. registered for RDMA.
Length Length
skipping to change at page 10, line 29 skipping to change at page 10, line 42
region-based registrations can be readily supported, and by region-based registrations can be readily supported, and by
using a single, universal chunk representation, the RPC RDMA using a single, universal chunk representation, the RPC RDMA
protocol implementation is simplified to its most general protocol implementation is simplified to its most general
form. form.
Position Position
For data which is to be encoded, the position in the XDR For data which is to be encoded, the position in the XDR
stream where the chunk would normally reside. Note that the stream where the chunk would normally reside. Note that the
chunk therefore inserts its data into the XDR stream at this chunk therefore inserts its data into the XDR stream at this
position, but its transfer is no longer "inline". Also note position, but its transfer is no longer "inline". Also note
it is possible that a contiguous sequence of chunks might all therefore that all chunks belonging to a single RPC argument
have the same position. For data which is to be decoded, no or result will have the same position. For data which is to
"position" is used. be decoded, no position is used.
When XDR marshaling is complete, the chunk list is XDR encoded, When XDR marshaling is complete, the chunk list is XDR encoded,
then sent to the receiver prepended to the RPC message. Any source then sent to the receiver prepended to the RPC message. Any source
data for a read chunk, or the destination of a write chunk, remain data for a read chunk, or the destination of a write chunk, remain
behind in the sender's registered memory and their actual payload behind in the sender's registered memory and their actual payload
is not marshalled into the request or reply. is not marshaled into the request or reply.
+----------------+----------------+------------- +----------------+----------------+-------------
| RPC over RDMA | | | RPC over RDMA | |
| header w/ | RPC Header | Non-chunk args/results | header w/ | RPC Header | Non-chunk args/results
| chunks | | | chunks | |
+----------------+----------------+------------- +----------------+----------------+-------------
Read chunk lists and write chunk lists are structured somewhat Read chunk lists and write chunk lists are structured somewhat
differently. This is due to the different usage - read chunks are differently. This is due to the different usage - read chunks are
decoded and indexed by their position in the XDR data stream, their decoded and indexed by their argument's or result's position in the
size is always known, and may be used for both arguments and XDR data stream; their size is always known. Write chunks on the
results. Write chunks on the other hand are used only for results, other hand are used only for results, and have neither a
and have neither a preassigned offset in the XDR stream, nor a size preassigned offset in the XDR stream, nor a size until the results
until the results are produced, since the buffers may not be used are produced, since the buffers may be only partially filled, or
for results at all, or may be partially filled. Their presence in may not be used for results at all. Their presence in the XDR
the XDR stream is therefore not known until the reply is processed. stream is therefore not known until the reply is processed. The
The mapping of Write chunks onto designated NFS procedures and mapping of Write chunks onto designated NFS procedures and their
their results is described in [NFSDDP]. results is described in [NFSDDP].
Therefore, read chunks are encoded into a read chunk list as a Therefore, read chunks are encoded into a read chunk list as a
single array, with each entry tagged by its (known) size and single array, with each entry tagged by its (known) size and its
position in the XDR stream. Write chunks are encoded as a list of argument's or result's position in the XDR stream. Write chunks
arrays of RDMA buffers, with each list element (an array) providing are encoded as a list of arrays of RDMA buffers, with each list
buffers for a separate result. Individual write chunk list element (an array) providing buffers for a separate result.
elements may thereby result in being partially or fully filled, or Individual write chunk list elements MAY thereby result in being
in fact not being filled at all. Unused write chunks, or unused partially or fully filled, or in fact not being filled at all.
bytes in write chunk buffer lists, are not returned as results, and Unused write chunks, or unused bytes in write chunk buffer lists,
their memory is returned to the upper layer as part of RPC are not returned as results, and their memory is returned to the
completion. However, the RPC layer should not assume that the upper layer as part of RPC completion. However, the RPC layer
buffers have not been modified. SHOULD NOT assume that the buffers have not been modified.
3.5. Padding
Alignment of specific opaque data enables certain scatter/gather
optimizations. Padding leverages the useful property that RDMA
transfers preserve alignment of data, even when they are placed
into pre-posted receive buffers by Sends.
Many servers can make good use of such padding. Padding allows the
chaining of RDMA receive buffers such that any data transferred by
RDMA on behalf of RPC requests will be placed into appropriately
aligned buffers on the system that receives the transfer. In this
way, the need for servers to perform RDMA Read to satisfy all but
the largest client writes is obviated.
The effect of padding is demonstrated below showing prior bytes on
an XDR stream (XXX) followed by an opaque field consisting of four
length bytes (LLLL) followed by data bytes (DDDD). The receiver of
the RDMA Send has posted two chained receive buffers. Without
padding, the opaque data is split across the two buffers. With the
addition of padding bytes (ppp) prior to the first data byte, the
data can be forced to align correctly in the second buffer.
Buffer 1 Buffer 2
Unpadded -------------- --------------
XXXXXXXLLLLDDDDDDDDDDDDDD ---> XXXXXXXLLLLDDD DDDDDDDDDDD
Padded
XXXXXXXLLLLpppDDDDDDDDDDDDDD ---> XXXXXXXLLLLppp DDDDDDDDDDDDDD
Padding is implemented completely within the RDMA transport
encoding, flagged with a specific message type. Where padding is
applied, two values are passed to the peer: an "rdma_align" which
is the padding value used, and "rdma_thresh", which is the opaque
data size at or above which padding is applied. For instance, if
the server is using chained 4 KB receive buffers, then up to (4 KB
- 1) padding bytes could be used to achieve alignment of the data.
If padding is to apply only to chunks at least 1 KB in size, then
the threshold should be set to 1 KB. The XDR routine at the peer
will consult these values when decoding opaque values. Where the
decoded length exceeds the rdma_thresh, the XDR decode will skip
over the appropriate padding as indicated by rdma_align and the
current XDR stream position.
3.6. XDR Decoding with Read Chunks 3.5. XDR Decoding with Read Chunks
The XDR decode process moves data from an XDR stream into a data The XDR decode process moves data from an XDR stream into a data
structure provided by the RPC client or server application. Where structure provided by the RPC client or server application. Where
elements of the destination data structure are buffers or strings, elements of the destination data structure are buffers or strings,
the RPC application can either pre-allocate storage to receive the the RPC application can either pre-allocate storage to receive the
data, or leave the string or buffer fields null and allow the XDR data, or leave the string or buffer fields null and allow the XDR
decode stage of RPC processing to automatically allocate storage of decode stage of RPC processing to automatically allocate storage of
sufficient size. sufficient size.
When decoding a message from an RDMA transport, the receiver first When decoding a message from an RDMA transport, the receiver first
XDR decodes the chunk lists from the RPC over RDMA header, then XDR decodes the chunk lists from the RPC over RDMA header, then
proceeds to decode the body of the RPC message (arguments or proceeds to decode the body of the RPC message (arguments or
results). Whenever the XDR offset in the decode stream matches results). Whenever the XDR offset in the decode stream matches
that of a chunk in the read chunk list, the XDR routine initiates that of a chunk in the read chunk list, the XDR routine initiates
an RDMA Read to bring over the chunk data into locally registered an RDMA Read to bring over the chunk data into locally registered
memory for the destination buffer. memory for the destination buffer.
When processing an RPC request, the RPC receiver (RPC server) When processing an RPC request, the RPC receiver (RPC server)
acknowledges its completion of use of the source buffers by simply acknowledges its completion of use of the source buffers by simply
replying to the RPC sender (client), and the peer may free all replying to the RPC sender (client), and the peer may then free all
source buffers advertised by the request. source buffers advertised by the request.
When processing an RPC reply, after completing such a transfer the When processing an RPC reply, after completing such a transfer the
RPC receiver (client) must issue an RDMA_DONE message (described in RPC receiver (client) MUST issue an RDMA_DONE message (described in
Section 3.8) to notify the peer (server) that the source buffers Section 3.8) to notify the peer (server) that the source buffers
can be freed. can be freed.
The read chunk list is constructed and used entirely within the The read chunk list is constructed and used entirely within the
RPC/XDR layer. Other than specifying the minimum chunk size, the RPC/XDR layer. Other than specifying the minimum chunk size, the
management of the read chunk list is automatic and transparent to management of the read chunk list is automatic and transparent to
an RPC application. an RPC application.
3.7. XDR Decoding with Write Chunks 3.6. XDR Decoding with Write Chunks
When a "write chunk list" is provided for the results of the RPC When a "write chunk list" is provided for the results of the RPC
call, the RPC server must provide any corresponding data via RDMA call, the RPC server MUST provide any corresponding data via RDMA
Write to the memory referenced in the chunk list entries. The RPC Write to the memory referenced in the chunk list entries. The RPC
reply conveys this by returning the write chunk list to the client reply conveys this by returning the write chunk list to the client
with the lengths rewritten to match the actual transfer. The XDR with the lengths rewritten to match the actual transfer. The XDR
"decode" of the reply therefore performs no local data transfer but "decode" of the reply therefore performs no local data transfer but
merely returns the length obtained from the reply. merely returns the length obtained from the reply.
Each decoded result consumes one entry in the write chunk list, Each decoded result consumes one entry in the write chunk list,
which in turn consists of an array of RDMA segments. The length is which in turn consists of an array of RDMA segments. The length is
therefore the sum of all returned lengths in all segments therefore the sum of all returned lengths in all segments
comprising the corresponding list entry. As each list entry is comprising the corresponding list entry. As each list entry is
"decoded", the entire entry is consumed. "decoded", the entire entry is consumed.
The write chunk list is constructed and used by the RPC The write chunk list is constructed and used by the RPC
application. The RPC/XDR layer simply conveys the list between application. The RPC/XDR layer simply conveys the list between
client and server and initiates the RDMA Writes back to the client. client and server and initiates the RDMA Writes back to the client.
The mapping of write chunk list entries to procedure arguments must The mapping of write chunk list entries to procedure arguments MUST
be determined for each protocol. An example of a mapping is be determined for each protocol. An example of a mapping is
described in [NFSDDP]. described in [NFSDDP].
3.7. XDR Roundup and Chunks
The XDR protocol requires 4-byte alignment of each new encoded
element in any XDR stream. This requirement is for efficiency and
ease of decode/unmarshaling at the receiver - if the XDR stream
buffer begins on a native machine boundary, then the XDR elements
will lie on similarly predictable offsets in memory.
Within XDR, when non-4-byte encodes (such as an odd-length string
or bulk data) are marshaled, their length is encoded literally,
while their data is padded to begin the next element at a 4-byte
boundary in the XDR stream. For TCP or RDMA inline encoding, this
minimal overhead is required because the transport-specific framing
relies on the fact that the relative offset of the elements in the
XDR stream from the start of the message determines the XDR
position during decode.
On the other hand, RPC/RDMA Read chunks carry the XDR position of
each chunked element and length of the Chunk segment, and can be
placed by the receiver exactly where they belong in the receiver's
memory without regard to the alignment of their position in the XDR
stream. Since any rounded-up data is not actually part of the
upper layer's message, the receiver will not reference it, and
there is no reason to set it to any particular value in the
receiver's memory.
When roundup is present at the end of a sequence of chunks, the
length of the sequence will terminate it at an non-4-byte XDR
position. When the receiver proceeds to decode the remaining part
of the XDR stream, it inspects the XDR position indicated by the
next chunk. Because this position will not match (else roundup
would not have occurred), the receiver decoding will fall back to
inspecting the remaining inline portion. If in turn, no data
remains to be decoded from the inline portion, then the receiver
MUST conclude that roundup is present, and therefore advances the
XDR decode position to that indicated by the next chunk (if any).
In this way, roundup is passed without ever actually transferring
additional XDR bytes.
Some protocol operations over RPC/RDMA, for instance NFS writes of
data encountered at the end of a file or in direct i/o situations,
commonly yield these roundups within RDMA Read Chunks. Because any
roundup bytes are not actually present in the data buffers being
written, memory for these bytes would come from noncontiguous
buffers, either as an additional memory registration segment, or as
an additional Chunk. The overhead of these operations can be
significant to both the sender to marshal them, and even higher to
the receiver which to transfer them. Senders SHOULD therefore
avoid encoding indivudual RDMA Read Chunks for roundup whenever
possible. It is acceptable, but not necessary, to include roundup
data in an existing RDMA Read Chunk, but only if it is already
present in the XDR stream to carry upper layer data.
Note that there is no exposure of additional data at the sender due
to eliding roundup data from the XDR stream, since any additional
sender buffers are never exposed to the peer. The data is
literally not there to be transferred.
For RDMA Write Chunks, a simpler encoding method applies. Again,
roundup bytes are not transferred, instead the chunk length sent to
the receiver in the reply is simply increased to include any
roundup. Because of the requirement that the RDMA Write chunks are
filled sequentially without gaps, this situation can only occur on
the final chunk receiving data. Therefore there is no opportunity
for roundup data to insert misalignment or positional gaps into the
XDR stream.
3.8. RPC Call and Reply 3.8. RPC Call and Reply
The RDMA transport for RPC provides three methods of moving data The RDMA transport for RPC provides three methods of moving data
between RPC client and server: between RPC client and server:
Inline Inline
Data are moved between RPC client and server within an RDMA Data are moved between RPC client and server within an RDMA
Send. Send.
RDMA Read RDMA Read
skipping to change at page 15, line 48 skipping to change at page 16, line 30
| | | |
| Done | | Done |
Send | ------------------------------> | Send | ------------------------------> |
The final Done message allows the RPC client to signal the server The final Done message allows the RPC client to signal the server
that it has received the chunks, so the server can de-register and that it has received the chunks, so the server can de-register and
free the memory holding the chunks. A Done completion is not free the memory holding the chunks. A Done completion is not
necessary for an RPC call, since the RPC reply Send is itself a necessary for an RPC call, since the RPC reply Send is itself a
receive completion notification. In the event that the client receive completion notification. In the event that the client
fails to return the Done message within some timeout period, the fails to return the Done message within some timeout period, the
server may conclude that a protocol violation has occurred and server MAY conclude that a protocol violation has occurred and
close the RPC connection, or it may proceed with a de-register and close the RPC connection, or it MAY proceed with a de-register and
free its chunk buffers. This may result in a fatal RDMA error if free its chunk buffers. This may result in a fatal RDMA error if
the client later attempts to perform an RDMA Read operation, which the client later attempts to perform an RDMA Read operation, which
amounts to the same thing. amounts to the same thing.
The use of read chunks in RPC reply messages is much less efficient The use of read chunks in RPC reply messages is much less efficient
than providing write chunks in the originating RPC calls, due to than providing write chunks in the originating RPC calls, due to
the additional message exchanges, the need for the RPC server to the additional message exchanges, the need for the RPC server to
advertise buffers to the peer, the necessity of the server advertise buffers to the peer, the necessity of the server
maintaining a timer for the purpose of recovery from misbehaving maintaining a timer for the purpose of recovery from misbehaving
clients, and the need for additional memory registration. Their clients, and the need for additional memory registration. Their
use is not recommended by upper layers where efficiency is a use is NOT RECOMMENDED by upper layers where efficiency is a
primary concern. [NFSDDP] However, they may be employed by upper primary concern. [NFSDDP] However, they MAY be employed by upper
layer protocol bindings which are primarily concerned with layer protocol bindings which are primarily concerned with
transparency, since they can frequently be implemented completely transparency, since they can frequently be implemented completely
within the RPC lower layers. within the RPC lower layers.
It is important to note that the Done message consumes a credit at It is important to note that the Done message consumes a credit at
the RPC server. The RPC server should provide sufficient credits the RPC server. The RPC server SHOULD provide sufficient credits
to the client to allow the Done message to be sent without deadlock to the client to allow the Done message to be sent without deadlock
(driving the outstanding credit count to zero). The RPC client (driving the outstanding credit count to zero). The RPC client
must account for its required Done messages to the server in its MUST account for its required Done messages to the server in its
accounting of available credits, and the server should replenish accounting of available credits, and the server SHOULD replenish
any credit consumed by its use of such exchanges at its earliest any credit consumed by its use of such exchanges at its earliest
opportunity. opportunity.
Finally, it is possible to conceive of RPC exchanges that involve Finally, it is possible to conceive of RPC exchanges that involve
any or all combinations of write chunks in the RPC call, read any or all combinations of write chunks in the RPC call, read
chunks in the RPC call, and read chunks in the RPC reply. Support chunks in the RPC call, and read chunks in the RPC reply. Support
for such exchanges is straightforward from a protocol perspective, for such exchanges is straightforward from a protocol perspective,
but in practice such exchanges would be quite rare, limited to but in practice such exchanges would be quite rare, limited to
upper layer protocol exchanges which transferred bulk data in both upper layer protocol exchanges which transferred bulk data in both
the call and corresponding reply. the call and corresponding reply.
3.9. Padding
Alignment of specific opaque data enables certain scatter/gather
optimizations. Padding leverages the useful property that RDMA
transfers preserve alignment of data, even when they are placed
into pre-posted receive buffers by Sends.
Many servers can make good use of such padding. Padding allows the
chaining of RDMA receive buffers such that any data transferred by
RDMA on behalf of RPC requests will be placed into appropriately
aligned buffers on the system that receives the transfer. In this
way, the need for servers to perform RDMA Read to satisfy all but
the largest client writes is obviated.
The effect of padding is demonstrated below showing prior bytes on
an XDR stream (XXX) followed by an opaque field consisting of four
length bytes (LLLL) followed by data bytes (DDDD). The receiver of
the RDMA Send has posted two chained receive buffers. Without
padding, the opaque data is split across the two buffers. With the
addition of padding bytes (ppp) prior to the first data byte, the
data can be forced to align correctly in the second buffer.
Buffer 1 Buffer 2
Unpadded -------------- --------------
XXXXXXXLLLLDDDDDDDDDDDDDD ---> XXXXXXXLLLLDDD DDDDDDDDDDD
Padded
XXXXXXXLLLLpppDDDDDDDDDDDDDD ---> XXXXXXXLLLLppp DDDDDDDDDDDDDD
Padding is implemented completely within the RDMA transport
encoding, flagged with a specific message type. Where padding is
applied, two values are passed to the peer: an "rdma_align" which
is the padding value used, and "rdma_thresh", which is the opaque
data size at or above which padding is applied. For instance, if
the server is using chained 4 KB receive buffers, then up to (4 KB
- 1) padding bytes could be used to achieve alignment of the data.
The XDR routine at the peer MUST consult these values when decoding
opaque values. Where the decoded length exceeds the rdma_thresh,
the XDR decode MUST skip over the appropriate padding as indicated
by rdma_align and the current XDR stream position.
4. RPC RDMA Message Layout 4. RPC RDMA Message Layout
RPC call and reply messages are conveyed across an RDMA transport RPC call and reply messages are conveyed across an RDMA transport
with a prepended RPC over RDMA header. The RPC over RDMA header with a prepended RPC over RDMA header. The RPC over RDMA header
includes data for RDMA flow control credits, padding parameters and includes data for RDMA flow control credits, padding parameters and
lists of addresses that provide direct data placement via RDMA Read lists of addresses that provide direct data placement via RDMA Read
and Write operations. The layout of the RPC message itself is and Write operations. The layout of the RPC message itself is
unchanged from that described in [RFC1831] except for the possible unchanged from that described in [RFC1831] except for the possible
exclusion of large data chunks that will be moved by RDMA Read or exclusion of large data chunks that will be moved by RDMA Read or
Write operations. If the RPC message (along with the RPC over RDMA Write operations. If the RPC message (along with the RPC over RDMA
header) is too long for the posted receive buffer (even after any header) is too long for the posted receive buffer (even after any
large chunks are removed), then the entire RPC message can be moved large chunks are removed), then the entire RPC message MAY be moved
separately as a chunk, leaving just the RPC over RDMA header in the separately as a chunk, leaving just the RPC over RDMA header in the
RDMA Send. RDMA Send.
4.1. RPC over RDMA Header 4.1. RPC over RDMA Header
The RPC over RDMA header begins with four 32-bit fields that are The RPC over RDMA header begins with four 32-bit fields that are
always present and which control the RDMA interaction including always present and which control the RDMA interaction including
RDMA-specific flow control. These are then followed by a number of RDMA-specific flow control. These are then followed by a number of
items such as chunk lists and padding which may or may not be items such as chunk lists and padding which MAY or MAY NOT be
present depending on the type of transmission. The four fields present depending on the type of transmission. The four fields
which are always present are: which are always present are:
1. Transaction ID (XID). 1. Transaction ID (XID).
The XID generated for the RPC call and reply. Having the XID The XID generated for the RPC call and reply. Having the XID
at the beginning of the message makes it easy to establish the at the beginning of the message makes it easy to establish the
message context. This XID mirrors the XID in the RPC header, message context. This XID mirrors the XID in the RPC header,
and takes precedence. The receiver may ignore the XID in the and takes precedence. The receiver MAY ignore the XID in the
RPC header, if it so chooses. RPC header, if it so chooses.
2. Version number. 2. Version number.
This version of the RPC RDMA message protocol is 1. The This version of the RPC RDMA message protocol is 1. The
version number must be increased by one whenever the format of version number MUST be increased by one whenever the format of
the RPC RDMA messages is changed. the RPC RDMA messages is changed.
3. Flow control credit value. 3. Flow control credit value.
When sent in an RPC call message, the requested value is When sent in an RPC call message, the requested value is
provided. When sent in an RPC reply message, the granted provided. When sent in an RPC reply message, the granted
value is returned. RPC calls must not be sent in excess of value is returned. RPC calls SHOULD not be sent in excess of
the currently granted limit. the currently granted limit.
4. Message type. 4. Message type.
o RDMA_MSG = 0 indicates that chunk lists and RPC message o RDMA_MSG = 0 indicates that chunk lists and RPC message
follow. follow.
o RDMA_NOMSG = 1 indicates that after the chunk lists there o RDMA_NOMSG = 1 indicates that after the chunk lists there
is no RPC message. In this case, the chunk lists provide is no RPC message. In this case, the chunk lists provide
information to allow the message proper to be transferred information to allow the message proper to be transferred
skipping to change at page 18, line 7 skipping to change at page 19, line 47
with some padding follow. with some padding follow.
0 RDMA_DONE = 3 indicates that the message signals the 0 RDMA_DONE = 3 indicates that the message signals the
completion of a chunk transfer via RDMA Read. completion of a chunk transfer via RDMA Read.
o RDMA_ERROR = 4 is used to signal any detected error(s) in o RDMA_ERROR = 4 is used to signal any detected error(s) in
the RPC RDMA chunk encoding. the RPC RDMA chunk encoding.
Because the version number is encoded as part of this header, and Because the version number is encoded as part of this header, and
the RDMA_ERROR message type is used to indicate errors, these first the RDMA_ERROR message type is used to indicate errors, these first
four fields and the start of the following message body must always four fields and the start of the following message body MUST always
remain aligned at these fixed offsets for all versions of the RPC remain aligned at these fixed offsets for all versions of the RPC
over RDMA header. over RDMA header.
For a message of type RDMA_MSG or RDMA_NOMSG, the Read and Write For a message of type RDMA_MSG or RDMA_NOMSG, the Read and Write
chunk lists follow. If the Read chunk list is null (a 32 bit word chunk lists follow. If the Read chunk list is null (a 32 bit word
of zeros), then there are no chunks to be transferred separately of zeros), then there are no chunks to be transferred separately
and the RPC message follows in its entirety. If non-null, then and the RPC message follows in its entirety. If non-null, then
it's the beginning of an XDR encoded sequence of Read chunk list it's the beginning of an XDR encoded sequence of Read chunk list
entries. If the Write chunk list is non-null, then an XDR encoded entries. If the Write chunk list is non-null, then an XDR encoded
sequence of Write chunk entries follows. sequence of Write chunk entries follows.
If the message type is RDMA_MSGP, then two additional fields that If the message type is RDMA_MSGP, then two additional fields that
specify the padding alignment and threshold are inserted prior to specify the padding alignment and threshold are inserted prior to
the Read and Write chunk lists. the Read and Write chunk lists.
A header of message type RDMA_MSG or RDMA_MSGP will be followed by A header of message type RDMA_MSG or RDMA_MSGP MUST be followed by
the RPC call or RPC reply message body, beginning with the XID. the RPC call or RPC reply message body, beginning with the XID.
The XID in the RDMA_MSG or RDMA_MSGP header must match this. The XID in the RDMA_MSG or RDMA_MSGP header MUST match this.
+--------+---------+---------+-----------+-------------+---------- +--------+---------+---------+-----------+-------------+----------
| | | | Message | NULLs | RPC Call | | | | Message | NULLs | RPC Call
| XID | Version | Credits | Type | or | or | XID | Version | Credits | Type | or | or
| | | | | Chunk Lists | Reply Msg | | | | | Chunk Lists | Reply Msg
+--------+---------+---------+-----------+-------------+---------- +--------+---------+---------+-----------+-------------+----------
Note that in the case of RDMA_DONE and RDMA_ERROR, no chunk list or Note that in the case of RDMA_DONE and RDMA_ERROR, no chunk list or
RPC message follows. As an implementation hint: a gather operation RPC message follows. As an implementation hint: a gather operation
on the Send of the RDMA RPC message can be used to marshal the on the Send of the RDMA RPC message can be used to marshal the
initial header, the chunk list, and the RPC message itself. initial header, the chunk list, and the RPC message itself.
4.2. RPC over RDMA header errors 4.2. RPC over RDMA header errors
When a peer receives an RPC RDMA message, it must perform certain When a peer receives an RPC RDMA message, it MUST perform the
basic validity checks on the header and chunk contents. If errors following basic validity checks on the header and chunk contents.
are detected in an RPC request, an RDMA_ERROR reply should be If such errors are detected in the request, an RDMA_ERROR reply
generated. MUST be generated.
Two types of errors are defined, version mismatch and invalid chunk Two types of errors are defined, version mismatch and invalid chunk
format. When the peer detects an RPC over RDMA header version format. When the peer detects an RPC over RDMA header version
which it does not support (currently this draft defines only which it does not support (currently this draft defines only
version 1), it replies with an error code of ERR_VERS, and provides version 1), it replies with an error code of ERR_VERS, and provides
the low and high inclusive version numbers it does, in fact, the low and high inclusive version numbers it does, in fact,
support. The version number in this reply can be any value support. The version number in this reply MAY be any value
otherwise valid at the receiver. When other decoding errors are otherwise valid at the receiver. When other decoding errors are
detected in the header or chunks, either an RPC decode error may be detected in the header or chunks, either an RPC decode error MAY be
returned, or the error code ERR_CHUNK. returned, or the ROC/RDMA error code ERR_CHUNK MUST be returned.
4.3. XDR Language Description 4.3. XDR Language Description
Here is the message layout in XDR language. Here is the message layout in XDR language.
struct xdr_rdma_segment { struct xdr_rdma_segment {
uint32 handle; /* Registered memory handle */ uint32 handle; /* Registered memory handle */
uint32 length; /* Length of the chunk in bytes */ uint32 length; /* Length of the chunk in bytes */
uint64 offset; /* Chunk virtual address or offset */ uint64 offset; /* Chunk virtual address or offset */
}; };
skipping to change at page 21, line 21 skipping to change at page 23, line 21
uint32 rdma_vers_low; uint32 rdma_vers_low;
uint32 rdma_vers_high; uint32 rdma_vers_high;
case ERR_CHUNK: case ERR_CHUNK:
void; void;
default: default:
uint32 rdma_extra[8]; uint32 rdma_extra[8];
}; };
5. Long Messages 5. Long Messages
The receiver of RDMA Send messages is required to have previously The receiver of RDMA Send messages is required by RDMA to have
posted one or more adequately sized buffers. The RPC client can previously posted one or more adequately sized buffers. The RPC
inform the server of the maximum size of its RDMA Send messages via client can inform the server of the maximum size of its RDMA Send
the Connection Configuration Protocol described later in this messages via the Connection Configuration Protocol described later
document. in this document.
Since RPC messages are frequently small, memory savings can be Since RPC messages are frequently small, memory savings can be
achieved by posting small buffers. Even large messages like NFS achieved by posting small buffers. Even large messages like NFS
READ or WRITE will be quite small once the chunks are removed from READ or WRITE will be quite small once the chunks are removed from
the message. However, there may be large messages that would the message. However, there may be large messages that would
demand a very large buffer be posted, where the contents of the demand a very large buffer be posted, where the contents of the
buffer may not be a chunkable XDR element. A good example is an buffer may not be a chunkable XDR element. A good example is an
NFS READDIR reply which may contain a large number of small NFS READDIR reply which may contain a large number of small
filename strings. Also, the NFS version 4 protocol [RFC3530] filename strings. Also, the NFS version 4 protocol [RFC3530]
features COMPOUND request and reply messages of unbounded length. features COMPOUND request and reply messages of unbounded length.
skipping to change at page 22, line 36 skipping to change at page 24, line 36
If the receiver gets an RPC over RDMA header with a message type of If the receiver gets an RPC over RDMA header with a message type of
RDMA_NOMSG and finds an initial read chunk list entry with a zero RDMA_NOMSG and finds an initial read chunk list entry with a zero
XDR position, it allocates a registered buffer and issues an RDMA XDR position, it allocates a registered buffer and issues an RDMA
Read of the long RPC message into it. The receiver then proceeds Read of the long RPC message into it. The receiver then proceeds
to XDR decode the RPC message as if it had received it inline with to XDR decode the RPC message as if it had received it inline with
the Send data. Further decoding may issue additional RDMA Reads to the Send data. Further decoding may issue additional RDMA Reads to
bring over additional chunks. bring over additional chunks.
Although the handling of long messages requires one extra network Although the handling of long messages requires one extra network
turnaround, in practice these messages should be rare if the posted turnaround, in practice these messages will be rare if the posted
receive buffers are correctly sized, and of course they will be receive buffers are correctly sized, and of course they will be
non-existent for RDMA-aware upper layers. non-existent for RDMA-aware upper layers.
A long call RPC with request supplied via RDMA Read looks like this:
RPC Client RPC Server
| RDMA over RPC Header |
Send | ------------------------------> |
| |
| Long RPC Call Msg |
| +------------------------------ | Read
| v-----------------------------> |
| |
| RDMA over RPC Reply |
| <------------------------------ | Send
An RPC with long reply returned via RDMA Read looks like this: An RPC with long reply returned via RDMA Read looks like this:
RPC Client RPC Server RPC Client RPC Server
| RPC Call | | RPC Call |
Send | ------------------------------> | Send | ------------------------------> |
| | | |
| RDMA over RPC Header | | RDMA over RPC Header |
| <------------------------------ | Send | <------------------------------ | Send
| | | |
| Long RPC Reply Msg | | Long RPC Reply Msg |
Read | ------------------------------+ | Read | ------------------------------+ |
| <-----------------------------v | | <-----------------------------v |
| | | |
| Done | | Done |
Send | ------------------------------> | Send | ------------------------------> |
It is possible for a single RPC procedure to employ both a long
call for its arguments, and a long reply for its results. However,
such an operation is atypical, as few upper layers define such
exchanges.
5.2. RDMA Write of Long Replies (Reply Chunks) 5.2. RDMA Write of Long Replies (Reply Chunks)
A superior method of handling long RPC replies is to have the RPC A superior method of handling long RPC replies is to have the RPC
client post a large buffer into which the server can write a large client post a large buffer into which the server can write a large
RPC reply. This has the advantage that an RDMA Write may be RPC reply. This has the advantage that an RDMA Write may be
slightly faster in network latency than an RDMA Read. slightly faster in network latency than an RDMA Read, and does not
Additionally, for a reply it removes the need for an RDMA_DONE require the server to wait for the completion as it must for RDMA
message if the large reply is returned as a Read chunk. Read. Additionally, for a reply it removes the need for an
RDMA_DONE message if the large reply is returned as a Read chunk.
This protocol supports direct return of a large reply via the This protocol supports direct return of a large reply via the
inclusion of an optional rdma_reply write chunk after the read inclusion of an OPTIONAL rdma_reply write chunk after the read
chunk list and the write chunk list. The client allocates a buffer chunk list and the write chunk list. The client allocates a buffer
sized to receive a large reply and enters its steering tag, address sized to receive a large reply and enters its steering tag, address
and length in the rdma_reply write chunk. If the reply message is and length in the rdma_reply write chunk. If the reply message is
too long to return inline with an RDMA Send (exceeds the size of too long to return inline with an RDMA Send (exceeds the size of
the client's posted receive buffer), even with read chunks removed, the client's posted receive buffer), even with read chunks removed,
then the RPC server performs an RDMA Write of the RPC reply message then the RPC server performs an RDMA Write of the RPC reply message
into the buffer indicated by the rdma_reply chunk. If the client into the buffer indicated by the rdma_reply chunk. If the client
doesn't provide an rdma_reply chunk, or if it's too small, then the doesn't provide an rdma_reply chunk, or if it's too small, then if
message must be returned as a Read chunk. the upper layer specification permits, the message MAY be returned
as a Read chunk.
An RPC with long reply returned via RDMA Write looks like this: An RPC with long reply returned via RDMA Write looks like this:
RPC Client RPC Server RPC Client RPC Server
| RPC Call with rdma_reply | | RPC Call with rdma_reply |
Send | ------------------------------> | Send | ------------------------------> |
| | | |
| Long RPC Reply Msg | | Long RPC Reply Msg |
| <------------------------------ | Write | <------------------------------ | Write
| | | |
skipping to change at page 24, line 36 skipping to change at page 26, line 47
eligible RPC operations such as NFS READDIR, which would otherwise eligible RPC operations such as NFS READDIR, which would otherwise
require extensive chunk management within the results or use of require extensive chunk management within the results or use of
RDMA Read and a Done message. [NFSDDP] RDMA Read and a Done message. [NFSDDP]
6. Connection Configuration Protocol 6. Connection Configuration Protocol
RDMA Send operations require the receiver to post one or more RDMA Send operations require the receiver to post one or more
buffers at the RDMA connection endpoint, each large enough to buffers at the RDMA connection endpoint, each large enough to
receive the largest Send message. Buffers are consumed as Send receive the largest Send message. Buffers are consumed as Send
messages are received. If a buffer is too small, or if there are messages are received. If a buffer is too small, or if there are
no buffers posted, the RDMA transport may return an error and break no buffers posted, the RDMA transport MAY return an error and break
the RDMA connection. The receiver must post sufficient, adequately the RDMA connection. The receiver MUST post sufficient, adequately
buffers to avoid buffer overrun or capacity errors. buffers to avoid buffer overrun or capacity errors.
The protocol described above includes only a mechanism for managing The protocol described above includes only a mechanism for managing
the number of such receive buffers, and no explicit features to the number of such receive buffers, and no explicit features to
allow the RPC client and server to provision or control buffer allow the RPC client and server to provision or control buffer
sizing, nor any other session parameters. sizing, nor any other session parameters.
In the past, this type of connection management has not been In the past, this type of connection management has not been
necessary for RPC. RPC over UDP or TCP does not have a protocol to necessary for RPC. RPC over UDP or TCP does not have a protocol to
negotiate the link. The server can get a rough idea of the maximum negotiate the link. The server can get a rough idea of the maximum
size of messages from the server protocol code. However, a size of messages from the server protocol code. However, a
protocol to negotiate transport features on a more dynamic basis is protocol to negotiate transport features on a more dynamic basis is
desirable. desirable.
The Connection Configuration Protocol allows the client to pass its The Connection Configuration Protocol allows the client to pass its
connection requirements to the server, and allows the server to connection requirements to the server, and allows the server to
inform the client of its connection limits. inform the client of its connection limits.
Use of the Connection Configuration Protocol by an upper layer is
OPTIONAL.
6.1. Initial Connection State 6.1. Initial Connection State
This protocol will be used for connection setup prior to the use of This protocol MAY be used for connection setup prior to the use of
another RPC protocol that uses the RDMA transport. It operates in- another RPC protocol that uses the RDMA transport. It operates in-
band, i.e. it uses the connection itself to negotiate the band, i.e. it uses the connection itself to negotiate the
connection parameters. To provide a basis for connection connection parameters. To provide a basis for connection
negotiation, the connection is assumed to provide a basic level of negotiation, the connection is assumed to provide a basic level of
interoperability: the ability to exchange at least one RPC message interoperability: the ability to exchange at least one RPC message
at a time that is at least 1 KB in size. The server may exceed at a time that is at least 1 KB in size. The server MAY exceed
this basic level of configuration, but the client must not assume this basic level of configuration, but the client MUST NOT assume
it. it.
6.2. Protocol Description 6.2. Protocol Description
Version 1 of the Connection Configuration protocol consists of a Version 1 of the Connection Configuration protocol consists of a
single procedure that allows the client to inform the server of its single procedure that allows the client to inform the server of its
connection requirements and the server to return connection connection requirements and the server to return connection
information to the client. information to the client.
The maxcall_sendsize argument is the maximum size of an RPC call The maxcall_sendsize argument is the maximum size of an RPC call
message that the client will send inline in an RDMA Send message to message that the client MUST send inline in an RDMA Send message to
the server. The server may return a maxcall_sendsize value that is the server. The server MAY return a maxcall_sendsize value that is
smaller or larger than the client's request. The client must not smaller or larger than the client's request. The client MUST NOT
send an inline call message larger than what the server will send an inline call message larger than what the server will
accept. The maxcall_sendsize limits only the size of inline RPC accept. The maxcall_sendsize limits only the size of inline RPC
calls. It does not limit the size of long RPC messages transferred calls. It does not limit the size of long RPC messages transferred
as an initial chunk in the Read chunk list. as an initial chunk in the Read chunk list.
The maxreply_sendsize is the maximum size of an inline RPC message The maxreply_sendsize is the maximum size of an inline RPC message
that the client will accept from the server. that the client will accept from the server.
The maxrdmaread is the maximum number of RDMA Reads which may be The maxrdmaread is the maximum number of RDMA Reads which may be
active at the peer. This number correlates to the RDMA incoming active at the peer. This number correlates to the RDMA incoming
RDMA Read count ("IRD") configured into each originating endpoint RDMA Read count ("IRD") configured into each originating endpoint
by the client or server. If more than this number of RDMA Read by the client or server. If more than this number of RDMA Read
operations by the connected peer are issued simultaneously, operations by the connected peer are issued simultaneously,
connection loss or suboptimal flow control may result, therefore connection loss or suboptimal flow control may result, therefore
the value should be observed at all times. The peers' values need the value SHOULD be observed at all times. The peers' values need
not be equal. If zero, the peer must not issue requests which not be equal. If zero, the peer MUST NOT issue requests which
require RDMA Read to satisfy, as no transfer will be possible. require RDMA Read to satisfy, as no transfer will be possible.
The align value is the value recommended by the server for opaque The align value is the value recommended by the server for opaque
data values such as strings and counted byte arrays. The client data values such as strings and counted byte arrays. The client
can use this value to compute the number of prepended pad bytes MAY use this value to compute the number of prepended pad bytes
when XDR encoding opaque values in the RPC call message. when XDR encoding opaque values in the RPC call message.
typedef unsigned int uint32; typedef unsigned int uint32;
struct config_rdma_req { struct config_rdma_req {
uint32 maxcall_sendsize; uint32 maxcall_sendsize;
/* max size of inline RPC call */ /* max size of inline RPC call */
uint32 maxreply_sendsize; uint32 maxreply_sendsize;
/* max size of inline RPC reply */ /* max size of inline RPC reply */
uint32 maxrdmaread; uint32 maxrdmaread;
skipping to change at page 26, line 33 skipping to change at page 28, line 48
/* max active RDMA Reads at server */ /* max active RDMA Reads at server */
}; };
program CONFIG_RDMA_PROG { program CONFIG_RDMA_PROG {
version VERS1 { version VERS1 {
/* /*
* Config call/reply * Config call/reply
*/ */
config_rdma_reply CONF_RDMA(config_rdma_req) = 1; config_rdma_reply CONF_RDMA(config_rdma_req) = 1;
} = 1; } = 1;
} = nnnnnn; <-- Need program number assigned } = 100400;
7. Memory Registration Overhead 7. Memory Registration Overhead
RDMA requires that all data be transferred between registered RDMA requires that all data be transferred between registered
memory regions at the source and destination. All protocol headers memory regions at the source and destination. All protocol headers
as well as separately transferred data chunks must use registered as well as separately transferred data chunks use registered
memory. Since the cost of registering and de-registering memory memory. Since the cost of registering and de-registering memory
can be a large proportion of the RDMA transaction cost, it is can be a large proportion of the RDMA transaction cost, it is
important to minimize registration activity. This is easily important to minimize registration activity. This is easily
achieved within RPC controlled memory by allocating chunk list data achieved within RPC controlled memory by allocating chunk list data
and RPC headers in a reusable way from pre-registered pools. and RPC headers in a reusable way from pre-registered pools.
The data chunks transferred via RDMA may occupy memory that The data chunks transferred via RDMA MAY occupy memory that
persists outside the bounds of the RPC transaction. Hence, the persists outside the bounds of the RPC transaction. Hence, the
default behavior of an RPC over RDMA transport is to register and default behavior of an RPC over RDMA transport is to register and
de-register these chunks on every transaction. However, this is de-register these chunks on every transaction. However, this is
not a limitation of the protocol - only of the existing local RPC not a limitation of the protocol - only of the existing local RPC
API. The API is easily extended through such functions as API. The API is easily extended through such functions as
rpc_control(3) to change the default behavior so that the rpc_control(3) to change the default behavior so that the
application can assume responsibility for controlling memory application can assume responsibility for controlling memory
registration through an RPC-provided registered memory allocator. registration through an RPC-provided registered memory allocator.
8. Errors and Error Recovery 8. Errors and Error Recovery
skipping to change at page 27, line 38 skipping to change at page 30, line 7
In setting up a new RDMA connection, the first action by an RPC In setting up a new RDMA connection, the first action by an RPC
client will be to obtain a transport address for the server. The client will be to obtain a transport address for the server. The
mechanism used to obtain this address, and to open an RDMA mechanism used to obtain this address, and to open an RDMA
connection is dependent on the type of RDMA transport, and is the connection is dependent on the type of RDMA transport, and is the
responsibility of each RPC protocol binding and its local responsibility of each RPC protocol binding and its local
implementation. implementation.
10. RPC Binding 10. RPC Binding
RPC services normally register with a portmap or rpcbind service, RPC services normally register with a portmap or rpcbind [RFC1833]
which associates an RPC program number with a service address. (In service, which associates an RPC program number with a service
the case of UDP or TCP, the service address for NFS is normally address. (In the case of UDP or TCP, the service address for NFS
port 2049.) This policy should be no different with RDMA is normally port 2049.) This policy is no different with RDMA
interconnects, although it may require the allocation of port interconnects, although it may require the allocation of port
numbers appropriate to each. numbers appropriate to each upper layer binding which uses the RPC
framing defined here.
When mapped atop the iWARP [RDDP] transport, which uses IP port
addressing due to its layering on TCP and/or SCTP, port mapping is
trivial and consists merely of issuing the port in the connection
process.
When mapped atop Infiniband [IB], which uses a GID-based service
endpoint naming scheme, a translation MUST be employed. One such
translation is defined in the Infiniband Port Addressing Annex
[IBPORT], which is appropriate for translating IP port addressing
to the Infiniband network. Therefore, in this case, IP port
addressing may be readily employed by the upper layer.
When a mapping standard or convention exists for IP ports on an When a mapping standard or convention exists for IP ports on an
RDMA interconnect, there are two possibilities: RDMA interconnect, there are several possibilities for each upper
layer to consider:
One possibility is to have the server's portmapper register One possibility is to have an upper layer server register its
itself on the RDMA interconnect at a "well known" service mapped IP port with the rpcbind service, under the netid (or
address. (On UDP or TCP, this corresponds to port 111.) A netid's) defined here. An RPC/RDMA-aware client can then
client could connect to this service address and use the resolve its desired service to a mappable port, and proceed to
portmap protocol to obtain a service address in response to a connect. This is the most flexible and compatible approach,
program number, e.g. an iWARP port number, or an Infiniband for those upper layers which are defined to use the rpcbind
GID. service.
A second possibility is to have the server's portmapper
register itself on the RDMA interconnect at a "well known"
service address. (On UDP or TCP, this corresponds to port
111.) A client could connect to this service address and use
the portmap protocol to obtain a service address in response
to a program number, e.g. an iWARP port number, or an
Infiniband GID.
Alternatively, the client could simply connect to the mapped Alternatively, the client could simply connect to the mapped
well-known port for the service itself, if it is appropriately well-known port for the service itself, if it is appropriately
defined. defined.
Historically, different RPC protocols have taken different Historically, different RPC protocols have taken different
approaches to their port assignment, therefore the specific method approaches to their port assignment, therefore the specific method
is left to each RPC/RDMA-enabled upper layer binding, and not is left to each RPC/RDMA-enabled upper layer binding, and not
addressed here. addressed here.
This specification defines a new "netid", to be used for
registration of upper layers atop iWARP [RDDP] and (when a suitable
port translation service is available) Infiniband [IB] in section
12, "IANA Considerations." Additional RDMA-capable networks MAY
define their own netids, or if they provide a port translation, MAY
share the one defined here.
11. Security 11. Security
ONC RPC provides its own security via the RPCSEC_GSS framework ONC RPC provides its own security via the RPCSEC_GSS framework
[RFC2203]. RPCSEC_GSS can provide message authentication, [RFC2203]. RPCSEC_GSS can provide message authentication,
integrity checking, and privacy. This security mechanism will be integrity checking, and privacy. This security mechanism will be
unaffected by the RDMA transport. The data integrity and privacy unaffected by the RDMA transport. The data integrity and privacy
features alter the body of the message, presenting it as a single features alter the body of the message, presenting it as a single
chunk. For large messages the chunk may be large enough to qualify chunk. For large messages the chunk may be large enough to qualify
for RDMA Read transfer. However, there is much data movement for RDMA Read transfer. However, there is much data movement
associated with computation and verification of integrity, or associated with computation and verification of integrity, or
encryption/decryption, so certain performance advantages may be encryption/decryption, so certain performance advantages may be
lost. lost.
For efficiency, more appropriate security mechanism for RDMA links For efficiency, more appropriate security mechanism for RDMA links
may be link-level protection, such as IPSec, which may be co- may be link-level protection, such as IPSec, which may be co-
located in the RDMA hardware. The use of link-level protection may located in the RDMA hardware. The use of link-level protection MAY
be negotiated through the use of a new RPCSEC_GSS mechanism like be negotiated through the use of a new RPCSEC_GSS mechanism like
the Credential Cache GSS Mechanism [CCM]. Use of such mechanisms the Credential Cache GSS Mechanism [CCM]. Use of such mechanisms
is recommended where end-to-end integrity and/or privacy is is RECOMMENDED where end-to-end integrity and/or privacy is
desired, and where efficiency is required. desired, and where efficiency is required.
There are no new issues here with exposed addresses. The only There are no new issues here with exposed addresses. The only
exposed addresses here are in the chunk list and in the transport exposed addresses here are in the chunk list and in the transport
packets transferred via RDMA. The data contained in these packets transferred via RDMA. The data contained in these
addresses continues to be protected by RPCSEC_GSS integrity and addresses continues to be protected by RPCSEC_GSS integrity and
privacy. privacy.
12. IANA Considerations 12. IANA Considerations
The new RPC transport should be assigned a new RPC "netid". The new RPC transport is to be assigned a new RPC "netid", which is
an rpcbind [RFC1833] string used to describe the underlying
protocol in order for RPC to select the appropriate transport
framing, as well as the format of the service ports.
As a new RPC transport, this protocol should have no effect on RPC The following string is to be added to the "nc_proto" registry on
program numbers or existing registered port numbers. However, new page 5 of [RFC1833]:
port numbers may be registered for use by RPC/RDMA-enabled
services, as appropriate to the new networks over which the
services will operate.
If adopted, the Connection Configuration protocol described herein NC_RDMA "rdma"
will require an RPC program number assignment. This netid MAY be used for any RDMA network satisfying the
requirements of section 2, and able to identify service endpoints
using IP port addressing, possibly through use of a translation
service as described above in section 10, RPC Binding.
As a new RPC transport, this protocol has no effect on RPC program
numbers or existing registered port numbers. However, new port
numbers MAY be registered for use by RPC/RDMA-enabled services, as
appropriate to the new networks over which the services will
operate.
The OPTIONAL Connection Configuration protocol described herein
requires an RPC program number assignment. The value "100400" is
assigned:
rdmaconfig 100400 rpc.rdmaconfig
Currently, these numbers are not assigned by IANA, they are merely
republished [IANA-RPC].
13. Acknowledgements 13. Acknowledgements
The authors wish to thank Rob Thurlow, John Howard, Chet Juszczak, The authors wish to thank Rob Thurlow, John Howard, Chet Juszczak,
Alex Chiu, Peter Staubach, Dave Noveck, Brian Pawlowski, Steve Alex Chiu, Peter Staubach, Dave Noveck, Brian Pawlowski, Steve
Kleiman, Mike Eisler, Mark Wittle, Shantanu Mehendale, David Kleiman, Mike Eisler, Mark Wittle, Shantanu Mehendale, David
Robinson and Mallikarjun Chadalapaka for their contributions to Robinson and Mallikarjun Chadalapaka for their contributions to
this document. this document.
14. Normative References 14. Normative References
[RFC2119]
S. Bradner, "Key words for use in RFCs to Indicate Requirement
Levels", Best Current Practice, BCP 14, RFC 2119, March 1997.
[RFC1094] [RFC1094]
Sun Microsystems, "NFS: Network File System Protocol Sun Microsystems, "NFS: Network File System Protocol
Specification", (NFS version 2) Informational RFC, Specification", (NFS version 2) Informational RFC,
http://www.ietf.org/rfc/rfc1094.txt http://www.ietf.org/rfc/rfc1094.txt
[RFC1831] [RFC1831]
R. Srinivasan, "RPC: Remote Procedure Call Protocol R. Srinivasan, "RPC: Remote Procedure Call Protocol
Specification Version 2", Standards Track RFC, Specification Version 2", Standards Track RFC,
http://www.ietf.org/rfc/rfc1831.txt http://www.ietf.org/rfc/rfc1831.txt
skipping to change at page 29, line 32 skipping to change at page 33, line 4
http://www.ietf.org/rfc/rfc1094.txt http://www.ietf.org/rfc/rfc1094.txt
[RFC1831] [RFC1831]
R. Srinivasan, "RPC: Remote Procedure Call Protocol R. Srinivasan, "RPC: Remote Procedure Call Protocol
Specification Version 2", Standards Track RFC, Specification Version 2", Standards Track RFC,
http://www.ietf.org/rfc/rfc1831.txt http://www.ietf.org/rfc/rfc1831.txt
[RFC1832] [RFC1832]
R. Srinivasan, "XDR: External Data Representation Standard", R. Srinivasan, "XDR: External Data Representation Standard",
Standards Track RFC, http://www.ietf.org/rfc/rfc1832.txt Standards Track RFC, http://www.ietf.org/rfc/rfc1832.txt
[RFC1813] [RFC1813]
B. Callaghan, B. Pawlowski, P. Staubach, "NFS Version 3 B. Callaghan, B. Pawlowski, P. Staubach, "NFS Version 3
Protocol Specification", Informational RFC, Protocol Specification", Informational RFC,
http://www.ietf.org/rfc/rfc1813.txt http://www.ietf.org/rfc/rfc1813.txt
[RFC1833]
R. Srinivasan, "Binding Protocols for ONC RPC Version 2",
Standards Track RFC, http://www.ietf.org/rfc/rfc1833.txt
[RFC3530] [RFC3530]
S. Shepler, B. Callaghan, D. Robinson, R. Thurlow, C. Beame, S. Shepler, B. Callaghan, D. Robinson, R. Thurlow, C. Beame,
M. Eisler, D. Noveck, "NFS version 4 Protocol", Standards M. Eisler, D. Noveck, "NFS version 4 Protocol", Standards
Track RFC, http://www.ietf.org/rfc/rfc3530.txt Track RFC, http://www.ietf.org/rfc/rfc3530.txt
[RFC2203] [RFC2203]
M. Eisler, A. Chiu, L. Ling, "RPCSEC_GSS Protocol M. Eisler, A. Chiu, L. Ling, "RPCSEC_GSS Protocol
Specification", Standards Track RFC, Specification", Standards Track RFC,
http://www.ietf.org/rfc/rfc2203.txt http://www.ietf.org/rfc/rfc2203.txt
15. Informative References 15. Informative References
[RDMAP] [RDMAP]
R. Recio et al, "An RDMA Protocol Specification", Internet R. Recio et. al., "A Remote Direct Memory Access Protocol
Draft Work in Progress, draft-ietf-rddp-rdmap Specification", Standards Track RFC, draft-ietf-rddp-rdmap
[CCM] [CCM]
M. Eisler, N. Williams, "CCM: The Credential Cache GSS M. Eisler, N. Williams, "CCM: The Credential Cache GSS
Mechanism", Internet Draft Work in Progress, draft-ietf- Mechanism", Internet Draft Work in Progress, draft-ietf-
nfsv4-ccm nfsv4-ccm
[NFSDDP] [NFSDDP]
B. Callaghan, T. Talpey, "NFS Direct Data Placement" Internet B. Callaghan, T. Talpey, "NFS Direct Data Placement" Internet
Draft Work in Progress, draft-ietf-nfsv4-nfsdirect Draft Work in Progress, draft-ietf-nfsv4-nfsdirect
[RDDP] [RDDP]
Remote Direct Data Placement Working Group Charter, H. Shah et. al., "Direct Data Placement over Reliable
http://www.ietf.org/html.charters/rddp-charter.html Transports", Standards Track RFC, draft-ietf-rddp-ddp
[NFSRDMAPS] [NFSRDMAPS]
T. Talpey, C. Juszczak, "NFS RDMA Problem Statement", Internet T. Talpey, C. Juszczak, "NFS RDMA Problem Statement", Internet
Draft Work in Progress, draft-ietf-nfsv4-nfs-rdma-problem- Draft Work in Progress, draft-ietf-nfsv4-nfs-rdma-problem-
statement statement
[NFSv4.1] [NFSv4.1]
S. Shepler, ed., "NFSv4 Minor Version 1" Internet Draft Work S. Shepler et. al., ed., "NFSv4 Minor Version 1" Internet
in Progress, draft-ietf-nfsv4-minorversion1 Draft Work in Progress, draft-ietf-nfsv4-minorversion1
[IB] [IB]
Infiniband Architecture Specification, Infiniband Architecture Specification, available from
http://www.infinibandta.org http://www.infinibandta.org
[IBPORT]
Infiniband Trade Association, "IP Addressing Annex", available
from http://www.infinibandta.org
[IANA-RPC]
IANA Sun RPC number statement,
http://www.iana.org/assignments/sun-rpc-numbers
16. Authors' Addresses 16. Authors' Addresses
Tom Talpey Tom Talpey
Network Appliance, Inc. Network Appliance, Inc.
375 Totten Pond Road 375 Totten Pond Road
Waltham, MA 02451 USA Waltham, MA 02451 USA
Phone: +1 781 768 5329 Phone: +1 781 768 5329
EMail: thomas.talpey@netapp.com EMail: thomas.talpey@netapp.com
Brent Callaghan Brent Callaghan
Apple Computer, Inc. Apple Computer, Inc.
MS: 302-4K MS: 302-4K
2 Infinite Loop 2 Infinite Loop
Cupertino, CA 95014 USA Cupertino, CA 95014 USA
EMail: brentc@apple.com EMail: brentc@apple.com
17. Intellectual Property and Copyright Statements 17. Intellectual Property and Copyright Statements
 End of changes. 90 change blocks. 
224 lines changed or deleted 394 lines changed or added

This html diff was produced by rfcdiff 1.33. The latest version is available from http://tools.ietf.org/tools/rfcdiff/