draft-ietf-nfsv4-rpcrdma-06.txt   draft-ietf-nfsv4-rpcrdma-07.txt 
NFSv4 Working Group Tom Talpey NFSv4 Working Group Tom Talpey
Internet-Draft Network Appliance, Inc. Internet-Draft NetApp
Intended status: Standards Track Brent Callaghan Intended status: Standards Track Brent Callaghan
Expires: January 1, 2008 Apple Computer, Inc. Expires: August 23, 2008 Apple
July 1, 2007 February 22, 2008
RDMA Transport for ONC RPC Remote Direct Memory Access Transport for Remote Procedure Call
draft-ietf-nfsv4-rpcrdma-06 draft-ietf-nfsv4-rpcrdma-07
Status of this Memo Status of this Memo
By submitting this Internet-Draft, each author represents that any By submitting this Internet-Draft, each author represents that any
applicable patent or other IPR claims of which he or she is aware applicable patent or other IPR claims of which he or she is aware
have been or will be disclosed, and any of which he or she becomes have been or will be disclosed, and any of which he or she becomes
aware will be disclosed, in accordance with Section 6 of BCP 79. aware will be disclosed, in accordance with Section 6 of BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that Task Force (IETF), its areas, and its working groups. Note that
skipping to change at page 1, line 38 skipping to change at page 1, line 38
progress." progress."
The list of current Internet-Drafts can be accessed at The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html. http://www.ietf.org/shadow.html.
Copyright Notice Copyright Notice
Copyright (C) The IETF Trust (2007). Copyright (C) The IETF Trust (2008).
Abstract Abstract
A protocol is described providing RDMA as a new transport for ONC A protocol is described providing Remote Direct Memory Access
RPC. The RDMA transport binding conveys the benefits of efficient, (RDMA) as a new transport for Computing Remote Procedure Call
bulk data transport over high speed networks, while providing for (RPC). The RDMA transport binding conveys the benefits of
minimal change to RPC applications and with no required revision of efficient, bulk data transport over high speed networks, while
the application RPC protocol, or the RPC protocol itself. providing for minimal change to RPC applications and with no
required revision of the application RPC protocol, or the RPC
protocol itself.
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
2. Abstract RDMA Requirements . . . . . . . . . . . . . . . . . 3 2. Abstract RDMA Requirements . . . . . . . . . . . . . . . . . 3
3. Protocol Outline . . . . . . . . . . . . . . . . . . . . . . 4 3. Protocol Outline . . . . . . . . . . . . . . . . . . . . . . 4
3.1. Short Messages . . . . . . . . . . . . . . . . . . . . . . 5 3.1. Short Messages . . . . . . . . . . . . . . . . . . . . . . 5
3.2. Data Chunks . . . . . . . . . . . . . . . . . . . . . . . 5 3.2. Data Chunks . . . . . . . . . . . . . . . . . . . . . . . 5
3.3. Flow Control . . . . . . . . . . . . . . . . . . . . . . . 6 3.3. Flow Control . . . . . . . . . . . . . . . . . . . . . . . 6
3.4. XDR Encoding with Chunks . . . . . . . . . . . . . . . . . 7 3.4. XDR Encoding with Chunks . . . . . . . . . . . . . . . . . 7
3.5. XDR Decoding with Read Chunks . . . . . . . . . . . . . 11 3.5. XDR Decoding with Read Chunks . . . . . . . . . . . . . 10
3.6. XDR Decoding with Write Chunks . . . . . . . . . . . . . 11 3.6. XDR Decoding with Write Chunks . . . . . . . . . . . . . 11
3.7. XDR Roundup and Chunks . . . . . . . . . . . . . . . . . 12 3.7. XDR Roundup and Chunks . . . . . . . . . . . . . . . . . 12
3.8. RPC Call and Reply . . . . . . . . . . . . . . . . . . . 13 3.8. RPC Call and Reply . . . . . . . . . . . . . . . . . . . 13
3.9. Padding . . . . . . . . . . . . . . . . . . . . . . . . 16 3.9. Padding . . . . . . . . . . . . . . . . . . . . . . . . 16
4. RPC RDMA Message Layout . . . . . . . . . . . . . . . . . 17 4. RPC RDMA Message Layout . . . . . . . . . . . . . . . . . 17
4.1. RPC over RDMA Header . . . . . . . . . . . . . . . . . . 17 4.1. RPC over RDMA Header . . . . . . . . . . . . . . . . . . 17
4.2. RPC over RDMA header errors . . . . . . . . . . . . . . 19 4.2. RPC over RDMA header errors . . . . . . . . . . . . . . 19
4.3. XDR Language Description . . . . . . . . . . . . . . . . 20 4.3. XDR Language Description . . . . . . . . . . . . . . . . 20
5. Long Messages . . . . . . . . . . . . . . . . . . . . . . 22 5. Long Messages . . . . . . . . . . . . . . . . . . . . . . 22
5.1. Message as an RDMA Read Chunk . . . . . . . . . . . . . 22 5.1. Message as an RDMA Read Chunk . . . . . . . . . . . . . 22
5.2. RDMA Write of Long Replies (Reply Chunks) . . . . . . . 24 5.2. RDMA Write of Long Replies (Reply Chunks) . . . . . . . 24
6. Connection Configuration Protocol . . . . . . . . . . . . 25 6. Connection Configuration Protocol . . . . . . . . . . . . 25
6.1. Initial Connection State . . . . . . . . . . . . . . . . 26 6.1. Initial Connection State . . . . . . . . . . . . . . . . 26
6.2. Protocol Description . . . . . . . . . . . . . . . . . . 26 6.2. Protocol Description . . . . . . . . . . . . . . . . . . 26
7. Memory Registration Overhead . . . . . . . . . . . . . . . 28 7. Memory Registration Overhead . . . . . . . . . . . . . . . 28
8. Errors and Error Recovery . . . . . . . . . . . . . . . . 28 8. Errors and Error Recovery . . . . . . . . . . . . . . . . 28
9. Node Addressing . . . . . . . . . . . . . . . . . . . . . 28 9. Node Addressing . . . . . . . . . . . . . . . . . . . . . 28
10. RPC Binding . . . . . . . . . . . . . . . . . . . . . . . 29 10. RPC Binding . . . . . . . . . . . . . . . . . . . . . . . 29
11. Security . . . . . . . . . . . . . . . . . . . . . . . . 30 11. Security Considerations . . . . . . . . . . . . . . . . . 30
12. IANA Considerations . . . . . . . . . . . . . . . . . . . 30 12. IANA Considerations . . . . . . . . . . . . . . . . . . . 31
13. Acknowledgements . . . . . . . . . . . . . . . . . . . . 31 13. Acknowledgements . . . . . . . . . . . . . . . . . . . . 32
14. Normative References . . . . . . . . . . . . . . . . . . 31 14. Normative References . . . . . . . . . . . . . . . . . . 32
15. Informative References . . . . . . . . . . . . . . . . . 32 15. Informative References . . . . . . . . . . . . . . . . . 33
16. Authors' Addresses . . . . . . . . . . . . . . . . . . . 33 16. Authors' Addresses . . . . . . . . . . . . . . . . . . . 34
17. Intellectual Property and Copyright Statements . . . . . 33 17. Intellectual Property and Copyright Statements . . . . . 35
Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . 34 Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . 36
Requirements Language Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in
this document are to be interpreted as described in [RFC2119]. this document are to be interpreted as described in [RFC2119].
1. Introduction 1. Introduction
RDMA is a technique for efficient movement of data between end Remote Direct Memory Access (RDMA) [RFC5040, RFC5041] [IB] is a
nodes, which becomes increasingly compelling over high speed technique for efficient movement of data between end nodes, which
transports. By directing data into destination buffers as it is becomes increasingly compelling over high speed transports. By
sent on a network, and placing it via direct memory access by directing data into destination buffers as it is sent on a network,
hardware, the double benefit of faster transfers and reduced host and placing it via direct memory access by hardware, the double
overhead is obtained. benefit of faster transfers and reduced host overhead is obtained.
ONC RPC [RFC1831bis] is a remote procedure call protocol that has Open Network Computing Remote Procedure Call (ONC RPC, or simply,
been run over a variety of transports. Most RPC implementations RPC) [RFC1831bis] is a remote procedure call protocol that has been
today use UDP or TCP. RPC messages are defined in terms of an run over a variety of transports. Most RPC implementations today
eXternal Data Representation (XDR) [RFC4506] which provides a use UDP or TCP. RPC messages are defined in terms of an eXternal
canonical data representation across a variety of host Data Representation (XDR) [RFC4506] which provides a canonical data
architectures. An XDR data stream is conveyed differently on each representation across a variety of host architectures. An XDR data
type of transport. On UDP, RPC messages are encapsulated inside stream is conveyed differently on each type of transport. On UDP,
datagrams, while on a TCP byte stream, RPC messages are delineated RPC messages are encapsulated inside datagrams, while on a TCP byte
by a record marking protocol. An RDMA transport also conveys RPC stream, RPC messages are delineated by a record marking protocol.
messages in a unique fashion that must be fully described if client An RDMA transport also conveys RPC messages in a unique fashion
and server implementations are to interoperate. that must be fully described if client and server implementations
are to interoperate.
RDMA transports present new semantics unlike the behaviors of RDMA transports present new semantics unlike the behaviors of
either UDP and TCP alone. They retain message delineations like either UDP or TCP alone. They retain message delineations like UDP
UDP while also providing a reliable, sequenced data transfer like while also providing a reliable, sequenced data transfer like TCP.
TCP. And, they provide the new efficient, bulk transfer service of And, they provide the new efficient, bulk transfer service of RDMA.
RDMA. RDMA transports are therefore naturally viewed as a new RDMA transports are therefore naturally viewed as a new transport
transport type by ONC RPC. type by RPC.
RDMA as a transport will benefit the performance of RPC protocols RDMA as a transport will benefit the performance of RPC protocols
that move large "chunks" of data, since RDMA hardware excels at that move large "chunks" of data, since RDMA hardware excels at
moving data efficiently between host memory and a high speed moving data efficiently between host memory and a high speed
network with little or no host CPU involvement. In this context, network with little or no host CPU involvement. In this context,
the NFS protocol, in all its versions [RFC1094] [RFC1813] [RFC3530] the NFS protocol, in all its versions [RFC1094] [RFC1813] [RFC3530]
[NFSv4.1], is an obvious beneficiary of RDMA. A complete problem [NFSv4.1], is an obvious beneficiary of RDMA. A complete problem
statement is discussed in [NFSRDMAPS], and related NFSv4 issues are statement is discussed in [NFSRDMAPS], and related NFSv4 issues are
discussed in [NFSv4.1]. Many other RPC-based protocols will also discussed in [NFSv4.1]. Many other RPC-based protocols will also
benefit. benefit.
skipping to change at page 4, line 12 skipping to change at page 4, line 13
header contains a transaction ID (XID) followed by the program and header contains a transaction ID (XID) followed by the program and
procedure number as well as a security credential. An RPC reply procedure number as well as a security credential. An RPC reply
header begins with an XID that matches that of the RPC call header begins with an XID that matches that of the RPC call
message, followed by a security verifier and results. All data in message, followed by a security verifier and results. All data in
an RPC message is XDR encoded. For a complete description of the an RPC message is XDR encoded. For a complete description of the
RPC protocol and XDR encoding, see [RFC1831bis] and [RFC4506]. RPC protocol and XDR encoding, see [RFC1831bis] and [RFC4506].
This protocol assumes the following abstract model for RDMA This protocol assumes the following abstract model for RDMA
transports. These terms, common in the RDMA lexicon, are used in transports. These terms, common in the RDMA lexicon, are used in
this document. A more complete glossary of RDMA terms can be found this document. A more complete glossary of RDMA terms can be found
in [RDMAP]. in [RFC5040].
o Registered Memory o Registered Memory
All data moved via tagged RDMA operations is resident in All data moved via tagged RDMA operations is resident in
registered memory at its destination. This protocol assumes registered memory at its destination. This protocol assumes
that each segment of registered memory MUST be identified with that each segment of registered memory MUST be identified with
a steering tag of no more than 32 bits and memory addresses of a steering tag of no more than 32 bits and memory addresses of
up to 64 bits in length. up to 64 bits in length.
o RDMA Send o RDMA Send
The RDMA provider supports an RDMA Send operation with The RDMA provider supports an RDMA Send operation with
skipping to change at page 5, line 10 skipping to change at page 5, line 10
receives no notification of RDMA Read completion, there is an receives no notification of RDMA Read completion, there is an
assumption that on receiving the data the receiver will signal assumption that on receiving the data the receiver will signal
completion with an RDMA Send message, so that the peer can completion with an RDMA Send message, so that the peer can
free the source buffers and the associated steering tags. free the source buffers and the associated steering tags.
This protocol is designed to be carried over all RDMA transports This protocol is designed to be carried over all RDMA transports
meeting the stated requirements. This protocol conveys to the RPC meeting the stated requirements. This protocol conveys to the RPC
peer, information sufficient for that RPC peer to direct an RDMA peer, information sufficient for that RPC peer to direct an RDMA
layer to perform transfers containing RPC data, and to communicate layer to perform transfers containing RPC data, and to communicate
their result(s). For example, it is readily carried over RDMA their result(s). For example, it is readily carried over RDMA
transports such as iWARP [RDDP] or Infiniband [IB]. transports such as iWARP [RFC5040, RFC5041] or Infiniband [IB].
3. Protocol Outline 3. Protocol Outline
An RPC message can be conveyed in identical fashion, whether it is An RPC message can be conveyed in identical fashion, whether it is
a call or reply message. In each case, the transmission of the a call or reply message. In each case, the transmission of the
message proper is preceded by transmission of a transport-specific message proper is preceded by transmission of a transport-specific
header for use by RPC over RDMA transports. This header is header for use by RPC over RDMA transports. This header is
analogous to the record marking used for RPC over TCP, but is more analogous to the record marking used for RPC over TCP, but is more
extensive, since RDMA transports support several modes of data extensive, since RDMA transports support several modes of data
transfer and it is important to allow the client and server to use transfer and it is important to allow the client and server to use
skipping to change at page 5, line 49 skipping to change at page 5, line 49
however define an exchange to dynamically enable RPC/RDMA on an however define an exchange to dynamically enable RPC/RDMA on an
existing RPC association. Any such exchange must be carefully existing RPC association. Any such exchange must be carefully
architected so as to prevent any ambiguity as to the framing in use architected so as to prevent any ambiguity as to the framing in use
for each side of the connection. Because RPC/RDMA framing delimits for each side of the connection. Because RPC/RDMA framing delimits
an entire RPC request or reply, any such shift must occur between an entire RPC request or reply, any such shift must occur between
distinct RPC messages. distinct RPC messages.
3.1. Short Messages 3.1. Short Messages
Many RPC messages are quite short. For example, the NFS version 3 Many RPC messages are quite short. For example, the NFS version 3
GETATTR request, is only 56 bytes: 20 bytes of RPC header plus a 32 GETATTR request, is only 56 bytes: 20 bytes of RPC header, plus a
byte filehandle argument and 4 bytes of length. The reply to this 32 byte file handle argument and 4 bytes of length. The reply to
common request is about 100 bytes. this common request is about 100 bytes.
There is no benefit in transferring such small messages with an There is no benefit in transferring such small messages with an
RDMA Read or Write operation. The overhead in transferring RDMA Read or Write operation. The overhead in transferring
steering tags and memory addresses is justified only by large steering tags and memory addresses is justified only by large
transfers. The critical message size that justifies RDMA transfer transfers. The critical message size that justifies RDMA transfer
will vary depending on the RDMA implementation and network, but is will vary depending on the RDMA implementation and network, but is
typically of the order of a few kilobytes. It is appropriate to typically of the order of a few kilobytes. It is appropriate to
transfer a short message with an RDMA Send to a pre-posted buffer. transfer a short message with an RDMA Send to a pre-posted buffer.
The RPC over RDMA header with the short message (call or reply) The RPC over RDMA header with the short message (call or reply)
immediately following is transferred using a single RDMA Send immediately following is transferred using a single RDMA Send
skipping to change at page 7, line 30 skipping to change at page 7, line 30
up or down at each opportunity to match the server's needs or up or down at each opportunity to match the server's needs or
policies. policies.
The RPC client MUST NOT send unacknowledged requests in excess of The RPC client MUST NOT send unacknowledged requests in excess of
this granted RPC server credit limit. If the limit is exceeded, this granted RPC server credit limit. If the limit is exceeded,
the RDMA layer may signal an error, possibly terminating the the RDMA layer may signal an error, possibly terminating the
connection. Even if an error does not occur, it is OPTIONAL that connection. Even if an error does not occur, it is OPTIONAL that
the server handle the excess request(s), and it MAY return an RPC the server handle the excess request(s), and it MAY return an RPC
error to the client. Also note that the never-zero requirement error to the client. Also note that the never-zero requirement
implies that an RPC server MUST always provide at least one credit implies that an RPC server MUST always provide at least one credit
to each connected RPC client. It is however OPTIONAL that the to each connected RPC client from which no requests are
server always be prepared to receive a request from each client, outstanding. The client would deadlock otherwise, unable to send
for example when the server is busy processing all granted client another request.
requests.
While RPC calls complete in any order, the current flow control While RPC calls complete in any order, the current flow control
limit at the RPC server is known to the RPC client from the Send limit at the RPC server is known to the RPC client from the Send
ordering properties. It is always the most recent server-granted ordering properties. It is always the most recent server-granted
credit value minus the number of requests in flight. credit value minus the number of requests in flight.
Certain RDMA implementations may impose additional flow control Certain RDMA implementations may impose additional flow control
restrictions, such as limits on RDMA Read operations in progress at restrictions, such as limits on RDMA Read operations in progress at
the responder. Because these operations are outside the scope of the responder. Because these operations are outside the scope of
this protocol, they are not addressed and SHOULD be provided for by this protocol, they are not addressed and SHOULD be provided for by
skipping to change at page 8, line 24 skipping to change at page 8, line 24
encoded as a contiguous sequence of bytes for network transmission encoded as a contiguous sequence of bytes for network transmission
over UDP or TCP. However, in the case of an RDMA transport, local over UDP or TCP. However, in the case of an RDMA transport, local
routines such as XDR encode can determine that (for instance) an routines such as XDR encode can determine that (for instance) an
opaque byte array is large enough to be more efficiently moved via opaque byte array is large enough to be more efficiently moved via
an RDMA data transfer operation like RDMA Read or RDMA Write. an RDMA data transfer operation like RDMA Read or RDMA Write.
Semantically speaking, the protocol has no restriction regarding Semantically speaking, the protocol has no restriction regarding
data types which may or may not be represented by a read or write data types which may or may not be represented by a read or write
chunk. In practice however, efficiency considerations lead to the chunk. In practice however, efficiency considerations lead to the
conclusion that certain data types are not generally "chunkable". conclusion that certain data types are not generally "chunkable".
Typically, only opaque and aggregate data types which may attain Typically, only those opaque and aggregate data types that may
substantial size are considered to be eligible. With today's attain substantial size are considered to be eligible. With
hardware this size may be a kilobyte or more. However any object today's hardware this size may be a kilobyte or more. However any
MAY be chosen for chunking in any given message. object MAY be chosen for chunking in any given message.
The eligibility of XDR data items to be candidates for being moved The eligibility of XDR data items to be candidates for being moved
as data chunks (as opposed to being marshaled inline) is not as data chunks (as opposed to being marshaled inline) is not
specified by the RPC over RDMA protocol. Chunk eligibility specified by the RPC over RDMA protocol. Chunk eligibility
criteria MUST be determined by each upper layer in order to provide criteria MUST be determined by each upper layer in order to provide
for an interoperable specification. One such example with for an interoperable specification. One such example with
rationale, for the NFS protocol family, is provided in [NFSDDP]. rationale, for the NFS protocol family, is provided in [NFSDDP].
The interface by which an upper layer implementation communicates The interface by which an upper layer implementation communicates
the eligibility of a data item locally to RPC for chunking is out the eligibility of a data item locally to RPC for chunking is out
skipping to change at page 13, line 21 skipping to change at page 13, line 21
On the other hand, RPC/RDMA Read chunks carry the XDR position of On the other hand, RPC/RDMA Read chunks carry the XDR position of
each chunked element and length of the Chunk segment, and can be each chunked element and length of the Chunk segment, and can be
placed by the receiver exactly where they belong in the receiver's placed by the receiver exactly where they belong in the receiver's
memory without regard to the alignment of their position in the XDR memory without regard to the alignment of their position in the XDR
stream. Since any rounded-up data is not actually part of the stream. Since any rounded-up data is not actually part of the
upper layer's message, the receiver will not reference it, and upper layer's message, the receiver will not reference it, and
there is no reason to set it to any particular value in the there is no reason to set it to any particular value in the
receiver's memory. receiver's memory.
When roundup is present at the end of a sequence of chunks, the When roundup is present at the end of a sequence of chunks, the
length of the sequence will terminate it at an non-4-byte XDR length of the sequence will terminate it at a non-4-byte XDR
position. When the receiver proceeds to decode the remaining part position. When the receiver proceeds to decode the remaining part
of the XDR stream, it inspects the XDR position indicated by the of the XDR stream, it inspects the XDR position indicated by the
next chunk. Because this position will not match (else roundup next chunk. Because this position will not match (else roundup
would not have occurred), the receiver decoding will fall back to would not have occurred), the receiver decoding will fall back to
inspecting the remaining inline portion. If in turn, no data inspecting the remaining inline portion. If in turn, no data
remains to be decoded from the inline portion, then the receiver remains to be decoded from the inline portion, then the receiver
MUST conclude that roundup is present, and therefore advances the MUST conclude that roundup is present, and therefore advances the
XDR decode position to that indicated by the next chunk (if any). XDR decode position to that indicated by the next chunk (if any).
In this way, roundup is passed without ever actually transferring In this way, roundup is passed without ever actually transferring
additional XDR bytes. additional XDR bytes.
skipping to change at page 17, line 38 skipping to change at page 17, line 38
RDMA on behalf of RPC requests will be placed into appropriately RDMA on behalf of RPC requests will be placed into appropriately
aligned buffers on the system that receives the transfer. In this aligned buffers on the system that receives the transfer. In this
way, the need for servers to perform RDMA Read to satisfy all but way, the need for servers to perform RDMA Read to satisfy all but
the largest client writes is obviated. the largest client writes is obviated.
The effect of padding is demonstrated below showing prior bytes on The effect of padding is demonstrated below showing prior bytes on
an XDR stream (XXX) followed by an opaque field consisting of four an XDR stream (XXX) followed by an opaque field consisting of four
length bytes (LLLL) followed by data bytes (DDDD). The receiver of length bytes (LLLL) followed by data bytes (DDDD). The receiver of
the RDMA Send has posted two chained receive buffers. Without the RDMA Send has posted two chained receive buffers. Without
padding, the opaque data is split across the two buffers. With the padding, the opaque data is split across the two buffers. With the
addition of padding bytes (ppp) prior to the first data byte, the addition of padding bytes ("ppp" in the figure below) prior to the
data can be forced to align correctly in the second buffer. first data byte, the data can be forced to align correctly in the
second buffer.
Buffer 1 Buffer 2 Buffer 1 Buffer 2
Unpadded -------------- -------------- Unpadded -------------- --------------
XXXXXXXLLLLDDDDDDDDDDDDDD ---> XXXXXXXLLLLDDD DDDDDDDDDDD XXXXXXXLLLLDDDDDDDDDDDDDD ---> XXXXXXXLLLLDDD DDDDDDDDDDD
Padded Padded
XXXXXXXLLLLpppDDDDDDDDDDDDDD ---> XXXXXXXLLLLppp DDDDDDDDDDDDDD XXXXXXXLLLLpppDDDDDDDDDDDDDD ---> XXXXXXXLLLLppp DDDDDDDDDDDDDD
skipping to change at page 26, line 29 skipping to change at page 26, line 29
| RPC Call with rdma_reply | | RPC Call with rdma_reply |
Send | ------------------------------> | Send | ------------------------------> |
| | | |
| Long RPC Reply Msg | | Long RPC Reply Msg |
| <------------------------------ | Write | <------------------------------ | Write
| | | |
| RDMA over RPC Header | | RDMA over RPC Header |
| <------------------------------ | Send | <------------------------------ | Send
The use of RDMA Write to return long replies requires that the The use of RDMA Write to return long replies requires that the
client application anticipate a long reply and have some knowledge client applications anticipate a long reply and have some knowledge
of its size so that an adequately sized buffer can be allocated. of its size so that an adequately sized buffer can be allocated.
This is certainly true of NFS READDIR replies; where the client This is certainly true of NFS READDIR replies; where the client
already provides an upper bound on the size of the encoded already provides an upper bound on the size of the encoded
directory fragment to be returned by the server. directory fragment to be returned by the server.
The use of these "reply chunks" is highly efficient and convenient The use of these "reply chunks" is highly efficient and convenient
for both RPC client and server. Their use is encouraged for for both RPC client and server. Their use is encouraged for
eligible RPC operations such as NFS READDIR, which would otherwise eligible RPC operations such as NFS READDIR, which would otherwise
require extensive chunk management within the results or use of require extensive chunk management within the results or use of
RDMA Read and a Done message. [NFSDDP] RDMA Read and a Done message. [NFSDDP]
skipping to change at page 30, line 15 skipping to change at page 30, line 15
10. RPC Binding 10. RPC Binding
RPC services normally register with a portmap or rpcbind [RFC1833] RPC services normally register with a portmap or rpcbind [RFC1833]
service, which associates an RPC program number with a service service, which associates an RPC program number with a service
address. (In the case of UDP or TCP, the service address for NFS address. (In the case of UDP or TCP, the service address for NFS
is normally port 2049.) This policy is no different with RDMA is normally port 2049.) This policy is no different with RDMA
interconnects, although it may require the allocation of port interconnects, although it may require the allocation of port
numbers appropriate to each upper layer binding which uses the RPC numbers appropriate to each upper layer binding which uses the RPC
framing defined here. framing defined here.
When mapped atop the iWARP [RDDP] transport, which uses IP port When mapped atop the iWARP [RFC5040, RFC5041] transport, which uses
addressing due to its layering on TCP and/or SCTP, port mapping is IP port addressing due to its layering on TCP and/or SCTP, port
trivial and consists merely of issuing the port in the connection mapping is trivial and consists merely of issuing the port in the
process. connection process.
When mapped atop Infiniband [IB], which uses a GID-based service When mapped atop Infiniband [IB], which uses a GID-based service
endpoint naming scheme, a translation MUST be employed. One such endpoint naming scheme, a translation MUST be employed. One such
translation is defined in the Infiniband Port Addressing Annex translation is defined in the Infiniband Port Addressing Annex
[IBPORT], which is appropriate for translating IP port addressing [IBPORT], which is appropriate for translating IP port addressing
to the Infiniband network. Therefore, in this case, IP port to the Infiniband network. Therefore, in this case, IP port
addressing may be readily employed by the upper layer. addressing may be readily employed by the upper layer.
When a mapping standard or convention exists for IP ports on an When a mapping standard or convention exists for IP ports on an
RDMA interconnect, there are several possibilities for each upper RDMA interconnect, there are several possibilities for each upper
skipping to change at page 31, line 8 skipping to change at page 31, line 8
Alternatively, the client could simply connect to the mapped Alternatively, the client could simply connect to the mapped
well-known port for the service itself, if it is appropriately well-known port for the service itself, if it is appropriately
defined. defined.
Historically, different RPC protocols have taken different Historically, different RPC protocols have taken different
approaches to their port assignment, therefore the specific method approaches to their port assignment, therefore the specific method
is left to each RPC/RDMA-enabled upper layer binding, and not is left to each RPC/RDMA-enabled upper layer binding, and not
addressed here. addressed here.
This specification defines a new "netid", to be used for This specification defines a new "netid", to be used for
registration of upper layers atop iWARP [RDDP] and (when a suitable registration of upper layers atop iWARP [RFC5040, RFC5041] and
port translation service is available) Infiniband [IB] in section (when a suitable port translation service is available) Infiniband
12, "IANA Considerations." Additional RDMA-capable networks MAY [IB] in section 12, "IANA Considerations." Additional RDMA-capable
define their own netids, or if they provide a port translation, MAY networks MAY define their own netids, or if they provide a port
share the one defined here. translation, MAY share the one defined here.
11. Security 11. Security Considerations
ONC RPC provides its own security via the RPCSEC_GSS framework RPC provides its own security via the RPCSEC_GSS framework
[RFC2203]. RPCSEC_GSS can provide message authentication, [RFC2203]. RPCSEC_GSS can provide message authentication,
integrity checking, and privacy. This security mechanism will be integrity checking, and privacy. This security mechanism will be
unaffected by the RDMA transport. The data integrity and privacy unaffected by the RDMA transport. The data integrity and privacy
features alter the body of the message, presenting it as a single features alter the body of the message, presenting it as a single
chunk. For large messages the chunk may be large enough to qualify chunk. For large messages the chunk may be large enough to qualify
for RDMA Read transfer. However, there is much data movement for RDMA Read transfer. However, there is much data movement
associated with computation and verification of integrity, or associated with computation and verification of integrity, or
encryption/decryption, so certain performance advantages may be encryption/decryption, so certain performance advantages may be
lost. lost.
For efficiency, more appropriate security mechanism for RDMA links For efficiency, a more appropriate security mechanism for RDMA
may be link-level protection, such as certain configurations of links may be link-level protection, such as certain configurations
IPsec, which may be co-located in the RDMA hardware. The use of of IPsec, which may be co-located in the RDMA hardware. The use of
link-level protection MAY be negotiated through the use of a new link-level protection MAY be negotiated through the use of the new
RPCSEC_GSS mechanism like the Credential Cache GSS Mechanism [CCM]. RPCSEC_GSS mechanism defined in [RPCSECGSSV2] in conjunction with
Use of such mechanisms is RECOMMENDED where end-to-end integrity the Channel Binding mechanism [RFC5056] and IPsec Channel
and/or privacy is desired, and where efficiency is required. Connection Latching [BTNSLATCH]. Use of such mechanisms is
REQUIRED where integrity and/or privacy is desired, and where
efficiency is required.
There are no new issues here with exposed addresses. The only An additional consideration is the protection of the integrity and
exposed addresses here are in the chunk list and in the transport privacy of local memory by the RDMA transport itself. The use of
packets transferred via RDMA. The data contained in these RDMA by RPC MUST NOT introduce any vulnerabilities to system memory
addresses continues to be protected by RPCSEC_GSS integrity and contents, or to memory owned by user processes. These protections
privacy. are provided by the RDMA layer specifications, and specifically
their security models. It is REQUIRED that any RDMA provider used
for RPC transport be conformant to the requirements of [RFC5042] in
order to satisfy these protections.
Once delivered securely by the RDMA provider, any RDMA-exposed
addresses will contain only RPC payloads in the chunk lists,
transferred under the protection of RPCSEC_GSS integrity and
privacy. By these means, the data will be protected end-to-end, as
required by the RPC layer security model.
Where results are supplied to the requester via Read chunks, a
server resource deficit can arise if the client does not promptly
acknowledge their status via the RDMA_DONE message. This can
potentially lead to a denial of service situation, with a single
client unfairly (and unnecessarily) consuming server RDMA
resources. Servers MUST protect against this situation,
originating from one or many clients. For example, a time-based
window of buffer availability may be offered, if the client fails
to obtain the data within the window, it will simply retry using
ordinary RPC retry semantics. Or, a more severe method would be
for the server to simply close the client's RDMA connection,
freeing the RDMA resources and allowing the server to reclaim them.
A fairer and more useful method is provided by the protocol itself.
The server MAY use the rdma_credit value to limit the number of
outstanding requests for each client. By including the number of
outstanding RDMA_DONE completions in the computation of available
client credits, the server can limit its exposure to each client,
and therefore provide uninterrupted service as its resources
permit.
However, the server must ensure that it does not decrease the
credit count to zero with this method, since the RDMA_DONE message
is not acknowledged. If the credit count were to drop to zero
solely due to outstanding RDMA_DONE messages, the client would
deadlock since it would never obtain a new credit with which to
continue. Therefore, if the server adjusts credits to zero for
outstanding RDMA_DONE, it MUST withhold its reply to at least one
message in order to provide the next credit. The time-based window
(or any other appropriate method) SHOULD be used by the server to
recover resources in the event that the client never returns.
The "Connection Configuration Protocol", when used, MUST be
protected by an appropriate RPC security flavor, to ensure it is
not attacked in the process of initiating an RPC/RDMA connection.
12. IANA Considerations 12. IANA Considerations
The new RPC transport is to be assigned a new RPC "netid", which is The new RPC transport is to be assigned a new RPC "netid", which is
an rpcbind [RFC1833] string used to describe the underlying an rpcbind [RFC1833] string used to describe the underlying
protocol in order for RPC to select the appropriate transport protocol in order for RPC to select the appropriate transport
framing, as well as the format of the service ports. framing, as well as the format of the service ports.
The following "nc_proto" registry string is hereby defined for this The following "nc_proto" registry string is hereby defined for this
purpose: purpose:
NC_RDMA "rdma" NC_RDMA "rdma"
The mechanism of adding this value to the RPC netid registry is
outside the scope of this document and is an IANA consideration.
This netid MAY be used for any RDMA network satisfying the This netid MAY be used for any RDMA network satisfying the
requirements of section 2, and able to identify service endpoints requirements of section 2, and able to identify service endpoints
using IP port addressing, possibly through use of a translation using IP port addressing, possibly through use of a translation
service as described above in section 10, RPC Binding. service as described above in section 10, RPC Binding.
As a new RPC transport, this protocol has no effect on RPC program As a new RPC transport, this protocol has no effect on RPC program
numbers or existing registered port numbers. However, new port numbers or existing registered port numbers. However, new port
numbers MAY be registered for use by RPC/RDMA-enabled services, as numbers MAY be registered for use by RPC/RDMA-enabled services, as
appropriate to the new networks over which the services will appropriate to the new networks over which the services will
operate. operate.
The OPTIONAL Connection Configuration protocol described herein The OPTIONAL Connection Configuration protocol described herein
requires an RPC program number assignment. The value "100400" is requires an RPC program number assignment. The value "100400" is
hereby assigned: hereby assigned:
rdmaconfig 100400 rpc.rdmaconfig rdmaconfig 100400 rpc.rdmaconfig
Currently, these numbers are not assigned by IANA, they are merely Currently, neither the nc_proto netid's nor the RPC program numbers
republished [IANA-RPC]. The mechanism of this republishing is are are assigned by IANA. The list in [RFC1833] has served as the
outside the scope of this document and is an IANA consideration. netid registry, and the republication declared in [IANA-RPC] has
served as the program number registry. Ideally, IANA will create
explicit registries for these objects. However, in the absence of
new registries, this document would serve as the repository for the
RPC program number assignment, and the protocol netid.
13. Acknowledgements 13. Acknowledgements
The authors wish to thank Rob Thurlow, John Howard, Chet Juszczak, The authors wish to thank Rob Thurlow, John Howard, Chet Juszczak,
Alex Chiu, Peter Staubach, Dave Noveck, Brian Pawlowski, Steve Alex Chiu, Peter Staubach, Dave Noveck, Brian Pawlowski, Steve
Kleiman, Mike Eisler, Mark Wittle, Shantanu Mehendale, David Kleiman, Mike Eisler, Mark Wittle, Shantanu Mehendale, David
Robinson and Mallikarjun Chadalapaka for their contributions to Robinson and Mallikarjun Chadalapaka for their contributions to
this document. this document.
14. Normative References 14. Normative References
skipping to change at page 33, line 27 skipping to change at page 34, line 27
[RFC3530] [RFC3530]
S. Shepler, B. Callaghan, D. Robinson, R. Thurlow, C. Beame, S. Shepler, B. Callaghan, D. Robinson, R. Thurlow, C. Beame,
M. Eisler, D. Noveck, "NFS version 4 Protocol", Standards M. Eisler, D. Noveck, "NFS version 4 Protocol", Standards
Track RFC, http://www.ietf.org/rfc/rfc3530.txt Track RFC, http://www.ietf.org/rfc/rfc3530.txt
[RFC2203] [RFC2203]
M. Eisler, A. Chiu, L. Ling, "RPCSEC_GSS Protocol M. Eisler, A. Chiu, L. Ling, "RPCSEC_GSS Protocol
Specification", Standards Track RFC, Specification", Standards Track RFC,
http://www.ietf.org/rfc/rfc2203.txt http://www.ietf.org/rfc/rfc2203.txt
15. Informative References [RPCSECGSSV2]
M. Eisler, "RPCSEC_GSS Version 2", Internet Draft Work in
Progress draft-ietf-nfsv4-rpcsec-gss-v2
[RDMAP] [RFC5056]
R. Recio et al., "A Remote Direct Memory Access Protocol N. Williams, "On the Use of Channel Bindings to Secure
Specification", Standards Track RFC, draft-ietf-rddp-rdmap Channels", Standards Track RFC
[CCM] [BTNSLATCH]
M. Eisler, N. Williams, "CCM: The Credential Cache GSS N. Williams, "IPsec Channels: Connection Latching", Internet
Mechanism", Internet Draft Work in Progress, draft-ietf- Draft Work in Progress draft-ietf-btns-connection-latching
nfsv4-ccm
[RFC5042]
J. Pinkerton, E. Deleganes, "Direct Data Placement Protocol
(DDP) / Remote Direct Memory Access Protocol (RDMAP) Security"
Standards Track RFC
15. Informative References
[NFSDDP] [NFSDDP]
B. Callaghan, T. Talpey, "NFS Direct Data Placement" Internet B. Callaghan, T. Talpey, "NFS Direct Data Placement" Internet
Draft Work in Progress, draft-ietf-nfsv4-nfsdirect Draft Work in Progress, draft-ietf-nfsv4-nfsdirect
[RFC5040]
R. Recio et al., "A Remote Direct Memory Access Protocol
Specification", Standards Track RFC
[RDDP] [RFC5041]
H. Shah et al., "Direct Data Placement over Reliable H. Shah et al., "Direct Data Placement over Reliable
Transports", Standards Track RFC, draft-ietf-rddp-ddp Transports", Standards Track RFC
[NFSRDMAPS] [NFSRDMAPS]
T. Talpey, C. Juszczak, "NFS RDMA Problem Statement", Internet T. Talpey, C. Juszczak, "NFS RDMA Problem Statement", Internet
Draft Work in Progress, draft-ietf-nfsv4-nfs-rdma-problem- Draft Work in Progress, draft-ietf-nfsv4-nfs-rdma-problem-
statement statement
[NFSv4.1] [NFSv4.1]
S. Shepler et al., ed., "NFSv4 Minor Version 1" Internet Draft S. Shepler et al., ed., "NFSv4 Minor Version 1" Internet Draft
Work in Progress, draft-ietf-nfsv4-minorversion1 Work in Progress, draft-ietf-nfsv4-minorversion1
[IB] [IB]
Infiniband Architecture Specification, available from Infiniband Architecture Specification, available from
http://www.infinibandta.org http://www.infinibandta.org
[IBPORT] [IBPORT]
Infiniband Trade Association, "IP Addressing Annex", available Infiniband Trade Association, "IP Addressing Annex", available
skipping to change at page 34, line 24 skipping to change at page 35, line 37
from http://www.infinibandta.org from http://www.infinibandta.org
[IANA-RPC] [IANA-RPC]
IANA Sun RPC number statement, IANA Sun RPC number statement,
http://www.iana.org/assignments/sun-rpc-numbers http://www.iana.org/assignments/sun-rpc-numbers
16. Authors' Addresses 16. Authors' Addresses
Tom Talpey Tom Talpey
Network Appliance, Inc. Network Appliance, Inc.
375 Totten Pond Road 1601 Trapelo Road, #16
Waltham, MA 02451 USA Waltham, MA 02451 USA
Phone: +1 781 768 5329 Phone: +1 781 768 5329
EMail: thomas.talpey@netapp.com EMail: thomas.talpey@netapp.com
Brent Callaghan Brent Callaghan
Apple Computer, Inc. Apple Computer, Inc.
MS: 302-4K MS: 302-4K
2 Infinite Loop 2 Infinite Loop
Cupertino, CA 95014 USA Cupertino, CA 95014 USA
EMail: brentc@apple.com EMail: brentc@apple.com
17. Intellectual Property and Copyright Statements 17. Intellectual Property and Copyright Statements
skipping to change at page 34, line 42 skipping to change at page 36, line 16
MS: 302-4K MS: 302-4K
2 Infinite Loop 2 Infinite Loop
Cupertino, CA 95014 USA Cupertino, CA 95014 USA
EMail: brentc@apple.com EMail: brentc@apple.com
17. Intellectual Property and Copyright Statements 17. Intellectual Property and Copyright Statements
Full Copyright Statement Full Copyright Statement
Copyright (C) The IETF Trust (2007). Copyright (C) The IETF Trust (2008).
This document is subject to the rights, licenses and restrictions This document is subject to the rights, licenses and restrictions
contained in BCP 78, and except as set forth therein, the authors contained in BCP 78, and except as set forth therein, the authors
retain all their rights. retain all their rights.
This document and the information contained herein are provided on This document and the information contained herein are provided on
an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE
REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE
IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL
WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY
 End of changes. 36 change blocks. 
101 lines changed or deleted 163 lines changed or added

This html diff was produced by rfcdiff 1.34. The latest version is available from http://tools.ietf.org/tools/rfcdiff/