draft-ietf-nfsv4-rpcrdma-04.txt   draft-ietf-nfsv4-rpcrdma-05.txt 
Internet-Draft Tom Talpey NFSv4 Working Group Tom Talpey
Expires: April 2007 Brent Callaghan Internet-Draft Network Appliance, Inc.
Intended status: Standards Track Brent Callaghan
Document: draft-ietf-nfsv4-rpcrdma-04 October, 2006 Expires: November 8, 2007 Apple Computer, Inc.
May 7, 2007
RDMA Transport for ONC RPC RDMA Transport for ONC RPC
draft-ietf-nfsv4-rpcrdma-05
Status of this Memo Status of this Memo
By submitting this Internet-Draft, each author represents that any By submitting this Internet-Draft, each author represents that any
applicable patent or other IPR claims of which he or she is aware applicable patent or other IPR claims of which he or she is aware
have been or will be disclosed, and any of which he or she becomes have been or will be disclosed, and any of which he or she becomes
aware will be disclosed, in accordance with Section 6 of BCP 79. aware will be disclosed, in accordance with Section 6 of BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that Task Force (IETF), its areas, and its working groups. Note that
skipping to change at page 1, line 34 skipping to change at page 1, line 36
documents at any time. It is inappropriate to use Internet-Drafts documents at any time. It is inappropriate to use Internet-Drafts
as reference material or to cite them other than as "work in as reference material or to cite them other than as "work in
progress." progress."
The list of current Internet-Drafts can be accessed at The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html. http://www.ietf.org/shadow.html.
This Internet-Draft will expire on November 8, 2007.
Copyright Notice
Copyright (C) The IETF Trust (2007).
Abstract Abstract
A protocol is described providing RDMA as a new transport for ONC A protocol is described providing RDMA as a new transport for ONC
RPC. The RDMA transport binding conveys the benefits of efficient, RPC. The RDMA transport binding conveys the benefits of efficient,
bulk data transport over high speed networks, while providing for bulk data transport over high speed networks, while providing for
minimal change to RPC applications and with no required revision of minimal change to RPC applications and with no required revision of
the application RPC protocol, or the RPC protocol itself. the application RPC protocol, or the RPC protocol itself.
Table of Contents Table of Contents
skipping to change at page 2, line 40 skipping to change at page 2, line 40
8. Errors and Error Recovery . . . . . . . . . . . . . . . . 28 8. Errors and Error Recovery . . . . . . . . . . . . . . . . 28
9. Node Addressing . . . . . . . . . . . . . . . . . . . . . 28 9. Node Addressing . . . . . . . . . . . . . . . . . . . . . 28
10. RPC Binding . . . . . . . . . . . . . . . . . . . . . . . 29 10. RPC Binding . . . . . . . . . . . . . . . . . . . . . . . 29
11. Security . . . . . . . . . . . . . . . . . . . . . . . . 30 11. Security . . . . . . . . . . . . . . . . . . . . . . . . 30
12. IANA Considerations . . . . . . . . . . . . . . . . . . . 30 12. IANA Considerations . . . . . . . . . . . . . . . . . . . 30
13. Acknowledgements . . . . . . . . . . . . . . . . . . . . 31 13. Acknowledgements . . . . . . . . . . . . . . . . . . . . 31
14. Normative References . . . . . . . . . . . . . . . . . . 31 14. Normative References . . . . . . . . . . . . . . . . . . 31
15. Informative References . . . . . . . . . . . . . . . . . 32 15. Informative References . . . . . . . . . . . . . . . . . 32
16. Authors' Addresses . . . . . . . . . . . . . . . . . . . 33 16. Authors' Addresses . . . . . . . . . . . . . . . . . . . 33
17. Intellectual Property and Copyright Statements . . . . . 33 17. Intellectual Property and Copyright Statements . . . . . 33
Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . 34 Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . 34
Requirements Language Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in
this document are to be interpreted as described in [RFC2119]. this document are to be interpreted as described in [RFC2119].
1. Introduction 1. Introduction
RDMA is a technique for efficient movement of data between end RDMA is a technique for efficient movement of data between end
nodes, which becomes increasingly compelling over high speed nodes, which becomes increasingly compelling over high speed
transports. By directing data into destination buffers as it is transports. By directing data into destination buffers as it is
sent on a network, and placing it via direct memory access by sent on a network, and placing it via direct memory access by
hardware, the double benefit of faster transfers and reduced host hardware, the double benefit of faster transfers and reduced host
overhead is obtained. overhead is obtained.
ONC RPC [RFC1831] is a remote procedure call protocol that has been ONC RPC [RFC1831] is a remote procedure call protocol that has been
run over a variety of transports. Most RPC implementations today run over a variety of transports. Most RPC implementations today
use UDP or TCP. RPC messages are defined in terms of an eXternal use UDP or TCP. RPC messages are defined in terms of an eXternal
Data Representation (XDR) [RFC1832] which provides a canonical data Data Representation (XDR) [RFC4506] which provides a canonical data
representation across a variety of host architectures. An XDR data representation across a variety of host architectures. An XDR data
stream is conveyed differently on each type of transport. On UDP, stream is conveyed differently on each type of transport. On UDP,
RPC messages are encapsulated inside datagrams, while on a TCP byte RPC messages are encapsulated inside datagrams, while on a TCP byte
stream, RPC messages are delineated by a record marking protocol. stream, RPC messages are delineated by a record marking protocol.
An RDMA transport also conveys RPC messages in a unique fashion An RDMA transport also conveys RPC messages in a unique fashion
that must be fully described if client and server implementations that must be fully described if client and server implementations
are to interoperate. are to interoperate.
RDMA transports present new semantics unlike the behaviors of RDMA transports present new semantics unlike the behaviors of
either UDP and TCP alone. They retain message delineations like either UDP and TCP alone. They retain message delineations like
skipping to change at page 4, line 7 skipping to change at page 4, line 7
sender to a receiver. An RPC message is either an RPC call from a sender to a receiver. An RPC message is either an RPC call from a
client to a server, or an RPC reply from the server back to the client to a server, or an RPC reply from the server back to the
client. An RPC message contains an RPC call header followed by client. An RPC message contains an RPC call header followed by
arguments if the message is an RPC call, or an RPC reply header arguments if the message is an RPC call, or an RPC reply header
followed by results if the message is an RPC reply. The call followed by results if the message is an RPC reply. The call
header contains a transaction ID (XID) followed by the program and header contains a transaction ID (XID) followed by the program and
procedure number as well as a security credential. An RPC reply procedure number as well as a security credential. An RPC reply
header begins with an XID that matches that of the RPC call header begins with an XID that matches that of the RPC call
message, followed by a security verifier and results. All data in message, followed by a security verifier and results. All data in
an RPC message is XDR encoded. For a complete description of the an RPC message is XDR encoded. For a complete description of the
RPC protocol and XDR encoding, see [RFC1831] and [RFC1832]. RPC protocol and XDR encoding, see [RFC1831] and [RFC4506].
This protocol assumes the following abstract model for RDMA This protocol assumes the following abstract model for RDMA
transports. These terms, common in the RDMA lexicon, are used in transports. These terms, common in the RDMA lexicon, are used in
this document. A more complete glossary of RDMA terms can be found this document. A more complete glossary of RDMA terms can be found
in [RDMAP]. in [RDMAP].
o Registered Memory o Registered Memory
All data moved via tagged RDMA operations is resident in All data moved via tagged RDMA operations is resident in
registered memory at its destination. This protocol assumes registered memory at its destination. This protocol assumes
that each segment of registered memory MUST be identified with that each segment of registered memory MUST be identified with
skipping to change at page 6, line 16 skipping to change at page 6, line 16
RDMA Read or Write operation. The overhead in transferring RDMA Read or Write operation. The overhead in transferring
steering tags and memory addresses is justified only by large steering tags and memory addresses is justified only by large
transfers. The critical message size that justifies RDMA transfer transfers. The critical message size that justifies RDMA transfer
will vary depending on the RDMA implementation and network, but is will vary depending on the RDMA implementation and network, but is
typically of the order of a few kilobytes. It is appropriate to typically of the order of a few kilobytes. It is appropriate to
transfer a short message with an RDMA Send to a pre-posted buffer. transfer a short message with an RDMA Send to a pre-posted buffer.
The RPC over RDMA header with the short message (call or reply) The RPC over RDMA header with the short message (call or reply)
immediately following is transferred using a single RDMA Send immediately following is transferred using a single RDMA Send
operation. operation.
Short RPC messages over an RDMA transport look like this: Short RPC messages over an RDMA transport:
RPC Client RPC Server RPC Client RPC Server
| RPC Call | | RPC Call |
Send | ------------------------------> | Send | ------------------------------> |
| | | |
| RPC Reply | | RPC Reply |
| <------------------------------ | Send | <------------------------------ | Send
3.2. Data Chunks 3.2. Data Chunks
skipping to change at page 15, line 5 skipping to change at page 15, line 5
call message. call message.
These methods of data movement may occur in combinations within a These methods of data movement may occur in combinations within a
single RPC. For instance, an RPC call may contain some inline data single RPC. For instance, an RPC call may contain some inline data
along with some large chunks to be transferred via RDMA Read to the along with some large chunks to be transferred via RDMA Read to the
server. The reply to that call may have some result chunks that server. The reply to that call may have some result chunks that
the server RDMA Writes back to the client. The following protocol the server RDMA Writes back to the client. The following protocol
interactions illustrate RPC calls that use these methods to move interactions illustrate RPC calls that use these methods to move
RPC message data: RPC message data:
An RPC with write chunks in the call message looks like this: An RPC with write chunks in the call message:
RPC Client RPC Server RPC Client RPC Server
| RPC Call + Write Chunk list | | RPC Call + Write Chunk list |
Send | ------------------------------> | Send | ------------------------------> |
| | | |
| Chunk 1 | | Chunk 1 |
| <------------------------------ | Write | <------------------------------ | Write
| : | | : |
| Chunk n | | Chunk n |
| <------------------------------ | Write | <------------------------------ | Write
| | | |
| RPC Reply | | RPC Reply |
| <------------------------------ | Send | <------------------------------ | Send
In the presence of write chunks, RDMA ordering provides the In the presence of write chunks, RDMA ordering provides the
guarantee that all data in the RDMA Write operations has been guarantee that all data in the RDMA Write operations has been
placed in memory prior to the client's RPC reply processing. placed in memory prior to the client's RPC reply processing.
An RPC with read chunks in the call message looks like this: An RPC with read chunks in the call message:
RPC Client RPC Server RPC Client RPC Server
| RPC Call + Read Chunk list | | RPC Call + Read Chunk list |
Send | ------------------------------> | Send | ------------------------------> |
| | | |
| Chunk 1 | | Chunk 1 |
| +------------------------------ | Read | +------------------------------ | Read
| v-----------------------------> | | v-----------------------------> |
| : | | : |
| Chunk n | | Chunk n |
| +------------------------------ | Read | +------------------------------ | Read
| v-----------------------------> | | v-----------------------------> |
| | | |
| RPC Reply | | RPC Reply |
| <------------------------------ | Send | <------------------------------ | Send
And an RPC with read chunks in the reply message looks like this: An RPC with read chunks in the reply message:
RPC Client RPC Server RPC Client RPC Server
| RPC Call | | RPC Call |
Send | ------------------------------> | Send | ------------------------------> |
| | | |
| RPC Reply + Read Chunk list | | RPC Reply + Read Chunk list |
| <------------------------------ | Send | <------------------------------ | Send
| | | |
| Chunk 1 | | Chunk 1 |
Read | ------------------------------+ | Read | ------------------------------+ |
skipping to change at page 20, line 45 skipping to change at page 20, line 45
MUST be generated. MUST be generated.
Two types of errors are defined, version mismatch and invalid chunk Two types of errors are defined, version mismatch and invalid chunk
format. When the peer detects an RPC over RDMA header version format. When the peer detects an RPC over RDMA header version
which it does not support (currently this draft defines only which it does not support (currently this draft defines only
version 1), it replies with an error code of ERR_VERS, and provides version 1), it replies with an error code of ERR_VERS, and provides
the low and high inclusive version numbers it does, in fact, the low and high inclusive version numbers it does, in fact,
support. The version number in this reply MAY be any value support. The version number in this reply MAY be any value
otherwise valid at the receiver. When other decoding errors are otherwise valid at the receiver. When other decoding errors are
detected in the header or chunks, either an RPC decode error MAY be detected in the header or chunks, either an RPC decode error MAY be
returned, or the ROC/RDMA error code ERR_CHUNK MUST be returned. returned, or the RPC/RDMA error code ERR_CHUNK MUST be returned.
4.3. XDR Language Description 4.3. XDR Language Description
Here is the message layout in XDR language. Here is the message layout in XDR language.
struct xdr_rdma_segment { struct xdr_rdma_segment {
uint32 handle; /* Registered memory handle */ uint32 handle; /* Registered memory handle */
uint32 length; /* Length of the chunk in bytes */ uint32 length; /* Length of the chunk in bytes */
uint64 offset; /* Chunk virtual address or offset */ uint64 offset; /* Chunk virtual address or offset */
}; };
skipping to change at page 25, line 5 skipping to change at page 25, line 5
Read of the long RPC message into it. The receiver then proceeds Read of the long RPC message into it. The receiver then proceeds
to XDR decode the RPC message as if it had received it inline with to XDR decode the RPC message as if it had received it inline with
the Send data. Further decoding may issue additional RDMA Reads to the Send data. Further decoding may issue additional RDMA Reads to
bring over additional chunks. bring over additional chunks.
Although the handling of long messages requires one extra network Although the handling of long messages requires one extra network
turnaround, in practice these messages will be rare if the posted turnaround, in practice these messages will be rare if the posted
receive buffers are correctly sized, and of course they will be receive buffers are correctly sized, and of course they will be
non-existent for RDMA-aware upper layers. non-existent for RDMA-aware upper layers.
A long call RPC with request supplied via RDMA Read looks like this: A long call RPC with request supplied via RDMA Read
RPC Client RPC Server RPC Client RPC Server
| RDMA over RPC Header | | RDMA over RPC Header |
Send | ------------------------------> | Send | ------------------------------> |
| | | |
| Long RPC Call Msg | | Long RPC Call Msg |
| +------------------------------ | Read | +------------------------------ | Read
| v-----------------------------> | | v-----------------------------> |
| | | |
| RDMA over RPC Reply | | RDMA over RPC Reply |
| <------------------------------ | Send | <------------------------------ | Send
An RPC with long reply returned via RDMA Read looks like this: An RPC with long reply returned via RDMA Read
RPC Client RPC Server RPC Client RPC Server
| RPC Call | | RPC Call |
Send | ------------------------------> | Send | ------------------------------> |
| | | |
| RDMA over RPC Header | | RDMA over RPC Header |
| <------------------------------ | Send | <------------------------------ | Send
| | | |
| Long RPC Reply Msg | | Long RPC Reply Msg |
Read | ------------------------------+ | Read | ------------------------------+ |
skipping to change at page 26, line 16 skipping to change at page 26, line 16
sized to receive a large reply and enters its steering tag, address sized to receive a large reply and enters its steering tag, address
and length in the rdma_reply write chunk. If the reply message is and length in the rdma_reply write chunk. If the reply message is
too long to return inline with an RDMA Send (exceeds the size of too long to return inline with an RDMA Send (exceeds the size of
the client's posted receive buffer), even with read chunks removed, the client's posted receive buffer), even with read chunks removed,
then the RPC server performs an RDMA Write of the RPC reply message then the RPC server performs an RDMA Write of the RPC reply message
into the buffer indicated by the rdma_reply chunk. If the client into the buffer indicated by the rdma_reply chunk. If the client
doesn't provide an rdma_reply chunk, or if it's too small, then if doesn't provide an rdma_reply chunk, or if it's too small, then if
the upper layer specification permits, the message MAY be returned the upper layer specification permits, the message MAY be returned
as a Read chunk. as a Read chunk.
An RPC with long reply returned via RDMA Write looks like this: An RPC with long reply returned via RDMA Write
RPC Client RPC Server RPC Client RPC Server
| RPC Call with rdma_reply | | RPC Call with rdma_reply |
Send | ------------------------------> | Send | ------------------------------> |
| | | |
| Long RPC Reply Msg | | Long RPC Reply Msg |
| <------------------------------ | Write | <------------------------------ | Write
| | | |
| RDMA over RPC Header | | RDMA over RPC Header |
| <------------------------------ | Send | <------------------------------ | Send
skipping to change at page 32, line 48 skipping to change at page 32, line 48
[RFC1094] [RFC1094]
Sun Microsystems, "NFS: Network File System Protocol Sun Microsystems, "NFS: Network File System Protocol
Specification", (NFS version 2) Informational RFC, Specification", (NFS version 2) Informational RFC,
http://www.ietf.org/rfc/rfc1094.txt http://www.ietf.org/rfc/rfc1094.txt
[RFC1831] [RFC1831]
R. Srinivasan, "RPC: Remote Procedure Call Protocol R. Srinivasan, "RPC: Remote Procedure Call Protocol
Specification Version 2", Standards Track RFC, Specification Version 2", Standards Track RFC,
http://www.ietf.org/rfc/rfc1831.txt http://www.ietf.org/rfc/rfc1831.txt
[RFC1832] [RFC4506]
R. Srinivasan, "XDR: External Data Representation Standard", M. Eisler Ed., "XDR: External Data Representation Standard",
Standards Track RFC, http://www.ietf.org/rfc/rfc1832.txt Standards Track RFC, http://www.ietf.org/rfc/rfc4506.txt
[RFC1813] [RFC1813]
B. Callaghan, B. Pawlowski, P. Staubach, "NFS Version 3 B. Callaghan, B. Pawlowski, P. Staubach, "NFS Version 3
Protocol Specification", Informational RFC, Protocol Specification", Informational RFC,
http://www.ietf.org/rfc/rfc1813.txt http://www.ietf.org/rfc/rfc1813.txt
[RFC1833] [RFC1833]
R. Srinivasan, "Binding Protocols for ONC RPC Version 2", R. Srinivasan, "Binding Protocols for ONC RPC Version 2",
Standards Track RFC, http://www.ietf.org/rfc/rfc1833.txt Standards Track RFC, http://www.ietf.org/rfc/rfc1833.txt
[RFC3530] [RFC3530]
skipping to change at page 34, line 36 skipping to change at page 34, line 36
Brent Callaghan Brent Callaghan
Apple Computer, Inc. Apple Computer, Inc.
MS: 302-4K MS: 302-4K
2 Infinite Loop 2 Infinite Loop
Cupertino, CA 95014 USA Cupertino, CA 95014 USA
EMail: brentc@apple.com EMail: brentc@apple.com
17. Intellectual Property and Copyright Statements 17. Intellectual Property and Copyright Statements
Intellectual Property Statement Full Copyright Statement
Copyright (C) The IETF Trust (2007).
This document is subject to the rights, licenses and restrictions
contained in BCP 78, and except as set forth therein, the authors
retain all their rights.
This document and the information contained herein are provided on
an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE
REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE
IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL
WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY
WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE
ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS
FOR A PARTICULAR PURPOSE.
Intellectual Property
The IETF takes no position regarding the validity or scope of any The IETF takes no position regarding the validity or scope of any
Intellectual Property Rights or other rights that might be claimed Intellectual Property Rights or other rights that might be claimed
to pertain to the implementation or use of the technology described to pertain to the implementation or use of the technology described
in this document or the extent to which any license under such in this document or the extent to which any license under such
rights might or might not be available; nor does it represent that rights might or might not be available; nor does it represent that
it has made any independent effort to identify any such rights. it has made any independent effort to identify any such rights.
Information on the procedures with respect to rights in RFC Information on the procedures with respect to rights in RFC
documents can be found in BCP 78 and BCP 79. documents can be found in BCP 78 and BCP 79.
Copies of IPR disclosures made to the IETF Secretariat and any Copies of IPR disclosures made to the IETF Secretariat and any
skipping to change at page 35, line 16 skipping to change at page 35, line 32
of such proprietary rights by implementers or users of this of such proprietary rights by implementers or users of this
specification can be obtained from the IETF on-line IPR repository specification can be obtained from the IETF on-line IPR repository
at http://www.ietf.org/ipr. at http://www.ietf.org/ipr.
The IETF invites any interested party to bring to its attention any The IETF invites any interested party to bring to its attention any
copyrights, patents or patent applications, or other proprietary copyrights, patents or patent applications, or other proprietary
rights that may cover technology that may be required to implement rights that may cover technology that may be required to implement
this standard. Please address the information to the IETF at ietf- this standard. Please address the information to the IETF at ietf-
ipr@ietf.org. ipr@ietf.org.
Disclaimer of Validity Acknowledgment
Funding for the RFC Editor function is provided by the IETF
This document and the information contained herein are provided on Administrative Support Activity (IASA).
an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE
REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND
THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT
THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR
ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A
PARTICULAR PURPOSE.
Copyright Statement
Copyright (C) The Internet Society (2006).
This document is subject to the rights, licenses and restrictions
contained in BCP 78, and except as set forth therein, the authors
retain all their rights.
Acknowledgement
Funding for the RFC Editor function is currently provided by the
Internet Society.
 End of changes. 18 change blocks. 
19 lines changed or deleted 43 lines changed or added

This html diff was produced by rfcdiff 1.33. The latest version is available from http://tools.ietf.org/tools/rfcdiff/