draft-ietf-nfsv4-nfsdirect-06.txt   draft-ietf-nfsv4-nfsdirect-07.txt 
NFSv4 Working Group Tom Talpey NFSv4 Working Group Tom Talpey
Internet-Draft Network Appliance, Inc. Internet-Draft NetApp
Intended status: Standards Track Brent Callaghan Intended status: Standards Track Brent Callaghan
Expires: January 1, 2008 Apple Computer, Inc. Expires: August 23, 2008 Apple
July 1, 2007 February 22, 2008
NFS Direct Data Placement NFS Direct Data Placement
draft-ietf-nfsv4-nfsdirect-06 draft-ietf-nfsv4-nfsdirect-07
Status of this Memo Status of this Memo
By submitting this Internet-Draft, each author represents that any By submitting this Internet-Draft, each author represents that any
applicable patent or other IPR claims of which he or she is aware applicable patent or other IPR claims of which he or she is aware
have been or will be disclosed, and any of which he or she becomes have been or will be disclosed, and any of which he or she becomes
aware will be disclosed, in accordance with Section 6 of BCP 79. aware will be disclosed, in accordance with Section 6 of BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that Task Force (IETF), its areas, and its working groups. Note that
skipping to change at page 1, line 38 skipping to change at page 1, line 38
progress." progress."
The list of current Internet-Drafts can be accessed at The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html. http://www.ietf.org/shadow.html.
Abstract Abstract
The RDMA transport for ONC RPC provides direct data placement for NFS This draft defines the bindings of the various Network File System
data. Direct data placement not only reduces the amount of data that (NFS) versions to the Remote Direct Memory Access (RDMA) operations
needs to be copied in an NFS call, but allows much of the data supported by the RPC/RDMA transport protocol. It describes the use
movement over the network to be implemented in RDMA hardware. This of direct data placement by means of server-initiated RDMA operations
draft describes the use of direct data placement by means of server- into client-supplied buffers for implementations of NFS versions 2,
initiated RDMA operations into client-supplied buffers in a Chunk 3, 4 and 4.1 over such an RDMA transport.
list for implementations of NFS versions 2, 3, and 4 over an RDMA
transport.
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
2. Transfers from NFS Client to NFS Server . . . . . . . . . . 2 2. Transfers from NFS Client to NFS Server . . . . . . . . . . 2
3. Transfers from NFS Server to NFS Client . . . . . . . . . . 3 3. Transfers from NFS Server to NFS Client . . . . . . . . . . 3
4. NFS Versions 2 and 3 Mapping . . . . . . . . . . . . . . . . 4 4. NFS Versions 2 and 3 Mapping . . . . . . . . . . . . . . . . 4
5. NFS Version 4 Mapping . . . . . . . . . . . . . . . . . . . 5 5. NFS Version 4 Mapping . . . . . . . . . . . . . . . . . . . 5
6. Security . . . . . . . . . . . . . . . . . . . . . . . . . . 7 5.1. NFS Version 4 Callbacks . . . . . . . . . . . . . . . . . 7
7. IANA Considerations . . . . . . . . . . . . . . . . . . . . 7 6. Security Considerations . . . . . . . . . . . . . . . . . . 8
7. IANA Considerations . . . . . . . . . . . . . . . . . . . . 8
8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 8 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 8
9. Normative References . . . . . . . . . . . . . . . . . . . . 8 9. Normative References . . . . . . . . . . . . . . . . . . . . 9
10. Informative References . . . . . . . . . . . . . . . . . . 8 10. Informative References . . . . . . . . . . . . . . . . . 10
11. Authors' Addresses . . . . . . . . . . . . . . . . . . . . 9 11. Authors' Addresses . . . . . . . . . . . . . . . . . . . 10
12. Intellectual Property and Copyright Statements . . . . . . 9 12. Intellectual Property and Copyright Statements . . . . . 10
Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . 10 Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . 11
Requirements Language Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in [RFC2119]. document are to be interpreted as described in [RFC2119].
1. Introduction 1. Introduction
The RDMA Transport for ONC RPC [RPCRDMA] allows an RPC client The Remote Direct Memory Access (RDMA) Transport for Remote Procedure
application to post buffers in a Chunk list for specific arguments Calls (RPC) [RPCRDMA] allows an RPC client application to post
and results from an RPC call. The RDMA transport header conveys this buffers in a Chunk list for specific arguments and results from an
list of client buffer addresses to the server where the application RPC call. The RDMA transport header conveys this list of client
can associate them with client data and use RDMA operations to buffer addresses to the server where the application can associate
transfer the results directly to and from the posted buffers on the them with client data and use RDMA operations to transfer the results
client. The client and server must agree on a consistent mapping of directly to and from the posted buffers on the client. The client
posted buffers to RPC. This document details the mapping for each and server must agree on a consistent mapping of posted buffers to
version of the NFS protocol [RFC1094] [RFC1813] [RFC3530] [NFSv4.1]. RPC. This document details the mapping for each version of the NFS
protocol [RFC1094] [RFC1813] [RFC3530] [NFSv4.1].
2. Transfers from NFS Client to NFS Server 2. Transfers from NFS Client to NFS Server
The RDMA Read list, in the RDMA transport header, allows an RPC The RDMA Read list, in the RDMA transport header, allows an RPC
client to marshal RPC call data selectively. Large chunks of data, client to marshal RPC call data selectively. Large chunks of data,
such as the file data of an NFS WRITE request, MAY be referenced by such as the file data of an NFS WRITE request, MAY be referenced by
an RDMA Read list and be moved efficiently and directly-placed by an an RDMA Read list and be moved efficiently and directly-placed by an
RDMA READ operation initiated by the server. RDMA READ operation initiated by the server.
The process of identifying these chunks for the RDMA Read list can be The process of identifying these chunks for the RDMA Read list can be
skipping to change at page 4, line 8 skipping to change at page 4, line 10
The sum of the segment lengths yields the total size of the buffer, The sum of the segment lengths yields the total size of the buffer,
which MUST be large enough to accept the result. If the buffer is which MUST be large enough to accept the result. If the buffer is
too small, the server MUST return an XDR encode error. The server too small, the server MUST return an XDR encode error. The server
MUST return the result data for a posted buffer by progressively MUST return the result data for a posted buffer by progressively
filling its segments, perhaps leaving some trailing segments unfilled filling its segments, perhaps leaving some trailing segments unfilled
or partially full if the size of the result is less than the total or partially full if the size of the result is less than the total
size of the buffer segments. size of the buffer segments.
The server returns the RDMA Write list to the client with the segment The server returns the RDMA Write list to the client with the segment
length fields overwritten to indicate the amount of data RDMA Written length fields overwritten to indicate the amount of data RDMA Written
to each segment. Results returned by direct placement MUST not be to each segment. Results returned by direct placement MUST NOT be
returned by other methods, e.g., by read chunk list or inline. If no returned by other methods, e.g., by read chunk list or inline. If no
result data at all is returned for the element, the server places no result data at all is returned for the element, the server places no
data in the buffer(s), but does return zeroes in the segment length data in the buffer(s), but does return zeroes in the segment length
fields corresponding to the result. fields corresponding to the result.
The RDMA Write list allows the client to provide multiple result The RDMA Write list allows the client to provide multiple result
buffers - each buffer maps to a specific result in the reply. The NFS buffers - each buffer maps to a specific result in the reply. The NFS
client and server implementations agree by specifying the mapping of client and server implementations agree by specifying the mapping of
results to buffers for each RPC procedure. The following sections results to buffers for each RPC procedure. The following sections
describe this mapping for versions of the NFS protocol. describe this mapping for versions of the NFS protocol.
skipping to change at page 5, line 49 skipping to change at page 6, line 4
padding which may be desirable for certain servers when RDMA Read is padding which may be desirable for certain servers when RDMA Read is
impractical. impractical.
5. NFS Version 4 Mapping 5. NFS Version 4 Mapping
This specification applies to the first minor version of NFS version This specification applies to the first minor version of NFS version
4 (NFSv4.0) and any subsequent minor versions that do not override 4 (NFSv4.0) and any subsequent minor versions that do not override
this mapping. this mapping.
The Write list MUST be considered only for the COMPOUND procedure. The Write list MUST be considered only for the COMPOUND procedure.
This procedure returns results from a sequence of operations. Only This procedure returns results from a sequence of operations. Only
the opaque file data from an NFS READ operation, and the pathname the opaque file data from an NFS READ operation, and the pathname
from a READLINK operation MUST utilize entries from the Write list. from a READLINK operation MUST utilize entries from the Write list.
If there is no Write list, i.e., the list is null, then any READ or If there is no Write list, i.e., the list is null, then any READ or
READLINK operations in the COMPOUND MUST return their data inline. READLINK operations in the COMPOUND MUST return their data inline.
The NFSv4.0 client MUST ensure that any result of its READ and The NFSv4.0 client MUST ensure in this case that any result of its
READLINK requests fits within its receive buffers, lest an RDMA READ and READLINK requests will fit within its receive buffers, in
transport error result upon transfer. order to avoid a resulting RDMA transport error upon transfer. The
server is not required to detect this.
The first entry in the Write list MUST be used by the first READ or The first entry in the Write list MUST be used by the first READ or
READLINK in the COMPOUND request. The next Write list entry by the READLINK in the COMPOUND request. The next Write list entry by the
by the next READ or READLINK, and so on. If there are more READ or by the next READ or READLINK, and so on. If there are more READ or
READLINK operations than Write list entries, then any remaining READLINK operations than Write list entries, then any remaining
operations MUST return their results inline. operations MUST return their results inline.
If a Write list entry is presented, then the corresponding READ or If a Write list entry is presented, then the corresponding READ or
READLINK MUST return its data via an RDMA WRITE to the buffer READLINK MUST return its data via an RDMA WRITE to the buffer
indicated by the Write list entry. If the Write list entry has zero indicated by the Write list entry. If the Write list entry has zero
skipping to change at page 7, line 16 skipping to change at page 7, line 20
Non-RDMA (inline) WRITE transfers MAY OPTIONALLY employ the Non-RDMA (inline) WRITE transfers MAY OPTIONALLY employ the
"RDMA_MSGP" padding method described in the RPC/RDMA protocol, if the "RDMA_MSGP" padding method described in the RPC/RDMA protocol, if the
appropriate value for the server is known to the client. Padding appropriate value for the server is known to the client. Padding
allows the opaque file data to arrive at the server in an aligned allows the opaque file data to arrive at the server in an aligned
fashion, which may improve server performance. In order to ensure fashion, which may improve server performance. In order to ensure
accurate alignment for all data, it is likely that the client will accurate alignment for all data, it is likely that the client will
restrict its use of OPTIONAL padding to COMPOUND requests containing restrict its use of OPTIONAL padding to COMPOUND requests containing
only a single WRITE operation. only a single WRITE operation.
Unlike NFS versions 2 and 3, the maximum size of an NFS version 4 Unlike NFS versions 2 and 3, the maximum size of an NFS version 4
COMPOUND is unbounded, even when RDMA chunks are in use. While it COMPOUND is not bounded, even when RDMA chunks are in use. While it
might appear that a configuration protocol exchange (such as the one might appear that a configuration protocol exchange (such as the one
described in [RPCRDMA]) would help, in fact the layering issues described in [RPCRDMA]) would help, in fact the layering issues
involved in building COMPOUNDs by NFS make such a mechanism involved in building COMPOUNDs by NFS make such a mechanism
unworkable. unworkable.
However, typical NFS version 4 clients rarely issue such problematic However, typical NFS version 4 clients rarely issue such problematic
requests. In practice, they behave in much more predictable ways, in requests. In practice, they behave in much more predictable ways, in
fact most still support the traditional rsize/wsize mount parameters. fact most still support the traditional rsize/wsize mount parameters.
Therefore, most NFS version 4 clients function over RPC/RDMA in the Therefore, most NFS version 4 clients function over RPC/RDMA in the
same way as NFS versions 2 and 3, operationally. same way as NFS versions 2 and 3, operationally.
There are however advantages to allowing both client and server to There are however advantages to allowing both client and server to
operate with prearranged size constraints, for example use of the operate with prearranged size constraints, for example use of the
sizes to better manage the server's response cache. An extension to sizes to better manage the server's response cache. An extension to
NFS version 4 supporting a more comprehensive exchange of upper layer NFS version 4 supporting a more comprehensive exchange of upper layer
parameters is part of [NFSv4.1]. parameters is part of [NFSv4.1].
6. Security 5.1. NFS Version 4 Callbacks
The RDMA transport for ONC RPC supports RPCSEC_GSS security as well The NFS version 4 protocols support server-initiated callbacks to
as link-level security. The use of RDMA Write to return RPC results selected clients, in order to notify them of events such as recalled
does not affect ONC RPC security. delegations, etc. These callbacks present no particular issue to
being framed over RPC/RDMA, since such callbacks do not carry bulk
data such as read or write. They MAY be transmitted inline via
RDMA_MSG, or if the callback message or its reply overflow the
negotiated buffer sizes for a callback connection, they MAY be
transferred via the RDMA_NOMSG method as described above for other
exchanges.
One special case is noteworthy: in NFS version 4.1, the callback
channel is optionally negotiated to be on the same connection as one
used for client requests. In this case, and because the XID is
present in the RPC/RDMA header, the client MUST ascertain whether the
message is in fact an RPC REPLY, and therefore a reply to a prior
request and carrying its XID, before processing it as such. By the
same token, the server MUST ascertain whether an incoming message on
such a callback-eligible connection is an RPC CALL, before optionally
processing the XID.
In the callback case, the XID present in the RPC/RDMA header will
potentially have any value which may (or may not) collide with an XID
used by the client for a previous or future request. The client and
server MUST inspect the RPC component of the message to determine its
potential disposition as either an RPC CALL or RPC REPLY, prior to
processing this XID, and MUST NOT reject or accept it without also
determining the proper context.
6. Security Considerations
The RDMA transport for RPC [RPCRDMA] supports all RPC [RFC1831bis]
security models, including RPCSEC_GSS [RFC2203] security and link-
level security. The choice of RDMA Read and RDMA Write to return RPC
argument and results, respectively, does not affect this, since it
only changes the method of data transfer. Specifically, the
requirements of [RPCRDMA] ensure that this choice does not introduce
new vulnerabilities.
Because this document defines only the binding of the NFS protocols
atop [RPCRDMA], all relevant security considerations are therefore to
be described at that layer.
7. IANA Considerations 7. IANA Considerations
NFS use of direct data placement introduces a need for an additional NFS use of direct data placement introduces a need for an additional
NFS port number assignment for networks which share traditional UDP NFS port number assignment for networks which share traditional UDP
and TCP port spaces with RDMA services. The iWARP [DDP] [RDMAP] and TCP port spaces with RDMA services. The iWARP [RFC5041]
protocol is such an example (Infiniband is not). [RFC5040] protocol is such an example (Infiniband is not).
NFS servers for versions 2 and 3 [RFC1094] [RFC1813] traditionally NFS servers for versions 2 and 3 [RFC1094] [RFC1813] traditionally
listen for clients on UDP and TCP port 2049, and additionally, they listen for clients on UDP and TCP port 2049, and additionally, they
register these with the portmapper and/or rpcbind [RFC1833] service. register these with the portmapper and/or rpcbind [RFC1833] service.
However, NFS servers for version 4 [RFC3530] are required by that However, [RFC3530] requires NFS servers for version 4 to listen on
specification to listen on TCP port 2049, and are not required to TCP port 2049, and they are not required to register.
register.
An NFS version 2 or version 3 server supporting RPC/RDMA on such a An NFS version 2 or version 3 server supporting RPC/RDMA on such a
network and registering itself with the RPC portmapper MAY choose an network and registering itself with the RPC portmapper MAY choose an
arbitrary port, or MAY use the alternative well-known port number for arbitrary port, or MAY use the alternative well-known port number for
its RPC/RDMA service by IANA. The chosen port MAY be registered with its RPC/RDMA service. The chosen port MAY be registered with the RPC
the RPC portmapper under the netid assigned by the requirement in portmapper under the netid assigned by the requirement in [RPCRDMA].
[RPCRDMA].
An NFS version 4 server supporting RPC/RDMA on such a network MUST An NFS version 4 server supporting RPC/RDMA on such a network MUST
use the alternative well-known port number for its RPC/RDMA service use the alternative well-known port number for its RPC/RDMA service.
by IANA. Clients SHOULD connect to this well-known port without Clients SHOULD connect to this well-known port without consulting the
consulting the RPC portmapper (as for NFSv4/TCP). The port number RPC portmapper (as for NFSv4/TCP).
assigned to an NFS service over an RPC/RDMA transport is available
from the IANA port registry [RFC3232]. The port number assigned to an NFS service over an RPC/RDMA transport
is available from the IANA port registry [RFC3232].
8. Acknowledgements 8. Acknowledgements
The authors would like to thank Dave Noveck and Chet Juszczak for The authors would like to thank Dave Noveck and Chet Juszczak for
their contributions to this document. their contributions to this document.
9. Normative References 9. Normative References
[RFC2119] [RFC2119]
S. Bradner, "Key words for use in RFCs to Indicate Requirement S. Bradner, "Key words for use in RFCs to Indicate Requirement
Levels", Levels",
Best Current Practice, Best Current Practice,
BCP 14, RFC 2119, March 1997. BCP 14, RFC 2119, March 1997.
[RFC1094] [RFC1094]
"NFS: Network File System Protocol Specification", "NFS: Network File System Protocol Specification",
(NFS version 2) Informational RFC, (NFS version 2) Informational RFC,
http://www.ietf.org/rfc/rfc1094.txt http://www.ietf.org/rfc/rfc1094.txt
[RFC1831bis]
R. Thurlow, Ed., "RPC: Remote Procedure Call Protocol
Specification Version 2",
Standards Track RFC
[RFC1813] [RFC1813]
B. Callaghan, B. Pawlowski, P. Staubach, "NFS Version 3 Protocol B. Callaghan, B. Pawlowski, P. Staubach, "NFS Version 3 Protocol
Specification", Specification",
Informational RFC, Informational RFC,
http://www.ietf.org/rfc/rfc1813.txt http://www.ietf.org/rfc/rfc1813.txt
[RFC1833] [RFC1833]
R. Srinivasan, "Binding Protocols for ONC RPC Version 2", R. Srinivasan, "Binding Protocols for ONC RPC Version 2",
Standards Track RFC, Standards Track RFC,
http://www.ietf.org/rfc/rfc1833.txt http://www.ietf.org/rfc/rfc1833.txt
[RFC3530] [RFC3530]
S. Shepler, B. Callaghan, D. Robinson, R. Thurlow, C. Beame, M. S. Shepler, et al., "NFS version 4 Protocol",
Eisler, D. Noveck, "NFS version 4 Protocol",
Standards Track RFC, Standards Track RFC,
http://www.ietf.org/rfc/rfc3530.txt http://www.ietf.org/rfc/rfc3530.txt
[NFSv4.1]
S. Shepler et al., ed., "NFSv4 Minor Version 1"
Internet Draft Work in Progress,
draft-ietf-nfsv4-minorversion1
[RFC2203]
M. Eisler, A. Chiu, L. Ling, "RPCSEC_GSS Protocol Specification",
Standards Track RFC,
http://www.ietf.org/rfc/rfc2203.txt
10. Informative References 10. Informative References
[RFC3232] [RFC3232]
Internet Assigned Numbers Authority (IANA), Internet Assigned Numbers Authority (IANA),
Port Registry database, Port Registry database,
http://www.ietf.org/rfc/rfc3232.txt http://www.ietf.org/rfc/rfc3232.txt
http://www.iana.org/assignments/port-numbers http://www.iana.org/assignments/port-numbers
[RPCRDMA] [RPCRDMA]
T. Talpey, B. Callaghan, "RDMA Transport for ONC RPC" T. Talpey, B. Callaghan, "Remote Direct Memory Access Transport
for Remote Procedure Call"
Internet Draft Work in Progress, Internet Draft Work in Progress,
draft-ietf-nfsv4-rpcrdma draft-ietf-nfsv4-rpcrdma
[NFSv4.1] [RFC5041]
S. Shepler et al., ed., "NFSv4 Minor Version 1"
Internet Draft Work in Progress,
draft-ietf-nfsv4-minorversion1
[DDP]
H. Shah et al., "Direct Data Placement over Reliable Transports", H. Shah et al., "Direct Data Placement over Reliable Transports",
Standards Track RFC, Standards Track RFC
draft-ietf-rddp-ddp
[RDMAP] [RFC5040]
R. Recio et al., "An RDMA Protocol Specification", R. Recio et al., "A Remote Direct Memory Access Protocol
Standards Track RFC, Specification",
draft-ietf-rddp-rdmap Standards Track RFC
11. Authors' Addresses 11. Authors' Addresses
Tom Talpey Tom Talpey
Network Appliance, Inc. Network Appliance, Inc.
375 Totten Pond Road 1601 Trapelo Road, #16
Waltham, MA 02451 USA Waltham, MA 02451 USA
Phone: +1 781 768 5329 Phone: +1 781 768 5329
EMail: thomas.talpey@netapp.com EMail: thomas.talpey@netapp.com
Brent Callaghan Brent Callaghan
Apple Computer, Inc. Apple Computer, Inc.
MS: 302-4K MS: 302-4K
2 Infinite Loop 2 Infinite Loop
Cupertino, CA 95014 USA Cupertino, CA 95014 USA
EMail: brentc@apple.com EMail: brentc@apple.com
12. Intellectual Property and Copyright Statements 12. Intellectual Property and Copyright Statements
Full Copyright Statement Full Copyright Statement
Copyright (C) The IETF Trust (2007). Copyright (C) The IETF Trust (2008).
This document is subject to the rights, licenses and restrictions This document is subject to the rights, licenses and restrictions
contained in BCP 78, and except as set forth therein, the authors contained in BCP 78, and except as set forth therein, the authors
retain all their rights. retain all their rights.
This document and the information contained herein are provided on This document and the information contained herein are provided on
an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE
REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE
IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL
WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY
 End of changes. 26 change blocks. 
67 lines changed or deleted 115 lines changed or added

This html diff was produced by rfcdiff 1.34. The latest version is available from http://tools.ietf.org/tools/rfcdiff/