draft-ietf-nfsv4-rfc5666-implementation-experience-00.txt   draft-ietf-nfsv4-rfc5666-implementation-experience-01.txt 
NFSv4 C. Lever Network File System Version 4 C. Lever
Internet-Draft Oracle Internet-Draft Oracle
Intended status: Informational November 2, 2015 Intended status: Informational February 23, 2016
Expires: May 5, 2016 Expires: August 26, 2016
RPC-over-RDMA Version One Implementation Experience RPC-over-RDMA Version One Implementation Experience
draft-ietf-nfsv4-rfc5666-implementation-experience-00 draft-ietf-nfsv4-rfc5666-implementation-experience-01
Abstract Abstract
This document details experiences and challenges implementing the This document details experiences and challenges implementing the
RPC-over-RDMA Version One protocol. Specification changes are RPC-over-RDMA Version One protocol. Specification changes are
recommended to address avoidable interoperability failures. recommended to address avoidable interoperability failures.
Status of This Memo Status of This Memo
This Internet-Draft is submitted in full conformance with the This Internet-Draft is submitted in full conformance with the
skipping to change at page 1, line 32 skipping to change at page 1, line 32
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/. Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on May 5, 2016. This Internet-Draft will expire on August 26, 2016.
Copyright Notice Copyright Notice
Copyright (c) 2015 IETF Trust and the persons identified as the Copyright (c) 2016 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License. described in the Simplified BSD License.
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3
1.2. Purpose Of This Document . . . . . . . . . . . . . . . . 3 1.2. Purpose Of This Document . . . . . . . . . . . . . . . . 3
1.3. Updating RFC 5666 . . . . . . . . . . . . . . . . . . . . 4 1.3. Updating RFC 5666 . . . . . . . . . . . . . . . . . . . . 3
2. RPC-Over-RDMA Essentials . . . . . . . . . . . . . . . . . . 5 2. RPC-Over-RDMA Essentials . . . . . . . . . . . . . . . . . . 4
2.1. Arguments And Results . . . . . . . . . . . . . . . . . . 5 2.1. Arguments And Results . . . . . . . . . . . . . . . . . . 4
2.2. Remote Direct Memory Access . . . . . . . . . . . . . . . 5 2.2. Remote Direct Memory Access . . . . . . . . . . . . . . . 5
2.2.1. Direct Data Placement . . . . . . . . . . . . . . . . 6 2.3. Transfer Models . . . . . . . . . . . . . . . . . . . . . 6
2.2.2. Channel Operation . . . . . . . . . . . . . . . . . . 6 2.4. Upper Layer Binding Specifications . . . . . . . . . . . 7
2.2.3. Explicit RDMA Operation . . . . . . . . . . . . . . . 7
2.3. Transfer Models . . . . . . . . . . . . . . . . . . . . . 7
2.3.1. Read-Read . . . . . . . . . . . . . . . . . . . . . . 7
2.3.2. Write-Write . . . . . . . . . . . . . . . . . . . . . 7
2.3.3. Read-Write . . . . . . . . . . . . . . . . . . . . . 8
2.4. Upper Layer Binding Specifications . . . . . . . . . . . 8
2.5. On-The-Wire Protocol . . . . . . . . . . . . . . . . . . 8 2.5. On-The-Wire Protocol . . . . . . . . . . . . . . . . . . 8
2.5.1. Inline Operation . . . . . . . . . . . . . . . . . . 8 3. Specification Issues . . . . . . . . . . . . . . . . . . . . 14
2.5.2. RDMA Segment . . . . . . . . . . . . . . . . . . . . 11 3.1. Extensibility Considerations . . . . . . . . . . . . . . 14
2.5.3. Chunk . . . . . . . . . . . . . . . . . . . . . . . . 11 3.2. XDR Clarifications . . . . . . . . . . . . . . . . . . . 15
2.5.4. Read Chunk . . . . . . . . . . . . . . . . . . . . . 12 3.3. The Position Zero Read Chunk . . . . . . . . . . . . . . 18
2.5.5. Write Chunk . . . . . . . . . . . . . . . . . . . . . 12 3.4. RDMA_NOMSG Call Messages . . . . . . . . . . . . . . . . 20
2.5.6. Read List . . . . . . . . . . . . . . . . . . . . . . 13 3.5. RDMA_MSG Call with Position Zero Read Chunk . . . . . . . 21
2.5.7. Write List . . . . . . . . . . . . . . . . . . . . . 14 3.6. Padding Inline Content After A Chunk . . . . . . . . . . 22
2.5.8. Position Zero Read Chunk . . . . . . . . . . . . . . 14 3.7. Write Chunk XDR Roundup . . . . . . . . . . . . . . . . . 24
2.5.9. Reply Chunk . . . . . . . . . . . . . . . . . . . . . 15
3. Specification Issues . . . . . . . . . . . . . . . . . . . . 15
3.1. Extensibility Considerations . . . . . . . . . . . . . . 15
3.1.1. Recommendations . . . . . . . . . . . . . . . . . . . 16
3.2. XDR Clarifications . . . . . . . . . . . . . . . . . . . 16
3.2.1. Recommendations . . . . . . . . . . . . . . . . . . . 18
3.3. The Position Zero Read Chunk . . . . . . . . . . . . . . 19
3.3.1. Recommendations . . . . . . . . . . . . . . . . . . . 21
3.4. RDMA_NOMSG Call Messages . . . . . . . . . . . . . . . . 21
3.4.1. Recommendations . . . . . . . . . . . . . . . . . . . 22
3.5. RDMA_MSG Call with Position Zero Read Chunk . . . . . . . 22
3.5.1. Recommendations . . . . . . . . . . . . . . . . . . . 23
3.6. Padding Inline Content After A Chunk . . . . . . . . . . 23
3.6.1. Recommendations . . . . . . . . . . . . . . . . . . . 25
3.7. Write List XDR Roundup . . . . . . . . . . . . . . . . . 25
3.7.1. Recommendations . . . . . . . . . . . . . . . . . . . 26
3.8. Write List Error Cases . . . . . . . . . . . . . . . . . 26 3.8. Write List Error Cases . . . . . . . . . . . . . . . . . 26
3.8.1. Recommendations . . . . . . . . . . . . . . . . . . . 29
4. Operational Considerations . . . . . . . . . . . . . . . . . 29 4. Operational Considerations . . . . . . . . . . . . . . . . . 29
4.1. Computing Request Buffer Requirements . . . . . . . . . . 29 4.1. Computing Request Buffer Requirements . . . . . . . . . . 29
4.1.1. Recommendations . . . . . . . . . . . . . . . . . . . 30
4.2. Default Inline Buffer Size . . . . . . . . . . . . . . . 30 4.2. Default Inline Buffer Size . . . . . . . . . . . . . . . 30
4.2.1. Recommendations . . . . . . . . . . . . . . . . . . . 30
4.3. When To Use Reply Chunks . . . . . . . . . . . . . . . . 30 4.3. When To Use Reply Chunks . . . . . . . . . . . . . . . . 30
4.3.1. Recommendations . . . . . . . . . . . . . . . . . . . 31
4.4. Computing Credit Values . . . . . . . . . . . . . . . . . 31 4.4. Computing Credit Values . . . . . . . . . . . . . . . . . 31
4.4.1. Recommendations . . . . . . . . . . . . . . . . . . . 32
4.5. Race Windows . . . . . . . . . . . . . . . . . . . . . . 32 4.5. Race Windows . . . . . . . . . . . . . . . . . . . . . . 32
4.5.1. Recommendations . . . . . . . . . . . . . . . . . . . 32
5. Pre-requisites For NFSv4 . . . . . . . . . . . . . . . . . . 32 5. Pre-requisites For NFSv4 . . . . . . . . . . . . . . . . . . 32
5.1. Bi-directional Operation . . . . . . . . . . . . . . . . 32 5.1. Bi-directional Operation . . . . . . . . . . . . . . . . 32
5.1.1. Recommendations . . . . . . . . . . . . . . . . . . . 33
6. Considerations For Upper Layer Binding Specifications . . . . 33 6. Considerations For Upper Layer Binding Specifications . . . . 33
6.1. Organization Of Binding Specification Requirements . . . 33 6.1. Organization Of Binding Specification Requirements . . . 33
6.1.1. Recommendations . . . . . . . . . . . . . . . . . . . 34
6.2. RDMA-Eligibility . . . . . . . . . . . . . . . . . . . . 34 6.2. RDMA-Eligibility . . . . . . . . . . . . . . . . . . . . 34
6.2.1. Recommendations . . . . . . . . . . . . . . . . . . . 35 6.3. Inline Threshold Requirements . . . . . . . . . . . . . . 35
6.3. Violations Of Binding Rules . . . . . . . . . . . . . . . 35 6.4. Violations Of Binding Rules . . . . . . . . . . . . . . . 36
6.3.1. Recommendations . . . . . . . . . . . . . . . . . . . 36 6.5. Binding Specification Completion Assessment . . . . . . . 37
6.4. Binding Specification Completion Assessment . . . . . . . 36 7. Unimplemented Protocol Features . . . . . . . . . . . . . . . 38
6.4.1. Recommendations . . . . . . . . . . . . . . . . . . . 37 7.1. Unimplemented Features To Be Removed . . . . . . . . . . 38
7. Removal of Unimplemented Protocol Features . . . . . . . . . 37 7.2. Unimplemented Features To Be Retained . . . . . . . . . . 39
7.1. Read-Read Transfer Model . . . . . . . . . . . . . . . . 37 8. Security Considerations . . . . . . . . . . . . . . . . . . . 41
7.1.1. Recommendations . . . . . . . . . . . . . . . . . . . 37 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 41
7.2. RDMA_MSGP . . . . . . . . . . . . . . . . . . . . . . . . 37 10. Appendix A: XDR Language Description . . . . . . . . . . . . 42
7.2.1. Recommendations . . . . . . . . . . . . . . . . . . . 38 11. Appendix B: Binding Requirement Summary . . . . . . . . . . . 45
8. Security Considerations . . . . . . . . . . . . . . . . . . . 38 12. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 46
9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 38 13. References . . . . . . . . . . . . . . . . . . . . . . . . . 46
10. Appendix A: XDR Language Description . . . . . . . . . . . . 38 13.1. Normative References . . . . . . . . . . . . . . . . . . 46
11. Appendix B: Binding Requirement Summary . . . . . . . . . . . 41 13.2. Informative References . . . . . . . . . . . . . . . . . 48
12. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 43 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 48
13. References . . . . . . . . . . . . . . . . . . . . . . . . . 43
13.1. Normative References . . . . . . . . . . . . . . . . . . 43
13.2. Informative References . . . . . . . . . . . . . . . . . 44
Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 44
1. Introduction 1. Introduction
1.1. Requirements Language 1.1. Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described in "OPTIONAL" in this document are to be interpreted as described in
[RFC2119]. [RFC2119].
skipping to change at page 25, line 26 skipping to change at page 24, line 26
restriction applies to inline content interleaved with write chunks. restriction applies to inline content interleaved with write chunks.
Because all XDR objects must start on an XDR alignment boundary, all Because all XDR objects must start on an XDR alignment boundary, all
read and write chunks and all inline XDR objects in any XDR stream read and write chunks and all inline XDR objects in any XDR stream
must start on an XDR alignment boundary. This has implications for must start on an XDR alignment boundary. This has implications for
the values allowed in read chunk Position fields, for how XDR roundup the values allowed in read chunk Position fields, for how XDR roundup
works for chunks, and for how XDR objects are placed in inline works for chunks, and for how XDR objects are placed in inline
buffers. XDR alignment in inline buffers is always relative to buffers. XDR alignment in inline buffers is always relative to
Position Zero (or, where the RPC header starts). Position Zero (or, where the RPC header starts).
3.7. Write List XDR Roundup 3.7. Write Chunk XDR Roundup
The final paragraph of RFC 5666 Section 3.7 says this: The final paragraph of RFC 5666 Section 3.7 says:
For RDMA Write Chunks, a simpler encoding method applies. Again, For RDMA Write Chunks, a simpler encoding method applies. Again,
roundup bytes are not transferred, instead the chunk length sent roundup bytes are not transferred, instead the chunk length sent
to the receiver in the reply is simply increased to include any to the receiver in the reply is simply increased to include any
roundup. roundup.
A responder should never write XDR pad bytes, as the requester's A responder should avoid writing XDR pad bytes, as the requester's
upper layers does not reference them. However, for the chunk length upper layer does not reference them, though the language does not
to be rounded up as described, the requester must provide adequate fully prohibit writing these bytes. A requester always provides the
extra space in the chunk for the XDR pad. A requester can provide extra space for XDR padding anyway.
space for the XDR pad using one of two approaches:
A problem arises if the data item written into a Write chunk is
shorter than the chunk and requires an XDR pad. A responder may
write the XDR pad past the end of the data content. For a short
directly-placed write, the pad bytes are then exposed in the RPC
consumer's data buffer.
In addition, for the chunk length to be rounded up as described, the
requester must provide adequate extra space in the chunk for the XDR
pad. A requester can provide space for the XDR pad using one of two
approaches:
1. It can extend the last segment in the chunk. 1. It can extend the last segment in the chunk.
2. It can provide another segment after the segments that receive 2. It can provide another segment after the segments that receive
RDMA Write payloads. RDMA Write payloads.
Case 1 is adequate when there is no danger that the responder's RDMA Case 1 is adequate when there is no danger that the responder's RDMA
Write operations will overwrite existing data on the requester in Write operations will overwrite existing data on the requester in
buffers following the advertised receive buffers. memory following the advertised receive buffers.
In Direct Data Placement scenarios, an extra segment must be provided In Direct Data Placement scenarios, an extra segment must be provided
separately to avoid overwriting existing data that follows the sink separately to avoid overwriting existing data that follows the sink
buffer (case 2). Thus, an extra registration is needed for just a buffer (case 2). Thus, an extra registration is needed for just a
handful of bytes that are not written by the responder. handful of bytes that may not be written by the responder, and are
ignored by the requester. Even so, this does not force the responder
to direct the XDR pad bytes into this extra segment, should the data
item in that chunk be shorter than the chunk itself.
Registering the extra buffer is a needless cost. It would be more Registering the extra buffer is a needless cost. It would be more
efficient if the XDR pad at the end of a write chunk were treated the efficient if the XDR pad at the end of a write chunk were treated the
same as it is for read chunks. Because RPC result data must begin on same as it is for Read chunks. Because RPC result data must begin on
an XDR alignment boundary, the result following the write chunk in an XDR alignment boundary, the result following the write chunk in
the reply's XDR stream must begin on an XDR alignment boundary. the reply's XDR stream must begin on an XDR alignment boundary.
There should be no need for a XDR pad to be present for the receiver There is no need for a XDR pad to be present for the receiver to re-
to re-assemble the RPC reply's XDR stream correctly. assemble the RPC reply's XDR stream properly.
Unfortunately at least one server implementation relies on the One responder implementation requires the requester to provide the
existence of that extra buffer, even though it does not write to it. extra buffer space in the Write chunk, but does not write to it.
Another server implementation does not rely on it (operation proceeds This follows the letter of the last paragraph of Section 3.7 of
if it is missing) but when it is present, this server does write [RFC5666].
zeroes to it.
Therefore the extra buffer for a write chunk's XDR pad, either as a Another responder implementation does not rely on having the extra
separate segment, or as an extension of the segment that represents space (operation proceeds if it is missing) but when the extra space
the data payload buffer, must remain for now. is present, this responder does write zeroes to it. While the
intention of Section 3.7 is that the responder does not write the
pad, it is not strictly forbidden.
Client implementations all appear to provide the extra buffer space
needed to accommodate the XDR pad. However, one implementation does
not register this extra buffer, since the responder is not expected
to write into it, while another implementation does.
These implementations may not be 100% interoperable. The language of
Section 3.7 of [RFC5666] appears to allow all of this behavior (in
particular, it does not prohibit a responder from writing the XDR pad
using RFC2119-style keywords, and does not require that requesters
register the extra space to accommodate the XDR pad).
Note that because the Reply chunk is a write chunk, these roundup Note that because the Reply chunk is a write chunk, these roundup
rules apply to it as well. rules also apply to it.
3.7.1. Recommendations 3.7.1. Recommendations
RFC 5666bis should provide a discussion of the requirements around The current specification allows XDR pad bytes to leak into user
write chunk roundup, with examples. The discussion should be buffers, and none of the current implementations prevent this leak.
separate from the discussion of read chunk roundup. There may be room to adjust the protocol specification independently
of current implementation behavior.
RFC 5666bis should explicitly discuss the requirements around write
chunk roundup separately from the discussion of read chunk roundup.
Explicit RFC2119-style interoperability requirements should be Explicit RFC2119-style interoperability requirements should be
provided in the text. For example, the requester MUST provide buffer provided for write chunks. Responders MUST NOT write XDR pad bytes
space for XDR roundup of write chunks, and the responder SHOULD NOT at the end of a Write chunk.
write into that buffer.
Allocating and registering extra space for XDR pad bytes that are
never written is wasteful. RFC 5666bis should forbid it. Responders
should not expect requesters to provide space for XDR pad bytes.
3.8. Write List Error Cases 3.8. Write List Error Cases
RFC 5666 Section 3.6 says: RFC 5666 Section 3.6 says:
When a write chunk list is provided for the results of the RPC When a write chunk list is provided for the results of the RPC
call, the RPC server MUST provide any corresponding data via RDMA call, the RPC server MUST provide any corresponding data via RDMA
Write to the memory referenced in the chunk list entries. Write to the memory referenced in the chunk list entries.
This requires the responder to use the Write list when it is This requires the responder to use the Write list when it is
skipping to change at page 35, line 45 skipping to change at page 35, line 45
their DDP-eligibility. RFC 5666bis should remind authors of Upper their DDP-eligibility. RFC 5666bis should remind authors of Upper
Layer Bindings that the Reply chunk and Position Zero read chunks are Layer Bindings that the Reply chunk and Position Zero read chunks are
expressly not for performance-critical Upper Layer operations. expressly not for performance-critical Upper Layer operations.
It is the responsibility of the Upper Layer Binding to specify RDMA- It is the responsibility of the Upper Layer Binding to specify RDMA-
eligibity rules so that if an RDMA-eligible XDR object is embedded eligibity rules so that if an RDMA-eligible XDR object is embedded
within another, only one of these two objects is to be represented by within another, only one of these two objects is to be represented by
a chunk. This ensures that the mapping from XDR position to the XDR a chunk. This ensures that the mapping from XDR position to the XDR
object represented is unambiguous. object represented is unambiguous.
6.3. Violations Of Binding Rules 6.3. Inline Threshold Requirements
An RPC-over-RDMA connection has two connection parameters that affect
the operation of Upper Layer Protocols: The credit limit, which is
how many outstanding RPCs are allowed on that connection; and the
inline threshold, which is the maximum payload size of an RDMA Send
on that connection. All ULPs sharing a connection also share the
same credits and inline threshold values.
The inline threshold is set when a connection is established. The
base RPC-over-RDMA protocol does not provide a mechanism for altering
the inline threshold of a connection once it has been established.
[RFC5667] places normative requirements on the inline threshold value
for a connection. There is no guidance provided on how
implementations should behave when two ULPs that have different
inline threshold requirements share the same connection.
Further, current NFS implementations ignore the inline threshold
requirements stated in [RFC5667]. It is unlikely that they would
interoperate successfully with any new implementation that followed
the letter of [RFC5667].
6.3.1. Recommendations
Upper Layer Protocols should be able to operate no matter what inline
threshold is in use.
An Upper Layer Binding might provide informative guidance about
optimal values of an inline threshold, but normative requirements are
difficult to enforce unless connection sharing is explicitly not
permitted.
6.4. Violations Of Binding Rules
Section 3.4 of RFC 5666 introduces the idea of an Upper Layer Binding Section 3.4 of RFC 5666 introduces the idea of an Upper Layer Binding
specification to state which Upper Layer operations are allowed to specification to state which Upper Layer operations are allowed to
use explicit RDMA to transfer a bulk payload item. use explicit RDMA to transfer a bulk payload item.
The fifth paragraph of this section states: The fifth paragraph of this section states:
The interface by which an upper-layer implementation communicates The interface by which an upper-layer implementation communicates
the eligibility of a data item locally to RPC for chunking is out the eligibility of a data item locally to RPC for chunking is out
of scope for this specification. In many implementations, it is of scope for this specification. In many implementations, it is
skipping to change at page 36, line 25 skipping to change at page 37, line 11
If a violation does occur, RFC 5666 does not define an unambiguous If a violation does occur, RFC 5666 does not define an unambiguous
mechanism for reporting the violation. The violation of Binding mechanism for reporting the violation. The violation of Binding
rules is an Upper Layer Protocol issue, but it is likely that there rules is an Upper Layer Protocol issue, but it is likely that there
is nothing the Upper Layer can do but reply with the equivalent of is nothing the Upper Layer can do but reply with the equivalent of
BAD XDR. BAD XDR.
When an erroneously-constructed reply reaches a requester, there is When an erroneously-constructed reply reaches a requester, there is
no recourse but to drop the reply, and perhaps the transport no recourse but to drop the reply, and perhaps the transport
connection as well. connection as well.
6.3.1. Recommendations 6.4.1. Recommendations
Policing DDP-eligibility must be done in co-operation with the Upper Policing DDP-eligibility must be done in co-operation with the Upper
Layer Protocol by its receive endpoint implementation. Layer Protocol by its receive endpoint implementation.
It is the Upper Layer Binding's responsibility to specify how a It is the Upper Layer Binding's responsibility to specify how a
responder must reply if a requester violates a DDP-eligibilty rule. responder must reply if a requester violates a DDP-eligibilty rule.
The Binding specification should provide similar guidance for The Binding specification should provide similar guidance for
requesters about handling invalid RPC-over-RDMA replies. requesters about handling invalid RPC-over-RDMA replies.
6.4. Binding Specification Completion Assessment 6.5. Binding Specification Completion Assessment
RFC 5666 Section 3.4 states: RFC 5666 Section 3.4 states:
Typically, only those opaque and aggregate data types that may Typically, only those opaque and aggregate data types that may
attain substantial size are considered to be eligible. However, attain substantial size are considered to be eligible. However,
any object MAY be chosen for chunking in any given message. any object MAY be chosen for chunking in any given message.
Chunk eligibility criteria MUST be determined by each upper-layer Chunk eligibility criteria MUST be determined by each upper-layer
in order to provide for an interoperable specification. in order to provide for an interoperable specification.
skipping to change at page 37, line 8 skipping to change at page 37, line 43
data type in the Upper Layer's XDR definition, in particular compound data type in the Upper Layer's XDR definition, in particular compound
types such as arrays and lists, when restricting what XDR objects are types such as arrays and lists, when restricting what XDR objects are
eligible for Direct Data Placement. eligible for Direct Data Placement.
In addition, there are requirements related to using NFS with RPC- In addition, there are requirements related to using NFS with RPC-
over-RDMA in [RFC5667], and there are some in [RFC5661]. It could be over-RDMA in [RFC5667], and there are some in [RFC5661]. It could be
helpful to have guidance about what kind of requirements belong in an helpful to have guidance about what kind of requirements belong in an
Upper Layer Binding specification versus what belong in the Upper Upper Layer Binding specification versus what belong in the Upper
Layer Protocol specification. Layer Protocol specification.
6.4.1. Recommendations 6.5.1. Recommendations
RFC 5666bis should describe what makes a Binding specification RFC 5666bis should describe what makes a Binding specification
complete (i.e. ready for publication). complete (i.e. ready for publication).
7. Removal of Unimplemented Protocol Features 7. Unimplemented Protocol Features
7.1. Read-Read Transfer Model There are features of RPC-over-RDMA Version One that remain
unimplemented in current implementations. Some are candidates to be
removed from the protocol because they have proven unnecessary or
were not properly specified.
Other features are unimplemented, unspecified, or have only one
implementation (thus interoperability remains unproven). These are
candidates to be retained and properly specified.
7.1. Unimplemented Features To Be Removed
7.1.1. Connection Configuration Protocol
No implementation has seen fit to support the Connection
Configuration Protocol. While a need to exchange pertinent
connection information remains, the preference is to exchange that
information as part of the set up of each connection, rather than as
settings that apply to all connections (and thus all ULPs) between
two peers.
7.1.1.1. Recommendations
CCP should be removed from RFC 5666bis.
7.1.2. Read-Read Transfer Model
All existing RPC-over-RDMA Version One implementations use a Read- All existing RPC-over-RDMA Version One implementations use a Read-
Write data transfer model. The server endpoint is responsible for Write data transfer model. The server endpoint is responsible for
initiating all RDMA data transfers. The Read-Read transfer model has initiating all RDMA data transfers. The Read-Read transfer model has
been deprecated, but because it appears in RFC 5666, implementations been deprecated, but because it appears in RFC 5666, implementations
are still responsible for supporting it. By removing the are still responsible for supporting it. By removing the
specification and discussion of Read-Read, the protocol and specification and discussion of Read-Read, the protocol and
specification can be made simpler and more clear. specification can be made simpler and more clear.
7.1.1. Recommendations 7.1.2.1. Recommendations
Remove Read-Read from RFC 5666bis, in particular from its equivalent Remove Read-Read from RFC 5666bis, in particular from its equivalent
of RFC 5666 Section 3.8. RFC 5666bis should require implementations of RFC 5666 Section 3.8. RFC 5666bis should require implementations
not to send RDMA_DONE; an implementation receiving it should ignore not to send RDMA_DONE; an implementation receiving it should ignore
it. The XDR definition should reserve RDMA_DONE. it. The XDR definition should reserve RDMA_DONE.
7.2. RDMA_MSGP 7.1.3. RDMA_MSGP
It has been observed that the current specification of RDMA_MSGP is It has been observed that the current specification of RDMA_MSGP is
not clear enough to result in interoperable implementations. not clear enough to result in interoperable implementations.
Possibly as a result, current receive endpoints do recognize and Possibly as a result, current receive endpoints do recognize and
process RDMA_MSGP messages, though they do not take advantage of the process RDMA_MSGP messages, though they do not take advantage of the
passed alignment parameters. Receivers treat RDMA_MSGP messages like passed alignment parameters. Receivers treat RDMA_MSGP messages like
RDMA_MSG messages. RDMA_MSG messages.
Currently senders do not use RDMA_MSGP messages. RDMA_MSGP depends Currently senders do not use RDMA_MSGP messages. RDMA_MSGP depends
on bulk payload occurring at the end of RPC messages, which is often on bulk payload occurring at the end of RPC messages, which is often
not true of NFSv4 COMPOUND requests. Most NFSv3 requests are small not true of NFSv4 COMPOUND requests. Most NFSv3 requests are small
enough not to need RDMA_MSGP. enough not to need RDMA_MSGP.
To be effective, RDMA_MSGP depends on getting alignment preferences To be effective, RDMA_MSGP depends on getting alignment preferences
in advance via CCP. There are no CCP implementations to date. in advance via CCP. There are no CCP implementations to date.
Without CCP, there is no way for peers to discover a receiver Without CCP, there is no way for peers to discover a receiver
endpoint's preferred alignment parameters, unless the implementation endpoint's preferred alignment parameters, unless the implementation
provides an administrative interface for specifying a remote's provides an administrative interface for specifying a remote's
alignment parameters. RDMA_MSGP is useless without that knowledge. alignment parameters. RDMA_MSGP is useless without that knowledge.
7.2.1. Recommendations 7.1.3.1. Recommendations
To maintain backward-compatibility, RDMA_MSGP must remain in the To maintain backward-compatibility, RDMA_MSGP must remain in the
protocol. RFC 5666bis should require implementations to not send protocol. RFC 5666bis should require implementations to not send
RDMA_MSGP messages. If an RDMA_MSGP message is seen by a receiver, RDMA_MSGP messages. If an RDMA_MSGP message is seen by a receiver,
it should ignore the alignment parameters and treat RDMA_MSGP it should ignore the alignment parameters and treat RDMA_MSGP
messages as RDMA_MSG messages. The XDR definition should reserve messages as RDMA_MSG messages. The XDR definition should reserve
RDMA_MSGP. RDMA_MSGP.
7.2. Unimplemented Features To Be Retained
7.2.1. RDMA_ERROR Type Messages
Server implementations the author is familiar with can send
RDMA_ERROR type messages, but only when an RPC-over-RDMA version
mismatch occurs. There is no facility to return the ERR_CHUNK error.
These implementations treat unrecognized message types and other
parsing errors as an RDMA_MSG type message. Obviously this behavior
does not comply with RFC 5666, but it is also recognized that this
behavior is not an improvement over the specification.
7.2.1.1. Recommendations
RFC 5666bis should provide stronger guidance for error checking, and
in particular, when a connection must be broken.
Implementations that do not adequately check incoming RPC-over-RDMA
headers must be updated.
7.2.2. RPCSEC_GSS On RPC-over-RDMA
The second paragraph of RFC 5666 Section 11 says:
For efficiency, a more appropriate security mechanism for RDMA
links may be link-level protection, such as certain configurations
of IPsec, which may be co-located in the RDMA hardware. The use
of link-level protection MAY be negotiated through the use of the
new RPCSEC_GSS mechanism defined in [RFC5403] in conjunction with
the Channel Binding mechanism [RFC5056] and IPsec Channel
Connection Latching [RFC5660]. Use of such mechanisms is REQUIRED
where integrity and/or privacy is desired, and where efficiency is
required.
However, consider:
o As of this writing, no implementation of RPCSEC_GSS v2 Channel
Binding or Connection Latching exist. Thus, though it is
sensible, this part of RFC 5666 has never been implemented.
o Not all fabrics and RNICs support a link-layer protection
mechanism that includes a privacy service.
o When multiple users access a storage service from the same client,
it is appropriate to deploy a message authentication service
concurrently with link-layer protection.
Therefore, despite its performance impact, RPCSEC_GSS can add
important function to RPC-over-RDMA deployments.
Currently there is an InfiniBand-only client and server
implementation of RPCSEC_GSS on RPC-over-RDMA that supports the
authentication, integrity, and privacy services. This pair of
implementations was created without the benefit of normative guidance
from RFC 5666. This client and server pair interoperates with each
other, but there are no independent implementations to test with.
RPC-over-RDMA requesters are responsible for providing adequate reply
resources to responders. These resources require special treatment
when an integrity or privacy service is in use. Direct data
placement cannot be used with software integrity checking or
encryption. Thus standards guidance is imperative to ensure that
independent RPCSEC_GSS implementations can interoperate on RPC-over-
RDMA transports.
7.2.2.1. Recommendations
RFC 5666bis should continue to require the use of link layer
protection when facilities are available to support it.
At the least, RPCSEC_GSS per-message authentiction is valuable, even
if link layer protection is in use. Integrity and privacy should
also be made available even if they do not perform well, because
there is no link layer protection for some fabrics.
Therefore, RFC 5666bis should provide a specification for RPCSEC_GSS
on RPC-over-RDMA, codifying the one existing implementation so that
others may interoperate with it.
8. Security Considerations 8. Security Considerations
To enable RDMA Read and Write operations, an RPC-over-RDMA Version To enable RDMA Read and Write operations, an RPC-over-RDMA Version
One requester exposes some or all of its memory to other hosts. RFC One requester exposes some or all of its memory to other hosts. RFC
5666bis should suggest best implementation practices to minimize 5666bis should suggest best implementation practices to minimize
exposure to careless or potentially malicious implementations that exposure to careless or potentially malicious implementations that
share the same fabric. Important considerations include: share the same fabric. Important considerations include:
o The use of Protection Domains to limit the exposure of memory o The use of Protection Domains to limit the exposure of memory
regions to a single connection is critical. Any attempt by a host regions to a single connection is critical. Any attempt by a host
skipping to change at page 43, line 22 skipping to change at page 46, line 32
Binding specification should provide guidance for requesters Binding specification should provide guidance for requesters
about handling invalid RPC-over-RDMA replies. about handling invalid RPC-over-RDMA replies.
12. Acknowledgements 12. Acknowledgements
The author gratefully acknowledges the contributions of Dai Ngo, The author gratefully acknowledges the contributions of Dai Ngo,
Karen Deitke, Chunli Zhang, Mahesh Siddheshwar, Dominique Martinet, Karen Deitke, Chunli Zhang, Mahesh Siddheshwar, Dominique Martinet,
and William Simpson. and William Simpson.
The author also wishes to thank Dave Noveck and Bill Baker for their The author also wishes to thank Dave Noveck and Bill Baker for their
unwavering support of this work. Special thanks go to nfsv4 Working support of this work. Special thanks go to nfsv4 Working Group Chair
Group Chair Spencer Shepler and nfsv4 Working Group Secretary Tom Spencer Shepler and nfsv4 Working Group Secretary Tom Haynes for
Haynes for their support. their support.
13. References 13. References
13.1. Normative References 13.1. Normative References
[RFC0793] Postel, J., "Transmission Control Protocol", STD 7, RFC [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, RFC
793, DOI 10.17487/RFC0793, September 1981, 793, DOI 10.17487/RFC0793, September 1981,
<http://www.rfc-editor.org/info/rfc793>. <http://www.rfc-editor.org/info/rfc793>.
[RFC1813] Callaghan, B., Pawlowski, B., and P. Staubach, "NFS [RFC1813] Callaghan, B., Pawlowski, B., and P. Staubach, "NFS
skipping to change at page 44, line 10 skipping to change at page 47, line 19
[RFC5040] Recio, R., Metzler, B., Culley, P., Hilland, J., and D. [RFC5040] Recio, R., Metzler, B., Culley, P., Hilland, J., and D.
Garcia, "A Remote Direct Memory Access Protocol Garcia, "A Remote Direct Memory Access Protocol
Specification", RFC 5040, DOI 10.17487/RFC5040, October Specification", RFC 5040, DOI 10.17487/RFC5040, October
2007, <http://www.rfc-editor.org/info/rfc5040>. 2007, <http://www.rfc-editor.org/info/rfc5040>.
[RFC5041] Shah, H., Pinkerton, J., Recio, R., and P. Culley, "Direct [RFC5041] Shah, H., Pinkerton, J., Recio, R., and P. Culley, "Direct
Data Placement over Reliable Transports", RFC 5041, DOI Data Placement over Reliable Transports", RFC 5041, DOI
10.17487/RFC5041, October 2007, 10.17487/RFC5041, October 2007,
<http://www.rfc-editor.org/info/rfc5041>. <http://www.rfc-editor.org/info/rfc5041>.
[RFC5056] Williams, N., "On the Use of Channel Bindings to Secure
Channels", RFC 5056, DOI 10.17487/RFC5056, November 2007,
<http://www.rfc-editor.org/info/rfc5056>.
[RFC5403] Eisler, M., "RPCSEC_GSS Version 2", RFC 5403, DOI
10.17487/RFC5403, February 2009,
<http://www.rfc-editor.org/info/rfc5403>.
[RFC5531] Thurlow, R., "RPC: Remote Procedure Call Protocol [RFC5531] Thurlow, R., "RPC: Remote Procedure Call Protocol
Specification Version 2", RFC 5531, DOI 10.17487/RFC5531, Specification Version 2", RFC 5531, DOI 10.17487/RFC5531,
May 2009, <http://www.rfc-editor.org/info/rfc5531>. May 2009, <http://www.rfc-editor.org/info/rfc5531>.
[RFC5660] Williams, N., "IPsec Channels: Connection Latching", RFC
5660, DOI 10.17487/RFC5660, October 2009,
<http://www.rfc-editor.org/info/rfc5660>.
[RFC5661] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., [RFC5661] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed.,
"Network File System (NFS) Version 4 Minor Version 1 "Network File System (NFS) Version 4 Minor Version 1
Protocol", RFC 5661, DOI 10.17487/RFC5661, January 2010, Protocol", RFC 5661, DOI 10.17487/RFC5661, January 2010,
<http://www.rfc-editor.org/info/rfc5661>. <http://www.rfc-editor.org/info/rfc5661>.
[RFC5666] Talpey, T. and B. Callaghan, "Remote Direct Memory Access [RFC5666] Talpey, T. and B. Callaghan, "Remote Direct Memory Access
Transport for Remote Procedure Call", RFC 5666, DOI Transport for Remote Procedure Call", RFC 5666, DOI
10.17487/RFC5666, January 2010, 10.17487/RFC5666, January 2010,
<http://www.rfc-editor.org/info/rfc5666>. <http://www.rfc-editor.org/info/rfc5666>.
 End of changes. 42 change blocks. 
107 lines changed or deleted 253 lines changed or added

This html diff was produced by rfcdiff 1.42. The latest version is available from http://tools.ietf.org/tools/rfcdiff/