draft-ietf-nfsv4-rpcrdma-05.txt   draft-ietf-nfsv4-rpcrdma-06.txt 
NFSv4 Working Group Tom Talpey NFSv4 Working Group Tom Talpey
Internet-Draft Network Appliance, Inc. Internet-Draft Network Appliance, Inc.
Intended status: Standards Track Brent Callaghan Intended status: Standards Track Brent Callaghan
Expires: November 8, 2007 Apple Computer, Inc. Expires: January 1, 2008 Apple Computer, Inc.
May 7, 2007 July 1, 2007
RDMA Transport for ONC RPC RDMA Transport for ONC RPC
draft-ietf-nfsv4-rpcrdma-05 draft-ietf-nfsv4-rpcrdma-06
Status of this Memo Status of this Memo
By submitting this Internet-Draft, each author represents that any By submitting this Internet-Draft, each author represents that any
applicable patent or other IPR claims of which he or she is aware applicable patent or other IPR claims of which he or she is aware
have been or will be disclosed, and any of which he or she becomes have been or will be disclosed, and any of which he or she becomes
aware will be disclosed, in accordance with Section 6 of BCP 79. aware will be disclosed, in accordance with Section 6 of BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that Task Force (IETF), its areas, and its working groups. Note that
skipping to change at page 1, line 36 skipping to change at page 1, line 36
documents at any time. It is inappropriate to use Internet-Drafts documents at any time. It is inappropriate to use Internet-Drafts
as reference material or to cite them other than as "work in as reference material or to cite them other than as "work in
progress." progress."
The list of current Internet-Drafts can be accessed at The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html. http://www.ietf.org/shadow.html.
This Internet-Draft will expire on November 8, 2007.
Copyright Notice Copyright Notice
Copyright (C) The IETF Trust (2007). Copyright (C) The IETF Trust (2007).
Abstract Abstract
A protocol is described providing RDMA as a new transport for ONC A protocol is described providing RDMA as a new transport for ONC
RPC. The RDMA transport binding conveys the benefits of efficient, RPC. The RDMA transport binding conveys the benefits of efficient,
bulk data transport over high speed networks, while providing for bulk data transport over high speed networks, while providing for
minimal change to RPC applications and with no required revision of minimal change to RPC applications and with no required revision of
skipping to change at page 2, line 22 skipping to change at page 2, line 22
3.3. Flow Control . . . . . . . . . . . . . . . . . . . . . . . 6 3.3. Flow Control . . . . . . . . . . . . . . . . . . . . . . . 6
3.4. XDR Encoding with Chunks . . . . . . . . . . . . . . . . . 7 3.4. XDR Encoding with Chunks . . . . . . . . . . . . . . . . . 7
3.5. XDR Decoding with Read Chunks . . . . . . . . . . . . . 11 3.5. XDR Decoding with Read Chunks . . . . . . . . . . . . . 11
3.6. XDR Decoding with Write Chunks . . . . . . . . . . . . . 11 3.6. XDR Decoding with Write Chunks . . . . . . . . . . . . . 11
3.7. XDR Roundup and Chunks . . . . . . . . . . . . . . . . . 12 3.7. XDR Roundup and Chunks . . . . . . . . . . . . . . . . . 12
3.8. RPC Call and Reply . . . . . . . . . . . . . . . . . . . 13 3.8. RPC Call and Reply . . . . . . . . . . . . . . . . . . . 13
3.9. Padding . . . . . . . . . . . . . . . . . . . . . . . . 16 3.9. Padding . . . . . . . . . . . . . . . . . . . . . . . . 16
4. RPC RDMA Message Layout . . . . . . . . . . . . . . . . . 17 4. RPC RDMA Message Layout . . . . . . . . . . . . . . . . . 17
4.1. RPC over RDMA Header . . . . . . . . . . . . . . . . . . 17 4.1. RPC over RDMA Header . . . . . . . . . . . . . . . . . . 17
4.2. RPC over RDMA header errors . . . . . . . . . . . . . . 19 4.2. RPC over RDMA header errors . . . . . . . . . . . . . . 19
4.3. XDR Language Description . . . . . . . . . . . . . . . . 19 4.3. XDR Language Description . . . . . . . . . . . . . . . . 20
5. Long Messages . . . . . . . . . . . . . . . . . . . . . . 22 5. Long Messages . . . . . . . . . . . . . . . . . . . . . . 22
5.1. Message as an RDMA Read Chunk . . . . . . . . . . . . . 22 5.1. Message as an RDMA Read Chunk . . . . . . . . . . . . . 22
5.2. RDMA Write of Long Replies (Reply Chunks) . . . . . . . 24 5.2. RDMA Write of Long Replies (Reply Chunks) . . . . . . . 24
6. Connection Configuration Protocol . . . . . . . . . . . . 25 6. Connection Configuration Protocol . . . . . . . . . . . . 25
6.1. Initial Connection State . . . . . . . . . . . . . . . . 26 6.1. Initial Connection State . . . . . . . . . . . . . . . . 26
6.2. Protocol Description . . . . . . . . . . . . . . . . . . 26 6.2. Protocol Description . . . . . . . . . . . . . . . . . . 26
7. Memory Registration Overhead . . . . . . . . . . . . . . . 28 7. Memory Registration Overhead . . . . . . . . . . . . . . . 28
8. Errors and Error Recovery . . . . . . . . . . . . . . . . 28 8. Errors and Error Recovery . . . . . . . . . . . . . . . . 28
9. Node Addressing . . . . . . . . . . . . . . . . . . . . . 28 9. Node Addressing . . . . . . . . . . . . . . . . . . . . . 28
10. RPC Binding . . . . . . . . . . . . . . . . . . . . . . . 29 10. RPC Binding . . . . . . . . . . . . . . . . . . . . . . . 29
skipping to change at page 3, line 8 skipping to change at page 3, line 8
1. Introduction 1. Introduction
RDMA is a technique for efficient movement of data between end RDMA is a technique for efficient movement of data between end
nodes, which becomes increasingly compelling over high speed nodes, which becomes increasingly compelling over high speed
transports. By directing data into destination buffers as it is transports. By directing data into destination buffers as it is
sent on a network, and placing it via direct memory access by sent on a network, and placing it via direct memory access by
hardware, the double benefit of faster transfers and reduced host hardware, the double benefit of faster transfers and reduced host
overhead is obtained. overhead is obtained.
ONC RPC [RFC1831] is a remote procedure call protocol that has been ONC RPC [RFC1831bis] is a remote procedure call protocol that has
run over a variety of transports. Most RPC implementations today been run over a variety of transports. Most RPC implementations
use UDP or TCP. RPC messages are defined in terms of an eXternal today use UDP or TCP. RPC messages are defined in terms of an
Data Representation (XDR) [RFC4506] which provides a canonical data eXternal Data Representation (XDR) [RFC4506] which provides a
representation across a variety of host architectures. An XDR data canonical data representation across a variety of host
stream is conveyed differently on each type of transport. On UDP, architectures. An XDR data stream is conveyed differently on each
RPC messages are encapsulated inside datagrams, while on a TCP byte type of transport. On UDP, RPC messages are encapsulated inside
stream, RPC messages are delineated by a record marking protocol. datagrams, while on a TCP byte stream, RPC messages are delineated
An RDMA transport also conveys RPC messages in a unique fashion by a record marking protocol. An RDMA transport also conveys RPC
that must be fully described if client and server implementations messages in a unique fashion that must be fully described if client
are to interoperate. and server implementations are to interoperate.
RDMA transports present new semantics unlike the behaviors of RDMA transports present new semantics unlike the behaviors of
either UDP and TCP alone. They retain message delineations like either UDP and TCP alone. They retain message delineations like
UDP while also providing a reliable, sequenced data transfer like UDP while also providing a reliable, sequenced data transfer like
TCP. And, they provide the new efficient, bulk transfer service of TCP. And, they provide the new efficient, bulk transfer service of
RDMA. RDMA transports are therefore naturally viewed as a new RDMA. RDMA transports are therefore naturally viewed as a new
transport type by ONC RPC. transport type by ONC RPC.
RDMA as a transport will benefit the performance of RPC protocols RDMA as a transport will benefit the performance of RPC protocols
that move large "chunks" of data, since RDMA hardware excels at that move large "chunks" of data, since RDMA hardware excels at
skipping to change at page 4, line 7 skipping to change at page 4, line 7
sender to a receiver. An RPC message is either an RPC call from a sender to a receiver. An RPC message is either an RPC call from a
client to a server, or an RPC reply from the server back to the client to a server, or an RPC reply from the server back to the
client. An RPC message contains an RPC call header followed by client. An RPC message contains an RPC call header followed by
arguments if the message is an RPC call, or an RPC reply header arguments if the message is an RPC call, or an RPC reply header
followed by results if the message is an RPC reply. The call followed by results if the message is an RPC reply. The call
header contains a transaction ID (XID) followed by the program and header contains a transaction ID (XID) followed by the program and
procedure number as well as a security credential. An RPC reply procedure number as well as a security credential. An RPC reply
header begins with an XID that matches that of the RPC call header begins with an XID that matches that of the RPC call
message, followed by a security verifier and results. All data in message, followed by a security verifier and results. All data in
an RPC message is XDR encoded. For a complete description of the an RPC message is XDR encoded. For a complete description of the
RPC protocol and XDR encoding, see [RFC1831] and [RFC4506]. RPC protocol and XDR encoding, see [RFC1831bis] and [RFC4506].
This protocol assumes the following abstract model for RDMA This protocol assumes the following abstract model for RDMA
transports. These terms, common in the RDMA lexicon, are used in transports. These terms, common in the RDMA lexicon, are used in
this document. A more complete glossary of RDMA terms can be found this document. A more complete glossary of RDMA terms can be found
in [RDMAP]. in [RDMAP].
o Registered Memory o Registered Memory
All data moved via tagged RDMA operations is resident in All data moved via tagged RDMA operations is resident in
registered memory at its destination. This protocol assumes registered memory at its destination. This protocol assumes
that each segment of registered memory MUST be identified with that each segment of registered memory MUST be identified with
skipping to change at page 7, line 17 skipping to change at page 7, line 17
example, NFSv4 callbacks are processed by the NFSv4 client, acting example, NFSv4 callbacks are processed by the NFSv4 client, acting
as an RPC server. The credit discussions apply equally in either as an RPC server. The credit discussions apply equally in either
case.) case.)
Flow control for RDMA Send operations is implemented as a simple Flow control for RDMA Send operations is implemented as a simple
request/grant protocol in the RPC over RDMA header associated with request/grant protocol in the RPC over RDMA header associated with
each RPC message. The RPC over RDMA header for RPC call messages each RPC message. The RPC over RDMA header for RPC call messages
contains a requested credit value for the RPC server, which MAY be contains a requested credit value for the RPC server, which MAY be
dynamically adjusted by the caller to match its expected needs. dynamically adjusted by the caller to match its expected needs.
The RPC over RDMA header for the RPC reply messages provides the The RPC over RDMA header for the RPC reply messages provides the
granted result, which MAY have any value except it MAY NOT be zero granted result, which MAY have any value except it MUST NOT be zero
when no in-progress operations are present at the server, since when no in-progress operations are present at the server, since
such a value would result in deadlock. The value MAY be adjusted such a value would result in deadlock. The value MAY be adjusted
up or down at each opportunity to match the server's needs or up or down at each opportunity to match the server's needs or
policies. policies.
The RPC client MUST NOT send unacknowledged requests in excess of The RPC client MUST NOT send unacknowledged requests in excess of
this granted RPC server credit limit. If the limit is exceeded, this granted RPC server credit limit. If the limit is exceeded,
the RDMA layer may signal an error, possibly terminating the the RDMA layer may signal an error, possibly terminating the
connection. Even if an error does not occur, it is NOT REQUIRED connection. Even if an error does not occur, it is OPTIONAL that
that the server handle the excess request(s), and it MAY return an the server handle the excess request(s), and it MAY return an RPC
RPC error to the client. Also note that the never-zero requirement error to the client. Also note that the never-zero requirement
implies that an RPC server MUST always provide at least one credit implies that an RPC server MUST always provide at least one credit
to each connected RPC client. It does however NOT REQUIRE that the to each connected RPC client. It is however OPTIONAL that the
server always be prepared to receive a request from each client, server always be prepared to receive a request from each client,
for example when the server is busy processing all granted client for example when the server is busy processing all granted client
requests. requests.
While RPC calls complete in any order, the current flow control While RPC calls complete in any order, the current flow control
limit at the RPC server is known to the RPC client from the Send limit at the RPC server is known to the RPC client from the Send
ordering properties. It is always the most recent server-granted ordering properties. It is always the most recent server-granted
credit value minus the number of requests in flight. credit value minus the number of requests in flight.
Certain RDMA implementations may impose additional flow control Certain RDMA implementations may impose additional flow control
skipping to change at page 9, line 16 skipping to change at page 9, line 16
separate "read chunk list" encoded within RPC RDMA transport- separate "read chunk list" encoded within RPC RDMA transport-
specific headers. Such chunks will be transferred via RDMA Read specific headers. Such chunks will be transferred via RDMA Read
operations initiated by the receiver. operations initiated by the receiver.
When the read chunks are to be moved via RDMA, the memory for each When the read chunks are to be moved via RDMA, the memory for each
chunk is registered. This registration may take place within XDR chunk is registered. This registration may take place within XDR
itself, providing for full transparency to upper layers, or it may itself, providing for full transparency to upper layers, or it may
be performed by any other specific local implementation. be performed by any other specific local implementation.
Additionally, when making an RPC call that can result in bulk data Additionally, when making an RPC call that can result in bulk data
transferred in the reply, it is desirable to provide chunks to transferred in the reply, write chunks MAY be provided to accept
accept the data directly via RDMA Write. These write chunks will the data directly via RDMA Write. These write chunks will
therefore be pre-filled by the RPC server prior to responding, and therefore be pre-filled by the RPC server prior to responding, and
XDR decode of the data at the client will not be required. These XDR decode of the data at the client will not be required. These
chunks undergo a similar registration and advertisement via "write chunks undergo a similar registration and advertisement via "write
chunk lists" built as a part of XDR encoding. chunk lists" built as a part of XDR encoding.
Some RPC client implementations are not able to determine where an Some RPC client implementations are not able to determine where an
RPC call's results reside during the "encode" phase. This makes it RPC call's results reside during the "encode" phase. This makes it
difficult or impossible for the RPC client layer to encode the difficult or impossible for the RPC client layer to encode the
write chunk list at the time of building the request. In this write chunk list at the time of building the request. In this
case, it is difficult for the RPC implementation to provide case, it is difficult for the RPC implementation to provide
transparency to the RPC consumer, which may require recoding to transparency to the RPC consumer, which may require recoding to
provide result information at this earlier stage. provide result information at this earlier stage.
Therefore if the RPC client does not make a write chunk list Therefore if the RPC client does not make a write chunk list
available to receive the result, then the RPC server MAY return available to receive the result, then the RPC server MAY return
data inline in the reply, or if the upper layer specification data inline in the reply, or if the upper layer specification
permits, it MAY be returned via a read chunk list. It is NOT permits, it MAY be returned via a read chunk list. It is NOT
RECOMMENDED that upper layer RPC client protocol specifcations omit RECOMMENDED that upper layer RPC client protocol specifications
write chunk lists for eligible replies, due to the lower omit write chunk lists for eligible replies, due to the lower
performance of the additional handshaking to perform data transfer, performance of the additional handshaking to perform data transfer,
and the requirement that the RPC server must expose (and preserve) and the requirement that the RPC server must expose (and preserve)
the reply data for a period of time. In the absence of a server- the reply data for a period of time. In the absence of a server-
provided read chunk list in the reply, if the encoded reply provided read chunk list in the reply, if the encoded reply
overflows the posted receive buffer, the RPC will fail with an RDMA overflows the posted receive buffer, the RPC will fail with an RDMA
transport error. transport error.
When any data within a message is provided via either read or write When any data within a message is provided via either read or write
chunks, the chunk itself refers only to the data portion of the XDR chunks, the chunk itself refers only to the data portion of the XDR
stream element. In particular, for counted fields (e.g. a "<>" stream element. In particular, for counted fields (e.g., a "<>"
encoding) the byte count which is encoded as part of the field encoding) the byte count which is encoded as part of the field
remains in the XDR stream, and is also encoded in the chunk list. remains in the XDR stream, and is also encoded in the chunk list.
The data portion is however elided from the encoded XDR stream, and The data portion is however elided from the encoded XDR stream, and
is transferred as part of chunk list processing. This is important is transferred as part of chunk list processing. This is important
to maintain upper layer implementation compatibility - both the to maintain upper layer implementation compatibility - both the
count and the data must be transferred as part of the logical XDR count and the data must be transferred as part of the logical XDR
stream. While the chunk list processing results in the data being stream. While the chunk list processing results in the data being
available to the upper layer peer for XDR decoding, the length available to the upper layer peer for XDR decoding, the length
present in the chunk list entries is not. Any byte count in the present in the chunk list entries is not. Any byte count in the
XDR stream MUST match the sum of the byte counts present in the XDR stream MUST match the sum of the byte counts present in the
skipping to change at page 11, line 32 skipping to change at page 11, line 32
Therefore, read chunks are encoded into a read chunk list as a Therefore, read chunks are encoded into a read chunk list as a
single array, with each entry tagged by its (known) size and its single array, with each entry tagged by its (known) size and its
argument's or result's position in the XDR stream. Write chunks argument's or result's position in the XDR stream. Write chunks
are encoded as a list of arrays of RDMA buffers, with each list are encoded as a list of arrays of RDMA buffers, with each list
element (an array) providing buffers for a separate result. element (an array) providing buffers for a separate result.
Individual write chunk list elements MAY thereby result in being Individual write chunk list elements MAY thereby result in being
partially or fully filled, or in fact not being filled at all. partially or fully filled, or in fact not being filled at all.
Unused write chunks, or unused bytes in write chunk buffer lists, Unused write chunks, or unused bytes in write chunk buffer lists,
are not returned as results, and their memory is returned to the are not returned as results, and their memory is returned to the
upper layer as part of RPC completion. However, the RPC layer upper layer as part of RPC completion. However, the RPC layer MUST
SHOULD NOT assume that the buffers have not been modified. NOT assume that the buffers have not been modified.
3.5. XDR Decoding with Read Chunks 3.5. XDR Decoding with Read Chunks
The XDR decode process moves data from an XDR stream into a data The XDR decode process moves data from an XDR stream into a data
structure provided by the RPC client or server application. Where structure provided by the RPC client or server application. Where
elements of the destination data structure are buffers or strings, elements of the destination data structure are buffers or strings,
the RPC application can either pre-allocate storage to receive the the RPC application can either pre-allocate storage to receive the
data, or leave the string or buffer fields null and allow the XDR data, or leave the string or buffer fields null and allow the XDR
decode stage of RPC processing to automatically allocate storage of decode stage of RPC processing to automatically allocate storage of
sufficient size. sufficient size.
skipping to change at page 13, line 42 skipping to change at page 13, line 42
Some protocol operations over RPC/RDMA, for instance NFS writes of Some protocol operations over RPC/RDMA, for instance NFS writes of
data encountered at the end of a file or in direct i/o situations, data encountered at the end of a file or in direct i/o situations,
commonly yield these roundups within RDMA Read Chunks. Because any commonly yield these roundups within RDMA Read Chunks. Because any
roundup bytes are not actually present in the data buffers being roundup bytes are not actually present in the data buffers being
written, memory for these bytes would come from noncontiguous written, memory for these bytes would come from noncontiguous
buffers, either as an additional memory registration segment, or as buffers, either as an additional memory registration segment, or as
an additional Chunk. The overhead of these operations can be an additional Chunk. The overhead of these operations can be
significant to both the sender to marshal them, and even higher to significant to both the sender to marshal them, and even higher to
the receiver which to transfer them. Senders SHOULD therefore the receiver which to transfer them. Senders SHOULD therefore
avoid encoding indivudual RDMA Read Chunks for roundup whenever avoid encoding individual RDMA Read Chunks for roundup whenever
possible. It is acceptable, but not necessary, to include roundup possible. It is acceptable, but not necessary, to include roundup
data in an existing RDMA Read Chunk, but only if it is already data in an existing RDMA Read Chunk, but only if it is already
present in the XDR stream to carry upper layer data. present in the XDR stream to carry upper layer data.
Note that there is no exposure of additional data at the sender due Note that there is no exposure of additional data at the sender due
to eliding roundup data from the XDR stream, since any additional to eliding roundup data from the XDR stream, since any additional
sender buffers are never exposed to the peer. The data is sender buffers are never exposed to the peer. The data is
literally not there to be transferred. literally not there to be transferred.
For RDMA Write Chunks, a simpler encoding method applies. Again, For RDMA Write Chunks, a simpler encoding method applies. Again,
skipping to change at page 18, line 33 skipping to change at page 18, line 33
the XDR decode MUST skip over the appropriate padding as indicated the XDR decode MUST skip over the appropriate padding as indicated
by rdma_align and the current XDR stream position. by rdma_align and the current XDR stream position.
4. RPC RDMA Message Layout 4. RPC RDMA Message Layout
RPC call and reply messages are conveyed across an RDMA transport RPC call and reply messages are conveyed across an RDMA transport
with a prepended RPC over RDMA header. The RPC over RDMA header with a prepended RPC over RDMA header. The RPC over RDMA header
includes data for RDMA flow control credits, padding parameters and includes data for RDMA flow control credits, padding parameters and
lists of addresses that provide direct data placement via RDMA Read lists of addresses that provide direct data placement via RDMA Read
and Write operations. The layout of the RPC message itself is and Write operations. The layout of the RPC message itself is
unchanged from that described in [RFC1831] except for the possible unchanged from that described in [RFC1831bis] except for the
exclusion of large data chunks that will be moved by RDMA Read or possible exclusion of large data chunks that will be moved by RDMA
Write operations. If the RPC message (along with the RPC over RDMA Read or Write operations. If the RPC message (along with the RPC
header) is too long for the posted receive buffer (even after any over RDMA header) is too long for the posted receive buffer (even
large chunks are removed), then the entire RPC message MAY be moved after any large chunks are removed), then the entire RPC message
separately as a chunk, leaving just the RPC over RDMA header in the MAY be moved separately as a chunk, leaving just the RPC over RDMA
RDMA Send. header in the RDMA Send.
4.1. RPC over RDMA Header 4.1. RPC over RDMA Header
The RPC over RDMA header begins with four 32-bit fields that are The RPC over RDMA header begins with four 32-bit fields that are
always present and which control the RDMA interaction including always present and which control the RDMA interaction including
RDMA-specific flow control. These are then followed by a number of RDMA-specific flow control. These are then followed by a number of
items such as chunk lists and padding which MAY or MAY NOT be items such as chunk lists and padding which MAY or MUST NOT be
present depending on the type of transmission. The four fields present depending on the type of transmission. The four fields
which are always present are: which are always present are:
1. Transaction ID (XID). 1. Transaction ID (XID).
The XID generated for the RPC call and reply. Having the XID The XID generated for the RPC call and reply. Having the XID
at the beginning of the message makes it easy to establish the at the beginning of the message makes it easy to establish the
message context. This XID mirrors the XID in the RPC header, message context. This XID MUST be the same as the XID in the
and takes precedence. The receiver MAY ignore the XID in the RPC header. The receiver MAY perform its processing based
RPC header, if it so chooses. solely on the XID in the RPC over RDMA header, and thereby
ignore the XID in the RPC header, if it so chooses.
2. Version number. 2. Version number.
This version of the RPC RDMA message protocol is 1. The This version of the RPC RDMA message protocol is 1. The
version number MUST be increased by one whenever the format of version number MUST be increased by one whenever the format of
the RPC RDMA messages is changed. the RPC RDMA messages is changed.
3. Flow control credit value. 3. Flow control credit value.
When sent in an RPC call message, the requested value is When sent in an RPC call message, the requested value is
provided. When sent in an RPC reply message, the granted provided. When sent in an RPC reply message, the granted
value is returned. RPC calls SHOULD not be sent in excess of value is returned. RPC calls SHOULD not be sent in excess of
skipping to change at page 20, line 42 skipping to change at page 20, line 44
When a peer receives an RPC RDMA message, it MUST perform the When a peer receives an RPC RDMA message, it MUST perform the
following basic validity checks on the header and chunk contents. following basic validity checks on the header and chunk contents.
If such errors are detected in the request, an RDMA_ERROR reply If such errors are detected in the request, an RDMA_ERROR reply
MUST be generated. MUST be generated.
Two types of errors are defined, version mismatch and invalid chunk Two types of errors are defined, version mismatch and invalid chunk
format. When the peer detects an RPC over RDMA header version format. When the peer detects an RPC over RDMA header version
which it does not support (currently this draft defines only which it does not support (currently this draft defines only
version 1), it replies with an error code of ERR_VERS, and provides version 1), it replies with an error code of ERR_VERS, and provides
the low and high inclusive version numbers it does, in fact, the low and high inclusive version numbers it does, in fact,
support. The version number in this reply MAY be any value support. The version number in this reply MUST be any value
otherwise valid at the receiver. When other decoding errors are otherwise valid at the receiver. When other decoding errors are
detected in the header or chunks, either an RPC decode error MAY be detected in the header or chunks, either an RPC decode error MAY be
returned, or the RPC/RDMA error code ERR_CHUNK MUST be returned. returned, or the ROC/RDMA error code ERR_CHUNK MUST be returned.
4.3. XDR Language Description 4.3. XDR Language Description
Here is the message layout in XDR language. Here is the message layout in XDR language.
struct xdr_rdma_segment { struct xdr_rdma_segment {
uint32 handle; /* Registered memory handle */ uint32 handle; /* Registered memory handle */
uint32 length; /* Length of the chunk in bytes */ uint32 length; /* Length of the chunk in bytes */
uint64 offset; /* Chunk virtual address or offset */ uint64 offset; /* Chunk virtual address or offset */
}; };
skipping to change at page 23, line 9 skipping to change at page 23, line 9
struct xdr_read_list *rdma_reads; struct xdr_read_list *rdma_reads;
struct xdr_write_list *rdma_writes; struct xdr_write_list *rdma_writes;
struct xdr_write_chunk *rdma_reply; struct xdr_write_chunk *rdma_reply;
/* rpc body follows */ /* rpc body follows */
}; };
enum rpc_rdma_errcode { enum rpc_rdma_errcode {
ERR_VERS = 1, ERR_VERS = 1,
ERR_CHUNK = 2 ERR_CHUNK = 2
}; };
union rpc_rdma_error switch (rpc_rdma_errcode) { union rpc_rdma_error switch (rpc_rdma_errcode err) {
case ERR_VERS: case ERR_VERS:
uint32 rdma_vers_low; uint32 rdma_vers_low;
uint32 rdma_vers_high; uint32 rdma_vers_high;
case ERR_CHUNK: case ERR_CHUNK:
void; void;
default: default:
uint32 rdma_extra[8]; uint32 rdma_extra[8];
}; };
5. Long Messages 5. Long Messages
skipping to change at page 23, line 44 skipping to change at page 23, line 44
filename strings. Also, the NFS version 4 protocol [RFC3530] filename strings. Also, the NFS version 4 protocol [RFC3530]
features COMPOUND request and reply messages of unbounded length. features COMPOUND request and reply messages of unbounded length.
Ideally, each upper layer will negotiate these limits. However, it Ideally, each upper layer will negotiate these limits. However, it
is frequently necessary to provide a transparent solution. is frequently necessary to provide a transparent solution.
5.1. Message as an RDMA Read Chunk 5.1. Message as an RDMA Read Chunk
One relatively simple method is to have the client identify any RPC One relatively simple method is to have the client identify any RPC
message that exceeds the RPC server's posted buffer size and move message that exceeds the RPC server's posted buffer size and move
it separately as a chunk, i.e. reference it as the first entry in it separately as a chunk, i.e., reference it as the first entry in
the read chunk list with an XDR position of zero. the read chunk list with an XDR position of zero.
Normal Message Normal Message
+--------+---------+---------+------------+-------------+---------- +--------+---------+---------+------------+-------------+----------
| | | | | | RPC Call | | | | | | RPC Call
| XID | Version | Credits | RDMA_MSG | Chunk Lists | or | XID | Version | Credits | RDMA_MSG | Chunk Lists | or
| | | | | | Reply Msg | | | | | | Reply Msg
+--------+---------+---------+------------+-------------+---------- +--------+---------+---------+------------+-------------+----------
skipping to change at page 27, line 28 skipping to change at page 27, line 28
connection requirements to the server, and allows the server to connection requirements to the server, and allows the server to
inform the client of its connection limits. inform the client of its connection limits.
Use of the Connection Configuration Protocol by an upper layer is Use of the Connection Configuration Protocol by an upper layer is
OPTIONAL. OPTIONAL.
6.1. Initial Connection State 6.1. Initial Connection State
This protocol MAY be used for connection setup prior to the use of This protocol MAY be used for connection setup prior to the use of
another RPC protocol that uses the RDMA transport. It operates in- another RPC protocol that uses the RDMA transport. It operates in-
band, i.e. it uses the connection itself to negotiate the band, i.e., it uses the connection itself to negotiate the
connection parameters. To provide a basis for connection connection parameters. To provide a basis for connection
negotiation, the connection is assumed to provide a basic level of negotiation, the connection is assumed to provide a basic level of
interoperability: the ability to exchange at least one RPC message interoperability: the ability to exchange at least one RPC message
at a time that is at least 1 KB in size. The server MAY exceed at a time that is at least 1 KB in size. The server MAY exceed
this basic level of configuration, but the client MUST NOT assume this basic level of configuration, but the client MUST NOT assume
it. more than one, and MUST receive a valid reply from the server
carrying the actual number of available receive messages, prior to
sending its next request.
6.2. Protocol Description 6.2. Protocol Description
Version 1 of the Connection Configuration protocol consists of a Version 1 of the Connection Configuration protocol consists of a
single procedure that allows the client to inform the server of its single procedure that allows the client to inform the server of its
connection requirements and the server to return connection connection requirements and the server to return connection
information to the client. information to the client.
The maxcall_sendsize argument is the maximum size of an RPC call The maxcall_sendsize argument is the maximum size of an RPC call
message that the client MUST send inline in an RDMA Send message to message that the client MAY send inline in an RDMA Send message to
the server. The server MAY return a maxcall_sendsize value that is the server. The server MAY return a maxcall_sendsize value that is
smaller or larger than the client's request. The client MUST NOT smaller or larger than the client's request. The client MUST NOT
send an inline call message larger than what the server will send an inline call message larger than what the server will
accept. The maxcall_sendsize limits only the size of inline RPC accept. The maxcall_sendsize limits only the size of inline RPC
calls. It does not limit the size of long RPC messages transferred calls. It does not limit the size of long RPC messages transferred
as an initial chunk in the Read chunk list. as an initial chunk in the Read chunk list.
The maxreply_sendsize is the maximum size of an inline RPC message The maxreply_sendsize is the maximum size of an inline RPC message
that the client will accept from the server. that the client will accept from the server.
skipping to change at page 30, line 44 skipping to change at page 30, line 44
resolve its desired service to a mappable port, and proceed to resolve its desired service to a mappable port, and proceed to
connect. This is the most flexible and compatible approach, connect. This is the most flexible and compatible approach,
for those upper layers which are defined to use the rpcbind for those upper layers which are defined to use the rpcbind
service. service.
A second possibility is to have the server's portmapper A second possibility is to have the server's portmapper
register itself on the RDMA interconnect at a "well known" register itself on the RDMA interconnect at a "well known"
service address. (On UDP or TCP, this corresponds to port service address. (On UDP or TCP, this corresponds to port
111.) A client could connect to this service address and use 111.) A client could connect to this service address and use
the portmap protocol to obtain a service address in response the portmap protocol to obtain a service address in response
to a program number, e.g. an iWARP port number, or an to a program number, e.g., an iWARP port number, or an
Infiniband GID. Infiniband GID.
Alternatively, the client could simply connect to the mapped Alternatively, the client could simply connect to the mapped
well-known port for the service itself, if it is appropriately well-known port for the service itself, if it is appropriately
defined. defined.
Historically, different RPC protocols have taken different Historically, different RPC protocols have taken different
approaches to their port assignment, therefore the specific method approaches to their port assignment, therefore the specific method
is left to each RPC/RDMA-enabled upper layer binding, and not is left to each RPC/RDMA-enabled upper layer binding, and not
addressed here. addressed here.
skipping to change at page 31, line 28 skipping to change at page 31, line 28
integrity checking, and privacy. This security mechanism will be integrity checking, and privacy. This security mechanism will be
unaffected by the RDMA transport. The data integrity and privacy unaffected by the RDMA transport. The data integrity and privacy
features alter the body of the message, presenting it as a single features alter the body of the message, presenting it as a single
chunk. For large messages the chunk may be large enough to qualify chunk. For large messages the chunk may be large enough to qualify
for RDMA Read transfer. However, there is much data movement for RDMA Read transfer. However, there is much data movement
associated with computation and verification of integrity, or associated with computation and verification of integrity, or
encryption/decryption, so certain performance advantages may be encryption/decryption, so certain performance advantages may be
lost. lost.
For efficiency, more appropriate security mechanism for RDMA links For efficiency, more appropriate security mechanism for RDMA links
may be link-level protection, such as IPSec, which may be co- may be link-level protection, such as certain configurations of
located in the RDMA hardware. The use of link-level protection MAY IPsec, which may be co-located in the RDMA hardware. The use of
be negotiated through the use of a new RPCSEC_GSS mechanism like link-level protection MAY be negotiated through the use of a new
the Credential Cache GSS Mechanism [CCM]. Use of such mechanisms RPCSEC_GSS mechanism like the Credential Cache GSS Mechanism [CCM].
is RECOMMENDED where end-to-end integrity and/or privacy is Use of such mechanisms is RECOMMENDED where end-to-end integrity
desired, and where efficiency is required. and/or privacy is desired, and where efficiency is required.
There are no new issues here with exposed addresses. The only There are no new issues here with exposed addresses. The only
exposed addresses here are in the chunk list and in the transport exposed addresses here are in the chunk list and in the transport
packets transferred via RDMA. The data contained in these packets transferred via RDMA. The data contained in these
addresses continues to be protected by RPCSEC_GSS integrity and addresses continues to be protected by RPCSEC_GSS integrity and
privacy. privacy.
12. IANA Considerations 12. IANA Considerations
The new RPC transport is to be assigned a new RPC "netid", which is The new RPC transport is to be assigned a new RPC "netid", which is
an rpcbind [RFC1833] string used to describe the underlying an rpcbind [RFC1833] string used to describe the underlying
protocol in order for RPC to select the appropriate transport protocol in order for RPC to select the appropriate transport
framing, as well as the format of the service ports. framing, as well as the format of the service ports.
The following string is to be added to the "nc_proto" registry on The following "nc_proto" registry string is hereby defined for this
page 5 of [RFC1833]: purpose:
NC_RDMA "rdma" NC_RDMA "rdma"
The mechanism of adding this value to the RPC netid registry is
outside the scope of this document and is an IANA consideration.
This netid MAY be used for any RDMA network satisfying the This netid MAY be used for any RDMA network satisfying the
requirements of section 2, and able to identify service endpoints requirements of section 2, and able to identify service endpoints
using IP port addressing, possibly through use of a translation using IP port addressing, possibly through use of a translation
service as described above in section 10, RPC Binding. service as described above in section 10, RPC Binding.
As a new RPC transport, this protocol has no effect on RPC program As a new RPC transport, this protocol has no effect on RPC program
numbers or existing registered port numbers. However, new port numbers or existing registered port numbers. However, new port
numbers MAY be registered for use by RPC/RDMA-enabled services, as numbers MAY be registered for use by RPC/RDMA-enabled services, as
appropriate to the new networks over which the services will appropriate to the new networks over which the services will
operate. operate.
The OPTIONAL Connection Configuration protocol described herein The OPTIONAL Connection Configuration protocol described herein
requires an RPC program number assignment. The value "100400" is requires an RPC program number assignment. The value "100400" is
assigned: hereby assigned:
rdmaconfig 100400 rpc.rdmaconfig rdmaconfig 100400 rpc.rdmaconfig
Currently, these numbers are not assigned by IANA, they are merely Currently, these numbers are not assigned by IANA, they are merely
republished [IANA-RPC]. republished [IANA-RPC]. The mechanism of this republishing is
outside the scope of this document and is an IANA consideration.
13. Acknowledgements 13. Acknowledgements
The authors wish to thank Rob Thurlow, John Howard, Chet Juszczak, The authors wish to thank Rob Thurlow, John Howard, Chet Juszczak,
Alex Chiu, Peter Staubach, Dave Noveck, Brian Pawlowski, Steve Alex Chiu, Peter Staubach, Dave Noveck, Brian Pawlowski, Steve
Kleiman, Mike Eisler, Mark Wittle, Shantanu Mehendale, David Kleiman, Mike Eisler, Mark Wittle, Shantanu Mehendale, David
Robinson and Mallikarjun Chadalapaka for their contributions to Robinson and Mallikarjun Chadalapaka for their contributions to
this document. this document.
14. Normative References 14. Normative References
[RFC2119] [RFC2119]
S. Bradner, "Key words for use in RFCs to Indicate Requirement S. Bradner, "Key words for use in RFCs to Indicate Requirement
Levels", Best Current Practice, BCP 14, RFC 2119, March 1997. Levels", Best Current Practice, BCP 14, RFC 2119, March 1997.
[RFC1094] [RFC1094]
Sun Microsystems, "NFS: Network File System Protocol Sun Microsystems, "NFS: Network File System Protocol
Specification", (NFS version 2) Informational RFC, Specification", (NFS version 2) Informational RFC,
http://www.ietf.org/rfc/rfc1094.txt http://www.ietf.org/rfc/rfc1094.txt
[RFC1831] [RFC1831bis]
R. Srinivasan, "RPC: Remote Procedure Call Protocol R. Thurlow, Ed., "RPC: Remote Procedure Call Protocol
Specification Version 2", Standards Track RFC, Specification Version 2", Standards Track RFC
http://www.ietf.org/rfc/rfc1831.txt
[RFC4506] [RFC4506]
M. Eisler Ed., "XDR: External Data Representation Standard", M. Eisler Ed., "XDR: External Data Representation Standard",
Standards Track RFC, http://www.ietf.org/rfc/rfc4506.txt Standards Track RFC, http://www.ietf.org/rfc/rfc4506.txt
[RFC1813] [RFC1813]
B. Callaghan, B. Pawlowski, P. Staubach, "NFS Version 3 B. Callaghan, B. Pawlowski, P. Staubach, "NFS Version 3
Protocol Specification", Informational RFC, Protocol Specification", Informational RFC,
http://www.ietf.org/rfc/rfc1813.txt http://www.ietf.org/rfc/rfc1813.txt
[RFC1833] [RFC1833]
R. Srinivasan, "Binding Protocols for ONC RPC Version 2", R. Srinivasan, "Binding Protocols for ONC RPC Version 2",
Standards Track RFC, http://www.ietf.org/rfc/rfc1833.txt Standards Track RFC, http://www.ietf.org/rfc/rfc1833.txt
[RFC3530] [RFC3530]
skipping to change at page 33, line 26 skipping to change at page 33, line 30
Track RFC, http://www.ietf.org/rfc/rfc3530.txt Track RFC, http://www.ietf.org/rfc/rfc3530.txt
[RFC2203] [RFC2203]
M. Eisler, A. Chiu, L. Ling, "RPCSEC_GSS Protocol M. Eisler, A. Chiu, L. Ling, "RPCSEC_GSS Protocol
Specification", Standards Track RFC, Specification", Standards Track RFC,
http://www.ietf.org/rfc/rfc2203.txt http://www.ietf.org/rfc/rfc2203.txt
15. Informative References 15. Informative References
[RDMAP] [RDMAP]
R. Recio et. al., "A Remote Direct Memory Access Protocol R. Recio et al., "A Remote Direct Memory Access Protocol
Specification", Standards Track RFC, draft-ietf-rddp-rdmap Specification", Standards Track RFC, draft-ietf-rddp-rdmap
[CCM] [CCM]
M. Eisler, N. Williams, "CCM: The Credential Cache GSS M. Eisler, N. Williams, "CCM: The Credential Cache GSS
Mechanism", Internet Draft Work in Progress, draft-ietf- Mechanism", Internet Draft Work in Progress, draft-ietf-
nfsv4-ccm nfsv4-ccm
[NFSDDP] [NFSDDP]
B. Callaghan, T. Talpey, "NFS Direct Data Placement" Internet B. Callaghan, T. Talpey, "NFS Direct Data Placement" Internet
Draft Work in Progress, draft-ietf-nfsv4-nfsdirect Draft Work in Progress, draft-ietf-nfsv4-nfsdirect
[RDDP] [RDDP]
H. Shah et. al., "Direct Data Placement over Reliable H. Shah et al., "Direct Data Placement over Reliable
Transports", Standards Track RFC, draft-ietf-rddp-ddp Transports", Standards Track RFC, draft-ietf-rddp-ddp
[NFSRDMAPS] [NFSRDMAPS]
T. Talpey, C. Juszczak, "NFS RDMA Problem Statement", Internet T. Talpey, C. Juszczak, "NFS RDMA Problem Statement", Internet
Draft Work in Progress, draft-ietf-nfsv4-nfs-rdma-problem- Draft Work in Progress, draft-ietf-nfsv4-nfs-rdma-problem-
statement statement
[NFSv4.1] [NFSv4.1]
S. Shepler et. al., ed., "NFSv4 Minor Version 1" Internet S. Shepler et al., ed., "NFSv4 Minor Version 1" Internet Draft
Draft Work in Progress, draft-ietf-nfsv4-minorversion1 Work in Progress, draft-ietf-nfsv4-minorversion1
[IB] [IB]
Infiniband Architecture Specification, available from Infiniband Architecture Specification, available from
http://www.infinibandta.org http://www.infinibandta.org
[IBPORT] [IBPORT]
Infiniband Trade Association, "IP Addressing Annex", available Infiniband Trade Association, "IP Addressing Annex", available
from http://www.infinibandta.org from http://www.infinibandta.org
[IANA-RPC] [IANA-RPC]
IANA Sun RPC number statement, IANA Sun RPC number statement,
 End of changes. 36 change blocks. 
70 lines changed or deleted 74 lines changed or added

This html diff was produced by rfcdiff 1.33. The latest version is available from http://tools.ietf.org/tools/rfcdiff/