[Docs] [txt|pdf] [Tracker] [WG] [Email] [Diff1] [Diff2] [Nits]

Versions: 00 01 02 03 04 05 06 07 08 09 RFC 5666

Internet-Draft                                 Brent Callaghan
Expires: April 2006                                 Tom Talpey

Document: draft-ietf-nfsv4-rpcrdma-02            October, 2005


                       RDMA Transport for ONC RPC


Status of this Memo


     By submitting this Internet-Draft, each author represents that any
     applicable patent or other IPR claims of which he or she is aware
     have been or will be disclosed, and any of which he or she becomes
     aware will be disclosed, in accordance with Section 6 of BCP 79.

     Internet-Drafts are working documents of the Internet Engineering
     Task Force (IETF), its areas, and its working groups.  Note that
     other groups may also distribute working documents as Internet-
     Drafts.

     Internet-Drafts are draft documents valid for a maximum of six
     months and may be updated, replaced, or obsoleted by other
     documents at any time.  It is inappropriate to use Internet-Drafts
     as reference material or to cite them other than as "work in
     progress."

     The list of current Internet-Drafts can be accessed at
         http://www.ietf.org/ietf/1id-abstracts.txt The list of
     Internet-Draft Shadow Directories can be accessed at
         http://www.ietf.org/shadow.html.

Abstract

     A protocol is described providing RDMA as a new transport for ONC
     RPC.  The RDMA transport binding conveys the benefits of efficient,
     bulk data transport over high speed networks, while providing for
     minimal change to RPC applications and with no required revision of
     the application RPC protocol, or the RPC protocol itself.











Expires: April 2006       Callaghan and Talpey                  [Page 1]

Internet-Draft         RDMA Transport for ONC RPC           October 2005


Table of Contents

     1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
     2.  Abstract RDMA Requirements . . . . . . . . . . . . . . . . . 3
     3.  Protocol Outline . . . . . . . . . . . . . . . . . . . . . . 4
     3.1.  Short Messages . . . . . . . . . . . . . . . . . . . . . . 4
     3.2.  Data Chunks  . . . . . . . . . . . . . . . . . . . . . . . 5
     3.3.  Flow Control . . . . . . . . . . . . . . . . . . . . . . . 5
     3.4.  XDR Encoding with Chunks . . . . . . . . . . . . . . . . . 7
     3.5.  Padding  . . . . . . . . . . . . . . . . . . . . . . . .  10
     3.6.  XDR Decoding with Read Chunks  . . . . . . . . . . . . .  11
     3.7.  XDR Decoding with Write Chunks . . . . . . . . . . . . .  12
     3.8.  RPC Call and Reply . . . . . . . . . . . . . . . . . . .  12
     4.  RPC RDMA Message Layout  . . . . . . . . . . . . . . . . .  15
     4.1.  RPC over RDMA Header . . . . . . . . . . . . . . . . . .  16
     4.2.  RPC over RDMA header errors  . . . . . . . . . . . . . .  17
     4.3.  XDR Language Description . . . . . . . . . . . . . . . .  18
     5.  Long Messages  . . . . . . . . . . . . . . . . . . . . . .  20
     5.1.  Message as an RDMA Read Chunk  . . . . . . . . . . . . .  20
     5.2.  RDMA Write of Long Replies (Reply Chunks)  . . . . . . .  22
     6.  Connection Configuration Protocol  . . . . . . . . . . . .  23
     6.1.  Initial Connection State . . . . . . . . . . . . . . . .  24
     6.2.  Protocol Description . . . . . . . . . . . . . . . . . .  24
     7.  Memory Registration Overhead . . . . . . . . . . . . . . .  25
     8.  Errors and Error Recovery  . . . . . . . . . . . . . . . .  26
     9.  Node Addressing  . . . . . . . . . . . . . . . . . . . . .  26
     10.  RPC Binding . . . . . . . . . . . . . . . . . . . . . . .  26
     11.  Security  . . . . . . . . . . . . . . . . . . . . . . . .  27
     12.  IANA Considerations . . . . . . . . . . . . . . . . . . .  27
     13.  Acknowledgements  . . . . . . . . . . . . . . . . . . . .  27
     14.  Normative References  . . . . . . . . . . . . . . . . . .  27
     15.  Informative References  . . . . . . . . . . . . . . . . .  28
     16.  Authors' Addresses  . . . . . . . . . . . . . . . . . . .  29
     17.  Intellectual Property and Copyright Statements  . . . . .  29
     Acknowledgement  . . . . . . . . . . . . . . . . . . . . . . .  30

1.  Introduction

     RDMA is a technique for efficient movement of data between end
     nodes, which becomes increasingly compelling over high speed
     transports.  By directing data into destination buffers as it sent
     on a network, and placing it via direct memory access by hardware,
     the double benefit of faster transfers and reduced host overhead is
     obtained.

     ONC RPC [RFC1831] is a remote procedure call protocol that has been
     run over a variety of transports.  Most RPC implementations today
     use UDP or TCP.  RPC messages are defined in terms of an eXternal



Expires: April 2006       Callaghan and Talpey                  [Page 2]

Internet-Draft         RDMA Transport for ONC RPC           October 2005


     Data Representation (XDR) [RFC1832] which provides a canonical data
     representation across a variety of host architectures.  An XDR data
     stream is conveyed differently on each type of transport.  On UDP,
     RPC messages are encapsulated inside datagrams, while on a TCP byte
     stream, RPC messages are delineated by a record marking protocol.
     An RDMA transport also conveys RPC messages in a unique fashion
     that must be fully described if client and server implementations
     are to interoperate.

     RDMA transports present new semantics unlike the behaviors of
     either UDP and TCP alone.  They retain message delineations like
     UDP while also providing a reliable, sequenced data transfer like
     TCP.  And, they provide the new efficient, bulk transfer service of
     RDMA.  RDMA transports are therefore naturally viewed as a new
     transport type by ONC RPC.

     RDMA as a transport will benefit the performance of RPC protocols
     that move large "chunks" of data, since RDMA hardware excels at
     moving data efficiently between host memory and a high speed
     network with little or no host CPU involvement.  In this context,
     the NFS protocol, in all its versions, is an obvious beneficiary of
     RDMA.  A complete problem statement is discussed in [NFSRDMAPS],
     and related NFSv4 issues are discussed in [NFSSESS].  Many other
     RPC-based protocols will also benefit.

     Although the RDMA transport described here provides relatively
     transparent support for any RPC application, the proposal goes
     further in describing mechanisms that can optimize the use of RDMA
     with more active participation by the RPC application.

2.  Abstract RDMA Requirements

     An RPC transport is responsible for conveying an RPC message from a
     sender to a receiver.  An RPC message is either an RPC call from a
     client to a server, or an RPC reply from the server back to the
     client.  An RPC message contains an RPC call header followed by
     arguments if the message is an RPC call, or an RPC reply header
     followed by results if the message is an RPC reply.  The call
     header contains a transaction ID (XID) followed by the program and
     procedure number as well as a security credential.  An RPC reply
     header begins with an XID that matches that of the RPC call
     message, followed by a security verifier and results.  All data in
     an RPC message is XDR encoded.  For a complete description of the
     RPC protocol and XDR encoding, see [RFC1831] and [RFC1832].

     This protocol assumes the following abstract model for RDMA
     transports.  Theese terms, common in the RDMA lexicon, are used in
     this document.  A more complete glossary of RDMA terms can be found



Expires: April 2006       Callaghan and Talpey                  [Page 3]

Internet-Draft         RDMA Transport for ONC RPC           October 2005


     in [RDMA].

     o Registered Memory
          All data moved via tagged RDMA operations must be resident in
          registered memory at its destination.  This protocol assumes
          that each segment of registered memory may be identified with
          a steering tag of no more than 32 bits and memory addresses of
          up to 64 bits in length.

     o RDMA Send
          The RDMA provider supports an RDMA Send operation with
          completion signalled at the receiver when data is placed in a
          pre-posted buffer.  The amount of transferred data is limited
          only by the size of the receiver's buffer.  Sends complete at
          the receiver in the order they were issued at the sender.

     o RDMA Write
          The RDMA provider supports an RDMA Write operation to directly
          place data in the receiver's buffer.  An RDMA Write is
          initiated by the sender and completion is signalled at the
          sender.  No completion is signalled at the receiver.  The
          sender uses a steering tag, memory address and length of the
          remote destination buffer.  RDMA Writes are not necessarily
          ordered with respect to one another, but are ordered with
          respect to RDMA Sends; a subsequent RDMA Send completion must
          be obtained at the receiver to notify that prior RDMA Write
          data has been successfully placed in the receiver's memory.

     o RDMA Read
          The RDMA provider supports an RDMA Read operation to directly
          place peer source data in the requester's buffer.  An RDMA
          Read is initiated by the receiver and completion is signalled
          at the receiver.  The receiver provides steering tags, memory
          addresses and a length for the remote source and local
          destination buffers.  Since the peer at the data source
          receives no notification of RDMA Read completion, there is an
          assumption that on receiving the data the receiver will signal
          completion with an RDMA Send message, so that the peer can
          free the source buffers and the associated steering tags.

     This protocol is designed to be carried over all RDMA transports
     meeting the stated requirements.  This protocol conveys to the RPC
     peer, information sufficient for that RPC peer to direct an RDMA
     layer to perform transfers containing RPC data, and to communicate
     their result(s).  For example, it is readily carried over RDMA
     transports such as iWARP [RDDP] or Infiniband [IB].





Expires: April 2006       Callaghan and Talpey                  [Page 4]

Internet-Draft         RDMA Transport for ONC RPC           October 2005


3.  Protocol Outline

     An RPC message can be conveyed in identical fashion, whether it is
     a call or reply message.  In each case, the transmission of the
     message proper is preceded by transmission of a transport-specific
     header for use by RPC over RDMA transports.  This header is
     analogous to the record marking used for RPC over TCP, but is more
     extensive, since RDMA transports support several modes of data
     transfer and it is important to allow the client and server to use
     the most efficient mode for any given transfer.  Multiple segments
     of a message may be transferred in different ways to different
     remote memory destinations.

     All transfers of a call or reply begin with an RDMA Send which
     transfers at least the RPC over RDMA header, usually with the call
     or reply message appended, or at least some part thereof.  Because
     the size of what may be transmitted via RDMA Send is limited by the
     size of the receiver's pre-posted buffer, the RPC over RDMA
     transport provides a number of methods to reduce the amount
     transferred by means of the RDMA Send, when necessary, by
     transferring various parts of the message using RDMA Read and RDMA
     Write.

3.1.  Short Messages

     Many RPC messages are quite short.  For example, the NFS version 3
     GETATTR request, is only 56 bytes: 20 bytes of RPC header plus a 32
     byte filehandle argument and 4 bytes of length.  The reply to this
     common request is about 100 bytes.

     There is no benefit in transferring such small messages with an
     RDMA Read or Write operation.  The overhead in transferring
     steering tags and memory addresses is justified only by large
     transfers.  The critical message size that justifies RDMA transfer
     will vary depending on the RDMA implementation and network, but is
     typically of the order of a few kilobytes.  It is appropriate to
     transfer a short message with an RDMA Send to a pre-posted buffer.
     The RPC over RDMA header with the short message (call or reply)
     immediately following is transferred using a single RDMA Send
     operation.











Expires: April 2006       Callaghan and Talpey                  [Page 5]

Internet-Draft         RDMA Transport for ONC RPC           October 2005


     Short RPC messages over an RDMA transport will look like this:


       RPC Client                           RPC Server
           |               RPC Call              |
      Send |   ------------------------------>   |
           |                                     |
           |               RPC Reply             |
           |   <------------------------------   | Send


3.2.  Data Chunks

     Some protocols, like NFS, have RPC procedures that can transfer
     very large "chunks" of data in the RPC call or reply and would
     cause the maximum send size to be exceeded if one tried to transfer
     them as part of the RDMA Send.  These large chunks typically range
     from a kilobyte to a megabyte or more.  An RDMA transport can
     transfer large chunks of data more efficiently via the direct
     placement of an RDMA Read or RDMA Write operation.  Using direct
     placement instead of inline transfer not only avoids expensive data
     copies, but provides correct data alignment at the destination.

3.3.  Flow Control

     It is critical to provide RDMA Send flow control for an RDMA
     connection.  RDMA receive operations will fail if a pre-posted
     receive buffer is not available to accept an incoming RDMA Send,
     and repeated occurrences of such errors can be fatal to the
     connection.  This is a departure from conventional TCP/IP
     networking where buffers are allocated dynamically on an as-needed
     basis, and pre-posting is not required.

     It is not practical to provide for fixed credit limits at the RPC
     server.  Fixed limits scale poorly, since posted buffers are
     dedicated to the associated connection until consumed by receive
     operations.  Additionally for protocol correctness, the RPC server
     must always be able to reply to client requests, whether or not new
     buffers have been posted to accept future receives.  (Note that the
     RPC server may in fact be a client at some other layer.  For
     example, NFSv4 callbacks are processed by the NFSv4 client, acting
     as an RPC server. The credit discussions apply equally in either
     case.)

     Flow control for RDMA Send operations is implemented as a simple
     request/grant protocol in the RPC over RDMA header associated with
     each RPC message.  The RPC over RDMA header for RPC call messages
     contains a requested credit value for the RPC server, which may be



Expires: April 2006       Callaghan and Talpey                  [Page 6]

Internet-Draft         RDMA Transport for ONC RPC           October 2005


     dynamically adjusted by the caller to match its expected needs.
     The RPC over RDMA header for the RPC reply messages provides the
     granted result, which may have any value except it may not be zero
     when no in-progress operations are present at the server, since
     such a value would result in deadlock.  The value may be adjusted
     up or down at each opportunity to match the server's needs or
     policies.

     The RPC client must not send unacknowledged requests in excess of
     this granted RPC server credit limit.  If the limit is exceeded,
     the RDMA layer may signal an error, possibly terminating the
     connection.  Even if an error does not occur, there is no
     requirement that the server must handle the excess request(s), and
     it may return an RPC error to the client.  Also note that the
     never-zero requirement implies that an RPC server must always
     provide at least one credit to each connected RPC client.  It does
     not however require that the server must always be prepared to
     receive a request from each client, for example when the server is
     busy processing all granted client requests.

     While RPC call may complete in any order, the current flow control
     limit at the RPC server is known to the RPC client from the Send
     ordering properties.  It is always the most recent server-granted
     credit value minus the number of requests in flight.

     Certain RDMA implementations may impose additional flow control
     restrictions, such as limits on RDMA Read operations in progress at
     the responder.  Because these operations are outside the scope of
     this protocol, they are not addressed and must be provided for by
     other layers.  For example, a simple upper layer RPC consumer might
     perform single-issue RDMA Read requests, while a more
     sophisticated, multithreaded RPC consumer may implement its own
     FIFO queue of such operations.  For further discussion of possible
     protocol implementations capable of negotiating these values, see
     section 6 "Connection Configuration Protocol" of this draft, or
     [NFSSESS].

3.4.  XDR Encoding with Chunks

     The data comprising an RPC call or reply message is marshaled or
     serialized into a contiguous stream by an XDR routine.  XDR data
     types such as integers, strings, arrays and linked lists are
     commonly implemented over two very simple functions that encode
     either an XDR data unit (32 bits) or an array of bytes.

     Normally, the separate data items in an RPC call or reply are
     encoded as a contiguous sequence of bytes for network transmission
     over UDP or TCP.  However, in the case of an RDMA transport, local



Expires: April 2006       Callaghan and Talpey                  [Page 7]

Internet-Draft         RDMA Transport for ONC RPC           October 2005


     routines such as XDR encode can determine that (for instance) an
     opaque byte array is large enough to be more efficiently moved via
     an RDMA data transfer operation like RDMA Read or RDMA Write.

     Semantically speaking, the protocol has no restriction regarding
     data types which may or may not be represented by a read or write
     chunk.  In practice however, efficiency considerations lead to the
     conclusion that certain data types are not generally "chunkable".
     Typically, only opaque and aggregate data types which may attain
     substantial size are considered to be eligible.  With today's
     hardware this size may be a kilobyte or more.  However any object
     may be chosen for chunking in any given message.

     The eligibility of XDR data items to be candidates for being moved
     as data chunks (as opposed to being marshalled inline) is not
     specified by the RPC over RDMA protocol.  Chunk eligibility
     criteria must be determined by each upper layer in order to provide
     for an interoperable specification.  One such example with
     rationale, for the NFS protocol family, is provided in [NFSDDP].

     The interface by which an upper layer implementation communicates
     the eligibility of a data item locally to RPC for chunking is out
     of scope for this specification.  In many implementations, it is
     possible to implement a transparent RPC chunking facility.
     However, such implementations may lead to inefficiencies, either
     because they require the RPC layer to perform expensive
     registration and deregistration of memory "on the fly", or they may
     require using RDMA chunks in reply messages, along with the
     resulting additional handshaking with the RPC over RDMA peer.
     However, these issues are internal and generally confined to the
     local interface between RPC and its upper layers, one in which
     implementations are free to innovate.  The only requirement is that
     the resulting RPC RDMA protocol sent to the peer is valid for the
     upper layer.  See for example [NFSDDP].

     When sending any message (request or reply) that contains an
     eligible large data chunk, the XDR encoding routine avoids moving
     the data into the XDR stream.  Instead, it does not encode the data
     portion, but records the address and size of each chunk in a
     separate "read chunk list" encoded within RPC RDMA transport-
     specific headers.  Such chunks will be transferred via RDMA Read
     operations initiated by the receiver.

     When the read chunks are to be moved via RDMA, the memory for each
     chunk must be registered.  This registration may take place within
     XDR itself, providing for full transparency to upper layers, or it
     may be performed by any other specific local implementation.




Expires: April 2006       Callaghan and Talpey                  [Page 8]

Internet-Draft         RDMA Transport for ONC RPC           October 2005


     Additionally, when making an RPC call that can result in bulk data
     transferred in the reply, it is desirable to provide chunks to
     accept the data directly via RDMA Write.  These write chunks will
     therefore be pre-filled by the RPC server prior to responding, and
     XDR decode at the client will not be required.  These chunks
     undergo a similar registration and advertisement via "write chunk
     lists" built as a part of XDR encoding.

     Some RPC client implementations are not able to determine where an
     RPC call's results reside during the "encode" phase.  This makes it
     difficult or impossible for the RPC client layer to encode the
     write chunk list at the time of building the request.  In this
     case, it is difficult for the RPC implementation to provide
     transparency to the RPC consumer, which may require recoding to
     provide result information at this earlier stage.

     Therefore if the RPC client does not make a write chunk list
     available to receive the result, then the RPC server must return
     data inline in the reply, or if it so chooses, via a read chunk
     list.  RPC clients are discouraged from omitting write chunk lists
     for eligible replies, due to the lower performance of the
     additional handshaking to perform data transfer, and the
     requirement that the RPC server must expose (and preserve) the
     reply data for a period of time.  In the absence of a server-
     provided read chunk list in the reply, if the encoded reply
     overflows the posted receive buffer, the RPC will fail.

     When any data within a message is provided via either read or write
     chunks, the chunk itself refers only to the data portion of the XDR
     stream element.  In particular, for counted fields (e.g. a "<>"
     encoding) the byte count which is encoded as part of the field
     remains in the XDR stream, and is also encoded in the chunk list.
     The data portion is however elided from the encoded XDR stream, and
     is transferred as part of chunk list processing.  This is important
     to maintain upper layer implementation compatibility - both the
     count and the data must be transferred as part of the logical XDR
     stream.  While the chunk list processing results in the data being
     available to the upper layer peer for XDR decoding, the length
     present in the chunk list entries is not.  Any byte count in the
     XDR stream must match the sum of the byte counts present in the
     corresponding read or write chunk list.  If they do not agree, an
     RPC protocol encoding error results.

     The following items are contained in a chunk list entry.

     Handle
          Steering tag or handle obtained when the chunk memory is
          registered for RDMA.



Expires: April 2006       Callaghan and Talpey                  [Page 9]

Internet-Draft         RDMA Transport for ONC RPC           October 2005


     Length
          The length of the chunk in bytes.

     Offset
          The offset or beginning memory address of the chunk.  In order
          to support the widest array of RDMA implementations, as well
          as the most general steering tag scheme, this field is
          unconditionally included in each chunk list entry.

          While zero-based offset schemes are available in many RDMA
          implementations, their use by RPC requires individual
          registration of each read or write chunk.  On many such
          implementations this can be a significant overhead.  By
          providing an offset in each chunk, many pre-registration or
          region-based registrations can be readily supported, and by
          using a single, universal chunk representation, the RPC RDMA
          protocol implementation is simplified to its most general
          form.

     Position
          For data which is to be encoded, the position in the XDR
          stream where the chunk would normally reside.  Note that the
          chunk therefore inserts its data into the XDR stream at this
          position, but its transfer is no longer "inline".  Also note
          it is possible that a contiguous sequence of chunks might all
          have the same position.  For data which is to be decoded, no
          "position" is used.

     When XDR marshaling is complete, the chunk list is XDR encoded,
     then sent to the receiver prepended to the RPC message.  Any source
     data for a read chunk, or the destination of a write chunk, remain
     behind in the sender's registered memory and their actual payload
     is not marshalled into the request or reply.

     +----------------+----------------+-------------
     | RPC over RDMA  |                |
     |    header w/   |   RPC Header   | Non-chunk args/results
     |     chunks     |                |
     +----------------+----------------+-------------


     Read chunk lists and write chunk lists are structured somewhat
     differently.  This is due to the different usage - read chunks are
     decoded and indexed by their position in the XDR data stream, their
     size is always known, and may be used for both arguments and
     results.  Write chunks on the other hand are used only for results,
     and have neither a preassigned offset in the XDR stream, nor a size
     until the results are produced, since the buffers may not be used



Expires: April 2006       Callaghan and Talpey                 [Page 10]

Internet-Draft         RDMA Transport for ONC RPC           October 2005


     for results at all, or may be partially filled.  Their presence in
     the XDR stream is therefore not known until the reply is processed.
     The mapping of Write chunks onto designated NFS procedures and
     their results is described in [NFSDDP].

     Therefore, read chunks are encoded into a read chunk list as a
     single array, with each entry tagged by its (known) size and
     position in the XDR stream.  Write chunks are encoded as a list of
     arrays of RDMA buffers, with each list element (an array) providing
     buffers for a separate result.  Individual write chunk list
     elements may thereby result in being partially or fully filled, or
     in fact not being filled at all.  Unused write chunks, or unused
     bytes in write chunk buffer lists, are not returned as results, and
     their memory is returned to the upper layer as part of RPC
     completion.  However, the RPC layer should not assume that the
     buffers have not been modified.

3.5.  Padding

     Alignment of specific opaque data enables certain scatter/gather
     optimizations.  Padding leverages the useful property that RDMA
     transfers preserve alignment of data, even when they are placed
     into pre-posted receive buffers by Sends.

     Many servers can make good use of such padding.  Padding allows the
     chaining of RDMA receive buffers such that any data transferred by
     RDMA on behalf of RPC requests will be placed into appropriately
     aligned buffers on the system that receives the transfer.  In this
     way, the need for servers to perform RDMA Read to satisfy all but
     the largest client writes is obviated.

     The effect of padding is demonstrated below showing prior bytes on
     an XDR stream (XXX) followed by an opaque field consisting of four
     length bytes (LLLL) followed by data bytes (DDDD).  The receiver of
     the RDMA Send has posted two chained receive buffers.  Without
     padding, the opaque data is split across the two buffers.  With the
     addition of padding bytes (ppp) prior to the first data byte, the
     data can be forced to align correctly in the second buffer.













Expires: April 2006       Callaghan and Talpey                 [Page 11]

Internet-Draft         RDMA Transport for ONC RPC           October 2005


                                           Buffer 1       Buffer 2
     Unpadded                           --------------  --------------


      XXXXXXXLLLLDDDDDDDDDDDDDD    ---> XXXXXXXLLLLDDD  DDDDDDDDDDD


     Padded


      XXXXXXXLLLLpppDDDDDDDDDDDDDD ---> XXXXXXXLLLLppp  DDDDDDDDDDDDDD




     Padding is implemented completely within the RDMA transport
     encoding, flagged with a specific message type.  Where padding is
     applied, two values are passed to the peer:  an "rdma_align" which
     is the padding value used, and "rdma_thresh", which is the opaque
     data size at or above which padding is applied.  For instance, if
     the server is using chained 4 KB receive buffers, then up to (4 KB
     - 1) padding bytes could be used to achieve alignment of the data.
     If padding is to apply only to chunks at least 1 KB in size, then
     the threshold should be set to 1 KB.  The XDR routine at the peer
     will consult these values when decoding opaque values.  Where the
     decoded length exceeds the rdma_thresh, the XDR decode will skip
     over the appropriate padding as indicated by rdma_align and the
     current XDR stream position.

3.6.  XDR Decoding with Read Chunks

     The XDR decode process moves data from an XDR stream into a data
     structure provided by the RPC client or server application.  Where
     elements of the destination data structure are buffers or strings,
     the RPC application can either pre-allocate storage to receive the
     data, or leave the string or buffer fields null and allow the XDR
     decode stage of RPC processing to automatically allocate storage of
     sufficient size.

     When decoding a message from an RDMA transport, the receiver first
     XDR decodes the chunk lists from the RPC over RDMA header, then
     proceeds to decode the body of the RPC message (arguments or
     results).  Whenever the XDR offset in the decode stream matches
     that of a chunk in the read chunk list, the XDR routine initiates
     an RDMA Read to bring over the chunk data into locally registered
     memory for the destination buffer.

     When processing an RPC request, the RPC receiver (RPC server)



Expires: April 2006       Callaghan and Talpey                 [Page 12]

Internet-Draft         RDMA Transport for ONC RPC           October 2005


     acknowledges its completion of use of the source buffers by simply
     replying to the RPC sender (client), and the peer may free all
     source buffers advertised by the request.

     When processing an RPC reply, after completing such a transfer the
     RPC receiver (client) must issue an RDMA_DONE message (described in
     Section 3.8) to notify the peer (server) that the source buffers
     can be freed.

     The read chunk list is constructed and used entirely within the
     RPC/XDR layer.  Other than specifying the minimum chunk size, the
     management of the read chunk list is automatic and transparent to
     an RPC application.

3.7.  XDR Decoding with Write Chunks

     When a "write chunk list" is provided for the results of the RPC
     call, the RPC server must provide any corresponding data via RDMA
     Write to the memory referenced in the chunk list entries.  The RPC
     reply conveys this by returning the write chunk list to the client
     with the lengths rewritten to match the actual transfer.  The XDR
     "decode" of the reply therefore performs no local data transfer but
     merely returns the length obtained from the reply.

     Each decoded result consumes one entry in the write chunk list,
     which in turn consists of an array of RDMA segments.  The length is
     therefore the sum of all returned lengths in all segments
     comprising the corresponding list entry.  As each list entry is
     "decoded", the entire entry is consumed.

     The write chunk list is constructed and used by the RPC
     application.  The RPC/XDR layer simply conveys the list between
     client and server and initiates the RDMA Writes back to the client.
     The mapping of write chunk list entries to procedure arguments must
     be determined for each protocol.  An example of a mapping is
     described in [NFSDDP].

3.8.  RPC Call and Reply

     The RDMA transport for RPC provides three methods of moving data
     between RPC client and server:

     Inline
          Data are moved between RPC client and server within an RDMA
          Send.

     RDMA Read
          Data are moved between RPC client and server via an RDMA Read



Expires: April 2006       Callaghan and Talpey                 [Page 13]

Internet-Draft         RDMA Transport for ONC RPC           October 2005


          operation via steering tag, address and offset obtained from a
          read chunk list.

     RDMA Write
          Result data is moved from RPC server to client via an RDMA
          Write operation via steering tag, address and offset obtained
          from a write chunk list or reply chunk in the client's RPC
          call message.

     These methods of data movement may occur in combinations within a
     single RPC.  For instance, an RPC call may contain some inline data
     along with some large chunks to be transferred via RDMA Read to the
     server.  The reply to that call may have some result chunks that
     the server RDMA Writes back to the client.  The following protocol
     interactions illustrate RPC calls that use these methods to move
     RPC message data:

     An RPC with write chunks in the call message looks like this:


       RPC Client                           RPC Server
           |     RPC Call + Write Chunk list     |
      Send |   ------------------------------>   |
           |                                     |
           |               Chunk 1               |
           |   <------------------------------   | Write
           |                  :                  |
           |               Chunk n               |
           |   <------------------------------   | Write
           |                                     |
           |               RPC Reply             |
           |   <------------------------------   | Send


     In the presence of write chunks, RDMA ordering provides the
     guarantee that all data in the RDMA Write operations has been
     placed in memory prior to the client's RPC reply processing.














Expires: April 2006       Callaghan and Talpey                 [Page 14]

Internet-Draft         RDMA Transport for ONC RPC           October 2005


     An RPC with read chunks in the call message looks like this:


       RPC Client                           RPC Server
           |     RPC Call + Read Chunk list      |
      Send |   ------------------------------>   |
           |                                     |
           |               Chunk 1               |
           |   +------------------------------   | Read
           |   v----------------------------->   |
           |                  :                  |
           |               Chunk n               |
           |   +------------------------------   | Read
           |   v----------------------------->   |
           |                                     |
           |               RPC Reply             |
           |   <------------------------------   | Send


     And an RPC with read chunks in the reply message looks like this:


       RPC Client                           RPC Server
           |               RPC Call              |
      Send |   ------------------------------>   |
           |                                     |
           |     RPC Reply + Read Chunk list     |
           |   <------------------------------   | Send
           |                                     |
           |               Chunk 1               |
      Read |   ------------------------------+   |
           |   <-----------------------------v   |
           |                  :                  |
           |               Chunk n               |
      Read |   ------------------------------+   |
           |   <-----------------------------v   |
           |                                     |
           |                 Done                |
      Send |   ------------------------------>   |


     The final Done message allows the RPC client to signal the server
     that it has received the chunks, so the server can de-register and
     free the memory holding the chunks.  A Done completion is not
     necessary for an RPC call, since the RPC reply Send is itself a
     receive completion notification.  In the event that the client
     fails to return the Done message within some timeout period, the
     server may conclude that a protocol violation has occurred and



Expires: April 2006       Callaghan and Talpey                 [Page 15]

Internet-Draft         RDMA Transport for ONC RPC           October 2005


     close the RPC connection, or it may proceed with a de-register and
     free its chunk buffers.  This may result in a fatal RDMA error if
     the client later attempts to perform an RDMA Read operation, which
     amounts to the same thing.

     The use of read chunks in RPC reply messages is much less efficient
     than providing write chunks in the originating RPC calls, due to
     the additional message exchanges, the need for the RPC server to
     advertise buffers to the peer, the necessity of the server
     maintaining a timer for the purpose of recovery from misbehaving
     clients, and the need for additional memory registration.  Their
     use is not recommended by upper layers where efficiency is a
     primary concern. [NFSDDP]  However, they may be employed by upper
     layer protocol bindings which are primarily concerned with
     transparency, since they can frequently be implemented completely
     within the RPC lower layers.

     It is important to note that the Done message consumes a credit at
     the RPC server.  The RPC server should provide sufficient credits
     to the client to allow the Done message to be sent without deadlock
     (driving the outstanding credit count to zero).  The RPC client
     must account for its required Done messages to the server in its
     accounting of available credits, and the server should replenish
     any credit consumed by its use of such exchanges at its earliest
     opportunity.

     Finally, it is possible to conceive of RPC exchanges that involve
     any or all combinations of write chunks in the RPC call, read
     chunks in the RPC call, and read chunks in the RPC reply.  Support
     for such exchanges is straightforward from a protocol perspective,
     but in practice such exchanges would be quite rare, limited to
     upper layer protocol exchanges which transferred bulk data in both
     the call and corresponding reply.

4.  RPC RDMA Message Layout

     RPC call and reply messages are conveyed across an RDMA transport
     with a prepended RPC over RDMA header.  The RPC over RDMA header
     includes data for RDMA flow control credits, padding parameters and
     lists of addresses that provide direct data placement via RDMA Read
     and Write operations.  The layout of the RPC message itself is
     unchanged from that described in [RFC1831] except for the possible
     exclusion of large data chunks that will be moved by RDMA Read or
     Write operations.  If the RPC message (along with the RPC over RDMA
     header) is too long for the posted receive buffer (even after any
     large chunks are removed), then the entire RPC message can be moved
     separately as a chunk, leaving just the RPC over RDMA header in the
     RDMA Send.



Expires: April 2006       Callaghan and Talpey                 [Page 16]

Internet-Draft         RDMA Transport for ONC RPC           October 2005


4.1.  RPC over RDMA Header

     The RPC over RDMA header begins with four 32-bit fields that are
     always present and which control the RDMA interaction including
     RDMA-specific flow control.  These are then followed by a number of
     items such as chunk lists and padding which may or may not be
     present depending on the type of transmission.  The four fields
     which are always present are:

     1. Transaction ID (XID).
          The XID generated for the RPC call and reply.  Having the XID
          at the beginning of the message makes it easy to establish the
          message context.  This XID mirrors the XID in the RPC header,
          and takes precedence.  The receiver may ignore the XID in the
          RPC header, if it so chooses.

     2. Version number.
          This version of the RPC RDMA message protocol is 1.  The
          version number must be increased by one whenever the format of
          the RPC RDMA messages is changed.

     3. Flow control credit value.
          When sent in an RPC call message, the requested value is
          provided.  When sent in an RPC reply message, the granted
          value is returned.  RPC calls must not be sent in excess of
          the currently granted limit.

     4. Message type.

          o    RDMA_MSG = 0 indicates that chunk lists and RPC message
               follow.

          o    RDMA_NOMSG = 1 indicates that after the chunk lists there
               is no RPC message.  In this case, the chunk lists provide
               information to allow the message proper to be transferred
               using RDMA Read or write and thus is not appended to the
               RPC over RDMA header.

          o    RDMA_MSGP = 2 indicates that a chunk list and RPC message
               with some padding follow.

          0    RDMA_DONE = 3 indicates that the message signals the
               completion of a chunk transfer via RDMA Read.

          o    RDMA_ERROR = 4 is used to signal any detected error(s) in
               the RPC RDMA chunk encoding.





Expires: April 2006       Callaghan and Talpey                 [Page 17]

Internet-Draft         RDMA Transport for ONC RPC           October 2005


     Because the version number is encoded as part of this header, and
     the RDMA_ERROR message type is used to indicate errors, these first
     four fields and the start of the following message body must always
     remain aligned at these fixed offsets for all versions of the RPC
     over RDMA header.

     For a message of type RDMA_MSG or RDMA_NOMSG, the Read and Write
     chunk lists follow.  If the Read chunk list is null (a 32 bit word
     of zeros), then there are no chunks to be transferred separately
     and the RPC message follows in its entirety.  If non-null, then
     it's the beginning of an XDR encoded sequence of Read chunk list
     entries.  If the Write chunk list is non-null, then an XDR encoded
     sequence of Write chunk entries follows.

     If the message type is RDMA_MSGP, then two additional fields that
     specify the padding alignment and threshold are inserted prior to
     the Read and Write chunk lists.

     A header of message type RDMA_MSG or RDMA_MSGP will be followed by
     the RPC call or RPC reply message body, beginning with the XID.
     The XID in the RDMA_MSG or RDMA_MSGP header must match this.

     +--------+---------+---------+-----------+-------------+----------
     |        |         |         | Message   |   NULLs     | RPC Call
     |  XID   | Version | Credits |  Type     |    or       |    or
     |        |         |         |           | Chunk Lists | Reply Msg
     +--------+---------+---------+-----------+-------------+----------


     Note that in the case of RDMA_DONE and RDMA_ERROR, no chunk list or
     RPC message follows.  As an implementation hint: a gather operation
     on the Send of the RDMA RPC message can be used to marshal the
     initial header, the chunk list, and the RPC message itself.

4.2.  RPC over RDMA header errors

     When a peer receives an RPC RDMA message, it must perform certain
     basic validity checks on the header and chunk contents.  If errors
     are detected in an RPC request, an RDMA_ERROR reply should be
     generated.

     Two types of errors are defined, version mismatch and invalid chunk
     format.  When the peer detects an RPC over RDMA header version
     which it does not support (currently this draft defines only
     version 1), it replies with an error code of ERR_VERS, and provides
     the low and high inclusive version numbers it does, in fact,
     support.  The version number in this reply can be any value
     otherwise valid at the receiver.  When other decoding errors are



Expires: April 2006       Callaghan and Talpey                 [Page 18]

Internet-Draft         RDMA Transport for ONC RPC           October 2005


     detected in the header or chunks, either an RPC decode error may be
     returned, or the error code ERR_CHUNK.

4.3.  XDR Language Description

     Here is the message layout in XDR language.

        struct xdr_rdma_segment {
           uint32 handle;          /* Registered memory handle */
           uint32 length;          /* Length of the chunk in bytes */
           uint64 offset;          /* Chunk virtual address or offset */
        };

        struct xdr_read_chunk {
           uint32 position;        /* Position in XDR stream */
           struct xdr_rdma_segment target;
        };

        struct xdr_read_list {
           struct xdr_read_chunk entry;
           struct xdr_read_list  *next;
        };

        struct xdr_write_chunk {
           struct xdr_rdma_segment target<>;
        };

        struct xdr_write_list {
           struct xdr_write_chunk entry;
           struct xdr_write_list  *next;
        };

        struct rdma_msg {
           uint32    rdma_xid;     /* Mirrors the RPC header xid */
           uint32    rdma_vers;    /* Version of this protocol */
           uint32    rdma_credit;  /* Buffers requested/granted */
           rdma_body rdma_body;
        };

        enum rdma_proc {
           RDMA_MSG=0,   /* An RPC call or reply msg */
           RDMA_NOMSG=1, /* An RPC call or reply msg - separate body */
           RDMA_MSGP=2,  /* An RPC call or reply msg with padding */
           RDMA_DONE=3,  /* Client signals reply completion */
           RDMA_ERROR=4  /* An RPC RDMA encoding error */
        };





Expires: April 2006       Callaghan and Talpey                 [Page 19]

Internet-Draft         RDMA Transport for ONC RPC           October 2005


        union rdma_body switch (rdma_proc proc) {
           case RDMA_MSG:
             rpc_rdma_header rdma_msg;
           case RDMA_NOMSG:
             rpc_rdma_header_nomsg rdma_nomsg;
           case RDMA_MSGP:
             rpc_rdma_header_padded rdma_msgp;
           case RDMA_DONE:
             void;
           case RDMA_ERROR:
             rpc_rdma_error rdma_error;
        };

        struct rpc_rdma_header {
           struct xdr_read_list   *rdma_reads;
           struct xdr_write_list  *rdma_writes;
           struct xdr_write_chunk *rdma_reply;
           /* rpc body follows */
        };

        struct rpc_rdma_header_nomsg {
           struct xdr_read_list   *rdma_reads;
           struct xdr_write_list  *rdma_writes;
           struct xdr_write_chunk *rdma_reply;
        };

        struct rpc_rdma_header_padded {
           uint32                 rdma_align;   /* Padding alignment */
           uint32                 rdma_thresh;  /* Padding threshold */
           struct xdr_read_list   *rdma_reads;
           struct xdr_write_list  *rdma_writes;
           struct xdr_write_chunk *rdma_reply;
           /* rpc body follows */
        };

















Expires: April 2006       Callaghan and Talpey                 [Page 20]

Internet-Draft         RDMA Transport for ONC RPC           October 2005


        enum rpc_rdma_errcode {
           ERR_VERS = 1,
           ERR_CHUNK = 2
        };

        union rpc_rdma_error switch (rpc_rdma_errcode) {
           case ERR_VERS:
             uint32               rdma_vers_low;
             uint32               rdma_vers_high;
           case ERR_CHUNK:
             void;
           default:
             uint32               rdma_extra[8];
        };

5.  Long Messages

     The receiver of RDMA Send messages is required to have previously
     posted one or more adequately sized buffers.  The RPC client can
     inform the server of the maximum size of its RDMA Send messages via
     the Connection Configuration Protocol described later in this
     document.

     Since RPC messages are frequently small, memory savings can be
     achieved by posting small buffers.  Even large messages like NFS
     READ or WRITE will be quite small once the chunks are removed from
     the message.  However, there may be large messages that would
     demand a very large buffer be posted, where the contents of the
     buffer may not be a chunkable XDR element.  A good example is an
     NFS READDIR reply which may contain a large number of small
     filename strings.  Also, the NFS version 4 protocol [RFC3530]
     features COMPOUND request and reply messages of unbounded length.

     Ideally, each upper layer will negotiate these limits.  However, it
     is frequently necessary to provide a transparent solution.

5.1.  Message as an RDMA Read Chunk

     One relatively simple method is to have the client identify any RPC
     message that exceeds the RPC server's posted buffer size and move
     it separately as a chunk, i.e. reference it as the first entry in
     the read chunk list with an XDR position of zero.









Expires: April 2006       Callaghan and Talpey                 [Page 21]

Internet-Draft         RDMA Transport for ONC RPC           October 2005


     Normal Message


     +--------+---------+---------+------------+-------------+----------
     |        |         |         |            |             | RPC Call
     |  XID   | Version | Credits |  RDMA_MSG  | Chunk Lists |    or
     |        |         |         |            |             | Reply Msg
     +--------+---------+---------+------------+-------------+----------


     Long Message


     +--------+---------+---------+------------+-------------+
     |        |         |         |            |             |
     |  XID   | Version | Credits | RDMA_NOMSG | Chunk Lists |
     |        |         |         |            |             |
     +--------+---------+---------+------------+-------------+
                                                  |
                                                  |  +----------
                                                  |  | Long RPC Call
                                                  +->|    or
                                                     | Reply Message
                                                     +----------


     If the receiver gets an RPC over RDMA header with a message type of
     RDMA_NOMSG and finds an initial read chunk list entry with a zero
     XDR position, it allocates a registered buffer and issues an RDMA
     Read of the long RPC message into it.  The receiver then proceeds
     to XDR decode the RPC message as if it had received it inline with
     the Send data.  Further decoding may issue additional RDMA Reads to
     bring over additional chunks.

     Although the handling of long messages requires one extra network
     turnaround, in practice these messages should be rare if the posted
     receive buffers are correctly sized, and of course they will be
     non-existent for RDMA-aware upper layers.













Expires: April 2006       Callaghan and Talpey                 [Page 22]

Internet-Draft         RDMA Transport for ONC RPC           October 2005


     An RPC with long reply returned via RDMA Read looks like this:


       RPC Client                           RPC Server
           |             RPC Call                |
      Send |   ------------------------------>   |
           |                                     |
           |         RDMA over RPC Header        |
           |   <------------------------------   | Send
           |                                     |
           |          Long RPC Reply Msg         |
      Read |   ------------------------------+   |
           |   <-----------------------------v   |
           |                                     |
           |                Done                 |
      Send |   ------------------------------>   |


5.2.  RDMA Write of Long Replies (Reply Chunks)

     A superior method of handling long RPC replies is to have the RPC
     client post a large buffer into which the server can write a large
     RPC reply.  This has the advantage that an RDMA Write may be
     slightly faster in network latency than an RDMA Read.
     Additionally, for a reply it removes the need for an RDMA_DONE
     message if the large reply is returned as a Read chunk.

     This protocol supports direct return of a large reply via the
     inclusion of an optional rdma_reply write chunk after the read
     chunk list and the write chunk list.  The client allocates a buffer
     sized to receive a large reply and enters its steering tag, address
     and length in the rdma_reply write chunk.  If the reply message is
     too long to return inline with an RDMA Send (exceeds the size of
     the client's posted receive buffer), even with read chunks removed,
     then the RPC server performs an RDMA Write of the RPC reply message
     into the buffer indicated by the rdma_reply chunk.  If the client
     doesn't provide an rdma_reply chunk, or if it's too small, then the
     message must be returned as a Read chunk.













Expires: April 2006       Callaghan and Talpey                 [Page 23]

Internet-Draft         RDMA Transport for ONC RPC           October 2005


     An RPC with long reply returned via RDMA Write looks like this:


       RPC Client                           RPC Server
           |      RPC Call with rdma_reply       |
      Send |   ------------------------------>   |
           |                                     |
           |          Long RPC Reply Msg         |
           |   <------------------------------   | Write
           |                                     |
           |         RDMA over RPC Header        |
           |   <------------------------------   | Send


     The use of RDMA Write to return long replies requires that the
     client application anticipate a long reply and have some knowledge
     of its size so that an adequately sized buffer can be allocated.
     This is certainly true of NFS READDIR replies; where the client
     already provides an upper bound on the size of the encoded
     directory fragment to be returned by the server.

     The use of these "reply chunks" is highly efficient and convenient
     for both RPC client and server.  Their use is encouraged for
     eligible RPC operations such as NFS READDIR, which would otherwise
     require extensive chunk management within the results or use of
     RDMA Read and a Done message. [NFSDDP]

6.  Connection Configuration Protocol

     RDMA Send operations require the receiver to post one or more
     buffers at the RDMA connection endpoint, each large enough to
     receive the largest Send message.  Buffers are consumed as Send
     messages are received.  If a buffer is too small, or if there are
     no buffers posted, the RDMA transport may return an error and break
     the RDMA connection.  The receiver must post sufficient, adequately
     buffers to avoid buffer overrun or capacity errors.

     The protocol described above includes only a mechanism for managing
     the number of such receive buffers, and no explicit features to
     allow the RPC client and server to provision or control buffer
     sizing, nor any other session parameters.

     In the past, this type of connection management has not been
     necessary for RPC.  RPC over UDP or TCP does not have a protocol to
     negotiate the link.  The server can get a rough idea of the maximum
     size of messages from the server protocol code.  However, a
     protocol to negotiate transport features on a more dynamic basis is
     desirable.



Expires: April 2006       Callaghan and Talpey                 [Page 24]

Internet-Draft         RDMA Transport for ONC RPC           October 2005


     The Connection Configuration Protocol allows the client to pass its
     connection requirements to the server, and allows the server to
     inform the client of its connection limits.

6.1.  Initial Connection State

     This protocol will be used for connection setup prior to the use of
     another RPC protocol that uses the RDMA transport.  It operates in-
     band, i.e. it uses the connection itself to negotiate the
     connection parameters.  To provide a basis for connection
     negotiation, the connection is assumed to provide a basic level of
     interoperability: the ability to exchange at least one RPC message
     at a time that is at least 1 KB in size.  The server may exceed
     this basic level of configuration, but the client must not assume
     it.

6.2.  Protocol Description

     Version 1 of the Connection Configuration protocol consists of a
     single procedure that allows the client to inform the server of its
     connection requirements and the server to return connection
     information to the client.

     The maxcall_sendsize argument is the maximum size of an RPC call
     message that the client will send inline in an RDMA Send message to
     the server.  The server may return a maxcall_sendsize value that is
     smaller or larger than the client's request.  The client must not
     send an inline call message larger than what the server will
     accept.  The maxcall_sendsize limits only the size of inline RPC
     calls.  It does not limit the size of long RPC messages transferred
     as an initial chunk in the Read chunk list.

     The maxreply_sendsize is the maximum size of an inline RPC message
     that the client will accept from the server.

     The maxrdmaread is the maximum number of RDMA Reads which may be
     active at the peer.  This number correlates to the RDMA incoming
     RDMA Read count ("IRD") configured into each originating endpoint
     by the client or server.  If more than this number of RDMA Read
     operations by the connected peer are issued simultaneously,
     connection loss or suboptimal flow control may result, therefore
     the value should be observed at all times.  The peers' values need
     not be equal.  If zero, the peer must not issue requests which
     require RDMA Read to satisfy, as no transfer will be possible.

     The align value is the value recommended by the server for opaque
     data values such as strings and counted byte arrays.  The client
     can use this value to compute the number of prepended pad bytes



Expires: April 2006       Callaghan and Talpey                 [Page 25]

Internet-Draft         RDMA Transport for ONC RPC           October 2005


     when XDR encoding opaque values in the RPC call message.

        typedef unsigned int uint32;

        struct config_rdma_req {
             uint32  maxcall_sendsize;
                         /* max size of inline RPC call */
             uint32  maxreply_sendsize;
                         /* max size of inline RPC reply */
             uint32  maxrdmaread;
                         /* max active RDMA Reads at client */
        };

        struct config_rdma_reply {
             uint32  maxcall_sendsize;
                         /* max call size accepted by server */
             uint32  align;
                         /* server's receive buffer alignment */
             uint32  maxrdmaread;
                         /* max active RDMA Reads at server */
        };

        program CONFIG_RDMA_PROG {
           version VERS1 {
              /*
               * Config call/reply
               */
              config_rdma_reply CONF_RDMA(config_rdma_req) = 1;
           } = 1;
        } = nnnnnn;  <-- Need program number assigned

7.  Memory Registration Overhead

     RDMA requires that all data be transferred between registered
     memory regions at the source and destination.  All protocol headers
     as well as separately transferred data chunks must use registered
     memory.  Since the cost of registering and de-registering memory
     can be a large proportion of the RDMA transaction cost, it is
     important to minimize registration activity.  This is easily
     achieved within RPC controlled memory by allocating chunk list data
     and RPC headers in a reusable way from pre-registered pools.

     The data chunks transferred via RDMA may occupy memory that
     persists outside the bounds of the RPC transaction.  Hence, the
     default behavior of an RPC over RDMA transport is to register and
     de-register these chunks on every transaction.  However, this is
     not a limitation of the protocol - only of the existing local RPC
     API.  The API is easily extended through such functions as



Expires: April 2006       Callaghan and Talpey                 [Page 26]

Internet-Draft         RDMA Transport for ONC RPC           October 2005


     rpc_control(3) to change the default behavior so that the
     application can assume responsibility for controlling memory
     registration through an RPC-provided registered memory allocator.

8.  Errors and Error Recovery

     RPC RDMA protocol errors are described in section 4.  RPC errors
     and RPC error recovery are not affected by the protocol, and
     proceed as for any RPC error condition.  RDMA Transport error
     reporting and recovery are outside the scope of this protocol.

     It is assumed that the link itself will provide some degree of
     error detection and retransmission.  iWARP's MPA layer (when used
     over TCP), SCTP, as well as the Infiniband link layer all provide
     CRC protection of the RDMA payload, and CRC-class protection is a
     general attribute of such transports.  Additionally, the RPC layer
     itself can accept errors from the link level and recover via
     retransmission.  RPC recovery can handle complete loss and re-
     establishment of the link.

     See section 11 for further discussion of the use of RPC-level
     integrity schemes to detect errors, and related efficiency issues.

9.  Node Addressing

     In setting up a new RDMA connection, the first action by an RPC
     client will be to obtain a transport address for the server.  The
     mechanism used to obtain this address, and to open an RDMA
     connection is dependent on the type of RDMA transport, and is the
     responsibility of each RPC protocol binding and its local
     implementation.

10.  RPC Binding

     RPC services normally register with a portmap or rpcbind service,
     which associates an RPC program number with a service address.  In
     the case of UDP or TCP, the service address for NFS is normally
     port 2049.  This policy should be no different with RDMA
     interconnects.

     One possibility is to have the server's portmapper register itself
     on the RDMA interconnect at a "well known" service address.  On UDP
     or TCP, this corresponds to port 111.  A client could connect to
     this service address and use the portmap protocol to obtain a
     service address in response to a program number, e.g. an iWARP port
     number, or an Infiniband GID.





Expires: April 2006       Callaghan and Talpey                 [Page 27]

Internet-Draft         RDMA Transport for ONC RPC           October 2005


11.  Security

     ONC RPC provides its own security via the RPCSEC_GSS framework [RFC
     2203].  RPCSEC_GSS can provide message authentication, integrity
     checking, and privacy.  This security mechanism will be unaffected
     by the RDMA transport.  The data integrity and privacy features
     alter the body of the message, presenting it as a single chunk.
     For large messages the chunk may be large enough to qualify for
     RDMA Read transfer.  However, there is much data movement
     associated with computation and verification of integrity, or
     encryption/decryption, so certain performance advantages may be
     lost.

     For efficiency, more appropriate security mechanism for RDMA links
     may be link-level protection, such as IPSec, which may be co-
     located in the RDMA hardware.  The use of link-level protection may
     be negotiated through the use of a new RPCSEC_GSS mechanism like
     the Credential Cache GSS Mechanism [CCM].  Use of such mechanisms
     is recommended where end-to-end integrity and/or privacy is
     desired, and where efficiency is required.

     There are no new issues here with exposed addresses.  The only
     exposed addresses here are in the chunk list and in the transport
     packets transferred via RDMA.  The data contained in these
     addresses continues to be protected by RPCSEC_GSS integrity and
     privacy.

12.  IANA Considerations

     As a new RPC transport, this protocol should have no effect on RPC
     program numbers or registered port numbers.  The new RPC transport
     should be assigned a new RPC "netid".  If adopted, the Connection
     Configuration protocol described herein will require an RPC program
     number assignment.

13.  Acknowledgements

     The authors wish to thank Rob Thurlow, John Howard, Chet Juszczak,
     Alex Chiu, Peter Staubach, Dave Noveck, Brian Pawlowski, Steve
     Kleiman, Mike Eisler, Mark Wittle and Shantanu Mehendale for their
     contributions to this document.

14.  Normative References


[RFC1831]
     R. Srinivasan, "RPC: Remote Procedure Call Protocol Specification
     Version 2", Standards Track RFC,



Expires: April 2006       Callaghan and Talpey                 [Page 28]

Internet-Draft         RDMA Transport for ONC RPC           October 2005


     http://www.ietf.org/rfc/rfc1831.txt

[RFC1832]
     R. Srinivasan, "XDR: External Data Representation Standard",
     Standards Track RFC, http://www.ietf.org/rfc/rfc1832.txt

[RFC1813]
     B. Callaghan, B. Pawlowski, P. Staubach, "NFS Version 3 Protocol
     Specification", Informational RFC,
     http://www.ietf.org/rfc/rfc1813.txt

[RFC3530]
     S. Shepler, B. Callaghan, D. Robinson, R. Thurlow, C. Beame, M.
     Eisler, D. Noveck, "NFS version 4 Protocol", Standards Track RFC,
     http://www.ietf.org/rfc/rfc3530.txt

[RFC2203]
     M. Eisler, A. Chiu, L. Ling, "RPCSEC_GSS Protocol Specification",
     Standards Track RFC, http://www.ietf.org/rfc/rfc2203.txt

15.  Informative References


[RDMA]
     R. Recio et al, "An RDMA Protocol Specification", Internet Draft
     Work in Progress, draft-ietf-rddp-rdmap

[CCM]
     M. Eisler, N. Williams, "CCM: The Credential Cache GSS Mechanism",
     Internet Draft Work in Progress, draft-ietf-nfsv4-ccm

[NFSDDP]
     B. Callaghan, T. Talpey, "NFS Direct Data Placement" Internet Draft
     Work in Progress, draft-ietf-nfsv4-nfsdirect

[RDDP]
     Remote Direct Data Placement Working Group Charter,
     http://www.ietf.org/html.charters/rddp-charter.html

[NFSRDMAPS]
     T. Talpey, C. Juszczak, "NFS RDMA Problem Statement", Internet
     Draft Work in Progress, draft-ietf-nfsv4-nfs-rdma-problem-statement

[NFSSESS]
     T. Talpey, S. Shepler, J. Bauman, "NFSv4 Session Extensions",
     Internet Draft Work in Progress, draft-ietf-nfsv4-nfs-sess





Expires: April 2006       Callaghan and Talpey                 [Page 29]

Internet-Draft         RDMA Transport for ONC RPC           October 2005


[IB]
     Infiniband Architecture Specification, http://www.infinibandta.org

16.  Authors' Addresses


     Brent Callaghan
     1614 Montalto Dr.
     Mountain View, California 94040 USA

     Phone: +1 650 968 2333
     EMail: brent.callaghan@gmail.com


     Tom Talpey
     Network Appliance, Inc.
     375 Totten Pond Road
     Waltham, MA 02451 USA

     Phone: +1 781 768 5329
     EMail: thomas.talpey@netapp.com


17.  Intellectual Property and Copyright Statements


Intellectual Property Statement

     The IETF takes no position regarding the validity or scope of any
     Intellectual Property Rights or other rights that might be claimed
     to pertain to the implementation or use of the technology described
     in this document or the extent to which any license under such
     rights might or might not be available; nor does it represent that
     it has made any independent effort to identify any such rights.
     Information on the procedures with respect to rights in RFC
     documents can be found in BCP 78 and BCP 79.

     Copies of IPR disclosures made to the IETF Secretariat and any
     assurances of licenses to be made available, or the result of an
     attempt made to obtain a general license or permission for the use
     of such proprietary rights by implementers or users of this
     specification can be obtained from the IETF on-line IPR repository
     at http://www.ietf.org/ipr.

     The IETF invites any interested party to bring to its attention any
     copyrights, patents or patent applications, or other proprietary
     rights that may cover technology that may be required to implement
     this standard.  Please address the information to the IETF at ietf-



Expires: April 2006       Callaghan and Talpey                 [Page 30]

Internet-Draft         RDMA Transport for ONC RPC           October 2005


     ipr@ietf.org.

Disclaimer of Validity

     This document and the information contained herein are provided on
     an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE
     REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND
     THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES,
     EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT
     THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR
     ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A
     PARTICULAR PURPOSE.

Copyright Statement

     Copyright (C) The Internet Society (2005).  This document is
     subject to the rights, licenses and restrictions contained in BCP
     78, and except as set forth therein, the authors retain all their
     rights.


Acknowledgement
     Funding for the RFC Editor function is currently provided by the
     Internet Society.



























Expires: April 2006       Callaghan and Talpey                 [Page 31]

Html markup produced by rfcmarkup 1.109, available from https://tools.ietf.org/tools/rfcmarkup/