[Docs] [txt|pdf|xml|html] [Tracker] [WG] [Email] [Diff1] [Diff2] [Nits]

Versions: (draft-cel-nfsv4-rpcrdma-version-two) 00

Network File System Version 4                              C. Lever, Ed.
Internet-Draft                                                    Oracle
Intended status: Standards Track                               D. Noveck
Expires: May 20, 2020                                             NetApp
                                                       November 17, 2019


                    RPC-over-RDMA Version 2 Protocol
                draft-ietf-nfsv4-rpcrdma-version-two-00

Abstract

   This document specifies the second version of a protocol that conveys
   Remote Procedure Call (RPC) messages on transports capable of Remote
   Direct Memory Access (RDMA).  This version of the protocol is
   extensible.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on May 20, 2020.

Copyright Notice

   Copyright (c) 2019 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (https://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.



Lever & Noveck            Expires May 20, 2020                  [Page 1]


Internet-Draft          RDMA Transport for RPC V2          November 2019


   This document may contain material from IETF Documents or IETF
   Contributions published or made publicly available before November
   10, 2008.  The person(s) controlling the copyright in some of this
   material may not have granted the IETF Trust the right to allow
   modifications of such material outside the IETF Standards Process.
   Without obtaining an adequate license from the person(s) controlling
   the copyright in such materials, this document may not be modified
   outside the IETF Standards Process, and derivative works of it may
   not be created outside the IETF Standards Process, except to format
   it for publication as an RFC or to translate it into languages other
   than English.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   4
   2.  Requirements Language . . . . . . . . . . . . . . . . . . . .   5
   3.  Terminology . . . . . . . . . . . . . . . . . . . . . . . . .   6
     3.1.  Remote Procedure Calls  . . . . . . . . . . . . . . . . .   6
       3.1.1.  Upper-Layer Protocols . . . . . . . . . . . . . . . .   6
       3.1.2.  Requesters and Responders . . . . . . . . . . . . . .   6
       3.1.3.  RPC Transports  . . . . . . . . . . . . . . . . . . .   7
       3.1.4.  External Data Representation  . . . . . . . . . . . .   8
     3.2.  Remote Direct Memory Access . . . . . . . . . . . . . . .   9
       3.2.1.  Direct Data Placement . . . . . . . . . . . . . . . .   9
       3.2.2.  RDMA Transport Requirements . . . . . . . . . . . . .   9
   4.  RPC-over-RDMA Protocol Framework  . . . . . . . . . . . . . .  11
     4.1.  Transfer Model  . . . . . . . . . . . . . . . . . . . . .  11
     4.2.  Message Framing . . . . . . . . . . . . . . . . . . . . .  11
     4.3.  Managing Receiver Resources . . . . . . . . . . . . . . .  12
       4.3.1.  RPC-over-RDMA Version 2 Flow Control  . . . . . . . .  12
       4.3.2.  Inline Threshold  . . . . . . . . . . . . . . . . . .  14
       4.3.3.  Initial Connection State  . . . . . . . . . . . . . .  14
     4.4.  XDR Encoding with Chunks  . . . . . . . . . . . . . . . .  15
       4.4.1.  Reducing an XDR Stream  . . . . . . . . . . . . . . .  15
       4.4.2.  DDP-Eligibility . . . . . . . . . . . . . . . . . . .  16
       4.4.3.  RDMA Segments . . . . . . . . . . . . . . . . . . . .  16
       4.4.4.  Chunks  . . . . . . . . . . . . . . . . . . . . . . .  17
       4.4.5.  Read Chunks . . . . . . . . . . . . . . . . . . . . .  18
       4.4.6.  Write Chunks  . . . . . . . . . . . . . . . . . . . .  19
     4.5.  Message Transfer Methods  . . . . . . . . . . . . . . . .  20
       4.5.1.  Short Messages  . . . . . . . . . . . . . . . . . . .  21
       4.5.2.  Continued Messages  . . . . . . . . . . . . . . . . .  21
       4.5.3.  Chunked Messages  . . . . . . . . . . . . . . . . . .  22
       4.5.4.  Long Messages . . . . . . . . . . . . . . . . . . . .  23
   5.  Transport Properties  . . . . . . . . . . . . . . . . . . . .  24
     5.1.  Transport Properties Model  . . . . . . . . . . . . . . .  25
     5.2.  Current Transport Properties  . . . . . . . . . . . . . .  26
       5.2.1.  Maximum Send Size . . . . . . . . . . . . . . . . . .  27



Lever & Noveck            Expires May 20, 2020                  [Page 2]


Internet-Draft          RDMA Transport for RPC V2          November 2019


       5.2.2.  Receive Buffer Size . . . . . . . . . . . . . . . . .  28
       5.2.3.  Maximum RDMA Segment Size . . . . . . . . . . . . . .  28
       5.2.4.  Maximum RDMA Segment Count  . . . . . . . . . . . . .  28
       5.2.5.  Reverse Request Support . . . . . . . . . . . . . . .  29
       5.2.6.  Host Authentication Message . . . . . . . . . . . . .  30
   6.  RPC-over-RDMA Version 2 Transport Messages  . . . . . . . . .  30
     6.1.  Overall Transport Message Structure . . . . . . . . . . .  30
     6.2.  Transport Header Types  . . . . . . . . . . . . . . . . .  30
     6.3.  RPC-over-RDMA Version 2 Headers and Chunks  . . . . . . .  31
       6.3.1.  Common Transport Header Prefix  . . . . . . . . . . .  31
       6.3.2.  RPC-over-RDMA Version 2 Transport Header Prefix . . .  32
       6.3.3.  Describing External Data Payloads . . . . . . . . . .  35
     6.4.  Header Types Defined in RPC-over-RDMA version 2 . . . . .  36
       6.4.1.  RDMA2_MSG: Convey RPC Message Inline  . . . . . . . .  36
       6.4.2.  RDMA2_NOMSG: Convey External RPC Message  . . . . . .  37
       6.4.3.  RDMA2_ERROR: Report Transport Error . . . . . . . . .  38
       6.4.4.  RDMA2_CONNPROP: Advertise Transport Properties  . . .  41
     6.5.  Choosing a Reply Mechanism  . . . . . . . . . . . . . . .  42
   7.  XDR Protocol Definition . . . . . . . . . . . . . . . . . . .  42
     7.1.  Code Component License  . . . . . . . . . . . . . . . . .  43
     7.2.  Extraction and Use of XDR Definitions . . . . . . . . . .  45
     7.3.  XDR Definition for RPC-over-RDMA Version 2 Core
           Structures  . . . . . . . . . . . . . . . . . . . . . . .  47
     7.4.  XDR Definition for RPC-over-RDMA Version 2 Base Header
           Types . . . . . . . . . . . . . . . . . . . . . . . . . .  49
     7.5.  Use of the XDR Description Files  . . . . . . . . . . . .  50
   8.  RPC Bind Parameters . . . . . . . . . . . . . . . . . . . . .  52
   9.  Implementation Status . . . . . . . . . . . . . . . . . . . .  53
   10. Security Considerations . . . . . . . . . . . . . . . . . . .  54
     10.1.  Memory Protection  . . . . . . . . . . . . . . . . . . .  54
       10.1.1.  Protection Domains . . . . . . . . . . . . . . . . .  54
       10.1.2.  Handle (STag) Predictability . . . . . . . . . . . .  54
       10.1.3.  Memory Protection  . . . . . . . . . . . . . . . . .  54
       10.1.4.  Denial of Service  . . . . . . . . . . . . . . . . .  55
     10.2.  RPC Message Security . . . . . . . . . . . . . . . . . .  55
       10.2.1.  RPC-over-RDMA Protection at Lower Layers . . . . . .  56
       10.2.2.  RPCSEC_GSS on RPC-over-RDMA Transports . . . . . . .  56
     10.3.  Transport Properties . . . . . . . . . . . . . . . . . .  58
     10.4.  Host Authentication  . . . . . . . . . . . . . . . . . .  59
   11. IANA Considerations . . . . . . . . . . . . . . . . . . . . .  59
   12. References  . . . . . . . . . . . . . . . . . . . . . . . . .  59
     12.1.  Normative References . . . . . . . . . . . . . . . . . .  59
     12.2.  Informative References . . . . . . . . . . . . . . . . .  61
   Appendix A.  ULB Specifications . . . . . . . . . . . . . . . . .  63
     A.1.  DDP-Eligibility . . . . . . . . . . . . . . . . . . . . .  63
     A.2.  Maximum Reply Size  . . . . . . . . . . . . . . . . . . .  64
     A.3.  Additional Considerations . . . . . . . . . . . . . . . .  65
     A.4.  ULP Extensions  . . . . . . . . . . . . . . . . . . . . .  65



Lever & Noveck            Expires May 20, 2020                  [Page 3]


Internet-Draft          RDMA Transport for RPC V2          November 2019


   Appendix B.  Extending the Version 2 Protocol . . . . . . . . . .  65
     B.1.  Adding New Header Types to RPC-over-RDMA Version 2  . . .  67
     B.2.  Adding New Header Flags to the Protocol . . . . . . . . .  68
     B.3.  Adding New Transport properties to the Protocol . . . . .  69
     B.4.  Adding New Error Codes to the Protocol  . . . . . . . . .  70
   Appendix C.  Differences from the RPC-over-RDMA Version 1
                Protocol . . . . . . . . . . . . . . . . . . . . . .  70
     C.1.  Relationship to the RPC-over-RDMA Version 1 XDR
           Definition  . . . . . . . . . . . . . . . . . . . . . . .  70
     C.2.  Transport Properties  . . . . . . . . . . . . . . . . . .  72
     C.3.  Credit Management Changes . . . . . . . . . . . . . . . .  72
     C.4.  Inline Threshold Changes  . . . . . . . . . . . . . . . .  73
     C.5.  Message Continuation Changes  . . . . . . . . . . . . . .  74
     C.6.  Host Authentication Changes . . . . . . . . . . . . . . .  75
     C.7.  Support for Remote Invalidation . . . . . . . . . . . . .  75
     C.8.  Error Reporting Changes . . . . . . . . . . . . . . . . .  76
   Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . .  76
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  77

1.  Introduction

   Remote Direct Memory Access (RDMA) [RFC5040] [RFC5041] [IBA] is a
   technique for moving data efficiently between network nodes.  By
   directing data into destination buffers as it is sent on a network
   and placing it using direct memory access implemented by hardware,
   the complementary benefits of faster transfers and reduced host
   overhead are obtained.

   Open Network Computing Remote Procedure Call (ONC RPC, often
   shortened in NFSv4 documents to RPC) [RFC5531] is a Remote Procedure
   Call protocol that runs over a variety of transports.  Most RPC
   implementations today use UDP [RFC0768] or TCP [RFC0793].  On UDP,
   RPC messages are encapsulated inside datagrams, while on a TCP byte
   stream, RPC messages are delineated by a record marking protocol.  An
   RDMA transport also conveys RPC messages in a specific fashion that
   must be fully described if RPC implementations are to interoperate
   when using RDMA to transport RPC transactions.

   RDMA transports present semantics that differ from either UDP or TCP.
   They retain message delineations like UDP but provide reliable and
   sequenced data transfer like TCP.  They also provide an offloaded
   bulk transfer service not provided by UDP or TCP.  RDMA transports
   are therefore appropriately treated as a new transport type by RPC.

   Although the RDMA transport described herein can provide relatively
   transparent support for any RPC application, this document also
   describes mechanisms that enable further optimization of data
   transfer, when RPC applications are structured to exploit awareness



Lever & Noveck            Expires May 20, 2020                  [Page 4]


Internet-Draft          RDMA Transport for RPC V2          November 2019


   of a transport's RDMA capability.  In this context, the Network File
   System (NFS) protocols, as described in [RFC1094], [RFC1813],
   [RFC7530], [RFC5661], and subsequent NFSv4 minor versions, are all
   potential beneficiaries of RDMA transports.  A complete problem
   statement is presented in [RFC5532].

   The RPC-over-RDMA version 1 protocol specified in [RFC8166] is
   deployed and in use, although there are known shortcomings to this
   protocol:

   o  The protocol's default size of Receive buffers forces the use of
      RDMA Read and Write transfers for small payloads, and limits the
      size of reverse direction messages.

   o  It is difficult to make optimizations or protocol fixes that
      require changes to on-the-wire behavior.

   o  For some RPC procedures, the maximum reply size is difficult or
      impossible for an RPC client to estimate in advance.

   To address these issues in a way that enables interoperation with
   existing RPC-over-RDMA version 1 deployments, a second version of the
   RPC-over-RDMA transport protocol is presented in this document.

   Version 2 of RPC-over-RDMA is extensible, enabling OPTIONAL
   extensions to be added without impacting existing implementations.
   To enable protocol extension, the XDR definition for RPC-over-RDMA
   version 2 is organized differently than the definition version 1.
   These changes, which are discussed in Appendix C.1, do not alter the
   on-the-wire format.

   In addition, RPC-over-RDMA version 2 contains a set of incremental
   changes that relieve certain performance constraints and enable
   recovery from abnormal corner cases.  These changes are outlined in
   Appendix C and include a larger default inline threshold, the ability
   to convey a single RPC message using multiple RDMA Send operations,
   support for authentication of connection peers, richer error
   reporting, an improved credit-based flow control mechanism, and
   support for Remote Invalidation.

2.  Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
   "OPTIONAL" in this document are to be interpreted as described in BCP
   14 [RFC2119] [RFC8174] when, and only when, they appear in all
   capitals, as shown here.




Lever & Noveck            Expires May 20, 2020                  [Page 5]


Internet-Draft          RDMA Transport for RPC V2          November 2019


3.  Terminology

3.1.  Remote Procedure Calls

   This section highlights key elements of the RPC protocol [RFC5531]
   and the External Data Representation (XDR) [RFC4506] used by it.
   RPC-over-RDMA version 2 enables the transmission of RPC messges built
   using XDR and also uses XDR internaly to describe its own header
   formats.  An understanding of RPC and its use of XDR is assumed in
   this document.

3.1.1.  Upper-Layer Protocols

   RPCs are an abstraction used to implement the operations of an Upper-
   Layer Protocol (ULP).  "ULP" refers to an RPC Program and Version
   tuple, which is a versioned set of procedure calls that comprise a
   single well-defined API.  One example of a ULP is the Network File
   System Version 4.0 [RFC7530].

   In this document, the term "RPC consumer" refers to an implementation
   of a ULP running on an RPC client.

3.1.2.  Requesters and Responders

   Like a local procedure call, every RPC procedure has a set of
   "arguments" and a set of "results".  A calling context invokes a
   procedure, passing arguments to it, and the procedure subsequently
   returns a set of results.  Unlike a local procedure call, the called
   procedure is executed remotely rather than in the local application's
   execution context.

   The RPC protocol as described in [RFC5531] is fundamentally a
   message-passing protocol between one or more clients, where RPC
   consumers are running, and a server, where a remote execution context
   is available to process RPC transactions on behalf of those
   consumers.

   ONC RPC transactions are made up of two types of messages:

   CALL
      A CALL message, or "Call", requests that work be done.  An RPC
      Call message is designated by the value zero (0) in the message's
      msg_type field.  An arbitrary unique value is placed in the
      message's XID field in order to match this RPC Call message to a
      corresponding RPC Reply message.

   REPLY




Lever & Noveck            Expires May 20, 2020                  [Page 6]


Internet-Draft          RDMA Transport for RPC V2          November 2019


      A REPLY message, or "Reply", reports the results of work requested
      by an RPC Call message.  An RPC Reply message is designated by the
      value one (1) in the message's msg_type field.  The value
      contained in an RPC Reply message's XID field is copied from the
      RPC Call message whose results are being reported.

   Each RPC client endpoint acts as a "Requester".  It serializes the
   procedure's arguments and conveys them to a server endpoint via an
   RPC Call message.  This message contains an RPC protocol header, a
   header describing the requested upper-layer operation, and all
   arguments.

   An RPC server endpoint acts as a "Responder".  It deserializes the
   arguments and processes the requested operation.  It then serializes
   the operation's results into another byte stream.  This byte stream
   is conveyed back to the Requester via an RPC Reply message.  This
   message contains an RPC protocol header, a header describing the
   upper-layer reply, and all results.

   The Requester deserializes the results and allows the RPC consumer to
   proceed.  At this point, the RPC transaction designated by the XID in
   the RPC Call message is complete, and the XID is retired.

   In summary, Requesters send RPC Call messages to Responders to
   initiate RPC transactions.  Responders send RPC Reply messages to
   Requesters to complete the processing on an RPC transaction.

3.1.3.  RPC Transports

   The role of an "RPC transport" is to mediate the exchange of RPC
   messages between Requesters and Responders.  An RPC transport bridges
   the gap between the RPC message abstraction and the native operations
   of a particular network transport.

   RPC-over-RDMA is a connection-oriented RPC transport.  When a
   connection-oriented transport is used, clients initiate transport
   connections, while servers wait passively to accept incoming
   connection requests.

   Most commonly, the client end of the connection acts in the role of
   Requester, and the server end of the connection acts as a Responder.
   However, RPC transactions can also be sent in the reverse direction.
   In this case, the server end of the connection acts as a Requestor
   while the client end acts as a Responder.







Lever & Noveck            Expires May 20, 2020                  [Page 7]


Internet-Draft          RDMA Transport for RPC V2          November 2019


3.1.4.  External Data Representation

   One cannot assume that all Requesters and Responders represent data
   objects the same way internally.  RPC uses External Data
   Representation (XDR) to translate native data types and serialize
   arguments and results [RFC4506].

   The XDR protocol encodes data independently of the endianness or size
   of host-native data types, enabling unambiguous decoding of data by
   the receiver.  RPC Programs are specified by writing an XDR
   definition of their procedures, argument data types, and result data
   types.

   XDR assumes only that the number of bits in a byte (octet) and their
   order are the same on both endpoints and on the physical network.
   The smallest indivisible unit of XDR encoding is a group of four
   octets.  XDR can also flatten lists, arrays, and other complex data
   types so they can be conveyed as a stream of bytes.

   A serialized stream of bytes that is the result of XDR encoding is
   referred to as an "XDR stream".  A sending endpoint encodes native
   data into an XDR stream and then transmits that stream to a receiver.
   A receiving endpoint decodes incoming XDR byte streams into its
   native data representation format.

3.1.4.1.  XDR Opaque Data

   Sometimes, a data item is to be transferred as is: without encoding
   or decoding.  The contents of such a data item are referred to as
   "opaque data".  XDR encoding places the content of opaque data items
   directly into an XDR stream without altering it in any way.  ULPs or
   applications perform any needed data translation in this case.
   Examples of opaque data items include the content of files or generic
   byte strings.

3.1.4.2.  XDR Roundup

   The number of octets in a variable-length data item precedes that
   item in an XDR stream.  If the size of an encoded data item is not a
   multiple of four octets, octets containing zero are added after the
   end of the item.  This is the case so that the next encoded data item
   in the XDR stream always starts on a four-octet boundary.  The
   encoded size of the item is not changed by the addition of the extra
   octets.  These extra octets are never exposed to ULPs.

   This technique is referred to as "XDR roundup", and the extra octets
   are referred to as "XDR roundup padding".




Lever & Noveck            Expires May 20, 2020                  [Page 8]


Internet-Draft          RDMA Transport for RPC V2          November 2019


3.2.  Remote Direct Memory Access

   RPC Requesters and Responders can be made more efficient if large RPC
   messages are transferred by a third party, such as intelligent
   network-interface hardware (data movement offload), and placed in the
   receiver's memory so that no additional adjustment of data alignment
   has to be made (direct data placement or "DDP").  RDMA transports
   enable both optimizations.

   In the current document, "RDMA" refers to the physical mechanism an
   RDMA transport utilizes when moving data.

3.2.1.  Direct Data Placement

   Typically, RPC implementations copy the contents of RPC messages into
   a buffer before being sent.  An efficient RPC implementation sends
   bulk data without copying it into a separate send buffer first.

   However, socket-based RPC implementations are often unable to receive
   data directly into its final place in memory.  Receivers often need
   to copy incoming data to finish an RPC operation: sometimes, only to
   adjust data alignment.

   Although it may not be efficient, before an RDMA transfer, a sender
   may copy data into an intermediate buffer.  After an RDMA transfer, a
   receiver may copy that data again to its final destination.  In this
   document, the term "DDP" refers to any optimized data transfer where
   it is unnecessary for a receiving host's CPU to copy transferred data
   to another location after it has been received.

   RPC-over-RDMA version 2 enables the use of RDMA Read and Write
   operations to achieve both data movement offload and DDP.  However,
   not all RDMA-based data transfer qualifies as DDP, and DDP can be
   achieved using non-RDMA mechanisms.

3.2.2.  RDMA Transport Requirements

   To achieve good performance during receive operations, RDMA
   transports require that RDMA consumers provision resources in advance
   in order to receive incoming messages.

   An RDMA consumer might provide Receive buffers in advance by posting
   an RDMA Receive Work Request for every expected RDMA Send from a
   remote peer.  These buffers are provided before the remote peer posts
   RDMA Send Work Requests.  Thus this is often referred to as "pre-
   posting" buffers.





Lever & Noveck            Expires May 20, 2020                  [Page 9]


Internet-Draft          RDMA Transport for RPC V2          November 2019


   An RDMA Receive Work Request remains outstanding until hardware
   matches it to an inbound Send operation.  The resources associated
   with that Receive must be retained in host memory, or "pinned", until
   the Receive completes.

   Given these basic tenets of RDMA transport operation, the RPC-over-
   RDMA version 2 protocol assumes each transport provides the following
   abstract operations.  A more complete discussion of these operations
   can be found in [RFC5040].

3.2.2.1.  Memory Registration

   Memory registration assigns a steering tag to a region of memory,
   permitting the RDMA provider to perform data-transfer operations.
   The RPC-over-RDMA version 2 protocol assumes that each registered
   memory region is identified with a steering tag of no more than 32
   bits and memory addresses of up to 64 bits in length.

3.2.2.2.  RDMA Send

   The RDMA provider supports an RDMA Send operation, with completion
   signaled on the receiving peer after data has been placed in a pre-
   posted buffer.  Sends complete at the receiver in the order they were
   issued at the sender.  The amount of data transferred by a single
   RDMA Send operation is limited by the size of the remote peer's pre-
   posted buffers.

3.2.2.3.  RDMA Receive

   The RDMA provider supports an RDMA Receive operation to receive data
   conveyed by incoming RDMA Send operations.  To reduce the amount of
   memory that must remain pinned awaiting incoming Sends, the amount of
   pre-posted memory is limited.  Flow control to prevent overrunning
   receiver resources is provided by the RDMA consumer (in this case,
   the RPC-over-RDMA version 2 protocol).

3.2.2.4.  RDMA Write

   The RDMA provider supports an RDMA Write operation to place data
   directly into a remote memory region.  The local host initiates an
   RDMA Write, and completion is signaled there.  No completion is
   signaled on the remote peer.  The local host provides a steering tag,
   memory address, and the length of the remote peer's memory region.

   RDMA Writes are not ordered with respect to one another, but are
   ordered with respect to RDMA Sends.  A subsequent RDMA Send
   completion obtained at the write initiator guarantees that prior RDMA
   Write data has been successfully placed in the remote peer's memory.



Lever & Noveck            Expires May 20, 2020                 [Page 10]


Internet-Draft          RDMA Transport for RPC V2          November 2019


3.2.2.5.  RDMA Read

   The RDMA provider supports an RDMA Read operation to place peer
   source data directly into the read initiator's memory.  The local
   host initiates an RDMA Read, and completion is signaled there.  No
   completion is signaled on the remote peer.  The local host provides
   steering tags, memory addresses, and a length for the remote source
   and local destination memory region.

   The local host signals Read completion to the remote peer as part of
   a subsequent RDMA Send message.  The remote peer can then invalidate
   steering tags and subsequently free associated source memory regions.

4.  RPC-over-RDMA Protocol Framework

4.1.  Transfer Model

   A "transfer model" designates which endpoint exposes its memory and
   which is responsible for initiating the transfer of data.  To enable
   RDMA Read and Write operations, for example, an endpoint first
   exposes regions of its memory to a remote endpoint, which initiates
   these operations against the exposed memory.

   In RPC-over-RDMA version 2, Requesters expose their memory to the
   Responder, but the Responder does not expose its memory.  The
   Responder pulls RPC arguments or whole RPC calls from each Requester.
   The Responder pushes RPC results or whole RPC replies to each
   Requester.

4.2.  Message Framing

   Each RPC-over-RDMA version 2 message consists of at most two XDR
   streams:

   Transport Stream
      The "Transport stream" contains a header that describes and
      controls the transfer of the Payload stream in this RPC-over-RDMA
      message.  Every RDMA Send message on an RPC-over-RDMA version 2
      connection MUST begin with a Transport stream.

   RPC Payload Stream
      The "Payload stream" contains part or all of a single RPC message.
      The sender MAY divide an RPC message at any convenient boundary,
      but MUST send RPC message fragments in XDR stream order and MUST
      NOT interleave Payload streams from multiple RPC messages.  The
      RPC-over-RDMA version 2 message carrying the final part of an RPC
      message is marked (see Section 6.3.2.2).




Lever & Noveck            Expires May 20, 2020                 [Page 11]


Internet-Draft          RDMA Transport for RPC V2          November 2019


   In its simplest form, an RPC-over-RDMA version 2 message conveying an
   RPC message payload consists of a Transport stream followed
   immediately by a Payload stream transmitted together via a single
   RDMA Send.

   RPC-over-RDMA framing replaces all other RPC framing (such as TCP
   record marking) when used atop an RPC-over-RDMA association, even
   when the underlying RDMA protocol may itself be layered atop a
   transport with a defined RPC framing (such as TCP).

   However, it is possible for RPC-over-RDMA to be dynamically enabled
   on a connection in the course of negotiating the use of RDMA via a
   ULP exchange.  Because RPC framing delimits an entire RPC request or
   reply, the resulting shift in framing must occur between distinct RPC
   messages, and in concert with the underlying transport.

4.3.  Managing Receiver Resources

   The longevity of an RDMA connection mandates that sending endpoints
   respect the resource limits of peer receivers.  To ensure messages
   can be sent and received reliably, there are two operational
   parameters for each connection.  It is critical to provide RDMA Send
   flow control for an RDMA connection.  If any pre-posted Receive
   buffer on the connection is not large enough to accept an incoming
   RDMA Send, or if a pre-posted Receive buffer is not available to
   accept an incoming RDMA Send, the RDMA connection can be terminated.

4.3.1.  RPC-over-RDMA Version 2 Flow Control

   Because RPC-over-RDMA requires reliable and in-order delivery of data
   payloads, RPC-over-RDMA transports MUST use the RDMA RC (Reliable
   Connected) Queue Pair (QP) type, which ensures in-transit data
   integrity and handles recovery from packet loss or misordering.

   However, RPC-over-RDMA transports provide their own flow control
   mechanism to prevent a sender from overwhelming receiver resources.
   RPC-over-RDMA transports employ an end-to-end credit-based flow
   control mechanism for this purpose [CBFC].  Credit-based flow control
   was chosen because it is relatively simple, provides robust operation
   in the face of bursty traffic, automated management of receive buffer
   allocation, and excellent buffer utilization.

4.3.1.1.  Granting Credits

   An RPC-over-RDMA version 2 credit is the capability to receive one
   RPC-over-RDMA version 2 message.  This enables RPC-over-RDMA version
   2 to support asymmetrical operation, where a message in one direction




Lever & Noveck            Expires May 20, 2020                 [Page 12]


Internet-Draft          RDMA Transport for RPC V2          November 2019


   might be matched by zero, one, or multiple messages in the other
   direction.

   To achieve this, credits are assigned to each connection peer's
   posted Receive buffers.  Each Requester has a set of Receive credits,
   and each Responder has a set of Receive credits.  These credit values
   are managed independently of one another.

   Section 7 of [RFC8166] requires that the 32-bit field containing the
   credit grant is the third word in the transport header.  To conform
   with that requirement, the two independent credit values are encoded
   into a single 32-bit field in the fixed portion of the transport
   header.  After the field is XDR decoded, the receiver takes the low-
   order two bytes as the number of credits that are newly granted by
   the sender, and the high-order two bytes as the maximum number of
   credits that can be outstanding at the sender.

   In this approach, then, there are requester credits, sent in messages
   from the requester to the responder; and responder credits, sent in
   messages from the responder to the requester.

   A sender MUST NOT send RDMA messages in excess of the receiver's
   granted credit limit.  If the granted value is exceeded, the RDMA
   layer may signal an error, possibly terminating the connection.  The
   granted value MUST NOT be zero, since such a value would result in
   deadlock.

   The granted credit values MAY be adjusted to match the needs or
   policies in effect on either peer.  For instance, a peer may reduce
   its granted credit value to accommodate the available resources in a
   Shared Receive Queue.

   Certain RDMA implementations may impose additional flow-control
   restrictions, such as limits on RDMA Read operations in progress at
   the Responder.  Accommodation of such restrictions is considered the
   responsibility of each RPC-over-RDMA version 2 implementation.

4.3.1.2.  Asynchronous Credit Grants

   A protocol convention is provided to enable one peer to refresh its
   credit grant to the other peer without sending a data payload.
   Messages of this type can also act as a keep-alive ping.  See
   Section 6.4.2 for information about this convention.

   To prevent transport deadlock, receivers MUST always be in a position
   to receive one such credit grant update message, in addition to
   payload-bearing messages.  One way a receiver can do this is to post
   one extra Receive more than the credit value it granted.



Lever & Noveck            Expires May 20, 2020                 [Page 13]


Internet-Draft          RDMA Transport for RPC V2          November 2019


4.3.2.  Inline Threshold

   An "inline threshold" value is the largest message size (in octets)
   that can be conveyed in one direction between peer implementations
   using RDMA Send and Receive operations.  The inline threshold value
   is effectively the smaller of the largest number of bytes the sender
   can post via a single RDMA Send operation and the largest number of
   bytes the receiver can accept via a single RDMA Receive operation.
   Each connection has two inline threshold values: one for messages
   flowing from Requester-to-Responder, referred to as the "call inline
   threshold", and one for messages flowing from Responder-to-Requester,
   referred to as the "reply inline threshold".  Inline threshold values
   can be advertised to peers via Transport Properties.

   Receiver implementations MUST support inline thresholds of 4096
   bytes.  In the absence of an exchange of Transport Properties,
   senders and receivers MUST assume both connection inline thresholds
   are 4096 bytes.

4.3.3.  Initial Connection State

   When an RPC-over-RDMA version 2 client establishes a connection to a
   server, its first order of business is to determine the server's
   highest supported protocol version.

   Upon connection establishment a client MUST NOT send more than a
   single RPC-over-RDMA message at a time until it receives a valid non-
   error RPC-over-RDMA message from the server that grants client
   credits.

   The second word of each transport header is used to convey the
   transport protocol version.  In the interest of simplicity, we refer
   to that word as rdma_vers even though in the RPC-over-RDMA version 2
   XDR definition it is described as rdma_start.rdma_vers.

   First, the client sends a single valid RPC-over-RDMA message with the
   value two (2) in the rdma_vers field.  Because the server might
   support only RPC-over-RDMA version 1, this initial message MUST NOT
   be larger than the version 1 default inline threshold of 1024 bytes.

4.3.3.1.  Server Does Support RPC-over-RDMA Version 2

   If the server does support RPC-over-RDMA version 2, it sends RPC-
   over-RDMA messages back to the client with the value two (2) in the
   rdma_vers field.  Both peers may use the default inline threshold
   value for RPC-over-RDMA version 2 connections (4096 bytes).





Lever & Noveck            Expires May 20, 2020                 [Page 14]


Internet-Draft          RDMA Transport for RPC V2          November 2019


4.3.3.2.  Server Does Not Support RPC-over-RDMA Version 2

   If the server does not support RPC-over-RDMA version 2, it MUST send
   an RPC-over-RDMA message to the client with the same XID, with
   RDMA2_ERROR in the rdma_start.rdma_htype field, and with the error
   code RDMA2_ERR_VERS.  This message also reports a range of protocol
   versions that the server supports.  To continue operation, the client
   selects a protocol version in the range of server-supported versions
   for subsequent messages on this connection.

   If the connection is lost immediately after an RDMA2_ERROR /
   RDMA2_ERR_VERS message is received, a client can avoid a possible
   version negotiation loop when re-establishing another connection by
   assuming that particular server does not support RPC-over-RDMA
   version 2.  A client can assume the same situation (no server support
   for RPC-over-RDMA version 2) if the initial negotiation message is
   lost or dropped.  Once the negotiation exchange is complete, both
   peers may use the default inline threshold value for the transport
   protocol version that has been selected.

4.3.3.3.  Client Does Not Support RPC-over-RDMA Version 2

   If the server supports the RPC-over-RDMA protocol version used in the
   first RPC-over-RDMA message received from a client, it MUST use that
   protocol version in all subsequent messages it sends on that
   connection.  The client MUST NOT change the protocol version for the
   duration of the connection.

4.4.  XDR Encoding with Chunks

   When a DDP capability is available, the transport places the contents
   of one or more XDR data items directly into the receiver's memory,
   separately from the transfer of other parts of the containing XDR
   stream.

4.4.1.  Reducing an XDR Stream

   RPC-over-RDMA version 2 provides a mechanism for moving part of an
   RPC message via a data transfer distinct from an RDMA Send/Receive
   pair.  The sender removes one or more XDR data items from the Payload
   stream.  These items are conveyed via other mechanisms, such as one
   or more RDMA Read or Write operations.  As the receiver decodes an
   incoming message, it skips over directly placed data items.

   The portion of an XDR stream that is split out and moved separately
   is referred to as a "chunk".  In some contexts, data in an RPC-over-
   RDMA header that describes these split out regions of memory may also
   be referred to as a "chunk".



Lever & Noveck            Expires May 20, 2020                 [Page 15]


Internet-Draft          RDMA Transport for RPC V2          November 2019


   A Payload stream after chunks have been removed is referred to as a
   "reduced" Payload stream.  Likewise, a data item that has been
   removed from a Payload stream to be transferred separately is
   referred to as a "reduced" data item.

4.4.2.  DDP-Eligibility

   Not all XDR data items benefit from DDP.  For example, small data
   items or data items that require XDR unmarshaling by the receiver do
   not benefit from DDP.  In addition, it is impractical for receivers
   to prepare for every possible XDR data item in a protocol to be
   transferred in a chunk.

   To maintain practical interoperability on an RPC-over-RDMA transport,
   a determination must be made of which few XDR data items in each ULP
   are allowed to use DDP.

   This is done in additional specifications that describe how ULPs
   employ DDP.  A "ULB specification" identifies which specific
   individual XDR data items in a ULP MAY be transferred via DDP.  Such
   data items are referred to as "DDP-eligible".  All other XDR data
   items MUST NOT be reduced.  Detailed requirements for ULBs are
   provided in Appendix A.

4.4.3.  RDMA Segments

   When encoding a Payload stream that contains a DDP-eligible data
   item, a sender may choose to reduce that data item.  When it chooses
   to do so, the sender does not place the item into the Payload stream.
   Instead, the sender records in the RPC-over-RDMA Transport header the
   location and size of the memory region containing that data item.

   The Requester provides location information for DDP-eligible data
   items in both RPC Call and Reply messages.  The Responder uses this
   information to retrieve arguments contained in the specified region
   of the Requester's memory or place results in that memory region.

   An "RDMA segment", or "plain segment", is an RPC-over-RDMA Transport
   header data object that contains the precise coordinates of a
   contiguous memory region that is to be conveyed separately from the
   Payload stream.  Plain segments contain the following information:

   Handle
      Steering tag (STag) or R_key generated by registering this memory
      with the RDMA provider.

   Length




Lever & Noveck            Expires May 20, 2020                 [Page 16]


Internet-Draft          RDMA Transport for RPC V2          November 2019


      The length of the RDMA segment's memory region, in octets.  An
      "empty segment" is an RDMA segment with the value zero (0) in its
      length field.

   Offset
      The offset or beginning memory address of the RDMA segment's
      memory region.

   See [RFC5040] for further discussion.

4.4.4.  Chunks

   In RPC-over-RDMA version 2, a "chunk" refers to a portion of the
   Payload stream that is moved independently of the RPC-over-RDMA
   Transport header and Payload stream.  Chunk data is removed from the
   sender's Payload stream, transferred via separate operations, and
   then reinserted into the receiver's Payload stream to form a complete
   RPC message.

   Each chunk is comprised of RDMA segments.  Each RDMA segment
   represents a single contiguous piece of that chunk.  A Requester MAY
   divide a chunk into RDMA segments using any boundaries that are
   convenient.  The length of a chunk is exactly the sum of the lengths
   of the RDMA segments that comprise it.

   The RPC-over-RDMA version 2 transport protocol does not place a limit
   on chunk size.  However, each ULP may cap the amount of data that can
   be transferred by a single RPC transaction.  For example, NFS has
   "rsize" and "wsize", which restrict the payload size of NFS READ and
   WRITE operations.  The Responder can use such limits to sanity check
   chunk sizes before using them in RDMA operations.

4.4.4.1.  Counted Arrays

   If a chunk contains a counted array data type, the count of array
   elements MUST remain in the Payload stream, while the array elements
   MUST be moved to the chunk.  For example, when encoding an opaque
   byte array as a chunk, the count of bytes stays in the Payload
   stream, while the bytes in the array are removed from the Payload
   stream and transferred within the chunk.

   Individual array elements appear in a chunk in their entirety.  For
   example, when encoding an array of arrays as a chunk, the count of
   items in the enclosing array stays in the Payload stream, but each
   enclosed array, including its item count, is transferred as part of
   the chunk.





Lever & Noveck            Expires May 20, 2020                 [Page 17]


Internet-Draft          RDMA Transport for RPC V2          November 2019


4.4.4.2.  Optional-Data

   If a chunk contains an optional-data data type, the "is present"
   field MUST remain in the Payload stream, while the data, if present,
   MUST be moved to the chunk.

4.4.4.3.  XDR Unions

   A union data type MUST NOT be made DDP-eligible, but one or more of
   its arms MAY be DDP-eligible, subject to the other requirements in
   this section.

4.4.4.4.  Chunk Roundup

   Except in special cases (covered in Section 4.5.4), a chunk MUST
   contain exactly one XDR data item.  This makes it straightforward to
   reduce variable-length data items without affecting the XDR alignment
   of data items in the Payload stream.

   When a variable-length XDR data item is reduced, the sender MUST
   remove XDR roundup padding for that data item from the Payload stream
   so that data items remaining in the Payload stream begin on four-byte
   alignment.

4.4.5.  Read Chunks

   A "Read chunk" represents an XDR data item that is to be pulled from
   the Requester to the Responder.  A Read chunk is a list of one or
   more RDMA read segments.  Each RDMA read segment consists of a
   Position field followed by a plain segment.

   Position
      The byte offset in the unreduced Payload stream where the receiver
      reinserts the data item conveyed in a chunk.  The Position value
      MUST be computed from the beginning of the unreduced Payload
      stream, which begins at Position zero.  All RDMA read segments
      belonging to the same Read chunk have the same value in their
      Position field.

   While constructing an RPC Call message, a Requester registers memory
   regions that contain data to be transferred via RDMA Read operations.
   It advertises the coordinates of these regions in the RPC-over-RDMA
   Transport header of the RPC Call message.

   After receiving an RPC Call message sent via an RDMA Send operation,
   a Responder transfers the chunk data from the Requester using RDMA
   Read operations.  The Responder reconstructs the transferred chunk
   data by concatenating the contents of each RDMA segment in list order



Lever & Noveck            Expires May 20, 2020                 [Page 18]


Internet-Draft          RDMA Transport for RPC V2          November 2019


   into the received Payload stream at the Position value recorded in
   that RDMA segment.

   Put another way, the Responder inserts the first RDMA segment in a
   Read chunk into the Payload stream at the byte offset indicated by
   its Position field.  RDMA segments whose Position field value match
   this offset are concatenated afterwards, until there are no more RDMA
   segments at that Position value.

   The Position field in a read segment indicates where the containing
   Read chunk starts in the Payload stream.  The value in this field
   MUST be a multiple of four.  All segments in the same Read chunk
   share the same Position value, even if one or more of the RDMA
   segments have a non-four-byte-aligned length.

4.4.5.1.  Decoding Read Chunks

   While decoding a received Payload stream, whenever the XDR offset in
   the Payload stream matches that of a Read chunk, the Responder
   initiates an RDMA Read to pull the chunk's data content into
   registered local memory.

   The Responder acknowledges its completion of use of Read chunk source
   buffers when it sends an RPC Reply message to the Requester.  The
   Requester may then release Read chunks advertised in the request.

4.4.5.2.  Read Chunk Roundup

   When reducing a variable-length argument data item, the Requester
   MUST NOT include the data item's XDR roundup padding in the chunk
   itself.  The chunk's total length MUST be the same as the encoded
   length of the data item.

4.4.6.  Write Chunks

   While constructing an RPC Call message, a Requester prepares memory
   regions in which to receive DDP-eligible result data items.  A "Write
   chunk" represents an XDR data item that is to be pushed from a
   Responder to a Requester.  It is made up of an array of zero or more
   plain segments.

   Write chunks are provisioned by a Requester long before the Responder
   has prepared the reply Payload stream.  A Requester often does not
   know the actual length of the result data items to be returned, since
   the result does not yet exist.  Thus, it MUST register Write chunks
   long enough to accommodate the maximum possible size of each returned
   data item.




Lever & Noveck            Expires May 20, 2020                 [Page 19]


Internet-Draft          RDMA Transport for RPC V2          November 2019


   In addition, the XDR position of DDP-eligible data items in the
   reply's Payload stream is not predictable when a Requester constructs
   an RPC Call message.  Therefore, RDMA segments in a Write chunk do
   not have a Position field.

   For each Write chunk provided by a Requester, the Responder pushes
   one data item to the Requester, filling the chunk contiguously and in
   segment array order until that data item has been completely written
   to the Requester.  The Responder MUST copy the segment count and all
   segments from the Requester-provided Write chunk into the RPC Reply
   message's Transport header.  As it does so, the Responder updates
   each segment length field to reflect the actual amount of data that
   is being returned in that segment.  The Responder then sends the RPC
   Reply message via an RDMA Send operation.

   An "empty Write chunk" is a Write chunk with a zero segment count.
   By definition, the length of an empty Write chunk is zero.  An
   "unused Write chunk" has a non-zero segment count, but all of its
   segments are empty segments.

4.4.6.1.  Decoding Write Chunks

   After receiving the RPC Reply message, the Requester reconstructs the
   transferred data by concatenating the contents of each segment in
   array order into the RPC Reply message's XDR stream at the known XDR
   position of the associated DDP-eligible result data item.

4.4.6.2.  Write Chunk Roundup

   When provisioning a Write chunk for a variable-length result data
   item, the Requester MUST NOT include additional space for XDR roundup
   padding.  A Responder MUST NOT write XDR roundup padding into a Write
   chunk, even if the result is shorter than the available space in the
   chunk.  Therefore, when returning a single variable-length result
   data item, a returned Write chunk's total length MUST be the same as
   the encoded length of the result data item.

4.5.  Message Transfer Methods

   A receiver of RDMA Send operations is required to have previously
   posted one or more adequately sized buffers.  Memory savings are
   achieved on both Requesters and Responders by posting small Receive
   buffers.  However, not all RPC messages are small.  RPC-over-RDMA
   version 2 provides several mechanisms that enable RPC message
   payloads of any size to be conveyed efficiently.






Lever & Noveck            Expires May 20, 2020                 [Page 20]


Internet-Draft          RDMA Transport for RPC V2          November 2019


4.5.1.  Short Messages

   RPC message payloads are often smaller than typical inline
   thresholds.  For example, an NFS version 3 GETATTR operation is only
   56 octets: 20 octets of RPC header, a 32-octet file handle argument,
   and 4 octets for its length.  The reply to this common request is
   about 100 octets.

   Since all RPC messages conveyed via RPC-over-RDMA version 2 require
   at least one RDMA Send operation, the most efficient way to send an
   RPC message that is smaller than the inline threshold is to append
   the Payload stream directly to the Transport stream.  An RPC-over-
   RDMA header with a small RPC Call or Reply message immediately
   following is transferred using a single RDMA Send operation.  No
   other operations are needed.

   An RPC-over-RDMA transaction using a Short Message:

           Requester                             Responder
               |        RDMA Send (RDMA_MSG)         |
          Call |   ------------------------------>   |
               |                                     |
               |                                     | Processing
               |                                     |
               |        RDMA Send (RDMA_MSG)         |
               |   <------------------------------   | Reply

4.5.2.  Continued Messages

   If an RPC message is larger than the inline threshold, the sender can
   choose to split that message over multiple RPC-over-RDMA messages.
   The Payload stream of each RPC-over-RDMA message contains a part of
   the RPC message.  The receiver reconstitutes the RPC message by
   concatenating the Payload streams of the sequence of RPC-over-RDMA
   messages together.

   Though the purpose of a Continued Message is to handle large RPC
   messages, senders MAY use a Continued Message at any time to convey
   an RPC message, and MAY split the RPC message payload on any
   convenient boundary.

   An RPC-over-RDMA transaction using a Continued Message:









Lever & Noveck            Expires May 20, 2020                 [Page 21]


Internet-Draft          RDMA Transport for RPC V2          November 2019


           Requester                             Responder
               |        RDMA Send (RDMA_MSG)         |
          Call |   ------------------------------>   |
               |        RDMA Send (RDMA_MSG)         |
               |   ------------------------------>   |
               |        RDMA Send (RDMA_MSG)         |
               |   ------------------------------>   |
               |                                     |
               |                                     |
               |                                     | Processing
               |                                     |
               |        RDMA Send (RDMA_MSG)         |
               |   <------------------------------   | Reply

4.5.3.  Chunked Messages

   If DDP-eligible data items are present in a Payload stream, a sender
   MAY reduce some or all of these items by removing them from the
   Payload stream.  The sender then uses a separate mechanism to
   transfer the reduced data items.  The Transport stream with the
   reduced Payload stream immediately following is then transferred
   using a single RDMA Send operation.

   After receiving the Transport and Payload streams of an RPC Call
   message accompanied by Read chunks, the Responder uses RDMA Read
   operations to move reduced data items in Read chunks.  Before sending
   the Transport and Payload streams of an RPC Reply message containing
   Write chunks, the Responder uses RDMA Write operations to move
   reduced data items in Write and Reply chunks.

   An RPC-over-RDMA transaction with a Read chunk:

           Requester                             Responder
               |        RDMA Send (RDMA_MSG)         |
          Call |   ------------------------------>   |
               |        RDMA Read                    |
               |   <------------------------------   |
               |        RDMA Response (arg data)     |
               |   ------------------------------>   |
               |                                     |
               |                                     | Processing
               |                                     |
               |        RDMA Send (RDMA_MSG)         |
               |   <------------------------------   | Reply

   An RPC-over-RDMA transaction with a Write chunk:





Lever & Noveck            Expires May 20, 2020                 [Page 22]


Internet-Draft          RDMA Transport for RPC V2          November 2019


           Requester                             Responder
               |        RDMA Send (RDMA_MSG)         |
          Call |   ------------------------------>   |
               |                                     |
               |                                     | Processing
               |                                     |
               |        RDMA Write (result data)     |
               |   <------------------------------   |
               |        RDMA Send (RDMA_MSG)         |
               |   <------------------------------   | Reply

   Chunking and Message Continuation can be combined.  After reduction,
   the sender MAY split the reduced RPC message into multiple Payload
   streams and then send it via a Continued Message.

4.5.4.  Long Messages

   When a Payload stream is larger than the receiver's inline threshold,
   the Payload stream is reduced by removing DDP-eligible data items and
   placing them in chunks to be moved separately.  If there are no DDP-
   eligible data items in the Payload stream, or the Payload stream is
   still too large after it has been reduced, the sender uses either
   Message Continuation, or it can use RDMA Read or Write operations to
   convey the entire RPC message.  The latter mechanism is referred to
   as a "Long Message".

   To transmit a Long Message, the sender conveys only the Transport
   stream with an RDMA Send operation.  The Payload stream is not
   included in the Send buffer in this instance.  Instead, the Requester
   provides chunks that the Responder uses to move the Payload stream.

   Long Call
      To send a Long Call message, the Requester provides a special Read
      chunk that contains the RPC Call message's Payload stream.  Every
      RDMA read segment in this chunk MUST contain zero in its Position
      field.  This type of chunk is known as a "Position Zero Read
      chunk".

   Long Reply
      To send a Long Reply, the Requester provides a single special
      Write chunk in advance, known as the "Reply chunk", that will
      contain the RPC Reply message's Payload stream.  The Requester
      sizes the Reply chunk to accommodate the maximum expected reply
      size for that upper-layer operation.

   Though the purpose of a Long Message is to handle large RPC messages,
   Requesters MAY use a Long Message at any time to convey an RPC Call
   message.



Lever & Noveck            Expires May 20, 2020                 [Page 23]


Internet-Draft          RDMA Transport for RPC V2          November 2019


   A Responder chooses which form of reply to use based on the chunks
   provided by the Requester.  If Write chunks were provided and the
   Responder has a DDP-eligible result, it first reduces the reply
   Payload stream.  If a Reply chunk was provided and the reduced
   Payload stream is larger than the reply inline threshold, the
   Responder MUST use the Requester-provided Reply chunk for the reply.

   XDR data items may appear in these special chunks without regard to
   their DDP-eligibility.  As these chunks contain a Payload stream,
   such chunks MUST include appropriate XDR roundup padding to maintain
   proper XDR alignment of their contents.

   An RPC-over-RDMA transaction using a Long Call:

           Requester                             Responder
               |        RDMA Send (RDMA_NOMSG)       |
          Call |   ------------------------------>   |
               |        RDMA Read                    |
               |   <------------------------------   |
               |        RDMA Response (RPC call)     |
               |   ------------------------------>   |
               |                                     |
               |                                     | Processing
               |                                     |
               |        RDMA Send (RDMA_MSG)         |
               |   <------------------------------   | Reply

   An RPC-over-RDMA transaction using a Long Reply:

           Requester                             Responder
               |        RDMA Send (RDMA_MSG)         |
          Call |   ------------------------------>   |
               |                                     |
               |                                     | Processing
               |                                     |
               |        RDMA Write (RPC reply)       |
               |   <------------------------------   |
               |        RDMA Send (RDMA_NOMSG)       |
               |   <------------------------------   | Reply

5.  Transport Properties

   RPC-over-RDMA version 2 provides a mechanism for connection endpoints
   to communicate information about implementation properties, enabling
   compatible endpoints to optimize data transfer.  Initially only a
   small set of transport properties are defined and a single operation
   is provided to exchange transport properties (see Section 6.4.4).




Lever & Noveck            Expires May 20, 2020                 [Page 24]


Internet-Draft          RDMA Transport for RPC V2          November 2019


   Both the set of transport properties and the operations used to
   communicate may be extended.  Within RPC-over-RDMA version 2, all
   such extensions are OPTIONAL.  For information about existing
   transport properties, see Sections 5.1 through 5.2.  For discussion
   of extensions to the set of transport properties, see Appendix B.3.

5.1.  Transport Properties Model

   A basic set of receiver and sender properties is specified in this
   document.  An extensible approach is used, allowing new properties to
   be defined in future Standards Track documents.

   Such properties are specified using:

   o  A code point identifying the particular transport property being
      specified.

   o  A nominally opaque array which contains within it the XDR encoding
      of the specific property indicated by the associated code point.

   The following XDR types are used by operations that deal with
   transport properties:

   <CODE BEGINS>

   typedef rpcrdma2_propid uint32;

   struct rpcrdma2_propval {
           rpcrdma2_propid rdma_which;
           opaque          rdma_data<>;
   };

   typedef rpcrdma2_propval rpcrdma2_propset<>;

   typedef uint32 rpcrdma2_propsubset<>;

   <CODE ENDS>

   An rpcrdma2_propid specifies a particular transport property.  In
   order to facilitate XDR extension of the set of properties by
   concatenating XDR definition files, specific properties are defined
   as const values rather than as elements in an enum.

   An rpcrdma2_propval specifies a value of a particular transport
   property with the particular property identified by rdma_which, while
   the associated value of that property is contained within rdma_data.





Lever & Noveck            Expires May 20, 2020                 [Page 25]


Internet-Draft          RDMA Transport for RPC V2          November 2019


   An rdma_data field which is of zero length is interpreted as
   indicating the default value or the property indicated by rdma_which.

   While rdma_data is defined as opaque within the XDR, the contents are
   interpreted (except when of length zero) using the XDR typedef
   associated with the property specified by rdma_which.  As a result,
   when rpcrdma2_propval does not conform to that typedef, the receiver
   is REQUIRED to return the error RDMA2_ERR_BAD_XDR using the header
   type RDMA2_ERROR as described in Section 6.4.3.  For example, the
   receiver of a message containing a valid rpcrdma2_propval returns
   this error if the length of rdma_data is such that it extends beyond
   the bounds of the message being transferred.

   In cases in which the rpcrdma2_propid specified by rdma_which is
   understood by the receiver, the receiver also MUST report the error
   RDMA2_ERR_BAD_XDR if either of the following occur:

   o  The nominally opaque data within rdma_data is not valid when
      interpreted using the property-associated typedef.

   o  The length of rdma_data is insufficient to contain the data
      represented by the property-associated typedef.

   Note that no error is to be reported if rdma_which is unknown to the
   receiver.  In that case, that rpcrdma2_propval is not processed and
   processing continues using the next rpcrdma2_propval, if any.

   A rpcrdma2_propset specifies a set of transport properties.  No
   particular ordering of the rpcrdma2_propval items within it is
   imposed.

   A rpcrdma2_propsubset identifies a subset of the properties in a
   previously specified rpcrdma2_propset.  Each bit in the mask denotes
   a particular element in a previously specified rpcrdma2_propset.  If
   a particular rpcrdma2_propval is at position N in the array, then bit
   number N mod 32 in word N div 32 specifies whether that particular
   rpcrdma2_propval is included in the defined subset.  Words beyond the
   last one specified are treated as containing zero.

5.2.  Current Transport Properties

   Although the set of transport properties may be extended, a basic set
   of transport properties is defined in Table 1.

   In that table, the columns contain the following information:

   o  The column labeled "Property" identifies the transport property
      described by the current row.



Lever & Noveck            Expires May 20, 2020                 [Page 26]


Internet-Draft          RDMA Transport for RPC V2          November 2019


   o  The column labeled "Code" specifies the rpcrdma2_propid value used
      to identify this property.

   o  The column labeled "XDR type" gives the XDR type of the data used
      to communicate the value of this property.  This data type
      overlays the data portion of the nominally opaque field rdma_data
      in a rpcrdma2_propval.

   o  The column labeled "Default" gives the default value for the
      property which is to be assumed by those who do not receive, or
      are unable to interpret, information about the actual value of the
      property.

   o  The column labeled "Sec" indicates the section within this
      document that explains the semantics and use of this transport
      property.

   +----------------------------+------+----------+---------+---------+
   | Property                   | Code | XDR type | Default | Sec     |
   +----------------------------+------+----------+---------+---------+
   | Maximum Send Size          | 1    | uint32   | 4096    |  5.2.1  |
   | Receive Buffer Size        | 2    | uint32   | 4096    |  5.2.2  |
   | Maximum RDMA Segment Size  | 3    | uint32   | 1048576 |  5.2.3  |
   | Maximum RDMA Segment Count | 4    | uint32   | 16      |  5.2.4  |
   | Reverse Request Support    | 5    | uint32   | 1       |  5.2.5  |
   | Host Auth Message          | 6    | opaque<> | N/A     |  5.2.6  |
   +----------------------------+------+----------+---------+---------+

                                  Table 1

5.2.1.  Maximum Send Size

   The Maximum Send Size specifies the maximum size, in octets, of Send
   payloads.  The endpoint sending this value ensures that it will not
   transmit a Send WR payload larger than this size, allowing the
   endpoint receiving this value to size its Receive buffers
   appropriately.

   <CODE BEGINS>

   const uint32 RDMA2_PROPID_SBSIZ = 1;
   typedef uint32 rpcrdma2_prop_sbsiz;

   <CODE ENDS>







Lever & Noveck            Expires May 20, 2020                 [Page 27]


Internet-Draft          RDMA Transport for RPC V2          November 2019


5.2.2.  Receive Buffer Size

   The Receive Buffer Size specifies the minimum size, in octets, of
   pre-posted receive buffers.  It is the responsibility of the endpoint
   sending this value to ensure that its pre-posted receive buffers are
   at least the size specified, allowing the endpoint receiving this
   value to send messages that are of this size.

   <CODE BEGINS>

   const uint32 RDMA2_PROPID_RBSIZ = 2;
   typedef uint32 rpcrdma2_prop_rbsiz;

   <CODE ENDS>

   A sender may use his knowledge of the receiver's buffer size to
   determine when the message to be sent will fit in the preposted
   receive buffers that the receiver has set up.  In particular,

   o  Requesters may use the value to determine when it is necessary to
      provide a Position Zero Read chunk or Message Continuation when
      sending a request.

   o  Requesters may use the value to determine when it is necessary to
      provide a Reply chunk when sending a request, based on the maximum
      possible size of the reply.

   o  Responders may use the value to determine when it is necessary,
      given the actual size of the reply, to actually use a Reply chunk
      provided by the requester.

5.2.3.  Maximum RDMA Segment Size

   The Maximum RDMA Segment Size specifies the maximum size, in octets,
   of an RDMA segment this endpoint is prepared to send or receive.

   <CODE BEGINS>

   const uint32 RDMA2_PROPID_RSSIZ = 3;
   typedef uint32 rpcrdma2_prop_rssiz;

   <CODE ENDS>

5.2.4.  Maximum RDMA Segment Count

   The Maximum RDMA Segment Count specifies the maximum number of RDMA
   segments that can appear in a requester's transport header.




Lever & Noveck            Expires May 20, 2020                 [Page 28]


Internet-Draft          RDMA Transport for RPC V2          November 2019


   <CODE BEGINS>

   const uint32 RDMA2_PROPID_RCSIZ = 4;
   typedef uint32 rpcrdma2_prop_rcsiz;

   <CODE ENDS>

5.2.5.  Reverse Request Support

   The value of this property is used to indicate a client
   implementation's readiness to accept and process messages that are
   part of reverse direction RPC requests.

   <CODE BEGINS>

   const uint32 RDMA_RVREQSUP_NONE = 0;
   const uint32 RDMA_RVREQSUP_INLINE = 1;
   const uint32 RDMA_RVREQSUP_GENL = 2;

   const uint32 RDMA2_PROPID_BRS = 5;
   typedef uint32 rpcrdma2_prop_brs;

   <CODE ENDS>

   Multiple levels of support are distinguished:

   o  The value RDMA2_RVREQSUP_NONE indicates that receipt of reverse
      direction requests and replies is not supported.

   o  The value RDMA2_RVREQSUP_INLINE indicates that receipt of reverse
      direction requests or replies is only supported using inline
      messages and that use of explicit RDMA operations for reverse
      direction messages is not supported.

   o  The value RDMA2_RVREQSUP_GENL that receipt of reverse direction
      requests or replies is supported in the same ways that forward
      direction requests or replies typically are.

   When information about this property is not provided, the support
   level of servers can be inferred from the reverse direction requests
   that they issue, assuming that issuing a request implicitly indicates
   support for receiving the corresponding reply.  On this basis,
   support for receiving inline replies can be assumed when requests
   without Read chunks, Write chunks, or Reply chunks are issued, while
   requests with any of these elements allow the client to assume that
   general support for reverse direction replies is present on the
   server.




Lever & Noveck            Expires May 20, 2020                 [Page 29]


Internet-Draft          RDMA Transport for RPC V2          November 2019


5.2.6.  Host Authentication Message

   The value of this transport property is used as part of an exchange
   of host authentication material.  This property can accommodate
   authentication handshakes that require multiple challenge-response
   interactions, and potentially large amounts of material.

   <CODE BEGINS>

   const uint32 RDMA2_PROPID_HOSTAUTH = 6;
   typedef opaque rpcrdma2_prop_hostauth<>;

   <CODE ENDS>

   When this property is not provided, the peer(s) remain
   unauthenticated.  Local security policy on each peer determines
   whether the connection is permitted to continue.

6.  RPC-over-RDMA Version 2 Transport Messages

6.1.  Overall Transport Message Structure

   Each transport message consists of multiple sections:

   o  A transport header prefix, as defined in Section 6.3.2.  Among
      other things, this structure indicates the header type.

   o  The transport header proper, as defined by one of the sub-sections
      below.  See Section 6.2 for the mapping between header types and
      the corresponding header structure.

   o  Potentially, all or part of an RPC message payload being conveyed
      as an addendum to the transport header.

   This organization differs from that presented in the definition of
   RPC-over-RDMA version 1 [RFC8166], which presented the first and
   second of the items above as a single XDR item.  The new organization
   is more in keeping with RPC-over-RDMA version 2's extensibility model
   in that new header types can be defined without modifying the
   existing set of header types.

6.2.  Transport Header Types

   The new header types within RPC-over-RDMA version 2 are set forth in
   Table 2.  In that table, the columns contain the following
   information:

   o  The column labeled "Operation" specifies the particular operation.



Lever & Noveck            Expires May 20, 2020                 [Page 30]


Internet-Draft          RDMA Transport for RPC V2          November 2019


   o  The column labeled "Code" specifies the value of header type for
      this operation.

   o  The column labeled "XDR type" gives the XDR type of the data
      structure used to describe the information in this new message
      type.  This data immediately follows the universal portion on the
      transport header present in every RPC-over-RDMA transport header.

   o  The column labeled "Msg" indicates whether this operation is
      followed (or not) by an RPC message payload.

   o  The column labeled "Sec" indicates the section (within this
      document) that explains the semantics and use of this operation.

   +-------------------------+------+-------------------+-----+--------+
   | Operation               | Code | XDR type          | Msg | Sec    |
   +-------------------------+------+-------------------+-----+--------+
   | Convey Appended RPC     | 0    | rpcrdma2_msg      | Yes |  6.4.1 |
   | Message                 |      |                   |     |        |
   | Convey External RPC     | 1    | rpcrdma2_nomsg    | No  |  6.4.2 |
   | Message                 |      |                   |     |        |
   | Report Transport Error  | 4    | rpcrdma2_err      | No  |  6.4.3 |
   | Specify Properties at   | 5    | rpcrdma2_connprop | No  |  6.4.4 |
   | Connection              |      |                   |     |        |
   +-------------------------+------+-------------------+-----+--------+

                                  Table 2

   Suppport for the operations in Table 2 is REQUIRED.  Support for
   additional operations will be OPTIONAL.  RPC-over-RDMA version 2
   implementations that receive an OPTIONAL operation that is not
   supported MUST respond with an RDMA2_ERROR message with an error code
   of RDMA2_ERR_INVAL_HTYPE.

6.3.  RPC-over-RDMA Version 2 Headers and Chunks

   Most RPC-over-RDMA version 2 data structures are derived from
   corresponding structures in RPC-over-RDMA version 1.  As is typical
   for new versions of an existing protocol, the XDR data structures
   have new names and there are a few small changes in content.  In some
   cases, there have been structural re-organizations to enabled
   protocol extensibility.

6.3.1.  Common Transport Header Prefix

   The rpcrdma_common prefix describes the first part of each RDMA-over-
   RPC transport header for version 2 and subsequent versions.




Lever & Noveck            Expires May 20, 2020                 [Page 31]


Internet-Draft          RDMA Transport for RPC V2          November 2019


   <CODE BEGINS>

   struct rpcrdma_common {
                uint32         rdma_xid;
                uint32         rdma_vers;
                uint32         rdma_credit;
                uint32         rdma_htype;
   };

   <CODE ENDS>

   RPC-over-RDMA version 2's use of these first four words matches that
   of version 1 as required by [RFC8166].  However, there are important
   structural differences in the way that these words are described by
   the respective XDR descriptions:

   o  The header type is represented as a uint32 rather than as an enum
      that would need to be modified to reflect additions to the set of
      header types made by later extensions.

   o  The header type field is part of an XDR structure devoted to
      representing the transport header prefix, rather than being part
      of a discriminated union, that includes the body of each transport
      header type.

   o  There is now a prefix structure (see Section 6.3.2) of which the
      rpcrdma_common structure is the initial segment.  This is a newly
      defined XDR object within the protocol description, in contrast
      with RPC-over-RDMA version 1, which limits the common portion of
      all header types to the four words in rpcrdma_common.

   These changes are part of a larger structural change in the XDR
   description of RPC-over-RDMA version 2 that enables a cleaner
   treatment of protocol extension.  The XDR appearing in Section 7
   reflects these changes, which are discussed in further detail in
   Appendix C.1.

6.3.2.  RPC-over-RDMA Version 2 Transport Header Prefix

   The following prefix structure appears at the start of any RPC-over-
   RDMA version 2 transport header.










Lever & Noveck            Expires May 20, 2020                 [Page 32]


Internet-Draft          RDMA Transport for RPC V2          November 2019


   <CODE BEGINS>

   const RPCRDMA2_F_RESPONSE           0x00000001;
   const RPCRDMA2_F_MORE               0x00000002;

   struct rpcrdma2_hdr_prefix
           struct rpcrdma_common       rdma_start;
           uint32                      rdma_flags;
   };

   <CODE ENDS>

   The rdma_flags is new to RPC-over-RDMA version 2.  Currently, the
   only flags defined within this word are the RPCRDMA2_F_RESPONSE flag
   and the RPCRDMA2_F_MORE flag.  The other bits are reserved for future
   use as described in Appendix B.2.  The sender MUST set these flags to
   zero.

6.3.2.1.  RPCRDMA2_F_RESPONSE Flag

   The RPCRDMA2_F_RESPONSE flag qualifies the value contained in the
   transport header's rdma_start.rdma_xid field.  The
   RPCRDMA2_F_RESPONSE flag enables a receiver to reliably avoid
   performing an XID lookup on incoming reverse direction Call messages.

   In general, when a message carries an XID that was generated by the
   message's receiver (that is, the receiver is acting as a requester),
   the message's sender sets the RPCRDMA2_F_RESPONSE flag.  Otherwise
   that flag is clear.  For example:

   o  When the rdma_start.rdma_htype field has the value RDMA2_MSG or
      RDMA2_NOMSG, the value of the RPCRDMA2_F_RESPONSE flag MUST be the
      same as the value of the associated RPC message's msg_type field.

   o  When the header type is anything else and a whole or partial RPC
      message payload is present, the value of the RPCRDMA2_F_RESPONSE
      flag MUST be the same as the value of the associated RPC message's
      msg_type field.

   o  When no RPC message payload is present, a requester MUST set the
      value of RPCRDMA2_F_RESPONSE to reflect how the receiver is to
      interpret the rdma_start.rdma_xid field.

   o  When the rdma_start.rdma_htype field has the value RDMA2_ERROR,
      the RPCRDMA2_F_RESPONSE flag MUST be set.






Lever & Noveck            Expires May 20, 2020                 [Page 33]


Internet-Draft          RDMA Transport for RPC V2          November 2019


6.3.2.2.  RPCRDMA2_F_MORE Flag

   The RPCRDMA2_F_MORE flag signifies that the RPC-over-RDMA message
   payload continues in the next message.  This is referred to as
   Message Continuation, or Send chaining.

   When the RPCRDMA2_F_MORE flag is asserted, the receiver is to
   concatenate the data payload of the next received message to the end
   of the data payload of the current received message.  The sender
   clears the RPCRDMA2_F_MORE flag in the final message in the sequence.

   All RPC-over-RDMA messages in such a sequence MUST have the same
   values in the rdma_start.rdma_xid and rdma_start.rdma_htype fields.
   If this constraint is not met, the receiver MUST respond with an
   RDMA2_ERROR message with the rdma_err field set to
   RDMA2_ERR_INVAL_FLAG.

   If a peer receives an RPC-over-RDMA message where the RPCRDMA2_F_MORE
   flag is set and the rdma_start.rdma_htype field does not contain
   RDMA2_MSG or RDMA2_CONNPROP, the receiver MUST respond with an
   RDMA2_ERROR message with the rdma_err field set to
   RDMA2_ERR_INVAL_FLAG.

   [ dnoveck: Both the above and your error in the existing third
   paragraph raise issues since they could be sent by a responder.  Will
   need to fix RDMA2_ERROR so that this can be done when appropriate. ]

   When the RPCRDMA2_F_MORE flag is set in an individual message, that
   message's chunk lists MUST be empty.  Chunks for a chained message
   may be conveyed in the final message in the sequence, whose
   RPCRDMA2_F_MORE flag is clear.

   There is no protocol-defined limit on the number of concatenated
   messages in a sequence.  If the sender exhausts the receiver's credit
   grant before the final message is sent, the sender MUST wait for a
   further credit grant from the receiver before continuing to send
   messages.

   Credit exhaustion can occur at the receiver in the middle of a
   sequence of continued messages.  To enable the sender to continue
   sending the remaining messages in the sequence, the receiver can
   grant more credits by sending an RPC message payload or an out-of-
   band credit grant (see Section 4.3.1.2).








Lever & Noveck            Expires May 20, 2020                 [Page 34]


Internet-Draft          RDMA Transport for RPC V2          November 2019


6.3.3.  Describing External Data Payloads

   The rpcrdma2_chunk_lists structure specifies how an RPC message is
   conveyed using explicit RDMA operations.

   <CODE BEGINS>

   struct rpcrdma2_chunk_lists {
           uint32                      rdma_inv_handle;
           struct rpcrdma2_read_list   *rdma_reads;
           struct rpcrdma2_write_list  *rdma_writes;
           struct rpcrdma2_write_chunk *rdma_reply;
   };

   <CODE ENDS>

   For the most part this structure parallels its RPC-over-RDMA version
   1 equivalent.  That is, the rdma_reads, rdma_writes, rdma_reply
   fields provide, respectively, descriptions of the chunks used to read
   a Long message or directly placed data from the requester, to write
   directly placed response data into the requester's memory, and to
   write a long reply into the requester's memory.

6.3.3.1.  Chunks and Chunk Lists

   The chunks and chunk list structures follow the same rules as in
   Section 3.4 of [RFC8166], with these exceptions:

   o  In RPC-over-RDMA version 1, there were cases where XDR padding was
      allowed to appear in a reduced XDR data item.  However, in RPC-
      over-RDMA version 2, requesters and responders MUST NOT include
      XDR padding in reduced Read and Write chunks, but chunks that make
      up Position Zero Read chunks and Reply chunks MUST include all XDR
      padding.

   o  A responder MUST use Message Continuation if the requester does
      not provide a Reply chunk and the actual size of the reply is
      larger than the connection's inline threshold.  A responder MAY
      use Message Continuation even if the requester has provided
      adequate Reply resources.  This makes it unnecessary for RPC-over-
      RDMA version 2 requesters to have perfect reply size estimation.

6.3.3.2.  Remote Invalidation

   An important addition relative to the corresponding RPC-over-RDMA
   version 1 rdma_header structures is the rdma_inv_handle field.  This
   field supports remote invalidation of requester memory registrations
   via the RDMA Send With Invalidate operation.



Lever & Noveck            Expires May 20, 2020                 [Page 35]


Internet-Draft          RDMA Transport for RPC V2          November 2019


   To request Remote Invalidation, a requester sets the value of the
   rdma_inv_handle field in an RPC Call's transport header to a non-zero
   value that matches one of the rdma_handle fields in that header.  If
   none of the rdma_handle values in the header conveying the Call may
   be invalidated by the responder, the requester sets the RPC Call's
   rdma_inv_handle field to the value zero.

   If the responder chooses not to use remote invalidation for this
   particular RPC Reply, or the RPC Call's rdma_inv_handle field
   contains the value zero, the responder uses RDMA Send to transmit the
   matching RPC reply.

   If a requester has provided a non-zero value in the RPC Call's
   rdma_inv_handle field and the responder chooses to use Remote
   Invalidation for the matching RPC Reply, the responder uses RDMA Send
   With Invalidate to transmit that RPC reply, and uses the value in the
   corresponding Call's rdma_inv_handle field to construct the Send With
   Invalidate Work Request.

6.4.  Header Types Defined in RPC-over-RDMA version 2

   The header types defined and used in RPC-over-RDMA version 1 are all
   carried over into RPC-over-RDMA version 2, although there may be
   limited changes in the definition of existing header types.

   In comparison with the header types of RPC-over-RDMA version 1, the
   changes can be summarized as follows:

   o  To simplify interoperability with RPC-over-RDMA version 1, only
      the RDMA2_ERROR header (defined in Section 6.4.3) has an XDR
      definition that differs from that in RPC-over-RDMA version 1, and
      its modifications are all compatible extensions.

   o  RDMA2_MSG and RDMA2_NOMSG (defined in Sections Section 6.4.1 and
      Section 6.4.2) have XDR definitions that match the corresponding
      RPC-over-RDMA version 1 header types.  However, because of the
      changes to the header prefix, the version 1 and version 2 header
      types differ in on-the-wire format.

   o  RDMA2_CONNPROP (defined in Section 6.4.4) is a completely new
      header type devoted to enabling connection peers to exchange
      information about their transport properties.

6.4.1.  RDMA2_MSG: Convey RPC Message Inline

   RDMA2_MSG is used to convey an RPC message that immediately follows
   the Transport Header in the Send buffer.  This is either an RPC




Lever & Noveck            Expires May 20, 2020                 [Page 36]


Internet-Draft          RDMA Transport for RPC V2          November 2019


   request that has no Position Zero Read chunk or an RPC reply that is
   not sent using a Reply chunk.

   <CODE BEGINS>

   const rpcrdma2_proc RDMA2_MSG = 0;

   struct rpcrdma2_msg {
           struct rpcrdma2_chunk_lists  rdma_chunks;

           /* The rpc message starts here and continues
            * through the end of the transmission. */
           uint32                       rdma_rpc_first_word;
   };

   <CODE ENDS>

6.4.2.  RDMA2_NOMSG: Convey External RPC Message

   RDMA2_NOMSG can convey an entire RPC message payload using explicit
   RDMA operations.  When an RPC message payload is present, this
   message type is also known as a Long message.  In particular, it is a
   Long call when the responder reads the RPC payload from a memory area
   specified by a Position Zero Read chunk; and it is a Long reply when
   the respond writes the RPC payload into a memory area specified by a
   Reply chunk.  In both of these cases, the rdma_xid field is set to
   the same value as the xid of the RPC message payload.

   If all the chunk lists are empty (i.e., three 32-bit zeroes in the
   chunk list fields), the message conveys a credit grant refresh.  The
   header prefix of this message contains a credit grant refresh in the
   rdma_credit field.  In this case, the sender MUST set the rdma_xid
   field to zero.

   <CODE BEGINS>

   const rpcrdma2_proc RDMA2_NOMSG = 1;

   struct rpcrdma2_nomsg {
           struct rpcrdma2_chunk_lists  rdma_chunks;
   };

   <CODE ENDS>

   In RPC-over-RDMA version 2, an alternative to using a Long message is
   to use Message Continuation.





Lever & Noveck            Expires May 20, 2020                 [Page 37]


Internet-Draft          RDMA Transport for RPC V2          November 2019


6.4.3.  RDMA2_ERROR: Report Transport Error

   RDMA2_ERROR provides a way of reporting the occurrence of transport
   errors on a previous transmission.  This header type MUST NOT be
   transmitted by a requester.

   <CODE BEGINS>

   const rpcrdma2_proc RDMA2_ERROR = 4;

   struct rpcrdma2_err_vers {
           uint32 rdma_vers_low;
           uint32 rdma_vers_high;
   };

   struct rpcrdma2_err_write {
           uint32 rdma_chunk_index;
           uint32 rdma_length_needed;
   };

   union rpcrdma2_error switch (rpcrdma2_errcode rdma_err) {
           case RDMA2_ERR_VERS:
             rpcrdma2_err_vers rdma_vrange;
           case RDMA2_ERR_READ_CHUNKS:
             uint32 rdma_max_chunks;
           case RDMA2_ERR_WRITE_CHUNKS:
             uint32 rdma_max_chunks;
           case RDMA2_ERR_SEGMENTS:
             uint32 rdma_max_segments;
           case RDMA2_ERR_WRITE_RESOURCE:
             rpcrdma2_err_write rdma_writeres;
           case RDMA2_ERR_REPLY_RESOURCE:
             uint32 rdma_length_needed;
           default:
             void;
   };

   <CODE ENDS>

   Error reporting is addressed in RPC-over-RDMA version 2 in a fashion
   similar to RPC-over-RDMA version 1.  Several new error codes, and
   error messages never flow from requester to responder.  RPC-over-RDMA
   version 1 error reporting is described in Section 5 of [RFC8166].

   Unless otherwise specified, in all cases below, the responder copies
   the values of the rdma_start.rdma_xid and rdma_start.rdma_vers fields
   from the incoming transport header that generated the error to
   transport header of the error response.  The responder sets the



Lever & Noveck            Expires May 20, 2020                 [Page 38]


Internet-Draft          RDMA Transport for RPC V2          November 2019


   rdma_start.rdma_htype field of the transport header prefix to
   RDMA2_ERROR, and the rdma_start.rdma_credit field is set to the
   credit grant value for this connection.  The receiver of this header
   type MUST ignore the value of the rdma_start.rdma_credit field.

   RDMA2_ERR_VERS
      This is the equivalent of ERR_VERS in RPC-over-RDMA version 1.
      The error code value, semantics, and utilization are the same.

   RDMA2_ERR_INVAL_HTYPE
      If a responder recognizes the value in the rdma_start.rdma_vers
      field, but it does not recognize the value in the
      rdma_start.rdma_htype field or does not support that header type,
      it MUST set the rdma_err field to RDMA2_ERR_INVAL_HTYPE.

   RDMA2_ERR_INVAL_FLAG
      If a receiver recognizes the value in the rdma_start.rdma_htype
      field but does not recognize the combination of flags in the
      rdma_flags field, it MUST set the rdma_err field to
      RDMA2_ERR_INVAL_HTYPE.

   RDMA2_ERR_BAD_XDR
      If a responder recognizes the values in the rdma_start.rdma_vers
      and rdma_start.rdma_proc fields, but the incoming RPC-over-RDMA
      transport header cannot be parsed, it MUST set the rdma_err field
      to RDMA2_ERR_BAD_XDR.  This includes cases in which a nominally
      opaque property value field cannot be parsed using the XDR typedef
      associated with the transport property definition.  The error code
      value of RDMA2_ERR_BAD_XDR is the same as the error code value of
      ERR_CHUNK in RPC-over-RDMA version 1.  The responder MUST NOT
      process the request in any way except to send an error message.

   RDMA2_ERR_READ_CHUNKS
      If a requester presents more DDP-eligible arguments than the
      responder is prepared to Read, the responder MUST set the rdma_err
      field to RDMA2_ERR_READ_CHUNKS, and set the rdma_max_chunks field
      to the maximum number of Read chunks the responder can receive and
      process.
      If the responder implementation cannot handle any Read chunks for
      a request, it MUST set the rdma_max_chunks to zero in this
      response.  The requester SHOULD resend the request using a
      Position Zero Read chunk.  If this was a request using a Position
      Zero Read chunk, the requester MUST terminate the transaction with
      an error.

   RDMA2_ERR_WRITE_CHUNKS
      If a requester has constructed an RPC Call message with more DDP-
      eligible results than the server is prepared to Write, the



Lever & Noveck            Expires May 20, 2020                 [Page 39]


Internet-Draft          RDMA Transport for RPC V2          November 2019


      responder MUST set the rdma_err field to RDMA2_ERR_WRITE_CHUNKS,
      and set the rdma_max_chunks field to the maximum number of Write
      chunks the responder can process and return.
      If the responder implementation cannot handle any Write chunks for
      a request, it MUST return a response of RDMA2_ERR_REPLY_RESOURCE
      (below).  The requester SHOULD resend the request with no Write
      chunks and a Reply chunk of appropriate size.

   RDMA2_ERR_SEGMENTS
      If a requester has constructed an RPC Call message with a chunk
      that contains more segments than the responder supports, the
      responder MUST set the rdma_err field to RDMA2_ERR_SEGMENTS, and
      set the rdma_max_segments field to the maximum number of segments
      the responder can process.

   RDMA2_ERR_WRITE_RESOURCE
      If a requester has provided a Write chunk that is not large enough
      to fully convey a DDP-eligible result, the responder MUST set the
      rdma_err field to RDMA2_ERR_WRITE_RESOURCE.

      The responder MUST set the rdma_chunk_index field to point to the
      first Write chunk in the transport header that is too short, or to
      zero to indicate that it was not possible to determine which chunk
      is too small.  Indexing starts at one (1), which represents the
      first Write chunk.  The responder MUST set the rdma_length_needed
      to the number of bytes needed in that chunk in order to convey the
      result data item.

      Upon receipt of this error code, a responder MAY choose to
      terminate the operation (for instance, if the responder set the
      index and length fields to zero), or it MAY send the request again
      using the same XID and more reply resources.

   RDMA2_ERR_REPLY_RESOURCE
      If an RPC Reply's Payload stream does not fit inline and the
      requester has not provided a large enough Reply chunk to convey
      the stream, the responder MUST set the rdma_err field to
      RDMA2_ERR_REPLY_RESOURCE.  The responder MUST set the
      rdma_length_needed to the number of Reply chunk bytes needed to
      convey the reply.

      Upon receipt of this error code, a responder MAY choose to
      terminate the operation (for instance, if the responder set the
      index and length fields to zero), or it MAY send the request again
      using the same XID and larger reply resources.

   RDMA2_ERR_SYSTEM




Lever & Noveck            Expires May 20, 2020                 [Page 40]


Internet-Draft          RDMA Transport for RPC V2          November 2019


      If some problem occurs on a responder that does not fit into the
      above categories, the responder MAY report it to the sender by
      setting the rdma_err field to RDMA2_ERR_SYSTEM.

      This is a permanent error: a requester that receives this error
      MUST terminate the RPC transaction associated with the XID value
      in the rdma_start.rdma_xid field.

6.4.4.  RDMA2_CONNPROP: Advertise Transport Properties

   The RDMA2_CONNPROP message type allows an RPC-over-RDMA endpoint,
   whether client or server, to indicate to its partner relevant
   transport properties that the partner might need to be aware of.

   The message definition for this operation is as follows:

   <CODE BEGINS>

   struct rpcrdma2_connprop {
           rpcrdma2_propset rdma_props;
   };

   <CODE ENDS>

   All relevant transport properties that the sender is aware of should
   be included in rdma_props.  Since support of each of the properties
   is OPTIONAL, the sender cannot assume that the receiver will
   necessarily take note of these properties.  The sender should be
   prepared for cases in which the receiver continues to assume that the
   default value for a particular property is still in effect.

   Generally, a participant will send a RDMA2_CONNPROP message as the
   first message after a connection is established.  Given that fact,
   the sender should make sure that the message can be received by peers
   who use the default Receive Buffer Size.  The connection's initial
   receive buffer size is typically 1KB, but it depends on the initial
   connection state of the RPC-over-RDMA version in use.

   Properties not included in rdma_props are to be treated by the peer
   endpoint as having the default value and are not allowed to change
   subsequently.  The peer should not request changes in such
   properties.

   Those receiving an RDMA2_CONNPROP may encounter properties that they
   do not support or are unaware of.  In such cases, these properties
   are simply ignored without any error response being generated.





Lever & Noveck            Expires May 20, 2020                 [Page 41]


Internet-Draft          RDMA Transport for RPC V2          November 2019


6.5.  Choosing a Reply Mechanism

   A requester provides any necessary registered memory resources for
   both an RPC Call message and its matching RPC Reply message.  A
   requester forms each RPC Call itself, thus it can compute the exact
   memory resources needed to send every Call.  However, the requester
   must allocate memory resources to receive the corresponding Reply
   before the responder has formed it.  In some cases it is difficult
   for the requester to know in advance precisely what resources will be
   needed to receive the Reply.

   In RPC-over-RDMA version 2, a requester MAY provide a Reply chunk at
   any time.  The responder MAY use the provided Reply chunk or decide
   to use another means to convey the RPC Reply.  If the combination of
   the provided Write chunk list and Reply chunk is not adequate to
   convey a Reply, the responder SHOULD use Message Continuation (see
   Section 6.3.2.2 to send that Reply.

   If even that is not possible, the responder sends an RDMA2_ERROR
   message to the requester, as described in Section 6.4.3:

   o  The responder MUST send a RDMA2_ERR_WRITE_RESOURCE error if the
      Write chunk list cannot accommodate the ULP's DDP-eligible data
      payload.

   o  The responder MUST send a RDMA2_ERR_REPLY_RESOURCE error if the
      Reply chunk cannot accommodate the non DDP-eligible parts of the
      Reply.

   When receiving such errors, the requester SHOULD retry the ULP call
   using larger reply resources.  In cases where retrying the ULP
   request is not possible, the requester terminates the RPC request and
   presents an error to the RPC consumer.

7.  XDR Protocol Definition

   This section contains a description of the core features of the RPC-
   over-RDMA version 2 protocol expressed in the XDR language [RFC4506].

   Because of the need to provide for protocol extensibility without
   modifying an existing XDR definition, this description has some
   important structural differences from the corresponding XDR
   description for RPC-over-RDMA version 1, which appears in [RFC8166].

   This description is divided into three parts:

   o  A code component license which appears in Section 7.1.




Lever & Noveck            Expires May 20, 2020                 [Page 42]


Internet-Draft          RDMA Transport for RPC V2          November 2019


   o  An XDR description of the structures that are generally available
      for use by transport header types including both those defined in
      this document and those that may be defined as extensions.  This
      includes definitions of the chunk-related structures derived from
      RPC-over-RDMA version 1, the transport property model introduced
      in this document, and a definition of the transport header
      prefixes that precede the various transport header types.  This
      appears in Section 7.3.

   o  An XDR description of the transport header types defined in this
      document, including those derived from RPC-over-RDMA version 1 and
      those introduced in RPC-over-RDMA version 2.  This appears in
      Section 7.4.

   This description is provided in a way that makes it simple to extract
   into ready-to-compile form.  To enable the combination of this
   description with the descriptions of subsequent extensions to RPC-
   over-RDMA version 2, the extracted description can be combined with
   similar descriptions published later, or those descriptions can be
   compiled separately.  Refer to Section 7.2 for details.

7.1.  Code Component License

   Code components extracted from this document must include the
   following license text.  When the extracted XDR code is combined with
   other complementary XDR code which itself has an identical license,
   only a single copy of the license text need be preserved.
























Lever & Noveck            Expires May 20, 2020                 [Page 43]


Internet-Draft          RDMA Transport for RPC V2          November 2019


   <CODE BEGINS>

   /// /*
   ///  * Copyright (c) 2010-2018 IETF Trust and the persons
   ///  * identified as authors of the code.  All rights reserved.
   ///  *
   ///  * The authors of the code are:
   ///  * B. Callaghan, T. Talpey, C. Lever, and D. Noveck.
   ///  *
   ///  * Redistribution and use in source and binary forms, with
   ///  * or without modification, are permitted provided that the
   ///  * following conditions are met:
   ///  *
   ///  * - Redistributions of source code must retain the above
   ///  *   copyright notice, this list of conditions and the
   ///  *   following disclaimer.
   ///  *
   ///  * - Redistributions in binary form must reproduce the above
   ///  *   copyright notice, this list of conditions and the
   ///  *   following disclaimer in the documentation and/or other
   ///  *   materials provided with the distribution.
   ///  *
   ///  * - Neither the name of Internet Society, IETF or IETF
   ///  *   Trust, nor the names of specific contributors, may be
   ///  *   used to endorse or promote products derived from this
   ///  *   software without specific prior written permission.
   ///  *
   ///  *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS
   ///  *   AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED
   ///  *   WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
   ///  *   IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
   ///  *   FOR A PARTICULAR PURPOSE ARE DISCLAIMED.  IN NO
   ///  *   EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
   ///  *   LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
   ///  *   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
   ///  *   NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
   ///  *   SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
   ///  *   INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
   ///  *   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
   ///  *   OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
   ///  *   IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
   ///  *   ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
   ///  */
   ///

   <CODE ENDS>





Lever & Noveck            Expires May 20, 2020                 [Page 44]


Internet-Draft          RDMA Transport for RPC V2          November 2019


7.2.  Extraction and Use of XDR Definitions

   The reader can apply the following sed script to this document to
   produce a machine-readable XDR description of the RPC-over-RDMA
   version 2 protocol without any OPTIONAL extensions.

   <CODE BEGINS>

   sed -n -e 's:^ */// ::p' -e 's:^ *///$::p'

   <CODE ENDS>

   That is, if this document is in a file called "spec.txt" then the
   reader can do the following to extract an XDR description file and
   store it in the file rpcrdma-v2.x.

   <CODE BEGINS>

   sed -n -e 's:^ */// ::p' -e 's:^ *///$::p' \
        < spec.txt > rpcrdma-v2.x

   <CODE ENDS>

   Although this file is a usable description of the base protocol, when
   extensions are to supported, it may be desirable to divide into
   multiple files.  The following script can be used for that purpose:

























Lever & Noveck            Expires May 20, 2020                 [Page 45]


Internet-Draft          RDMA Transport for RPC V2          November 2019


   <CODE BEGINS>

   #!/usr/local/bin/perl
   open(IN,"rpcrdma-v2.x");
   open(OUT,">temp.x");
   while(<IN>)
   {
     if (m/FILE ENDS: (.*)$/)
       {
         close(OUT);
         rename("temp.x", $1);
         open(OUT,">temp.x");
       }
       else
       {
         print OUT $_;
       }
   }
   close(IN);
   close(OUT);

   <CODE ENDS>

   Running the above script will result in two files:

   o  The file common.x, containing the license plus the common XDR
      definitions which need to be made available to both the base
      operations and any subsequent extensions.

   o  The file baseops.x containing the XDR definitions for the base
      operations, defined in this document.

   Optional extensions to RPC-over-RDMA version 2, published as
   Standards Track documents, will have similar means of providing XDR
   that describes those extensions.  Once XDR for all desired extensions
   is also extracted, it can be appended to the XDR description file
   extracted from this document to produce a consolidated XDR
   description file reflecting all extensions selected for an RPC-over-
   RDMA implementation.

   Alternatively, the XDR descriptions can be compiled separately.  In
   this case the combination of common.x and baseops.x serves to define
   the base transport, while using as XDR descriptions for extensions,
   the XDR from the document defining that extension, together with the
   file common.x, obtained from this document.






Lever & Noveck            Expires May 20, 2020                 [Page 46]


Internet-Draft          RDMA Transport for RPC V2          November 2019


7.3.  XDR Definition for RPC-over-RDMA Version 2 Core Structures

<CODE BEGINS>
/// /*******************************************************************
///  *    Transport Header Prefixes
///  ******************************************************************/
///
/// struct rpcrdma_common {
///         uint32         rdma_xid;
///         uint32         rdma_vers;
///         uint32         rdma_credit;
///         uint32         rdma_htype;
/// };
///
/// const RPCRDMA2_F_RESPONSE           0x00000001;
/// const RPCRDMA2_F_MORE               0x00000002;
///
/// struct rpcrdma2_hdr_prefix
///         struct rpcrdma_common       rdma_start;
///         uint32                      rdma_flags;
/// };
///
/// /*******************************************************************
///  *    Chunks and Chunk Lists
///  ******************************************************************/
///
/// struct rpcrdma2_segment {
///         uint32 rdma_handle;
///         uint32 rdma_length;
///         uint64 rdma_offset;
/// };
///
/// struct rpcrdma2_read_segment {
///         uint32                  rdma_position;
///         struct rpcrdma2_segment rdma_target;
/// };
///
/// struct rpcrdma2_read_list {
///         struct rpcrdma2_read_segment rdma_entry;
///         struct rpcrdma2_read_list    *rdma_next;
/// };
///
/// struct rpcrdma2_write_chunk {
///         struct rpcrdma2_segment rdma_target<>;
/// };
///
/// struct rpcrdma2_write_list {
///         struct rpcrdma2_write_chunk rdma_entry;



Lever & Noveck            Expires May 20, 2020                 [Page 47]


Internet-Draft          RDMA Transport for RPC V2          November 2019


///         struct rpcrdma2_write_list  *rdma_next;
/// };
///
/// struct rpcrdma2_chunk_lists {
///         uint32                      rdma_inv_handle;
///         struct rpcrdma2_read_list   *rdma_reads;
///         struct rpcrdma2_write_list  *rdma_writes;
///         struct rpcrdma2_write_chunk *rdma_reply;
/// };
///
/// /*******************************************************************
///  *    Transport Properties
///  ******************************************************************/
///
/// /*
///  * Types for transport properties model
///  */
/// typedef rpcrdma2_propid uint32;
///
/// struct rpcrdma2_propval {
///         rpcrdma2_propid rdma_which;
///         opaque          rdma_data<>;
/// };
///
/// typedef rpcrdma2_propval rpcrdma2_propset<>;
/// typedef uint32 rpcrdma2_propsubset<>;
///
/// /*
///  * Transport propid values for basic properties
///  */
/// const uint32 RDMA2_PROPID_SBSIZ = 1;
/// const uint32 RDMA2_PROPID_RBSIZ = 2;
/// const uint32 RDMA2_PROPID_RSSIZ = 3;
/// const uint32 RDMA2_PROPID_RCSIZ = 4;
/// const uint32 RDMA2_PROPID_BRS = 5;
/// const uint32 RDMA2_PROPID_HOSTAUTH = 6;
///
/// /*
///  * Types specific to particular properties
///  */
/// typedef uint32 rpcrdma2_prop_sbsiz;
/// typedef uint32 rpcrdma2_prop_rbsiz;
/// typedef uint32 rpcrdma2_prop_rssiz;
/// typedef uint32 rpcrdma2_prop_rcsiz;
/// typedef uint32 rpcrdma2_prop_brs;
/// typedef opaque rpcrdma2_prop_hostauth<>;
///
/// const uint32 RDMA_RVREQSUP_NONE = 0;



Lever & Noveck            Expires May 20, 2020                 [Page 48]


Internet-Draft          RDMA Transport for RPC V2          November 2019


/// const uint32 RDMA_RVREQSUP_INLINE = 1;
/// const uint32 RDMA_RVREQSUP_GENL = 2;
///
/// /* FILE ENDS: common.x; */

<CODE ENDS>

7.4.  XDR Definition for RPC-over-RDMA Version 2 Base Header Types

<CODE BEGINS>
/// /*******************************************************************
///  *    Descriptions of RPC-over-RDMA Header Types
///  ******************************************************************/
///
/// /*
///  * Header Type Codes.
///  */
/// const rpcrdma2_proc RDMA2_MSG = 0;
/// const rpcrdma2_proc RDMA2_NOMSG = 1;
/// const rpcrdma2_proc RDMA2_ERROR = 4;
/// const rpcrdma2_proc RDMA2_CONNPROP = 5;
///
/// /*
///  * Header Types to Convey RPC Messages.
///  */
/// struct rpcrdma2_msg {
///         struct rpcrdma2_chunk_lists  rdma_chunks;
///
///         /* The rpc message starts here and continues
///          * through the end of the transmission. */
///         uint32                       rdma_rpc_first_word;
/// };
///
/// struct rpcrdma2_nomsg {
///         struct rpcrdma2_chunk_lists  rdma_chunks;
/// };
///
/// /*
///  * Header Type to Report Errors.
///  */
/// const uint32 RDMA2_ERR_VERS = 1;
/// const uint32 RDMA2_ERR_BAD_XDR = 2;
/// const uint32 RDMA2_ERR_INVAL_HTYPE = 3;
/// const uint32 RDMA2_ERR_INVAL_FLAG = 4;
/// const uint32 RDMA2_ERR_READ_CHUNKS = 5;
/// const uint32 RDMA2_ERR_WRITE_CHUNKS = 6;
/// const uint32 RDMA2_ERR_SEGMENTS = 7;
/// const uint32 RDMA2_ERR_WRITE_RESOURCE = 8;



Lever & Noveck            Expires May 20, 2020                 [Page 49]


Internet-Draft          RDMA Transport for RPC V2          November 2019


/// const uint32 RDMA2_ERR_REPLY_RESOURCE = 9;
/// const uint32 RDMA2_ERR_SYSTEM = 10;
///
/// struct rpcrdma2_err_vers {
///         uint32 rdma_vers_low;
///         uint32 rdma_vers_high;
/// };
///
/// struct rpcrdma2_err_write {
///         uint32 rdma_chunk_index;
///         uint32 rdma_length_needed;
/// };
///
/// union rpcrdma2_error switch (rpcrdma2_errcode rdma_err) {
///         case RDMA2_ERR_VERS:
///           rpcrdma2_err_vers rdma_vrange;
///         case RDMA2_ERR_READ_CHUNKS:
///           uint32 rdma_max_chunks;
///         case RDMA2_ERR_WRITE_CHUNKS:
///           uint32 rdma_max_chunks;
///         case RDMA2_ERR_SEGMENTS:
///           uint32 rdma_max_segments;
///         case RDMA2_ERR_WRITE_RESOURCE:
///           rpcrdma2_err_write rdma_writeres;
///         case RDMA2_ERR_REPLY_RESOURCE:
///           uint32 rdma_length_needed;
///         default:
///           void;
/// };
///
/// /*
///  * Header Type to Exchange Transport Properties.
///  */
/// struct rpcrdma2_connprop {
///         rpcrdma2_propset rdma_props;
/// };
///
/// /* FILE ENDS: baseops.x; */

<CODE ENDS>

7.5.  Use of the XDR Description Files

   The three files common.x and baseops.x, when combined with the XDR
   descriptions for extension defined later, produce a human-readable
   and compilable description of the RPC-over-RDMA version 2 protocol
   with the included extensions.




Lever & Noveck            Expires May 20, 2020                 [Page 50]


Internet-Draft          RDMA Transport for RPC V2          November 2019


   Although this XDR description can be useful in generating code to
   encode and decode the transport and payload streams, there are
   elements of the structure of RPC-over-RDMA version 2 which are not
   expressible within the XDR language as currently defined.  This
   requires implementations that use the output of the XDR processor to
   provide additional code to bridge the gaps.

   o  The values of transport properties are represented within XDR as
      opaque values.  However, the actual structures of each of the
      properties are represented by XDR typedefs, with the selection of
      the appropriate typedef described by text in this document.  The
      determination of the appropriate typedef is not specified by XDR,
      which does not possess the facilities necessary for that
      determination to be specified in an extensible way.

      This is similar to the way in which NFSv4 attributes are handled
      [RFC7530] [RFC5661].  As in that case, implementations that need
      to encode and decode these nominally opaque entities need to use
      the protocol description to determine the actual XDR
      representation that underlays the items described as opaque.

   o  The transport stream is not represented as a single XDR object.
      Instead, the header prefix is described by one XDR object while
      the rest of the header is described as another XDR object with the
      mapping between the header type in the header prefix and the XDR
      object representing the header type represented by tables
      contained in this document, with additional mappings being
      specifiable by a later extension document.

      This situation is similar to that in which RPC message headers
      contain program and procedure numbers, so that the XDR for those
      request and replies can be used to encode and decode the
      associated messages without requiring that all be present in a
      single XDR specification.  As in that case, implementations need
      to use the header specification to select the appropriate XDR-
      generated code to be used in message processing.

   o  The relationship between the transport stream and the payload
      stream is not specified in the XDR itself, although comments
      within the XDR text make clear where transported messages,
      described by their own XDR, need to appear.  Such data by its
      nature is opaque to the transport, although its form differs XDR
      opaque arrays.

      Potential extensions allowing continuation of RPC messages across
      transport message boundaries will require that message assembly
      facilities, not specifiable within XDR, also be part of transport
      implementations.



Lever & Noveck            Expires May 20, 2020                 [Page 51]


Internet-Draft          RDMA Transport for RPC V2          November 2019


   To summarize, the role of XDR in this specification is more limited
   than for protocols which are themselves XDR programs, where the
   totality of the protocol is expressible within the XDR paradigm
   established for that purpose.  This more limited role reflects the
   fact that XDR lacks facilities to represent the embedding of
   transported material within the transport framework.  In addition,
   the need to cleanly accommodate extensions has meant that those using
   rpcgen in their applications need to take a more active role in
   providing the facilities that cannot be expressed within XDR.

8.  RPC Bind Parameters

   In setting up a new RDMA connection, the first action by an RPC
   client is to obtain a transport address for the RPC server.  The
   means used to obtain this address and to open an RDMA connection is
   dependent on the type of RDMA transport, and is the responsibility of
   each RPC protocol binding and its local implementation.

   RPC services normally register with a portmap or rpcbind service
   [RFC1833], which associates an RPC Program number with a service
   address.  This policy is no different with RDMA transports.  However,
   a different and distinct service address (port number) might
   sometimes be required for ULP operation with RPC-over-RDMA.

   When mapped atop the iWARP transport [RFC5040] [RFC5041], which uses
   IP port addressing due to its layering on TCP and/or SCTP, port
   mapping is trivial and consists merely of issuing the port in the
   connection process.  The NFS/RDMA protocol service address has been
   assigned port 20049 by IANA, for both iWARP/TCP and iWARP/SCTP
   [RFC8267].

   When mapped atop InfiniBand [IBA], which uses a service endpoint
   naming scheme based on a Group Identifier (GID), a translation MUST
   be employed.  One such translation is described in Annexes A3
   (Application Specific Identifiers), A4 (Sockets Direct Protocol
   (SDP)), and A11 (RDMA IP CM Service) of [IBA], which is appropriate
   for translating IP port addressing to the InfiniBand network.
   Therefore, in this case, IP port addressing may be readily employed
   by the upper layer.

   When a mapping standard or convention exists for IP ports on an RDMA
   interconnect, there are several possibilities for each upper layer to
   consider:

   o  One possibility is to have the server register its mapped IP port
      with the rpcbind service under the netid (or netids) defined in
      [RFC8166].  An RPC-over-RDMA-aware RPC client can then resolve its
      desired service to a mappable port and proceed to connect.  This



Lever & Noveck            Expires May 20, 2020                 [Page 52]


Internet-Draft          RDMA Transport for RPC V2          November 2019


      is the most flexible and compatible approach for those upper
      layers that are defined to use the rpcbind service.

   o  A second possibility is to have the RPC server's portmapper
      register itself on the RDMA interconnect at a "well-known" service
      address (on UDP or TCP, this corresponds to port 111).  An RPC
      client could connect to this service address and use the portmap
      protocol to obtain a service address in response to a program
      number; e.g., an iWARP port number or an InfiniBand GID.

   o  Alternately, the RPC client could simply connect to the mapped
      well-known port for the service itself, if it is appropriately
      defined.  By convention, the NFS/RDMA service, when operating atop
      such an InfiniBand fabric, uses the same 20049 assignment as for
      iWARP.

   Historically, different RPC protocols have taken different approaches
   to their port assignment.  Therefore, the specific method is left to
   each RPC-over-RDMA-enabled ULB and is not addressed in this document.

   [RFC8166] defines two new netid values to be used for registration of
   upper layers atop iWARP [RFC5040] [RFC5041] and (when a suitable port
   translation service is available) InfiniBand [IBA].  Additional RDMA-
   capable networks MAY define their own netids, or if they provide a
   port translation, they MAY share the one defined in [RFC8166].

9.  Implementation Status

   This section records the status of known implementations of the
   protocol defined by this specification at the time of posting of this
   Internet-Draft, and is based on a proposal described in [RFC7942].
   The description of implementations in this section is intended to
   assist the IETF in its decision processes in progressing drafts to
   RFCs.

   Please note that the listing of any individual implementation here
   does not imply endorsement by the IETF.  Furthermore, no effort has
   been spent to verify the information presented here that was supplied
   by IETF contributors.  This is not intended as, and must not be
   construed to be, a catalog of available implementations or their
   features.  Readers are advised to note that other implementations may
   exist.

   At this time, no known implementations of the protocol described in
   this document exist.






Lever & Noveck            Expires May 20, 2020                 [Page 53]


Internet-Draft          RDMA Transport for RPC V2          November 2019


10.  Security Considerations

10.1.  Memory Protection

   A primary consideration is the protection of the integrity and
   confidentiality of host memory by an RPC-over-RDMA transport.  The
   use of an RPC-over-RDMA transport protocol MUST NOT introduce
   vulnerabilities to system memory contents nor to memory owned by user
   processes.

   It is REQUIRED that any RDMA provider used for RPC transport be
   conformant to the requirements of [RFC5042] in order to satisfy these
   protections.  These protections are provided by the RDMA layer
   specifications, and in particular, their security models.

10.1.1.  Protection Domains

   The use of Protection Domains to limit the exposure of memory regions
   to a single connection is critical.  Any attempt by an endpoint not
   participating in that connection to reuse memory handles needs to
   result in immediate failure of that connection.  Because ULP security
   mechanisms rely on this aspect of Reliable connected behavior, strong
   authentication of remote endpoints is recommended.

10.1.2.  Handle (STag) Predictability

   Unpredictable memory handles should be used for any operation
   requiring advertised memory regions.  Advertising a continuously
   registered memory region allows a remote host to read or write to
   that region even when an RPC involving that memory is not under way.
   Therefore, implementations should avoid advertising persistently
   registered memory.

10.1.3.  Memory Protection

   Requesters should register memory regions for remote access only when
   they are about to be the target of an RPC operation that involves an
   RDMA Read or Write.

   Registered memory regions should be invalidated as soon as related
   RPC operations are complete.  Invalidation and DMA unmapping of
   memory regions should be complete before message integrity checking
   is done and before the RPC consumer is allowed to continue execution
   and use or alter the contents of a memory region.

   An RPC transaction on a Requester might be terminated before a reply
   arrives if the RPC consumer exits unexpectedly (for example, it is
   signaled or a segmentation fault occurs).  When an RPC terminates



Lever & Noveck            Expires May 20, 2020                 [Page 54]


Internet-Draft          RDMA Transport for RPC V2          November 2019


   abnormally, memory regions associated with that RPC should be
   invalidated appropriately before the regions are released to be
   reused for other purposes on the Requester.

10.1.4.  Denial of Service

   A detailed discussion of denial-of-service exposures that can result
   from the use of an RDMA transport is found in Section 6.4 of
   [RFC5042].

   A Responder is not obliged to pull Read chunks that are unreasonably
   large.  The Responder can use an RDMA2_ERROR response to terminate
   RPCs with unreadable Read chunks.  If a Responder transmits more data
   than a Requester is prepared to receive in a Write or Reply chunk,
   the RDMA Network Interface Cards (RNICs) typically terminate the
   connection.  For further discussion, see Section 6.4.3.  Such
   repeated chunk errors can deny service to other users sharing the
   connection from the errant Requester.

   An RPC-over-RDMA transport implementation is not responsible for
   throttling the RPC request rate, other than to keep the number of
   concurrent RPC transactions at or under the number of credits granted
   per connection.  This is explained in Section 4.3.1.  A sender can
   trigger a self denial of service by exceeding the credit grant
   repeatedly.

   When an RPC has been canceled due to a signal or premature exit of an
   application process, a Requester typically invalidates the RPC's
   Write and Reply chunks.  Invalidation prevents the subsequent arrival
   of the Responder's reply from altering the memory regions associated
   with those chunks after the memory has been reused.

   On the Requester, a malfunctioning application or a malicious user
   can create a situation where RPCs are continuously initiated and then
   aborted, resulting in Responder replies that terminate the underlying
   RPC-over-RDMA connection repeatedly.  Such situations can deny
   service to other users sharing the connection from that Requester.

10.2.  RPC Message Security

   ONC RPC provides cryptographic security via the RPCSEC_GSS framework
   [RFC7861].  RPCSEC_GSS implements message authentication
   (rpc_gss_svc_none), per-message integrity checking
   (rpc_gss_svc_integrity), and per-message confidentiality
   (rpc_gss_svc_privacy) in the layer above the RPC-over-RDMA transport.
   The latter two services require significant computation and movement
   of data on each endpoint host.  Some performance benefits enabled by
   RDMA transports can be lost.



Lever & Noveck            Expires May 20, 2020                 [Page 55]


Internet-Draft          RDMA Transport for RPC V2          November 2019


10.2.1.  RPC-over-RDMA Protection at Lower Layers

   For any RPC transport, utilizing RPCSEC_GSS integrity or privacy
   services has performance implications.  Protection below the RPC
   transport is often more appropriate in performance-sensitive
   deployments, especially if it, too, can be offloaded.  Certain
   configurations of IPsec can be co-located in RDMA hardware, for
   example, without change to RDMA consumers and little loss of data
   movement efficiency.  Such arrangements can also provide a higher
   degree of privacy by hiding endpoint identity or altering the
   frequency at which messages are exchanged, at a performance cost.

   The use of protection in a lower layer MAY be negotiated through the
   use of an RPCSEC_GSS security flavor defined in [RFC7861] in
   conjunction with the Channel Binding mechanism [RFC5056] and IPsec
   Channel Connection Latching [RFC5660].  Use of such mechanisms is
   REQUIRED where integrity or confidentiality is desired and where
   efficiency is required.

10.2.2.  RPCSEC_GSS on RPC-over-RDMA Transports

   Not all RDMA devices and fabrics support the above protection
   mechanisms.  Also, per-message authentication is still required on
   NFS clients where multiple users access NFS files.  In these cases,
   RPCSEC_GSS can protect NFS traffic conveyed on RPC-over-RDMA
   connections.

   RPCSEC_GSS extends the ONC RPC protocol without changing the format
   of RPC messages.  By observing the conventions described in this
   section, an RPC-over-RDMA transport can convey RPCSEC_GSS-protected
   RPC messages interoperably.

   As part of the ONC RPC protocol, protocol elements of RPCSEC_GSS that
   appear in the Payload stream of an RPC-over-RDMA message (such as
   control messages exchanged as part of establishing or destroying a
   security context or data items that are part of RPCSEC_GSS
   authentication material) MUST NOT be reduced.

10.2.2.1.  RPCSEC_GSS Context Negotiation

   Some NFS client implementations use a separate connection to
   establish a Generic Security Service (GSS) context for NFS operation.
   Such clients use TCP and the standard NFS port (2049) for context
   establishment.  To enable the use of RPCSEC_GSS with NFS/RDMA, an NFS
   server MUST also provide a TCP-based NFS service on port 2049.






Lever & Noveck            Expires May 20, 2020                 [Page 56]


Internet-Draft          RDMA Transport for RPC V2          November 2019


10.2.2.2.  RPC-over-RDMA with RPCSEC_GSS Authentication

   The RPCSEC_GSS authentication service has no impact on the DDP-
   eligibility of data items in a ULP.

   However, RPCSEC_GSS authentication material appearing in an RPC
   message header can be larger than, say, an AUTH_SYS authenticator.
   In particular, when an RPCSEC_GSS pseudoflavor is in use, a Requester
   needs to accommodate a larger RPC credential when marshaling RPC Call
   messages and needs to provide for a maximum size RPCSEC_GSS verifier
   when allocating reply buffers and Reply chunks.

   RPC messages, and thus Payload streams, are made larger as a result.
   ULP operations that fit in a Short Message when a simpler form of
   authentication is in use might need to be reduced or conveyed via a
   Long Message when RPCSEC_GSS authentication is in use.  It is more
   likely that a Requester provides both a Read list and a Reply chunk
   in the same RPC-over-RDMA Transport header to convey a Long Call and
   provision a receptacle for a Long Reply.

   In addition to this cost, the XDR encoding and decoding of each RPC
   message using RPCSEC_GSS authentication requires host compute
   resources to construct the GSS verifier.

10.2.2.3.  RPC-over-RDMA with RPCSEC_GSS Integrity or Privacy

   The RPCSEC_GSS integrity service enables endpoints to detect
   modification of RPC messages in flight.  The RPCSEC_GSS privacy
   service prevents all but the intended recipient from viewing the
   cleartext content of RPC arguments and results.  RPCSEC_GSS integrity
   and privacy services are end-to-end.  They protect RPC arguments and
   results from application to server endpoint, and back.

   The RPCSEC_GSS integrity and encryption services operate on whole RPC
   messages after they have been XDR encoded for transmit, and before
   they have been XDR decoded after receipt.  Both sender and receiver
   endpoints use intermediate buffers to prevent exposure of encrypted
   data or unverified cleartext data to RPC consumers.  After
   verification, encryption, and message wrapping has been performed,
   the transport layer MAY use RDMA data transfer between these
   intermediate buffers.

   The process of reducing a DDP-eligible data item removes the data
   item and its XDR padding from the encoded Payload stream.  XDR
   padding of a reduced data item is not transferred in a normal RPC-
   over-RDMA message.  After reduction, the Payload stream contains
   fewer octets than the whole XDR stream did beforehand.  XDR padding
   octets are often zero bytes, but they don't have to be.  Thus,



Lever & Noveck            Expires May 20, 2020                 [Page 57]


Internet-Draft          RDMA Transport for RPC V2          November 2019


   reducing DDP-eligible items affects the result of message integrity
   verification or encryption.

   Therefore, a sender MUST NOT reduce a Payload stream when RPCSEC_GSS
   integrity or encryption services are in use.  Effectively, no data
   item is DDP-eligible in this situation, and Chunked Messages cannot
   be used.  In this mode, an RPC-over-RDMA transport operates in the
   same manner as a transport that does not support DDP.

   When an RPCSEC_GSS integrity or privacy service is in use, a
   Requester provides both a Read list and a Reply chunk in the same
   RPC-over-RDMA header to convey a Long Call and provision a receptacle
   for a Long Reply.

10.2.2.4.  Protecting RPC-over-RDMA Transport Headers

   Like the base fields in an ONC RPC message (XID, call direction, and
   so on), the contents of an RPC-over-RDMA message's Transport stream
   are not protected by RPCSEC_GSS.  This exposes XIDs, connection
   credit limits, and chunk lists (but not the content of the data items
   they refer to) to malicious behavior, which could redirect data that
   is transferred by the RPC-over-RDMA message, result in spurious
   retransmits, or trigger connection loss.

   In particular, if an attacker alters the information contained in the
   chunk lists of an RPC-over-RDMA Transport header, data contained in
   those chunks can be redirected to other registered memory regions on
   Requesters.  An attacker might alter the arguments of RDMA Read and
   RDMA Write operations on the wire to similar effect.  If such
   alterations occur, the use of RPCSEC_GSS integrity or privacy
   services enable a Requester to detect unexpected material in a
   received RPC message.

   Encryption at lower layers, as described in Section 10.2.1 protects
   the content of the Transport stream.  To address attacks on RDMA
   protocols themselves, RDMA transport implementations should conform
   to [RFC5042].

10.3.  Transport Properties

   Like other fields that appear in each RPC-over-RDMA header, property
   information is sent in the clear on the fabric with no integrity
   protection, making it vulnerable to man-in-the-middle attacks.

   For example, if a man-in-the-middle were to change the value of the
   Receive buffer size or the Requester Remote Invalidation boolean, it
   could reduce connection performance or trigger loss of connection.
   Repeated connection loss can impact performance or even prevent a new



Lever & Noveck            Expires May 20, 2020                 [Page 58]


Internet-Draft          RDMA Transport for RPC V2          November 2019


   connection from being established.  Recourse is to deploy on a
   private network or use link-layer encryption.

10.4.  Host Authentication

   Wherein we use the relevant sections of [RFC3552] to analyze the
   addition of host authentication to this RPC-over-RDMA transport.

   The authors refer readers to Appendix C of [RFC8446] for information
   on how to design and test a secure authentication handshake
   implementation.

11.  IANA Considerations

   The RPC-over-RDMA family of transports have been assigned RPC netids
   by [RFC8166].  A netid is an rpcbind [RFC1833] string used to
   identify the underlying protocol in order for RPC to select
   appropriate transport framing and the format of the service addresses
   and ports.

   The following netid registry strings are already defined for this
   purpose:

      NC_RDMA "rdma"
      NC_RDMA6 "rdma6"

   The "rdma" netid is to be used when IPv4 addressing is employed by
   the underlying transport, and "rdma6" when IPv6 addressing is
   employed.  The netid assignment policy and registry are defined in
   [RFC5665].  The current document does not alter these netid
   assignments.

   These netids MAY be used for any RDMA network that satisfies the
   requirements of Section 3.2.2 and that is able to identify service
   endpoints using IP port addressing, possibly through use of a
   translation service as described in Section 8.

12.  References

12.1.  Normative References

   [RFC1833]  Srinivasan, R., "Binding Protocols for ONC RPC Version 2",
              RFC 1833, DOI 10.17487/RFC1833, August 1995,
              <https://www.rfc-editor.org/info/rfc1833>.







Lever & Noveck            Expires May 20, 2020                 [Page 59]


Internet-Draft          RDMA Transport for RPC V2          November 2019


   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <https://www.rfc-editor.org/info/rfc2119>.

   [RFC4506]  Eisler, M., Ed., "XDR: External Data Representation
              Standard", STD 67, RFC 4506, DOI 10.17487/RFC4506, May
              2006, <https://www.rfc-editor.org/info/rfc4506>.

   [RFC5042]  Pinkerton, J. and E. Deleganes, "Direct Data Placement
              Protocol (DDP) / Remote Direct Memory Access Protocol
              (RDMAP) Security", RFC 5042, DOI 10.17487/RFC5042, October
              2007, <https://www.rfc-editor.org/info/rfc5042>.

   [RFC5056]  Williams, N., "On the Use of Channel Bindings to Secure
              Channels", RFC 5056, DOI 10.17487/RFC5056, November 2007,
              <https://www.rfc-editor.org/info/rfc5056>.

   [RFC5280]  Cooper, D., Santesson, S., Farrell, S., Boeyen, S.,
              Housley, R., and W. Polk, "Internet X.509 Public Key
              Infrastructure Certificate and Certificate Revocation List
              (CRL) Profile", RFC 5280, DOI 10.17487/RFC5280, May 2008,
              <https://www.rfc-editor.org/info/rfc5280>.

   [RFC5531]  Thurlow, R., "RPC: Remote Procedure Call Protocol
              Specification Version 2", RFC 5531, DOI 10.17487/RFC5531,
              May 2009, <https://www.rfc-editor.org/info/rfc5531>.

   [RFC5660]  Williams, N., "IPsec Channels: Connection Latching",
              RFC 5660, DOI 10.17487/RFC5660, October 2009,
              <https://www.rfc-editor.org/info/rfc5660>.

   [RFC5665]  Eisler, M., "IANA Considerations for Remote Procedure Call
              (RPC) Network Identifiers and Universal Address Formats",
              RFC 5665, DOI 10.17487/RFC5665, January 2010,
              <https://www.rfc-editor.org/info/rfc5665>.

   [RFC6125]  Saint-Andre, P. and J. Hodges, "Representation and
              Verification of Domain-Based Application Service Identity
              within Internet Public Key Infrastructure Using X.509
              (PKIX) Certificates in the Context of Transport Layer
              Security (TLS)", RFC 6125, DOI 10.17487/RFC6125, March
              2011, <https://www.rfc-editor.org/info/rfc6125>.

   [RFC7861]  Adamson, A. and N. Williams, "Remote Procedure Call (RPC)
              Security Version 3", RFC 7861, DOI 10.17487/RFC7861,
              November 2016, <https://www.rfc-editor.org/info/rfc7861>.




Lever & Noveck            Expires May 20, 2020                 [Page 60]


Internet-Draft          RDMA Transport for RPC V2          November 2019


   [RFC7942]  Sheffer, Y. and A. Farrel, "Improving Awareness of Running
              Code: The Implementation Status Section", BCP 205,
              RFC 7942, DOI 10.17487/RFC7942, July 2016,
              <https://www.rfc-editor.org/info/rfc7942>.

   [RFC8166]  Lever, C., Ed., Simpson, W., and T. Talpey, "Remote Direct
              Memory Access Transport for Remote Procedure Call Version
              1", RFC 8166, DOI 10.17487/RFC8166, June 2017,
              <https://www.rfc-editor.org/info/rfc8166>.

   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
              May 2017, <https://www.rfc-editor.org/info/rfc8174>.

   [RFC8267]  Lever, C., "Network File System (NFS) Upper-Layer Binding
              to RPC-over-RDMA Version 1", RFC 8267,
              DOI 10.17487/RFC8267, October 2017,
              <https://www.rfc-editor.org/info/rfc8267>.

   [RFC8446]  Rescorla, E., "The Transport Layer Security (TLS) Protocol
              Version 1.3", RFC 8446, DOI 10.17487/RFC8446, August 2018,
              <https://www.rfc-editor.org/info/rfc8446>.

12.2.  Informative References

   [CBFC]     Kung, H., Blackwell, T., and A. Chapman, "Credit-Based
              Flow Control for ATM Networks: Credit Update Protocol,
              Adaptive Credit Allocation, and Statistical Multiplexing",
              Proc. ACM SIGCOMM '94 Symposium on Communications
              Architectures, Protocols and Applications, pp. 101-114.,
              August 1994.

   [IBA]      InfiniBand Trade Association, "InfiniBand Architecture
              Specification Volume 1", Release 1.3, March 2015.

              Available from https://www.infinibandta.org/

   [RFC0768]  Postel, J., "User Datagram Protocol", STD 6, RFC 768,
              DOI 10.17487/RFC0768, August 1980,
              <https://www.rfc-editor.org/info/rfc768>.

   [RFC0793]  Postel, J., "Transmission Control Protocol", STD 7,
              RFC 793, DOI 10.17487/RFC0793, September 1981,
              <https://www.rfc-editor.org/info/rfc793>.

   [RFC1094]  Nowicki, B., "NFS: Network File System Protocol
              specification", RFC 1094, DOI 10.17487/RFC1094, March
              1989, <https://www.rfc-editor.org/info/rfc1094>.



Lever & Noveck            Expires May 20, 2020                 [Page 61]


Internet-Draft          RDMA Transport for RPC V2          November 2019


   [RFC1813]  Callaghan, B., Pawlowski, B., and P. Staubach, "NFS
              Version 3 Protocol Specification", RFC 1813,
              DOI 10.17487/RFC1813, June 1995,
              <https://www.rfc-editor.org/info/rfc1813>.

   [RFC3552]  Rescorla, E. and B. Korver, "Guidelines for Writing RFC
              Text on Security Considerations", BCP 72, RFC 3552,
              DOI 10.17487/RFC3552, July 2003,
              <https://www.rfc-editor.org/info/rfc3552>.

   [RFC5040]  Recio, R., Metzler, B., Culley, P., Hilland, J., and D.
              Garcia, "A Remote Direct Memory Access Protocol
              Specification", RFC 5040, DOI 10.17487/RFC5040, October
              2007, <https://www.rfc-editor.org/info/rfc5040>.

   [RFC5041]  Shah, H., Pinkerton, J., Recio, R., and P. Culley, "Direct
              Data Placement over Reliable Transports", RFC 5041,
              DOI 10.17487/RFC5041, October 2007,
              <https://www.rfc-editor.org/info/rfc5041>.

   [RFC5532]  Talpey, T. and C. Juszczak, "Network File System (NFS)
              Remote Direct Memory Access (RDMA) Problem Statement",
              RFC 5532, DOI 10.17487/RFC5532, May 2009,
              <https://www.rfc-editor.org/info/rfc5532>.

   [RFC5661]  Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed.,
              "Network File System (NFS) Version 4 Minor Version 1
              Protocol", RFC 5661, DOI 10.17487/RFC5661, January 2010,
              <https://www.rfc-editor.org/info/rfc5661>.

   [RFC5662]  Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed.,
              "Network File System (NFS) Version 4 Minor Version 1
              External Data Representation Standard (XDR) Description",
              RFC 5662, DOI 10.17487/RFC5662, January 2010,
              <https://www.rfc-editor.org/info/rfc5662>.

   [RFC7530]  Haynes, T., Ed. and D. Noveck, Ed., "Network File System
              (NFS) Version 4 Protocol", RFC 7530, DOI 10.17487/RFC7530,
              March 2015, <https://www.rfc-editor.org/info/rfc7530>.

   [RFC8167]  Lever, C., "Bidirectional Remote Procedure Call on RPC-
              over-RDMA Transports", RFC 8167, DOI 10.17487/RFC8167,
              June 2017, <https://www.rfc-editor.org/info/rfc8167>.

   [RFC8178]  Noveck, D., "Rules for NFSv4 Extensions and Minor
              Versions", RFC 8178, DOI 10.17487/RFC8178, July 2017,
              <https://www.rfc-editor.org/info/rfc8178>.




Lever & Noveck            Expires May 20, 2020                 [Page 62]


Internet-Draft          RDMA Transport for RPC V2          November 2019


Appendix A.  ULB Specifications

   An Upper-Layer Protocol (ULP) is typically defined independently of
   any particular RPC transport.  An Upper-Layer Binding (ULB)
   specification provides guidance that helps the ULP interoperate
   correctly and efficiently over a particular transport.  For RPC-over-
   RDMA version 2, a ULB may provide:

   o  A taxonomy of XDR data items that are eligible for DDP

   o  Constraints on which upper-layer procedures may be reduced and on
      how many chunks may appear in a single RPC request

   o  A method for determining the maximum size of the reply Payload
      stream for all procedures in the ULP

   o  An rpcbind port assignment for operation of the RPC Program and
      Version on an RPC-over-RDMA transport

   Each RPC Program and Version tuple that utilizes RPC-over-RDMA
   version 2 needs to have a ULB specification.

A.1.  DDP-Eligibility

   An ULB designates some XDR data items as eligible for DDP.  As an
   RPC-over-RDMA message is formed, DDP-eligible data items can be
   removed from the Payload stream and placed directly in the receiver's
   memory.  An XDR data item should be considered for DDP-eligibility if
   there is a clear benefit to moving the contents of the item directly
   from the sender's memory to the receiver's memory.

   Criteria for DDP-eligibility include:

   o  The XDR data item is frequently sent or received, and its size is
      often much larger than typical inline thresholds.

   o  If the XDR data item is a result, its maximum size must be
      predictable in advance by the requester.

   o  Transport-level processing of the XDR data item is not needed.
      For example, the data item is an opaque byte array, which requires
      no XDR encoding and decoding of its content.

   o  The content of the XDR data item is sensitive to address
      alignment.  For example, a data copy operation would be required
      on the receiver to enable the message to be parsed correctly, or
      to enable the data item to be accessed.




Lever & Noveck            Expires May 20, 2020                 [Page 63]


Internet-Draft          RDMA Transport for RPC V2          November 2019


   o  The XDR data item does not contain DDP-eligible data items.

   In addition to defining the set of data items that are DDP-eligible,
   a ULB may also limit the use of chunks to particular upper-layer
   procedures.  If more than one data item in a procedure is DDP-
   eligible, the ULB may also limit the number of chunks that a
   requester can provide for a particular upper-layer procedure.

   Senders MUST NOT reduce data items that are not DDP-eligible.  Such
   data items MAY, however, be moved as part of a Position Zero Read
   chunk or a Reply chunk.

   The programming interface by which an upper-layer implementation
   indicates the DDP-eligibility of a data item to the RPC transport is
   not described by this specification.  The only requirements are that
   the receiver can re-assemble the transmitted RPC-over-RDMA message
   into a valid XDR stream, and that DDP-eligibility rules specified by
   the ULB are respected.

   There is no provision to express DDP-eligibility within the XDR
   language.  The only definitive specification of DDP-eligibility is a
   ULB.

   In general, a DDP-eligibility violation occurs when:

   o  A requester reduces a non-DDP-eligible argument data item.  The
      Responder MUST NOT process this RPC Call message and MUST report
      the violation as described in Section 6.4.3.

   o  A Responder reduces a non-DDP-eligible result data item.  The
      requester MUST terminate the pending RPC transaction and report an
      appropriate permanent error to the RPC consumer.

   o  A Responder does not reduce a DDP-eligible result data item into
      an available Write chunk.  The requester MUST terminate the
      pending RPC transaction and report an appropriate permanent error
      to the RPC consumer.

A.2.  Maximum Reply Size

   When expecting small and moderately-sized Replies, a requester should
   typically rely on Message Continuation rather than provisioning a
   Reply chunk.  For each ULP procedure where there is no clear Reply
   size maximum and the maximum can be large, the ULB should specify a
   dependable means for determining the maximum Reply size.






Lever & Noveck            Expires May 20, 2020                 [Page 64]


Internet-Draft          RDMA Transport for RPC V2          November 2019


A.3.  Additional Considerations

   There may be other details provided in a ULB.

   o  An ULB may recommend inline threshold values or other transport-
      related parameters for RPC-over-RDMA version 2 connections bearing
      that ULP.

   o  An ULP may provide a means to communicate these transport-related
      parameters between peers.  Note that RPC-over-RDMA version 2 does
      not specify any mechanism for changing any transport-related
      parameter after a connection has been established and the initial
      transport properties have been exchanged.

   o  Multiple ULPs may share a single RPC-over-RDMA version 2
      connection when their ULBs allow the use of RPC-over-RDMA version
      2 and the rpcbind port assignments for the Protocols allow
      connection sharing.  In this case, the same transport parameters
      (such as inline threshold) apply to all Protocols using that
      connection.

   Each ULB needs to be designed to allow correct interoperation without
   regard to the transport parameters actually in use.  Furthermore,
   implementations of ULPs must be designed to interoperate correctly
   regardless of the connection parameters in effect on a connection.

A.4.  ULP Extensions

   An RPC Program and Version tuple may be extensible.  For instance,
   there may be a minor versioning scheme that is not reflected in the
   RPC version number, or the ULP may allow additional features to be
   specified after the original RPC Program specification was ratified.
   ULBs are provided for interoperable RPC Programs and Versions by
   extending existing ULBs to reflect the changes made necessary by each
   addition to the existing XDR.

Appendix B.  Extending the Version 2 Protocol

   This Appendix is not addressed to protocol implementers, but rather
   to authors of documents that intend to extend the protocol described
   earlier in this document.

   Subsequent RPC-over-RDMA versions are free to change the protocol in
   any way they choose as long as they leave unchanged those fields
   identified as "fixed for all versions" in Section 4.2.1 of [RFC8166].

   Such changes might involve deletion or major re-organization of
   existing transport headers.  However, the need for interoperability



Lever & Noveck            Expires May 20, 2020                 [Page 65]


Internet-Draft          RDMA Transport for RPC V2          November 2019


   between adjacent versions will often limit the scope of changes that
   can be made in a single version.

   In some cases it may prove desirable to transition to a new version
   by using the extension features described for use with RPC-over-RDMA
   version 2, by continuing the same basic extension model but allowing
   header types and properties that were OPTIONAL in one version to
   become REQUIRED in the subsequent version.

   RPC-over-RDMA version 2 is designed to be extensible in a way that
   enables the addition of OPTIONAL features that may subsequently be
   converted to REQUIRED status in a future protocol version.  The
   protocol may be extended by Standards Track documents in a way
   analogous to that provided for Network File System Version 4 as
   described in [RFC8178].

   This form of extensibility enables limited extensions to the base
   RPC-over-RDMA version 2 protocol presented in this document so that
   new optional capabilities can be introduced without a protocol
   version change, while maintaining robust interoperability with
   existing RPC-over-RDMA version 2 implementations.  The design allows
   extensions to be defined, including the definition of new protocol
   elements, without requiring modification or recompilation of the
   existing XDR.

   A Standards Track document introduces each set of such protocol
   elements.  Together these elements are considered an OPTIONAL
   feature.  Each implementation is either aware of all the protocol
   elements introduced by that feature or is aware of none of them.

   Documents describing extensions to RPC-over-RDMA version 2 should
   contain:

   o  An explanation of the purpose and use of each new protocol element
      added.

   o  An XDR description including all of the new protocol elements, and
      a script to extract it.

   o  A description of interactions with existing extensions.

      This includes possible requirements of other OPTIONAL features to
      be present for new protocol elements to work, or that a particular
      level of support for an OPTIONAL facility is required for the new
      extension to work.

   Implementers combine the XDR descriptions of the new features they
   intend to use with the XDR description of the base protocol in this



Lever & Noveck            Expires May 20, 2020                 [Page 66]


Internet-Draft          RDMA Transport for RPC V2          November 2019


   document.  This may be necessary to create a valid XDR input file
   because extensions are free to use XDR types defined in the base
   protocol, and later extensions may use types defined by earlier
   extensions.

   The XDR description for the RPC-over-RDMA version 2 base protocol
   combined with that for any selected extensions should provide an
   adequate human-readable description of the extended protocol.

   The base protocol specified in this document may be extended within
   RPC-over-RDMA version 2 in two ways:

   o  New OPTIONAL transport header types may be introduced by later
      Standards Track documents.  Such transport header types will be
      documented as described in Appendix B.1.

   o  New OPTIONAL transport properties may be defined in later
      Standards Track documents.  Such transport properties will be
      documented as described in Appendix B.3.

   The following sorts of ancillary protocol elements may be added to
   the protocol to support the addition of new transport properties and
   header types.

   o  New error codes may be created as described in Appendix B.4.

   o  New flags to use within the rdma_flags field may be created as
      described in Appendix B.2.

   New capabilities can be proposed and developed independently of each
   other, and implementers can choose among them.  This makes it
   straightforward to create and document experimental features and then
   bring them through the standards process.

B.1.  Adding New Header Types to RPC-over-RDMA Version 2

   New transport header types are to defined in a manner similar to the
   way existing ones are described in Sections 6.4.1 through 6.4.4.
   Specifically what is needed is:

   o  A description of the function and use of the new header type.

   o  A complete XDR description of the new header type including a
      description of the use of all fields within the header.

   o  A description of how errors are reported, including the definition
      of a mechanism for reporting errors when the error is outside the




Lever & Noveck            Expires May 20, 2020                 [Page 67]


Internet-Draft          RDMA Transport for RPC V2          November 2019


      available choices already available in the base protocol or in
      other existing extensions.

   o  An indication of whether a Payload stream must be present, and a
      description of its contents and how such payload streams are used
      to construct RPC messages for processing.

   In addition, there needs to be additional documentation that is made
   necessary due to the Optional status of new transport header types.

   o  Information about constraints on support for the new header types
      should be provided.  For example, if support for one header type
      is implied or foreclosed by another one, this needs to be
      documented.

   o  A preferred method by which a sender should determine whether the
      peer supports a particular header type needs to be provided.
      While it is always possible for a send a test invocation of a
      particular header type to see if support is available, when more
      efficient means are available (e.g. the value of a transport
      property, this should be noted.

B.2.  Adding New Header Flags to the Protocol

   New flag bits are to defined in a manner similar to the way existing
   ones are described in Sections 6.3.2.1 and 6.3.2.2.  Each new flag
   definition should include:

   o  An XDR description of the new flag.

   o  A description of the function and use of the new flag.

   o  An indication for which header types the flag value is meaningful
      and for which header types it is an error to set the flag or to
      leave it unset.

   o  A means to determine whether receivers are prepared to receive
      transport headers with the new flag set.

   In addition, there needs to be additional documentation that is made
   necessary due to the Optional status of new transport header types.

   o  Information about constraints on support for the new flags should
      be provided.  For example, if support for one flag is implied or
      foreclosed by another one, this needs to be documented.






Lever & Noveck            Expires May 20, 2020                 [Page 68]


Internet-Draft          RDMA Transport for RPC V2          November 2019


B.3.  Adding New Transport properties to the Protocol

   The set of transport properties is designed to be extensible.  As a
   result, once new properties are defined in standards track documents,
   the operations defined in this document may reference these new
   transport properties, as well as the ones described in this document.

   A standards track document defining a new transport property should
   include the following information paralleling that provided in this
   document for the transport properties defined herein.

   o  The rpcrdma2_propid value used to identify this property.

   o  The XDR typedef specifying the form in which the property value is
      communicated.

   o  A description of the transport property that is communicated by
      the sender of RDMA2_CONNPROP.

   o  An explanation of how this knowledge could be used by the peer
      receiving this information.

   The definition of transport property structures is such as to make it
   easy to assign unique values.  There is no requirement that a
   continuous set of values be used and implementations should not rely
   on all such values being small integers.  A unique value should be
   selected when the defining document is first published as an internet
   draft.  When the document becomes a standards track document, the
   working group should ensure that:

   o  rpcrdma2_propid values specified in the document do not conflict
      with those currently assigned or in use by other pending working
      group documents defining transport properties.

   o  rpcrdma2_propid values specified in the document do not conflict
      with the range reserved for experimental use, as defined in
      Section 8.2.

   Documents defining new properties fall into a number of categories.

   o  Those defining new properties and explaining (only) how they
      affect use of existing message types.

   o  Those defining new OPTIONAL message types and new properties
      applicable to the operation of those new message types.

   o  Those defining new OPTIONAL message types and new properties
      applicable both to new and existing message types.



Lever & Noveck            Expires May 20, 2020                 [Page 69]


Internet-Draft          RDMA Transport for RPC V2          November 2019


   When additional transport properties are proposed, the review of the
   associated standards track document should deal with possible
   security issues raised by those new transport properties.

B.4.  Adding New Error Codes to the Protocol

   New error codes to be returned when using new header types may be
   introduced in the same Standards Track document that defines the new
   header type.  Cases in which a new error code is to be returned by an
   existing header type can be accommodated by defining the new error
   code in the same Standards Track document that defines the new
   transport property.

   For error codes that do not require that additional error information
   be returned with them, the existing RDMA_ERR2 header can be used to
   report the new error.  The new error code is set as the value of
   rdma_err with the result that the default switch arm of the
   rpcrdma2_error (i.e. void) is selected.

   For error codes that do require the return of additional error-
   related information together with the error, a new header type should
   be defined for the purpose of returning the error together with
   needed additional information.  It should be documented just like any
   other new header type.

   When a new header type is sent, the sender needs to be prepared to
   accept header types necessary to report associated errors.

Appendix C.  Differences from the RPC-over-RDMA Version 1 Protocol

   This section describes the substantive changes made in RPC-over-RDMA
   version 2.

C.1.  Relationship to the RPC-over-RDMA Version 1 XDR Definition

   There are a number of structural XDR changes whose goal is to enable
   within-version protocol extensibility.

   The RPC-over-RDMA version 1 transport header is defined as a single
   XDR object, with an RPC message proper potentially following it.  In
   RPC-over-RDMA version 2, as described in Section 6.1 there are
   separate XDR definitions of the transport header prefix (see
   Section 6.3.2 which specifies the transport header type to be used,
   and the specific transport header, defined within one of the
   subsections of Section 6).  This is similar to the way that an RPC
   message consists of an RPC header (defined in [RFC5531]) and an RPC
   request or reply, defined by the Upper-Layer protocol being conveyed.




Lever & Noveck            Expires May 20, 2020                 [Page 70]


Internet-Draft          RDMA Transport for RPC V2          November 2019


   As a new version of the RPC-over-RDMA transport protocol, RPC-over-
   RDMA version 2 exists within the versioning rules defined in
   [RFC8166].  In particular, it maintains the first four words of the
   protocol header as sent and received, as specified in Section 4.2 of
   [RFC8166], even though, as explained in Section 6.3.1 of this
   document, the XDR definition of those words is structured
   differently.

   Although each of the first four words retains its semantic function,
   there are important differences of field interpretation, besides the
   fact that the words have different names and different roles with the
   XDR constrict of they are parts.

   o  The first word of the header, previously the rdma_xid field,
      retains the format and function that in had in RPC-over-RDMA
      version 1.  Within RPC-over-RDMA version 2, this word is the
      rdma_xid field of the structure rdma_start.  However, to
      accommodate the use of request-response pairing of non-RPC
      messages and the potential use of message continuation, it cannot
      be assumed that it will always have the same value it would have
      had in RPC-over-RDMA version 1.  As a result, the contents of this
      field should not be used without consideration of the associated
      protocol version identification.

   o  The second word of the header, previously the rdma_vers field,
      retains the format and function that it had in RPC-over-RDMA
      version 1.  Within RPC-over-RDMA version 2, this word is the
      rdma_vers field of the structure rdma_start.  To clearly
      distinguish version 1 and version 2 messages, senders MUST fill in
      the correct version (fixed after version negotiation) and
      receivers MUST check that the content of the rdma_vers is correct
      before using referencing any other header field.

   o  The third word of the header, previously the rdma_credit field,
      retains the size and general purpose that it had in RPC-over-RDMA
      version 1.  Within RPC-over-RDMA version 2, this word is the
      rdma_credit field of the structure rdma_start.

   o  The fourth word of the header, previously the union discriminator
      field rdma_proc, retains its format and general function even
      though the set of valid values has changed.  The value of this
      field is now considered an unsigned 32-bit integer rather than an
      enum.  Within RPC-over-RDMA version 2, this word is the rdma_htype
      field of the structure rdma_start.

   Beyond conforming to the restrictions specified in [RFC8166], RPC-
   over-RDMA version 2 tightly limits the scope of the changes made in
   order to ensure interoperability.  It makes no major structural



Lever & Noveck            Expires May 20, 2020                 [Page 71]


Internet-Draft          RDMA Transport for RPC V2          November 2019


   changes to the protocol, and all existing transport header types used
   in version 1 (as defined in [RFC8166]) are retained in version 2.
   Chunks are expressed using the same on-the-wire format and are used
   in the same way in both versions.

C.2.  Transport Properties

   RPC-over-RDMA version 2 provides a mechanism for exchanging the
   transport's operational properties.  This mechanism allows connection
   endpoints to communicate the properties of their implementation at
   connection setup.  The mechanism could be expanded to enable an
   endpoint to request changes in properties of the other endpoint and
   to notify peer endpoints of changes to properties that occur during
   operation.  Transport properties are described in Section 5.

C.3.  Credit Management Changes

   RPC-over-RDMA transports employ credit-based flow control to ensure
   that a requester does not emit more RDMA Sends than the responder is
   prepared to receive.  Section 3.3.1 of [RFC8166] explains the purpose
   and operation of RPC-over-RDMA version 1 credit management in detail.

   In the RPC-over-RDMA version 1 design, each RDMA Send from a
   requester contains an RPC Call with a credit request, and each RDMA
   Send from a responder contains an RPC Reply with a credit grant.  The
   credit grant implies that enough Receives have been posted on the
   responder to handle the credit grant minus the number of pending RPC
   transactions (the number of remaining Receive buffers might be zero).

   In other words, each RPC Reply acts as an implicit ACK for a previous
   RPC Call from the requester, indicating that the responder has posted
   a Receive to replace the Receive consumed by the requester's RDMA
   Send.  Without an RPC Reply message, the requester has no way to know
   that the responder is properly prepared for subsequent RPC Calls.

   Aside from being a bit of a layering violation, there are basic (but
   rare) cases where this arrangement is inadequate:

   o  When a requester retransmits an RPC Call on the same connection as
      an earlier RPC Call for the same transaction.

   o  When a requester transmits an RPC operation that requires no
      reply.

   o  When more than one RPC-over-RDMA message is needed to complete the
      transaction (e.g., RDMA_DONE).





Lever & Noveck            Expires May 20, 2020                 [Page 72]


Internet-Draft          RDMA Transport for RPC V2          November 2019


   Typically, the connection must be replaced in these cases.  This
   resets the credit accounting mechanism but has an undesirable impact
   on other ongoing RPC transactions on that connection.

   Because credit management accompanies each RPC message, there is a
   strict one-to-one ratio between RDMA Send and RPC message.  There are
   interesting use cases that might be enabled if this relationship were
   more flexible:

   o  RPC-over-RDMA operations which do not carry an RPC message; e.g.,
      control plane operations.

   o  A single RDMA Send that conveys more than one RPC message for the
      purpose of interrupt mitigation.

   o  An RPC message that is conveyed via several sequential RDMA Sends
      to reduce the use of explicit RDMA operations for moderate-sized
      RPC messages.

   o  An RPC transaction that needs multiple exchanges or an odd number
      of RPC-over-RDMA operations to complete.

   Bi-directional RPC operation also introduces an ambiguity.  If the
   RPC-over-RDMA message does not carry an RPC message, then it is not
   possible to determine whether the sender is a requester or a
   responder, and thus whether the rdma_credit field contains a credit
   request or a credit grant.

   A more sophisticated credit accounting mechanism is provided in RPC-
   over-RDMA version 2 in an attempt to address some of these
   shortcomings.  This new mechanism is detailed in Section 4.3.1.

C.4.  Inline Threshold Changes

   The term "inline threshold" is defined in Section 3.3.2 of [RFC8166].
   An "inline threshold" value is the largest message size (in octets)
   that can be conveyed on an RDMA connection using only RDMA Send and
   Receive.  Each connection has two inline threshold values: one for
   messages flowing from client-to-server (referred to as the "client-
   to-server inline threshold") and one for messages flowing from
   server-to-client (referred to as the "server-to-client inline
   threshold").  Note that [RFC8166] uses somewhat different
   terminology.  This is because it was written with only forward-
   direction RPC transactions in mind.

   A connection's inline thresholds determine when RDMA Read or Write
   operations are required because the RPC message to be sent cannot be
   conveyed via a single RDMA Send and Receive pair.  When an RPC



Lever & Noveck            Expires May 20, 2020                 [Page 73]


Internet-Draft          RDMA Transport for RPC V2          November 2019


   message does not contain DDP-eligible data items, a requester can
   prepare a Long Call or Reply to convey the whole RPC message using
   RDMA Read or Write operations.

   RDMA Read and Write operations require that each data payload resides
   in a region of memory that is registered with the RNIC.  When an RPC
   is complete, that region is invalidated, fencing it from the
   responder.  Memory registration and invalidation typically have a
   latency cost that is insignificant compared to data handling costs.
   When a data payload is small, however, the cost of registering and
   invalidating the memory where the payload resides becomes a
   relatively significant part of total RPC latency.  Therefore the most
   efficient operation of RPC-over-RDMA occurs when explicit RDMA Read
   and Write operations are used for large payloads, and are avoided for
   small payloads.

   When RPC-over-RDMA version 1 was conceived, the typical size of RPC
   messages that did not involve a significant data payload was under
   500 bytes.  A 1024-byte inline threshold adequately minimized the
   frequency of inefficient Long messages.

   With NFS version 4.1 [RFC5661], the increased size of NFS COMPOUND
   operations resulted in RPC messages that are on average larger and
   more complex than previous versions of NFS.  With 1024-byte inline
   thresholds, RDMA Read or Write operations are needed for frequent
   operations that do not bear a data payload, such as GETATTR and
   LOOKUP, reducing the efficiency of the transport.

   To reduce the need to use Long messages, RPC-over-RDMA version 2
   increases the default size of inline thresholds.  This also increases
   the maximum size of reverse-direction RPC messages.

C.5.  Message Continuation Changes

   In addition to a larger default inline threshold, RPC-over-RDMA
   version 2 introduces Message Continuation.  Message Continuation is a
   mechanism that enables the transmission of a data payload using more
   than one RDMA Send.  The purpose of Message Continuation is to
   provide relief in several important cases:

   o  If a requester finds that it is inefficient to convey a
      moderately-sized data payload using Read chunks, the requester can
      use Message Continuation to send the RPC Call.

   o  If a requester has provided insufficient Reply chunk space for a
      responder to send an RPC Reply, the responder can use Message
      Continuation to send the RPC Reply.




Lever & Noveck            Expires May 20, 2020                 [Page 74]


Internet-Draft          RDMA Transport for RPC V2          November 2019


   o  If a sender has to convey a large non-RPC data payload (e.g, a
      large transport property), the sender can use Message Continuation
      to avoid using registered memory.

C.6.  Host Authentication Changes

   For general operation of NFS on open networks, we eventually intend
   to rely on RPC-on-TLS [citation needed] to provide cryptographic
   authentication of the two ends of each connection.  In turn, this
   will improve the trustworthiness of AUTH_SYS-style user identities
   that flow on TCP, which are not cryptographic.  We do not have a
   similar solution for RPC-over-RDMA, however.

   Here, the RDMA transport layer already provides a strong guarantee of
   message integrity.  On some network fabrics, IPsec can be used to
   protect the privacy of in-transit data, or TLS itself could be used
   for transporting raw RDMA operations.  However, this is not the case
   for all fabrics (e.g., InfiniBand [IBA]).

   Thus, it is sensible to add a mechanism in the RPC-over-RDMA
   transport itself for authenticating the connection peers.  This
   mechanism is described in Section 5.2.6.  And like GSS channel
   binding, there should also be a way to determine when the use of host
   authentication is superfluous and can be avoided.

C.7.  Support for Remote Invalidation

   An STag that is registered using the FRWR mechanism in a privileged
   execution context or is registered via a Memory Window in an
   unprivileged context may be invalidated remotely [RFC5040].  These
   mechanisms are available when a requester's RNIC supports
   MEM_MGT_EXTENSIONS.

   For the purposes of this discussion, there are two classes of STags.
   Dynamically-registered STags are used in a single RPC, then
   invalidated.  Persistently-registered STags live longer than one RPC.
   They may persist for the life of an RPC-over-RDMA connection, or
   longer.

   An RPC-over-RDMA requester may provide more than one STag in one
   transport header.  It may provide a combination of dynamically- and
   persistently-registered STags in one RPC message, or any combination
   of these in a series of RPCs on the same connection.  Only
   dynamically-registered STags using Memory Windows or FRWR (i.e.,
   registered via MEM_MGT_EXTENSIONS) may be invalidated remotely.

   There is no transport-level mechanism by which a responder can
   determine how a requester-provided STag was registered, nor whether



Lever & Noveck            Expires May 20, 2020                 [Page 75]


Internet-Draft          RDMA Transport for RPC V2          November 2019


   it is eligible to be invalidated remotely.  A requester that mixes
   persistently- and dynamically-registered STags in one RPC, or mixes
   them across RPCs on the same connection, must therefore indicate
   which handles may be invalidated via a mechanism provided in the
   Upper-Layer Protocol.  RPC-over-RDMA version 2 provides such a
   mechanism.

   The RDMA Send With Invalidate operation is used to invalidate an STag
   on a remote system.  It is available only when a responder's RNIC
   supports MEM_MGT_EXTENSIONS, and must be utilized only when a
   requester's RNIC supports MEM_MGT_EXTENSIONS (can receive and
   recognize an IETH).

   Existing RPC-over-RDMA transport protocol specifications [RFC8166]
   [RFC8167] do not forbid direct data placement in the reverse
   direction, even though there is currently no Upper-Layer Protocol
   that makes data items in reverse direction operations elegible for
   direct data placement.

   When chunks are present in a reverse direction RPC request, Remote
   Invalidation allows the responder to trigger invalidation of a
   requester's STags as part of sending a reply, the same way as is done
   in the forward direction.

   However, in the reverse direction, the server acts as the requester,
   and the client is the responder.  The server's RNIC, therefore, must
   support receiving an IETH, and the server must have registered the
   STags with an appropriate registration mechanism.

C.8.  Error Reporting Changes

   RPC-over-RDMA version 2 expands the repertoire of errors that may be
   reported by connection endpoints.  This change, which is structured
   to enable extensibility, allows a peer to report overruns of specific
   resources and to avoid requester retries when an error is permanent.

Acknowledgments

   The authors gratefully acknowledge the work of Brent Callaghan and
   Tom Talpey on the original RPC-over-RDMA version 1 specification (RFC
   5666).  The authors also wish to thank Bill Baker, Greg Marsden, and
   Matt Benjamin for their support of this work.

   The XDR extraction conventions were first described by the authors of
   the NFS version 4.1 XDR specification [RFC5662].  Herbert van den
   Bergh suggested the replacement sed script used in this document.





Lever & Noveck            Expires May 20, 2020                 [Page 76]


Internet-Draft          RDMA Transport for RPC V2          November 2019


   Special thanks go to Transport Area Director Magnus Westerlund, NFSV4
   Working Group Chairs Spencer Shepler and Brian Pawlowski, and NFSV4
   Working Group Secretary Thomas Haynes for their support.

Authors' Addresses

   Charles Lever (editor)
   Oracle Corporation
   United States of America

   Email: chuck.lever@oracle.com


   David Noveck
   NetApp
   1601 Trapelo Road
   Waltham, MA  02451
   United States of America

   Phone: +1 781 572 8038
   Email: davenoveck@gmail.com






























Lever & Noveck            Expires May 20, 2020                 [Page 77]


Html markup produced by rfcmarkup 1.129d, available from https://tools.ietf.org/tools/rfcmarkup/