Remote Direct Data Placement Work Group P. Culley INTERNET-DRAFT Hewlett-Packard Company
draft-ietf-rddp-mpa-04.txtdraft-ietf-rddp-mpa-05.txt U. Elzur Broadcom Corporation R. Recio IBM Corporation S. Bailey Sandburst Corporation J. Carrier Cray Inc. Expires: NovemberDecember 2006 May 30,June 23, 2006 Marker PDU Aligned Framing for TCP Specification Status of this Memo By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/1id-abstracts.html. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html Abstract MPA (Marker Protocol data unit Aligned framing) is designed to work as an "adaptation layer" between TCP and the Direct Data Placement [DDP] protocol, preserving the reliable, in-order delivery of TCP, while adding the preservation of higher-level protocol record boundaries that DDP requires. MPA is fully compliant with applicable TCP RFCs and can be utilized with existing TCP implementations. MPA also supports integrated implementations that combine TCP, MPA and DDP to reduce buffering requirements in the implementation and improve performance at the system level. Table of Contents Status of this Memo 1 Abstract 1 1 Glossary 74 2 Introduction 107 2.1 Motivation 107 2.2 Protocol Overview 107 3 LLP and DDP requirements 14 3.1 TCP implementation Requirements to support MPA 14 3.1.1 TCP Transmit side 14 3.1.2 TCP Receive side 14 3.2MPA's interactions with DDP 1611 4 FPDU Formats 18MPA Full Operation Mode 13 4.1 FPDU Format 13 4.2 Marker Format 19 5 Data Transfer Semantics 20 5.114 4.3 MPA Markers 20 5.214 4.4 CRC Calculation 17 4.5 FPDU Size Considerations 20 5 MPA's interactions with TCP 22 5.1 MPA transmitters with a standard layered TCP 23 5.35.2 MPA onreceivers with a standard layered TCP Sender Segmentation 2624 5.3 Optimized MPA/TCP transmitters 24 5.3.1 Effects of MPA on TCPOptimized MPA/TCP Segmentation 27 5.3.2 FPDU Size Considerations 2925 5.4 Optimized MPA/TCP receivers 27 6 MPA Receiver FPDU Identification 30 5.4.128 6.1 Re-segmenting Middle boxes and non MPA-aware TCP senders 31 6optimized MPA/TCP senders29 7 Connection Semantics 32 6.130 7.1 Connection setup 32 6.1.130 7.1.1 MPA Request and Reply Frame Format 34 6.1.232 7.1.2 Connection Startup Rules 35 6.1.333 7.1.3 Example Delayed Startup sequence 38 6.1.436 7.1.4 Use of Private Data 41 6.1.539 7.1.5 "Dual stack" implementations 44 6.242 7.2 Normal Connection Teardown 45 743 8 Error Semantics 46 844 9 Security Considerations 47 8.145 9.1 Protocol-specific Security Considerations 47 8.1.145 9.1.1 Spoofing 47 8.1.245 9.1.2 Eavesdropping 48 8.246 9.2 Introduction to Security Options 49 8.347 9.3 Using IPsec With MPA 49 8.447 9.4 Requirements for IPsec Encapsulation of MPA/DDP 50 948 10 IANA Considerations 51 1049 11 References 52 10.150 11.1 Normative References 52 10.250 11.2 Informative References 52 1150 12 Appendix 54 11.152 12.1 Analysis of MPA over TCP Operations 54 11.1.152 12.1.1 Assumptions 55 11.1.253 12.1.2 The Value of FPDU Alignment 56 11.254 12.2 Receiver implementation 63 11.2.161 12.2.1 Network Layer Reassembly Buffers 63 11.2.261 12.2.2 TCP Reassembly buffers 64 11.362 12.3 IETF Implementation Interoperability with RDMA Consortium Protocols 65 11.3.163 12.3.1 Negotiated Parameters 65 11.3.263 12.3.2 RDMAC RNIC and Non-permissive IETF RNIC 66 11.3.364 12.3.3 RDMAC RNIC and Permissive IETF RNIC 68 11.3.466 12.3.4 Non-Permissive IETF RNIC and Permissive IETF RNIC 69 1267 13 Author's Addresses 70 1368 14 Acknowledgments 7169 Full Copyright Statement 7472 Intellectual Property 7472 Table of Figures Figure 1 ULP MPA TCP Layering 118 Figure 2 FPDU Format 1813 Figure 3 Marker Format 1914 Figure 4 Example FPDU Format with Marker 2116 Figure 5 Annotated Hex Dump of an FPDU 2519 Figure 6 Annotated Hex Dump of an FPDU with Marker 2620 Figure 7 Fully layered implementation 22 Figure 8 Optimized MPA/TCP implementation 22 Figure 9 MPA Request/Reply Frame 3432 Figure 8:10: Example Delayed Startup negotiation 3937 Figure 9:11: Example Immediate Startup negotiation 4240 Figure 10:12: Non-aligned FPDU freely placed in TCP octet stream 5856 Figure 11:13: Aligned FPDU placed immediately after TCP header 5957 Figure 12.14. Connection Parameters for the RNIC Types. 6664 Figure 13:15: MPA negotiation between an RDMAC RNIC and a Non-permissive IETF RNIC. 6765 Figure 14:16: MPA negotiation between an RDMAC RNIC and a Permissive IETF RNIC. 6866 Figure 15:17: MPA negotiation between a Non-permissive IETF RNIC and a Permissive IETF RNIC. 6967 Revision history [To be deleted prior to RFC publication] [draft-ietf-rddp-mpa-04][draft-ietf-rddp-mpa-05] workgroup draft with following changes: Numerous capitalization and "" adjustments, tried to make more consistent. Added some missing capitalized termsDocument restructuring to glossary Removed company specific "use as is" boilerplate paragraph Fixed up some contact informationdifferentiate between fully layered MPA on TCP implementations and cross references. Removed reference to expired draft-elzur-iwarp-mpa-tcp-analysis- 00.txt Suggested MTU to be used to determine EMSS, when otherwise not available; removed technology specific lengths per AD suggestion Tweaked text around disabling Nagle so that it is no longer implied that that is all that is necessary to achieve proper segmentation behavior Revamped section 5.3.1 for improved clarity [draft-ietf-rddp-mpa-03] workgroup draft with following changes: Tweaked abstract to give a bit more information. Tightened definition and usage of "deliver" Cleaned up usage of terms "FPDU Alignment" and "Header Alignment" Rearranged overview sections with stackoptimized MPA/TCP implementations. This involved somewhat blurring the artificial layer between MPA and glossary earlier Mentioned howan non-MPA-Aware TCP MPA receiver deals with outMPA-aware TCP. This involved a bit of orderterminology change. Re-wrote the requirement to avoid duplicate segments (it doesn't have to...) Fixed description ofduring TCP out of order segment handling in section 3.1.1 Added text sayingpassing to MPA; this is now a co-responsibility between MPA/TCP; also explained that orderingthe requirement was to avoid data corruption through bypassing MPA CRCs and completion indicationsother checks. 1 Glossary The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are used to deliverto DDP Added redundant text indicating low two bits of FPDUPTR must alwaysbe zero and treatedinterpreted as such in Section 4.1 Added redundant text indicating Markers are always includeddescribed in a CRC calculation Removed indication saying that an implementation can "ignore" an administrative input to not use CRCs; clarified[RFC2119]. Consumer - the ULPs or applications that both ends have to agree to not use CRC (as originally intended). Changed example FPDU hex dump formatlie above MPA and DDP. The Consumer is responsible for greater clarity Clarified that EMSS shrinking below 128 bytesmaking TCP connections, starting MPA and DDP connections, and generally controlling operations. Delivery - (Delivered, Delivers) - For MPA, Delivery is defined as the condition (rather than "very small sizes") Put connection startup rules after the start frame formats Added Initiator Private Data to figure 9 Removed or Clarified useprocess of RNIC term Added intro to IETF/RDMAC interoperability appendix and gaveinforming DDP that a web referenceparticular PDU is ordered for docs; also recommended use of "permissive IETF RNIC" Numerous minor clarifications Updated Boilerplates per current requirements [draft-ietf-rddp-mpa-02] workgroup draft with following changes: Made IPsec must implement, optional touse. Updated Marker language to clarifyA PDU is Delivered in the exact order that it pointswas sent by the original sender; MPA uses TCP's byte stream ordering to ULPDU Length even when Marker precedes FPDU. Clarifieddetermine when to start Markers use (in Full Operation mode). Added informative text on interoperability with RDMAC RNICs. Reduced Private Data to 512 octets max. Clarified CRC use description, must be used unless dataDelivery is at least as well protected by another means. Clarified CRC disabled mode; CRC fieldpossible. This is always valid. Added Security text. Changed DDP and RDMAP version numbers in hex dumps (Fig 5, 6) and adjusted CRC accordingly. [draft-ietf-rddp-mpa-01] workgroup draft with following changes: Addedspecifically different from "passing the "R" bit (Rejected)PDU to DDP", which may generally occur in any order, while the MPA Reply Frame and described its semantics. Added some comments on recent decisions regarding startup. Updated RFC3667 boilerplate. [draft-ietf-rddp-mpa-00] workgroup draft with following changes: Changed "Start Key" to two separate startup frames to facilitate identificationorder of incorrect active/active startup. Changed Active/Passive nomenclature to Initiator/Responder to reduce confusion with TCP startup and verbs doc (which used opposite sense). Added Private Data toDelivery is strictly defined. EMSS - Effective Maximum Segment Size. EMSS is the startup key sequences. This also required describingsmaller of the motivationTCP maximum segment size (MSS) as defined in RFC 793 [RFC793], and expected usage models along with some interface hints. Removedthe Privatecurrent path Maximum Transfer Unit (MTU) [RFC1191]. FPDU - Framed Protocol Data stuff from appendix. Added example "Immediate" startupUnit. The unit of data created by an MPA sender. FPDU Alignment - the property that an FPDU is Header Aligned with the TCP segment, and explanation. [draft-culley-iwarp-mpa-03] Add option to allow receivers to specify Marker use. Add option that allows both sides to agree not to use CRC. Added startup declaration "Start Key" with options and larger MPA mode recognition "key". Updated MPA/DDP connection startup rules and sequence to dealthe TCP segment includes an integer number of FPDUs. A TCP segment with "Start Key". Added Appendix that providesa more detailed analysisFPDU Alignment allows immediate processing of the effects of MPAcontained FPDUs without waiting on other TCP data streams. Added appendix that describes a mechanismsegments to dealarrive or combining with "Private Data"prior to full MPA/DDP operation. [draft-culley-iwarp-mpa-02] Enhanced descriptionssegments. FPDU Pointer (FPDUPTR) - This field of how MPAthe Marker is used over an unmodified TCP. Removed "No Packing" text. Made MPA an adaptation layer for DDP, insteadto indicate the beginning of a generalized framing solution. Added clarificationsan FPDU. Full Operation (Full Operation Phase) - After the completion of the MPA/TCP interaction for optimized implementations and that any such optimizations are to be used only when requested by MPA. [draft-culley-iwarp-mpa-01] initial draft. 1 Glossary The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119. ConsumerStartup Phase MPA begins exchanging FPDUs. Header Alignment - the ULPs or applicationsproperty that lie above MPA and DDP.a TCP segment begins with an FPDU. The ConsumerFPDU is responsible for making TCP connections, starting MPA and DDP connections, and generally controlling operations. Delivery - (Delivered, Delivers) - For MPA, DeliveryHeader Aligned when the FPDU header is defined asexactly at the processstart of informing DDP that a particular PDU is ordered for use. A PDU is Delivered inthe exact orderTCP segment (right behind the TCP headers on the wire). Initiator - The endpoint of a connection that it was sent by the original sender; MPA uses TCP's byte stream ordering to determine when Delivery is possible. This is specifically different from "passing the PDU to DDP", which may generally occur in any order, while the order of Delivery is strictly defined. EMSS - Effective Maximum Segment Size. EMSS is the smaller of the TCP maximum segment size (MSS) as defined in RFC 793 [RFC793], and the current path Maximum Transfer Unit (MTU) [RFC1191]. FPDU - Framed Protocol Data Unit. The unit of data created by an MPA sender. FPDU Alignment - the property that an FPDU is Header Aligned with the TCP segment, and the TCP segment includes an integer number of FPDUs. A TCP segment with a FPDU Alignment allows immediate processing of the contained FPDUs without waiting on other TCP segments to arrive or combining with prior segments. FPDU Pointer (FPDUPTR) - This field of the Marker is used to indicate the beginning of an FPDU. Full Operation (Full Operation Phase) - After the completion of the Startup Phase MPA begins exchanging FPDUs. Header Alignment - the property that a TCP segment begins with an FPDU. The FPDU is Header Aligned when the FPDU header is exactly at the start of the TCP segment (right behind the TCP headers on the wire). Initiator - The endpoint of a connection that sendssends the MPA Request Frame, i.e. the first to actually send data (which may not be the one which sends the TCP SYN). Marker - A four octet field that is placed in the MPA data stream at fixed octet intervals (every 512 octets). MPA-aware TCP - a TCP implementation that is aware of the receiver efficiencies of MPA FPDU Alignment and is capable of sending TCP segments that begin with an FPDU. MPA-enabled - MPA is enabled if the MPA protocol is visible on the wire. When the sender is MPA-enabled, it is inserting framing and Markers. When the receiver is MPA-enabled, it is interpreting framing and Markers. MPA Request Frame - Data sent from the MPA Initiator to the MPA Responder during the Startup Phase. MPA Reply Frame - Data sent from the MPA Responder to the MPA Initiator during the Startup Phase. MPA - Marker-based ULP PDU Aligned Framing for TCP protocol. This document defines the MPA protocol. MULPDU - Maximum ULPDU. The current maximum size of the record that is acceptable for DDP to pass to MPA for transmission. Node - A computing device attached to one or more links of a Network. A Node in this context does not refer to a specific application or protocol instantiation running on the computer. A Node may consist of one or more MPA on TCP devices installed in a host computer. PAD - A 1-3 octet group of zeros used to fill an FPDU to an exact modulo 4 size. PDU - protocol data unit Private Data - A block of data exchanged between MPA endpoints during initial connection setup. Protection Domain - An RDMA concept (see [VERBS] and [RDMASEC]) that tie use of various endpoint resources (memory access etc.) to the specific RDMA/DDP/MPA connection. RDMA - Remote Direct Memory Access; a protocol that uses DDP and MPA to enable applications to transfer data directly from memory buffers. See [RDMAP]. Remote Peer - The MPA protocol implementation on the opposite end of the connection. Used to refer to the remote entity when describing protocol exchanges or other interactions between two Nodes. Responder - The connection endpoint which responds to an incoming MPA connection request (the MAP Request Frame). This may not be the endpoint which awaited the TCP SYN. Startup Phase - The initial exchanges of an MPA connection which serves to more fully identify MPA endpoints to each other and pass connection specific setup information to each other. ULP - Upper Layer Protocol. The protocol layer above the protocol layer currently being referenced. The ULP for MPA is DDP [DDP]. ULPDU - Upper Layer Protocol Data Unit. The data record defined by the layer above MPA (DDP). ULPDU corresponds to DDP's DDP segment. ULPDU_Length - a field in the FPDU describing the length of the included ULPDU. 2 Introduction This section discusses the reason for creating MPA on TCP and a general overview of the protocol. Later sections show the MPA headers (see section 4 on page 18), and detailed protocol requirements and characteristics (see section 5 on page 20), as well as Connection Semantics (section 6 on page 31), Error Semantics (section 7 on page 46), and Security Considerations (section 8 on page 47).2.1 Motivation The Direct Data Placement protocol [DDP], when used with TCP [RFC793] requires a mechanism to detect record boundaries. The DDP records are referred to as Upper Layer Protocol Data Units by this document. The ability to locate the Upper Layer Protocol Data Unit (ULPDU) boundary is useful to a hardware network adapter that uses DDP to directly place the data in the application buffer based on the control information carried in the ULPDU header. This may be done without requiring that the packets arrive in order. Potential benefits of this capability are the avoidance of the memory copy overhead and a smaller memory requirement for handling out of order or dropped packets. Many approaches have been proposed for a generalized framing mechanism. Some are probabilistic in nature and others are deterministic. A probabilistic approach is characterized by a detectable value embedded in the octet stream. It is probabilistic because under some conditions the receiver may incorrectly interpret application data as the detectable value. Under these conditions, the protocol may fail with unacceptable frequency. A deterministic approach is characterized by embedded controls at known locations in the octet stream. Because the receiver can guarantee it will only examine the data stream at locations that are known to contain the embedded control, the protocol can never misinterpret application data as being embedded control data. For unambiguous handling of an out of order packet, the deterministic approach is preferred. The MPA protocol provides a framing mechanism for DDP running over TCP using the deterministic approach. It allows the location of the ULPDU to be determined in the TCP stream even if the TCP segments arrive out of order. 2.2 Protocol Overview The layering of PDUs with MPA is shown in Figure 1, below. +------------------+ | ULP client | +------------------+ <- Consumer messages | DDP | +------------------+ <- ULPDUs | MPAMPA* | +------------------+ <- FPDUs (containing ULPDUs) | TCP* | +------------------+ <- TCP Segments (containing FPDUs) | IP etc. | +------------------+ * TCPThese may be fully layered or MPA-aware TCP.optimized together. Figure 1 ULP MPA TCP Layering MPA is described as an extra layer above TCP and below DDP. The operation sequence is: 1. A TCP connection is established by ULP action. This is done using methods not described by this specification. The ULP may exchange some amount of data in streaming mode prior to starting MPA, but is not required to do so. 2. The Consumer negotiates the use of DDP and MPA at both ends of a connection. The mechanisms to do this are not described in this specification. The negotiation may be done in streaming mode, or by some other mechanism (such as a pre-arranged port number). 3. The ULP activates MPA on each end in the Startup Phase, either as an Initiator or a Responder, as determined by the ULP. This mode verifies the usage of MPA, specifies the use of CRC and Markers, and allows the ULP to communicate some additional data via a Private Data exchange. See section 6.17.1 Connection setup for more details on the startup process. 4. At the end of the Startup Phase, the ULP puts MPA (and DDP) into Full Operation and begins sending DDP data as further described below. In this document, DDP data chunks are called ULPDUs. For a description of the DDP data, see [DDP]. Following is a description of data transfer when MPA is in Full Operation. 1. DDP determines the Maximum ULPDU (MULPDU) size by querying MPA for this value. MPA derives this information from TCP or IP, when it is available, or chooses a reasonable value. 2. DDP creates ULPDUs of MULPDU size or smaller, and hands them to MPA at the sender. 3. MPA creates a Framed Protocol Data Unit (FPDU) by pre-pending a header, optionally inserting Markers, and appending a CRC field after the ULPDU and PAD (if any). MPA delivers the FPDU to TCP. 4. The TCP sender puts the FPDUs into the TCP stream. If the TCP Sendersender is MPA-aware,optimized MPA/TCP, it segments the TCP stream in such a way that a TCP Segment boundary is also the boundary of an FPDU. TCP then passes each segment to the IP layer for transmission. 5. The TCPreceiver may be MPA-awareor may not be MPA-aware.optimized. If it is MPA-aware,optimized MPA/TCP, it may separate passing the TCP payload to MPA from passing the TCP payload ordering information to MPA. In either case, RFC compliant TCP wire behavior is observed at both the sender and receiver. 6. The MPA receiver locates and assembles complete FPDUs within the stream, verifies their integrity, and removes MPA Markers (when present), ULPDU_Length, PAD and the CRC field. 7. MPA then provides the complete ULPDUs to DDP. MPA may also separate passing MPA payload to DDP from passing the MPA payload ordering information. MPA-awareA fully layered MPA on TCP is implemented as a data stream ULP for TCP and is therefore RFC compliant. An optimized MPA/TCP uses a TCP layer which potentially contains some additional semantics as defined in this document. MPAIt is implemented ascompletely interoperable with a data stream ULP forfully layered MPA on TCP implementation and is therefore RFC compliant. MPA- aware TCP isalso RFC compliant. An MPA-aware TCPoptimized MPA/TCP sender is able to segment the data stream such that TCP segments begin with FPDUs (FPDU Alignment). This has significant advantages for receivers. When segments arrive with aligned FPDUs the receiver usually need not buffer any portion of the segment, allowing DDP to place it in its destination memory immediately, thus avoiding copies from intermediate buffers (DDP's reason for existence). MPA with an MPA-aware TCPAn optimized MPA/TCP receiver allows a DDP on MPA implementation to locate the start of ULPDUs that may be received out of order. It also allows the implementation to determine if the entire ULPDU has been received. As a result, MPA can pass out of order ULPDUs to DDP for immediate use. This enables a DDP on MPA implementation to save a significant amount of intermediate storage by placing the ULPDUs in the right locations in the application buffers when they arrive, rather than waiting until full ordering can be restored. The ability of a receiver to recover out of order ULPDUs is optional and declared to the transmitter during startup. When the receiver declares that it does not support out of order recovery, the transmitter does not add the control information to the data stream needed for out of order recovery. If TCPthe receiver is not MPA-aware,fully layered, then MPA receives a strictly ordered stream of data and does not deal with out of order ULPDUs. In this case MPA passes each ULPDU to DDP when the last bytes arrive from TCP, along with the indication that they are in order. MPA implementations that support recovery of out of order ULPDUs MUST support a mechanism to indicate the ordering of ULPDUs as the sender transmitted them and indicate when missing intermediate segments arrive. These mechanisms allow DDP to reestablish record ordering and report Delivery of complete messages (groups of records). MPA also addresses enhanced data integrity. Some users of TCP have noted that the TCP checksum is not as strong as could be desired (see [CRCTCP]). Studies such as [CRCTCP] have shown that the TCP checksum indicates segments in error at a much higher rate than the underlying link characteristics would indicate. With these higher error rates, the chance that an error will escape detection, when using only the TCP checksum for data integrity, becomes a concern. A stronger integrity check can reduce the chance of data errors being missed. MPA includes a CRC check to increase the ULPDU data integrity to the level provided by other modern protocols, such as SCTP [RFC2960]. It is possible to disable this CRC check, however CRCs MUST be enabled unless it is clear that the end to end connection through the network has data integrity at least as good as a MPA with CRC enabled (for example when IPsec is implemented end to end). DDP's ULP expects this level of data integrity and therefore the ULP does not have to provide its own duplicate data integrity and error recovery for lost data. 3 LLP andMPA's interactions with DDP requirements The following sections describe requirements on TCP andDDP requires MPA to utilize MPA. Themaintain DDP requirements enablerecord boundaries from the correct operation oversender to the receiver. When using MPA andon TCP (as opposedto send data, DDP over SCTP or other LLPs). The TCP requirements are mostly intendedprovides records (ULPDUs) to supportMPA. MPA will use the MPA-awarereliable transmission abilities of TCP variation, which allows implementations that require less buffer memoryto transmit the data, and may provide better overall system performance. 3.1will insert appropriate additional information into the TCP implementation Requirementsstream to support MPA The TCP implementation MUST informallow the MPA whenreceiver to locate the TCP connection is closed or has begun closingrecord boundary information. As such, MPA accepts complete records (ULPDUs) from DDP at the connection (e.g. received a FIN). 3.1.1 TCP Transmit side To provide optimum performance, an MPA-aware transmit side TCP implementation SHOULD be enabled to: * With an EMSS large enoughsender and returns them to containDDP at the FPDU(s), segmentreceiver. MPA MUST encapsulate the outgoing TCP streamULPDU such that the first octet of every TCP Segment begins with anthere is exactly one ULPDU contained in one FPDU. Multiple FPDUs MAY be packed intoMPA over a singlestandard TCP segment as long as they are entirely contained instack can usually provide FPDU Alignment with the TCP segment. * ReportHeader if the current EMSSFPDU is equal to the MPA transmit layer.TCP's EMSS. An MPA-aware TCP transmit side implementation MUST continue to useoptimized MPA/TCP stack can also maintain alignment as long as the method of segmentation expected by non-MPA applications (and described in TCP RFCs) when MPAFPDU is not enabled on the connection. When MPAless than or equal to TCP's EMSS. Since FPDU Alignment is enabled above an MPA-aware TCP, it SHOULD specifically enable the segmentation rules described above forgenerally desired by the receiver, DDP segments (FPDUs) posted for transmission. Ifmust cooperate with MPA to ensure FPDUs' lengths do not exceed the transmit side TCP implementationEMSS under normal conditions. This is not able to segmentdone with the TCP stream as indicated above,MULPDU mechanism. MPA SHOULD make a best effort to achieve that result. For example, using the TCP_NODELAY socket optionprovides information to disableDDP on the Nagle algorithm will usually result in manycurrent maximum size of the segments starting with an FPDU. If the transmit side TCP implementationrecord that is not ableacceptable to report the EMSS, MPAsend (MULPDU). DDP SHOULD use the current MTU valuelimit each record size to establish a likely FPDU size, taking into account the various expected header sizes. 3.1.2 TCP Receive side When an MPA receive implementation and the MPA-aware receive side TCP implementation support handling outMULPDU. The range of order ULPDUs, the TCP receive implementation SHOULDMULPDU values MUST be enabled to: * Pass incoming TCP segments to MPA as soon as they have been receivedbetween 128 octets and validated, even if not received in order.64768 octets, inclusive. The TCP layersending DDP MUST have committed to keeping each segment before it can be passedNOT post a ULPDU larger than 64768 octets to theMPA. This means that the segment must have passed the TCP, IP, and lower layer data integrity validation (i.e., checksum), must be in the receive window, must not beDDP MAY post a duplicate, must be partULPDU of the same epoch (if timestamps are used to verify this) andany other checks required by TCP RFCs. The segment MUST NOT be passed tosize between one and 64768 octets, however MPA more than once unless explicitly requested (see Section 7). Thisis not REQUIRED to implysupport a ULPDU Length that is greater than the data must be completely ordered before use. An implementation MAY accept out of order segments, SACK them [RFC2018], and pass them to DDP immediately, beforecurrent MULPDU. While the reception ofmaximum theoretical length supported by the segments needed to fill inMPA header ULPDU_Length field is 65535, TCP over IP requires the gaps arrive. Such an implementation MUST "commit"IP datagram maximum length to the data early on, and MUST NOT overwrite it even if (or when) duplicate data arrives.be 65535 octets. To enable MPA expects to utilize this "commit"to allowsupport FPDU Alignment, the passing of ULPDUs to DDP when they arrive, independentmaximum size of ordering. DDP usesthe passed ULPDU to "place"FPDU must fit within an IP datagram. Thus the DDP segments (see [DDP] for more details).ULPDU limit of 64768 octets was derived by taking the maximum IP datagram length, subtracting from it the maximum total length of the sum of the IPv4 header, TCP header, IPv4 options, TCP options, and the worst case MPA overhead, and then rounding the result down to a 128 octet boundary. On receive, MPA MUST pass each ULPDU with its length to DDP when it has been validated. If an MPA implementation supports passing out of order ULPDUs to DDP, the MPA implementation SHOULD: * Pass each ULPDU with its length to DDP as soon as it has been fully received and validated. * Provide a mechanism to indicate the ordering of TCP segmentsULPDUs as the sender transmitted them. One possible mechanism might be attachingproviding the TCP sequence number tofor each segment.ULPDU. * Provide a mechanism to indicate when a given TCP segmentULPDU (and theprior TCP stream) is complete.ULPDUs) are complete (Delivered to DDP). One possible mechanism might be to utilize the leading (left) edge ofallow DDP to see the current outgoing TCP Receive Window. MPA uses the ordering and completion indicationsAck sequence number. * Provide an indication to informDDP whenthat the TCP has closed or has begun to close the connection (e.g. received a ULPDU is complete;FIN). MPA DeliversMUST provide the FPDUprotocol version negotiated with its peer to DDP. DDP uses the indicationswill use this version to "deliver"set the version in its messagesheader and to report the DDP consumer (see [DDP] for more details). DDP on MPA MUST utilize these two mechanismsversion to establish[RDMAP]. 4 MPA Full Operation Mode The following sections describe the Delivery semantics that DDP's consumers agree to. Thesemain semantics are described fully in [DDP]. These include requirements on DDP's consumer to respect ownershipof buffers prior tothe time that DDP delivers them to the Consumer. An MPA-aware TCP receive side implementationfull operation mode of MPA. 4.1 FPDU Format MPA senders create FPDUs out of ULPDUs. The format of an FPDU shown below MUST continue to buffer TCP segments until completely ordered and then deliver them as expected by non-MPA applications (and described in TCP RFCs) whenbe used for all MPA isFPDUs. For purposes of clarity, Markers are not enabled on the connection. When MPAshown in Figure 2. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ULPDU_Length | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | | ~ ~ ~ ULPDU ~ | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | PAD (0-3 octets) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | CRC | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 2 FPDU Format ULPDU_Length: 16 bits (unsigned integer). This is enabled above an MPA- aware TCP, TCP SHOULD enablethe in and outnumber of order passingoctets of data, andthe separate ordering information as described above. When an MPA receive implementation is coupled with a TCP receive implementation thatcontained ULPDU. It does not supportinclude the preceding mechanisms, TCP passes and Delivers incoming stream data to MPA in order. 3.2 MPA's interactions with DDP DDP requires MPA to maintain DDP record boundaries fromlength of the sender toFPDU header itself, the receiver. When using MPA on TCP to send data, DDP provides records (ULPDUs) to MPA. MPA will usepad, the reliable transmission abilitiesCRC, or of TCP to transmit the data, and will insert appropriate additional information into the TCP stream to allowany Markers that fall within the MPA receiverULPDU. The 16-bit ULPDU Length field is large enough to locatesupport the record boundary information. As such, MPA accepts complete records (ULPDUs) from DDP atlargest IP datagrams for IPv4 or IPv6. PAD: The PAD field trails the senderULPDU and returns themcontains between zero and three octets of data. The pad data MUST be set to DDP at the receiver. MPA combined with an MPA-aware TCP can only ensure FPDU Alignment with the TCP Header ifzero by the FPDU is less than or equal to TCP's EMSS. Since FPDU Alignment is generally desiredsender and ignored by the receiver, DDP must cooperate with MPA to ensure FPDUs' lengths do not exceedreceiver (except for CRC checking). The length of the EMSS under normal conditions. Thispad is done with the MULPDU mechanism. MPA provides informationset so as to DDP onmake the current maximumsize of the record thatFPDU an integral multiple of four. CRC: 32 bits, When CRCs are enabled, this field contains a CRC32C check value, which is acceptable to send (MULPDU). DDP SHOULD limit each record sizeused to MULPDU. The rangeverify the entire contents of MULPDU values MUST be between 128 octetsthe FPDU, using CRC32C. See section 4.4 CRC Calculation on page 17. When CRCs are not enabled, this field is still present, may contain any value, and 64768 octets, inclusive. The sending DDPMUST NOT postbe checked. The FPDU adds a ULPDU larger than 64768minimum of 6 octets to MPA. DDP MAY post a ULPDUthe length of any size between one and 64768 octets, however MPA is not REQUIRED to support a ULPDU Length that is greater thanthe current MULPDU. WhileULPDU. In addition, the maximum theoreticaltotal length supported byof the MPA header ULPDU_Length field is 65535, TCP over IP requiresFPDU will include the IP datagram maximumlength of any Markers and from 0 to be 65535 octets. To enable MPA3 pad octets added to support FPDU Alignment,round-up the maximum sizeULPDU size. 4.2 Marker Format The format of thea Marker MUST be as specified in Figure 3: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | RESERVED | FPDUPTR | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 3 Marker Format RESERVED: The Reserved field MUST be set to zero on transmit and ignored on receive (except for CRC calculation). FPDUPTR: The FPDU must fit withinPointer is a relative pointer, 16-bits long, interpreted as an IP datagram. Thusunsigned integer that indicates the ULPDU limitnumber of 64768octets was derived by takingin the maximum IP datagram length, subtractingTCP stream from itthe maximum total lengthbeginning of the sumULPDU Length field to the first octet of the IPv4 header, TCP header, IPv4 options, TCP options, andentire Marker. The least significant two bits MUST always be set to zero at the worst case MPA overhead,transmitter, and then roundingthe result down to a 128 octet boundary. On receive, MPAreceivers MUST pass each ULPDU with its length to DDP when it has been validated. If analways treat these as zero for calculations. 4.3 MPA implementation supports passing out of order ULPDUs to DDP, theMarkers MPA implementation SHOULD: * Pass each ULPDU with its length to DDP as soon as it has been fully received and validated. * Provide a mechanismMarkers are used to indicateidentify the orderingstart of ULPDUs as the sender transmitted them. One possible mechanism might be providing the TCP sequence number for each ULPDU. * Provide a mechanism to indicateFPDUs when a given ULPDU (and prior ULPDUs)packets are complete (Delivered to DDP). One possible mechanism might be to allow DDPreceived out of order. This is done by locating the Markers at fixed intervals in the data stream (which is correlated to seethe current outgoingTCP Acksequence number. * Provide an indication to DDP thatnumber) and using the TCP has closed or has begunMarker value to closelocate the connection (e.g. received a FIN).preceding FPDU start. All MPA MUST provideMarkers are included in the protocol version negotiated with its peercontaining FPDU CRC calculation (when both CRCs and Markers are in use). The MPA receiver's ability to locate out of order FPDUs and pass the ULPDUs to DDP.DDP will useis implementation dependent. MPA/DDP allows those receivers that are able to deal with out of order FPDUs in this versionway to setrequire the versioninsertion of Markers in its header and to reportthe version to [RDMAP]. 4 FPDU Formats MPA senders create FPDUsdata stream. When the receiver cannot deal with out of ULPDUs. The formatorder FPDUs in this way, it may disable the insertion of an FPDU shown belowMarkers at the sender. All MPA senders MUST be used for all MPA FPDUs. For purposes of clarity,able to generate Markers when their use is declared by the opposing receiver (see section 7.1 Connection setup on page 30). When Markers are not shownenabled, MPA senders MUST insert a Marker into the data stream at a 512 octet periodic interval in Figure 2. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ULPDU_Length | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | | ~ ~ ~ ULPDU ~ | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | PAD (0-3 octets) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | CRC | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 2 FPDU Format ULPDU_Length:the TCP Sequence Number Space. The Marker contains a 16 bits (unsigned integer). Thisbit unsigned integer referred to as the FPDUPTR (FPDU Pointer). If the FPDUPTR's value is non-zero, the FPDU Pointer is a 16 bit relative back-pointer. FPDUPTR MUST contain the number of octets of the contained ULPDU. It does not include the length of the FPDU header itself,in the pad,TCP stream from the CRC, orbeginning of any Markers that fall withinthe ULPDU. The 16-bitULPDU Length field is large enoughto supportthe largest IP datagrams for IPv4 or IPv6. PAD: The PAD field trailsfirst octet of the ULPDU and containsMarker, unless the Marker falls between zero and three octetsFPDUs. Thus the location of data. The pad data MUSTthe first octet of the previous FPDU header can be set to zerodetermined by subtracting the sender and ignored by the receiver (except for CRC checking). The lengthvalue of the pad is set so as to makegiven Marker from the sizecurrent octet-stream sequence number (i.e. TCP sequence number) of the FPDU an integral multiplefirst octet of four. CRC: 32 bits, When CRCs are enabled,the Marker. Note that this field containscomputation MUST take into account that the TCP sequence number could have wrapped between the Marker and the header. An FPDUPTR value of 0x0000 is a CRC32C check value, whichspecial case - it is used to verifywhen the entire contents ofMarker falls exactly between FPDUs (between the FPDU, using CRC32C. See section 5.2preceding FPDU CRC Calculation on page 23. When CRCs are not enabled,field, and the next FPDU's ULPDU Length field). In this fieldcase, the Marker is still present, may contain any value, and MUST NOT be checked. The FPDU adds a minimum of 6 octetsconsidered to be contained in the length offollowing FPDU; the ULPDU. In addition,Marker MUST be included in the total lengthCRC calculation of the FPDU will includefollowing the lengthMarker (if CRCs are being generated or checked). Thus an FPDUPTR value of any Markers and from 0 to 3 pad octets added to round-up0x0000 means that immediately following the ULPDU size. 4.1Marker Format The formatis an FPDU header (the ULPDU Length field). Since all FPDUs are integral multiples of a Marker4 octets, the bottom two bits of the FPDUPTR as calculated by the sender are zero. MPA reserves these bits so they MUST be treated as specified in Figure 3: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | RESERVED | FPDUPTR | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 3 Marker Format RESERVED: The Reserved field MUST be set tozero on transmit and ignored on receive (exceptfor CRC calculation). FPDUPTR: The FPDU Pointer is a relative pointer, 16-bits long, interpreted as an unsigned integer that indicatescomputation at the numberreceiver. When Markers are enabled (see section 7.1 Connection setup on page 30), the MPA Markers MUST be inserted immediately preceding the first FPDU of Full Operation phase, and at every 512th octet of octets inthe TCP octet stream fromthereafter. As a result, the beginningfirst Marker has an FPDUPTR value of 0x0000. If the ULPDU Length field tofirst Marker begins at octet sequence number SeqStart, then Markers are inserted such that the first octet of the entire Marker. The least significant two bits MUST always be set to zeroMarker is at octet sequence number SeqNum if the transmitter, and the receivers MUST always treat these as zero for calculations. 5 Data Transfer Semantics This section discusses some characteristics and behavior of the MPA protocol as well as implicationsremainder of (SeqNum - SeqStart) mod 512 is zero. Note that protocol. 5.1 MPA Markers MPA Markers areSeqNum can wrap. For example, if the TCP sequence number were used to identifycalculate the start of FPDUs when packets are received outinsertion point of order. This is done by locating the Markers at fixed intervals inthe data stream (which is correlated toMarker, the starting TCP sequence number) and using the Marker valuenumber is unlikely to locate the preceding FPDU start. All MPA Markers are included in the containing FPDU CRC calculation (when both CRCsbe zero, and Markers512 octet multiples are in use). The MPA receiver's abilityunlikely to locate outfall on a modulo 512 of order FPDUs and passzero. If the ULPDUs to DDPMPA connection is implementation dependent. MPA/DDP allows those receivers that are ablestarted at TCP sequence number 11, then the 1st Marker will begin at 11, and subsequent Markers will begin at 523, 1035, etc. If an FPDU is large enough to deal with out of order FPDUs in this waycontain multiple Markers, they MUST all point to requirethe insertion of Markerssame point in the data stream. When the receiver cannot deal with out of order FPDUs in this way, it may disableTCP stream: the insertionfirst octet of Markers atthe sender. All MPA senders MUST be able to generate Markers when their use is declared byULPDU Length field for the opposing receiver (see section 6.1 Connection setup on page 32). When Markers are enabled, MPA senders MUST insertFPDU. If a Marker into the data stream at a 512 octet periodicinterval in the TCP Sequence Number Space. The Markercontains a 16 bit unsigned integer referred to as the FPDUPTR (FPDU Pointer). If the FPDUPTR's value is non-zero,multiple FPDUs (the FPDUs are small), the FPDU Pointer is a 16 bit relative back-pointer. FPDUPTRMarker MUST contain the number of octets in the TCP stream frompoint to the beginningstart of the ULPDU Length field tofor the first octet ofFPDU containing the Marker,Marker unless the Marker falls between FPDUs. ThusFPDUs, in which case the location of the first octet of the previous FPDU header can be determined by subtracting the value of the givenMarker from the current octet-stream sequence number (i.e. TCP sequence number) of the first octet of the Marker. Note that this computationMUST take into account that the TCP sequence number could have wrapped between the Marker and the header. An FPDUPTR value of 0x0000 isbe zero. The following example shows an FPDU containing a special case - it is used when the Marker falls exactly between FPDUs (between the precedingMarker. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ULPDU Length (0x0010) | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | | + + | ULPDU (octets 0-9) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | (0x0000) | FPDU ptr (0x000C) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ULPDU (octets 10-15) | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | PAD (2 octets:0,0) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | CRC field,| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 4 Example FPDU Format with Marker MPA Receivers MUST preserve ULPDU boundaries when passing data to DDP. MPA Receivers MUST pass the ULPDU data and the next FPDU'sULPDU Length field). In this case, the Marker is consideredto be contained in the following FPDU;DDP and not the MarkerMarkers, headers, and CRC. 4.4 CRC Calculation An MPA implementation MUST be included in theimplement CRC calculation of the FPDU following the Marker (ifsupport and MUST either: (1) always use CRCs; The MPA provider at is not REQUIRED to support an administrator's request that CRCs are being generatednot be used. or checked). Thus an FPDUPTR value of 0x0000 means that immediately following(2a) only indicate a preference to not use CRCs on the Marker is an FPDU header (the ULPDU Length field). Since all FPDUs are integral multiplesexplicit request of 4 octets,the bottom two bits ofsystem administrator, via an interface not defined in this spec. The default configuration for a connection MUST be to use CRCs. (2b) disable CRC checking (and possibly generation) if both the FPDUPTRlocal and remote endpoints indicate preference to not use CRCs. The decision for hosts to request CRC suppression MAY be made on an administrative basis for any path that provides equivalent protection from undetected errors as calculated byan end-to-end CRC32c. The process MUST be invisible to the sender are zero.ULP. After receipt of an MPA reserves these bits so theystartup declaration indicating that its peer requires CRCs, an MPA instance MUST be treated as zero for computation atcontinue generating and checking CRCs until the receiver. When Markers are enabled (seeconnection terminates. If an MPA instance has declared that it does not require CRCs, it MUST turn off CRC checking immediately after receipt of an MPA mode declaration indicating that its peer also does not require CRCs. It MAY continue generating CRCs. See section 6.17.1 Connection setup on page 32),30 for details on the MPA Markersstartup. When sending an FPDU, the sender MUST be inserted immediately precedinginclude a CRC field. When CRCs are enabled, the first FPDU of Full Operation phase, and at every 512th octet ofCRC field in the TCP octet stream thereafter. As a result,MPA FPDU MUST be computed using the first Marker hasCRC32C polynomial in the manner described in the iSCSI Protocol [iSCSI] document for Header and Data Digests. The fields which MUST be included in the CRC calculation when sending an FPDUPTR value of 0x0000.FPDU are as follows: 1) If the firsta Marker begins at octet sequence number SeqStart, then Markers are inserted such thatdoes not immediately precede the first octet ofULPDU Length field, the MarkerCRC-32c is at octet sequence number SeqNum ifcalculated from the remainderfirst octet of (SeqNum - SeqStart) mod 512 is zero. Note that SeqNum can wrap. For example, ifthe TCP sequence number were usedULPDU Length field, through all the ULPDU and Markers (if present), to calculatethe insertion pointlast octet of the Marker, the starting TCP sequence numberPAD (if present), inclusive. If there is unlikely to be zero, and 512 octet multiples are unlikely to fall ona modulo 512 of zero. IfMarker immediately following the MPA connection is started at TCP sequence number 11, thenPAD, the 1stMarker will begin at 11, and subsequent Markers will begin at 523, 1035, etc. If an FPDUis large enough to contain multiple Markers, they MUST all point to the same pointincluded in the TCP stream: the first octet of the ULPDU Length fieldCRC calculation for thethis FPDU. 2) If a Marker interval contains multiple FPDUs (the FPDUs are small), the Marker MUST point toimmediately precedes the startfirst octet of the ULPDU Length field for the FPDU containingof the Marker unlessFPDU, (i.e. the Marker fallsfell between FPDUs, and thus is required to be included in which casethe Marker MUST be zero. The following example shows an FPDU containing a Marker. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |second FPDU), the CRC- 32c is calculated from the first octet of the Marker, through the ULPDU Length (0x0010) | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | | + + | ULPDU (octets 0-9) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | (0x0000) | FPDU ptr (0x000C) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ULPDU (octets 10-15) | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | PAD (2 octets:0,0) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | CRC | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 4 Example FPDU Format with Marker MPA Receivers MUST preserve ULPDU boundaries when passing data to DDP. MPA Receivers MUST passheader, through all the ULPDU dataand the ULPDU LengthMarkers (if present), to DDP and notthe Markers, headers, and CRC. 5.2 CRC Calculation An MPA implementation MUST implementlast octet of the PAD (if present), inclusive. 3) After calculating the CRC-32c, the resultant value is placed into the CRC support and MUST either: (1) always use CRCs; The MPA providerfield at is not REQUIRED to support an administrator's request that CRCs not be used. or (2a) only indicate a preference to not use CRCs onthe explicit requestend of the system administrator, viaFPDU. When an interface not defined in this spec. The default configuration for a connection MUST be to use CRCs. (2b) disableFPDU is received, and CRC checking (and possibly generation) if bothis enabled, the local and remote endpoints indicate preference to not use CRCs. The decision for hosts to request CRC suppression MAY be made on an administrative basis for any path that provides equivalent protection from undetected errors as an end-to-end CRC32c. The processreceiver MUST be invisible tofirst perform the ULP. After receipt of an MPA startup declaration indicating that its peer requires CRCs, an MPA instance MUST continue generating and checking CRCs untilfollowing: 1) Calculate the connection terminates. If an MPA instance has declared that it does not require CRCs, it MUST turn offCRC checking immediately after receiptof an MPA mode declaration indicatingthe incoming FPDU in the same fashion as defined above. 2) Verify that its peer also does not require CRCs. It MAY continue generating CRCs. See section 6.1 Connection setup on page 32 for details onthe MPA startup. When sending an FPDU,calculated CRC-32c value is the sender MUST include a CRC field. When CRCs are enabled,same as the CRC fieldreceived CRC-32c value found in the MPAFPDU MUST be computed using the CRC32C polynomial inCRC field. If not, the manner described inreceiver MUST treat the iSCSI Protocol [iSCSI] document for Header and Data Digests.FPDU as an invalid FPDU. The fields which MUST be includedprocedure for handling invalid FPDUs is covered in the CRC calculation when sendingError Section (see section 8 on page 44) The following is an annotated hex dump of an example FPDU aresent as follows: 1) If a Marker does not immediately precede the ULPDU Length field, the CRC-32c is calculated fromthe first octet of the ULPDU Length field, through all the ULPDU and Markers (if present), to the last octet of the PAD (if present), inclusive. If there is a Marker immediately following the PAD, the Marker is included in the CRC calculation for this FPDU. 2) If a Marker immediately precedes the first octet of the ULPDU Length field of the FPDU, (i.e. the Marker fell between FPDUs, and thus is required to be included in the second FPDU), the CRC- 32c is calculated from the first octet of the Marker, through the ULPDU Length header, through all the ULPDU and Markers (if present), to the last octet of the PAD (if present), inclusive. 3) After calculating the CRC-32c, the resultant value is placed into the CRC field at the end of the FPDU. When an FPDU is received, and CRC checking is enabled, the receiver MUST first perform the following: 1) Calculate the CRC of the incoming FPDU in the same fashion as defined above. 2) Verify that the calculated CRC-32c value is the same as the received CRC-32c value found in the FPDU CRC field. If not, the receiver MUST treat the FPDU as an invalid FPDU. The procedure for handling invalid FPDUs is covered in the Error Section (see section 7 on page 46) The following is an annotated hex dump of an example FPDU sent as the first FPDU onFPDU on the stream. As such, it starts with a Marker. The FPDU contains a 42 octet ULPDU (an example DDP segment) which in turn contains 24 octets of the contained ULPDU, which is a data load that is all zeros. The CRC32c has been correctly calculated and can be used as a reference. See the [DDP] and [RDMAP] specification for definitions of the DDP Control field, Queue, MSN, MO, and Send Data. Octet Contents Annotation Count 0000 00 Marker: Reserved 0001 00 0002 00 Marker: FPDUPTR 0003 00 0004 00 ULPDU Length 0005 2a 0006 41 DDP Control Field, Send with Last flag set 0007 43 0008 00 Reserved (DDP STag position with no STag) 0009 00 000a 00 000b 00 000c 00 DDP Queue = 0 000d 00 000e 00 000f 00 0010 00 DDP MSN = 1 0011 00 0012 00 0013 01 0014 00 DDP MO = 0 0015 00 0016 00 0017 00 0018 00 DDP Send Data (24 octets of zeros) ... 002f 00 0030 52 CRC32c 0031 23 0032 99 0033 83 Figure 5 Annotated Hex Dump of an FPDU The following is an example sent as the second FPDU of the stream where the first FPDU (which is not shown here) had a length of 492 octets and was also a Send to Queue 0 with Last Flag set. This example contains a Marker. Octet Contents Annotation Count 01ec 00 Length 01ed 2a 01ee 41 DDP Control Field: Send with Last Flag set 01ef 43 01f0 00 Reserved (DDP STag position with no STag) 01f1 00 01f2 00 01f3 00 01f4 00 DDP Queue = 0 01f5 00 01f6 00 01f7 00 01f8 00 DDP MSN = 2 01f9 00 01fa 00 01fb 02 01fc 00 DDP MO = 0 01fd 00 01fe 00 01ff 00 0200 00 Marker: Reserved 0201 00 0202 00 Marker: FPDUPTR 0203 14 0204 00 DDP Send Data (24 octets of zeros) ... 021b 00 021c 84 CRC32c 021d 92 021e 58 021f 98 Figure 6 Annotated Hex Dump of an FPDU with Marker 5.3Marker 4.5 FPDU Size Considerations MPA defines the Maximum Upper Layer Protocol Data Unit (MULPDU) as the size of the largest ULPDU fitting in an FPDU. For an empty TCP Segment, MULPDU is EMSS minus the FPDU overhead (6 octets) minus space for Markers and pad octets. The maximum ULPDU Length for a single ULPDU when Markers are present MUST be computed as: MULPDU = EMSS - (6 + 4 * Ceiling(EMSS / 512) + EMSS mod 4) The formula above accounts for the worst-case number of Markers. The maximum ULPDU Length for a single ULPDU when Markers are NOT present MUST be computed as: MULPDU = EMSS - (6 + EMSS mod 4) As a further optimization of the wire efficiency an MPA implementation MAY dynamically adjust the MULPDU (see section 5 for latency and wire efficiency trade-offs). When one or more FPDUs are already packed into a TCP Segment, MULPDU MAY be reduced accordingly. DDP SHOULD provide ULPDUs that are as large as possible, but less than or equal to MULPDU. If the TCP implementation needs to adjust EMSS to support MTU changes or changing TCP options, the MULPDU value is changed accordingly. In certain rare situations, the EMSS may shrink below 128 octets in size. If this occurs, the MPA on TCP sender MUST NOT shrink the MULPDU below 128 octets and is not REQUIRED to follow the segmentation rules in Sections 5.1 and 5.3. If one or more FPDUs are already packed into a TCP segment, such that the remaining room is less than 128 octets, MPA MUST NOT provide a MULPDU smaller than 128. In this case, MPA would typically provide a MULPDU for the next full sized segment, but may still pack the next FPDU into the small remaining room, provide that the next FPDU is small enough to fit. The value 128 is chosen as to allow DDP designers room for the DDP Header and some user data. 5 MPA's interactions with TCP The following sections describe MPA's interactions with TCP. We will discuss two significant cases; using a standard layered TCP stack with MPA attached above a TCP socket, and using an optimized MPA- aware TCP with an MPA implementation that takes advantage of the extra optimizations. Other implementations are possible. +-----------------------------------+ | +-----+ +-----------------+ | | | MPA | | Other Protocols | | | +-----+ +-----------------+ | | || || | | ----- socket API -------------- | | || | | +-----+ | | | TCP | | | +-----+ | | || | | +-----+ | | | IP | | | +-----+ | +-----------------------------------+ Figure 7 Fully layered implementation The Fully layered implementation is described for completeness; however, the user is cautioned that the reduced probability of FPDU alignment when transmitting with this implementation will tend to introduce a higher overhead at optimized receivers. In addition, the lack of out-of-order receive processing will significantly reduce the value of DDP/MPA by imposing higher buffering and copying overhead in the local receiver. +-----------------------------------+ | +-----------+ +-----------------+ | | | Optimized | | Other Protocols | | | | MPA/TCP | +-----------------+ | | +-----------+ || | | \\ --- socket API --- | | \\ || | | \\ +-----+ | | \\ | TCP | | | \\ +-----+ | | \\ // | | +-------+ | | | IP | | | +-------+ | +-----------------------------------+ Figure 8 Optimized MPA/TCP implementation The optimized MPA/TCP implementations described below are only applicable to MPA, all other TCP applications continue to use the standard TCP stacks and interfaces. 5.1 MPA transmitters with a standard layered TCP MPA transmitters SHOULD calculate a MULPDU as described in section 4.5 If the TCP implementation allows EMSS to be determined by MPA, that value should be used. If the transmit side TCP implementation is not able to report the EMSS, MPA SHOULD use the current MTU value to establish a likely FPDU size, taking into account the various expected header sizes. MPA transmitters SHOULD also use whatever facilities the TCP stack presents to cause the TCP transmitter to start TCP segments at FPDU boundaries. Multiple FPDUs MAY be packed into a single TCP segment as determined by the EMSS calculation as long as they are entirely contained in the TCP segment. For example, passing FPDU buffers sized to the current EMSS to the TCP socket and using the TCP_NODELAY socket option to disable the Nagle [RFC0896] algorithm will usually result in many of the segments starting with an FPDU. It is recognized that various effects can cause a FPDU alignment to be lost. Following are a few of the effects: * ULPDUs that are smaller than the MULPDU. If these are sent in a continuous stream, FPDU alignment will be lost. Note that careful use of a dynamic MULPDU can help in this case; the MULPDU for future FPDUs can be adjusted to re-establish alignment with the segments based on the current EMSS. * Sending enough data that the TCP receive window limit is reached. TCP may send a smaller segment to exactly fill the receive window. * Sending data when TCP is operating up against the congestion window. If TCP is not tracking the congestion window in segments, it may transmit a smaller segment to exactly fill the receive window. * Changes in EMSS due to varying TCP options, or changes in MTU. If FPDU alignment with TCP segments is lost for any reason, the alignment is regained after a break in transmission where the TCP send buffers are emptied. Many usage models for DDP/MPA will include such breaks. MPA onreceivers are REQUIRED to be able to operate correctly even if alignment is lost (see section 6). 5.2 MPA receivers with a standard layered TCP Sender SegmentationMPA receivers will get TCP data in the usual ordered stream. The receivers MUST identify FPDU boundaries by using the ULPDU_LENGTH field, as described in section 6. Receivers MAY utilize markers to check for FPDU boundary consistency, but they are NOT required to examine the markers to determine the FPDU boundaries. 5.3 Optimized MPA/TCP transmitters The various TCP RFCs allow considerable choice in segmenting a TCP stream. In order to optimize FPDU recovery at the MPA receiver, MPA specifiesan optimized MPA/TCP implementation uses additional segmentation rules. MPA MUST encapsulate the ULPDU such that there is exactly one ULPDU contained in one FPDU. An MPA-aware TCP sender SHOULD, whenTo provide optimum performance, an optimized MPA/TCP transmit side implementation SHOULD be enabled for MPA, on TCP implementations that support this, and withto: * With an EMSS large enough to contain at least one FPDU, segment the outbound TCP stream such that each TCP segment begins with an FPDU, and fully contains all included FPDUs. Implementation note: To achieve the previous segmentation rule, an MPA-aware TCP sender implementation SHOULD disable TCP's Nagle [RFC0896] algorithm, communicate the FPDU boundaries to TCP, and make other minor changescontain the FPDU(s), segment the outgoing TCP stream such asthat the reportingfirst octet of every TCP Segment begins with an FPDU. Multiple FPDUs MAY be packed into a single TCP segment as long as they are entirely contained in the TCP segment. * Report the current EMSS from the TCP to MPA.the MPA transmit layer. There are exceptions to the above rule. Once an ULPDU is provided to MPA, the MPA on TCPMPA/TCP sender MUST transmit it or fail the connection; it cannot be repudiated. As a result, during changes in MTU and EMSS, or when TCP's Receive Window size (RWIN) becomes too small, it may be necessary to send FPDUs that do not conform to the segmentation rule above. A possible, but less desirable, alternative is to use IP fragmentation on accepted FPDUs to deal with MTU reductions or extremely small EMSS. The sender MUST still format the FPDU according to FPDU format as shown in Figure 2. On a retransmission, TCP does not necessarily preserve original TCP segmentation boundaries. This can lead to the loss of FPDU Alignment and containment within a TCP segment during TCP retransmissions. An MPA-aware TCPoptimized MPA/TCP sender SHOULD try to preserve original TCP segmentation boundaries on a retransmission. 5.3.1 Effects of MPA on TCPOptimized MPA/TCP Segmentation DDP/MPAOptimized MPA/TCP senders will fill TCP segments to the EMSS with a single FPDU when a DDP message is large enough. Since the DDP message may not exactly fit into TCP segments, a "message tail" often occurs that results in an FPDU that is smaller than a single TCP segment. Additionally some DDP messages may be considerably shorter than the EMSS. If a small FPDU is sent in a single TCP segment the result is a "short" TCP segment. Applications expected to see strong advantages from Direct Data Placement include transaction-based applications and throughput applications. Request/response protocols typically send one FPDU per TCP segment and then wait for a response. Under these conditions, these "short" TCP segments are an appropriate and expected effect of the segmentation. Another possibility is that the application might be sending multiple messages (FPDUs) to the same endpoint before waiting for a response. In this case, the segmentation policy would tend to reduce the available connection bandwidth by under-filling the TCP segments. Standard TCP implementations often utilize the Nagle [RFC0896] algorithm to ensure that segments are filled to the EMSS whenever the round trip latency is large enough that the source stream can fully fill segments before Acks arrive. The algorithm does this by delaying the transmission of TCP segments until a ULP can fill a segment, or until an ACK arrives from the far side. The algorithm thus allows for smaller segments when latencies are shorter to keep the ULP's end to end latency to reasonable levels. The Nagle algorithm is not mandatory to use [RFC1122]. IfWhen used with optimized MPA/TCP stacks, Nagle or otherand similar algorithms for detectingcan result in the availability of multiple FPDUs for transmission is used,"packing" of multiple FPDUs into TCP segments can occur.segments. If a "message tail", small DDP messages, or the start of a larger DDP message are available, MPA MAY pack multiple FPDUs into TCP segments. When this is done, the TCP segments can be more fully utilized, but, due to the size constraints of FPDUs, segments may not be filled to the EMSS. A dynamic MULPDU that informs DDP of the size of the remaining TCP segment space makes filling the TCP segment more effective. Note that MPA receivers must do more processing of a TCP segment that contains multiple FPDUs, this may affect the performance of some receiver implementations. It is up to the ULP to decide if Nagle is useful with DDP/MPA. Note that many of the applications expected to take advantage of MPA/DDP prefer to avoid the extra delays caused by Nagle. In such scenarios it is anticipated there will be minimal opportunity for packing at the transmitter and receivers may choose to optimize their performance for this anticipated behavior. Therefore, the application is expected to set TCP parameters such that it can trade off latency and wire efficiency. This is accomplished by setting the TCP_NODELAY socket option (which disables Nagle). When latency is not critical, application is expected to leave Nagle enabled. In this caseleave Nagle enabled. In this case the TCP implementation may pack any available FPDUs into TCP segments so that the segments are filled to the EMSS. If the amount of data available is not enough to fill the TCP segment when it is prepared for transmission, TCP can send the segment partly filled, or use the Nagle algorithm to wait for the ULP to post more data. 5.4 Optimized MPA/TCP receivers When an MPA receive implementation and the MPA-aware receive side TCP implementation support handling out of order ULPDUs, the TCP receive implementation SHOULD be enabled to perform the following functions: 1) The implementation SHOULD pass incoming TCP segments to MPA as soon as they have been received and validated, even if not received in order. The TCP layer MUST have committed to keeping each segment before it can be passed to the TCP implementation may pack any available stream data into TCP segments soMPA. This means that the segments are filled tosegment must have passed the EMSS. IfTCP, IP, and lower layer data integrity validation (i.e., checksum), must be in the amountreceive window, must be part of data available is not enough to fillthe same epoch (if timestamps are used to verify this) and any other checks required by TCP segment when itRFCs. This is prepared for transmission, TCP can send the segment partly filled, or use the Nagle algorithmnot to wait forimply that the ULP to post moredata (discussed below). 5.3.2 FPDU Size Considerationsmust be completely ordered before use. An implementation MAY accept out of order segments, SACK them [RFC2018], and pass them to MPA defines the Maximum Upper Layer Protocol Data Unit (MULPDU) asimmediately, before the sizereception of the largest ULPDU fittingsegments needed to fill in an FPDU. For an empty TCP Segment, MULPDU is EMSS minus the FPDU overhead (6 octets) minus space for Markers and pad octets. The maximum ULPDU Length for a single ULPDU when Markers are present MUST be computed as: MULPDU = EMSS - (6 + 4 * Ceiling(EMSS / 512) + EMSS mod 4) The formula above accounts forthe worst-case number of Markers. The maximum ULPDU Length for a single ULPDUgaps arrive. MPA expects to utilize these segments when Markersthey are NOT present MUSTcomplete FPDUs or can be computed as: MULPDU = EMSS - (6 + EMSS mod 4) As a further optimizationcombined into complete FPDUs to allow the passing of ULPDUs to DDP when they arrive, independent of ordering. DDP uses the wire efficiency an MPA implementation MAY dynamically adjustpassed ULPDU to "place" the MULPDUDDP segments (see section 5.3.1[DDP] for latency and wire efficiency trade-offs). When one ormore FPDUs are already packed intodetails). Since MPA performs a CRC calculation and other checks on received FPDUs, the MPA/TCP implementation MUST ensure that any TCP Segment, MULPDU MAY be reduced accordingly. DDP SHOULD provide ULPDUssegments that are as large as possible, but less than or equal to MULPDU. If theduplicate data already received and processed (as can happen during TCP retries) do not overwrite already received and processed FPDUs. This avoids the possibility that duplicate data may corrupt already validated FPDUs. 2) The implementation needs to adjust EMSSMUST provide a mechanism to support MTU changes,indicate the MULPDU value is changed accordingly. In certain rare situations,ordering of TCP segments as the EMSS may shrink below 128 octets in size. If this occurs,sender transmitted them. One possible mechanism might be attaching the MPA onTCP sendersequence number to each segment. 3) The implementation MUST NOT shrink the MULPDU below 128 octets and is not REQUIREDprovide a mechanism to follow the segmentation rules in Section 5.3 MPA on TCP Sender Segmentation on page 26. If one or more FPDUs are already packed intoindicate when a given TCP segment, such thatsegment (and the remaining roomprior TCP stream) is less than 128 octets,complete. One possible mechanism might be to utilize the leading (left) edge of the TCP Receive Window. MPA MUST NOT provideuses the ordering and completion indications to inform DDP when a MULPDU smaller than 128. In this case,ULPDU is complete; MPA would typically provide a MULPDU for the next full sized segment, but may still packDelivers the nextFPDU into the small remaining room, provide thatto DDP. DDP uses the next FPDU is small enoughindications to fit. The value 128 is chosen as"deliver" its messages to allowthe DDP designers roomconsumer (see [DDP] for more details). DDP on MPA MUST utilize these two mechanisms to establish the Delivery semantics that DDP's consumers agree to. These semantics are described fully in [DDP]. These include requirements on DDP's consumer to respect ownership of buffers prior to the time that DDP Header and some user data. 5.4delivers them to the Consumer. 6 MPA Receiver FPDU Identification An MPA receiver MUST first verify the FPDU before passing the ULPDU to DDP. To do this, the receiver MUST: * locate the start of the FPDU unambiguously, * verify its CRC (if CRC checking is enabled). If the above conditions are true, the MPA receiver passes the ULPDU to DDP. To detect the start of the FPDU unambiguously one of the following MUST be used: 1: In an ordered TCP stream, the ULPDU Length field in the current FPDU when FPDU has a valid CRC, can be used to identify the beginning of the next FPDU. 2: For optimized MPA/TCP receivers that support out of order reception of FPDUs (see section 5.14.3 MPA Markers on page 20)14) a Marker can always be used to locate the beginning of an FPDU (in FPDUs with valid CRCs). Since the location of the Marker is known in the octet stream (sequence number space), the Marker can always be found. 3: Having found an FPDU by means of a Marker, an optimized MPA/TCP receiver can find following contiguous FPDUs can be foundby using the ULPDU Length fields (from FPDUs with valid CRCs) to establish the next FPDU boundary. The ULPDU Length field (see section 4) MUST be used to determine if the entire FPDU is present before forwarding the ULPDU to DDP. CRC calculation is discussed in section 5.24.4 on page 2317 above. 18.104.22.168 Re-segmenting Middle boxes and non MPA-aware TCPoptimized MPA/TCP senders Since MPA on MPA-aware TCPsenders often start FPDUs on TCP segment boundaries, a receiving DDP on MPA on TCPoptimized MPA/TCP implementation may be able to optimize the reception of data in various ways. However, MPA receivers MUST NOT depend on FPDU Alignment on TCP segment boundaries. Some MPA senders may be unable to conform to the sender requirements because their implementation of TCP is not designed with MPA in mind. Even if the sender is MPA-aware,for optimized MPA/TCP senders, the network may contain "middle boxes" which modify the TCP stream by changing the segmentation. This is generally interoperable with TCP and its users and MPA must be no exception. The presence of Markers in MPA (when enabled) allows an MPAoptimized MPA/TCP receiver to recover the FPDUs despite these obstacles, although it may be necessary to utilize additional buffering at the receiver to do so. Some of the cases that a receiver may have to contend with are listed below as a reminder to the implementer: * A single Aligned and complete FPDU, either in order, or out of order: This can be passed to DDP as soon as validated, and Delivered when ordering is established. * Multiple FPDUs in a TCP segment, aligned and fully contained, either in order, or out of order: These can be passed to DDP as soon as validated, and Delivered when ordering is established. * Incomplete FPDU: The receiver should buffer until the remainder of the FPDU arrives. If the remainder of the FPDU is already available, this can be passed to DDP as soon as validated, and Delivered when ordering is established. * Unaligned FPDU start: The partial FPDU must be combined with its preceding portion(s). If the preceding parts are already available, and the whole FPDU is present, this can be passed to DDP as soon as validated, and Delivered when ordering is established. If the whole FPDU is not available, the receiver should buffer until the remainder of the FPDU arrives. * Combinations of Unaligned or incomplete FPDUs (and potentially other complete FPDUs) in the same TCP segment: If any FPDU is present in its entirety, or can be completed with portions already available, it can be passed to DDP as soon as validated, and Delivered when ordering is established. 67 Connection Semantics 6.17.1 Connection setup MPA requires that the Consumer MUST activate MPA, and any TCP enhancements for MPA, on a TCP half connection at the same location in the octet stream at both the sender and the receiver. This is required in order for the Marker scheme to correctly locate the Markers (if enabled) and to correctly locate the first FPDU. MPA, and any TCP enhancements for MPA are enabled by the ULP in both directions at once at an endpoint. This can be accomplished several ways, and is left up to DDP's ULP: * DDP's ULP MAY require DDP on MPA startup immediately after TCP connection setup. This has the advantage that no streaming mode negotiation is needed. An example of such a protocol is shown in Figure 9:11: Example Immediate Startup negotiation on page 42.40. This may be accomplished by using a well-known port, or a service locator protocol to locate an appropriate port on which DDP on MPA is expected to operate. * DDP's ULP MAY negotiate the start of DDP on MPA sometime after a normal TCP startup, using TCP streaming data exchanges on the same connection. The exchange establishes that DDP on MPA (as well as other ULPs) will be used, and exactly locates the point in the octet stream where MPA is to begin operation. Note that such a negotiation protocol is outside the scope of this specification. A simplified example of such a protocol is shown in Figure 8:10: Example Delayed Startup negotiation on page 39.37. An MPA endpoint operates in two distinct phases. The Startup Phase is used to verify correct MPA setup, exchange CRC and Marker configuration, and optionally pass Private Data between endpoints prior to completing a DDP connection. During this phase, specifically formatted frames are exchanged as TCP byte streams without using CRCs or Markers. During this phase a DDP endpoint need not be "bound" to the MPA connection. In fact, the choice of DDP endpoint and its operating parameters may not be known until the Consumer supplied Private Data (if any) has been examined by the Consumer. The second distinct phase is Full Operation during which FPDUs are sent using all the rules that pertain (CRCs, Markers, MULPDU restrictions etc.). A DDP endpoint MUST be "bound" to the MPA connection at entry to this phase. When Private Data is passed between ULPs in the Startup Phase, the ULP is responsible for interpreting that data, and then placing MPA into Full Operation. Note: The following text differentiates the two endpoints by calling them Initiator and Responder. This is quite arbitrary and is NOT related to the TCP startup (SYN, SYN/ACK sequence). The Initiator is the side that sends first in the MPA startup sequence (the MPA Request Frame). Note: The possibility that both endpoints would be allowed to make a connection at the same time, sometimes called an active/active connection, was considered by the work group and rejected. There were several motivations for this decision. One was that applications needing this facility were few (none other than theoretical at the time of this draft). Another was that the facility created some implementation difficulties, particularly with the "dual stack" designs described later on. A last issue was that dealing with rejected connections at startup would have required at least an additional frame type, and more recovery actions, complicating the protocol. While none of these issues was overwhelming, the group and implementers were not motivated to do the work to resolve these issues. The protocol includes a method of detecting these active/active startup attempts so that they can be rejected and an error reported. The ULP is responsible for determining which side is Initiator or Responder. For client/server type ULPs this is easy. For peer-peer ULPs (which might utilize a TCP style active/active startup), some mechanism (not defined by this specification) must be established, or some streaming mode data exchanged prior to MPA startup to determine the side which starts in Initiator and which starts in Responder MPA mode. 22.214.171.124.1 MPA Request and Reply Frame Format 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 0 | | + Key (16 bytes containing "MPA ID Req Frame") + 4 | (4D 50 41 20 49 44 20 52 65 71 20 46 72 61 6D 65) | + Or (16 bytes containing "MPA ID Rep Frame") + 8 | (4D 50 41 20 49 44 20 52 65 70 20 46 72 61 6D 65) | + + 12 | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 16 |M|C|R| Res | Rev | PD_Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | ~ ~ ~ Private Data ~ | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 79 MPA Request/Reply Frame Key: This field contains the "key" used to validate that the sender is an MPA sender. Initiator mode senders MUST set this field to the fixed value "MPA ID Req frame" or (in byte order) 4D 50 41 20 49 44 20 52 65 71 20 46 72 61 6D 65 (in hexadecimal). Responder mode receivers MUST check this field for the same value, and close the connection and report an error locally if any other value is detected. Responder mode senders MUST set this field to the fixed value "MPA ID Rep frame" or (in byte order) 4D 50 41 20 49 44 20 52 65 70 20 46 72 61 6D 65 (in hexadecimal). Initiator mode receivers MUST check this field for the same value, and close the connection and report an error locally if any other value is detected. M: This bit, when sent in an MPA Request Frame or an MPA Reply Frame, declares a receiver's requirement for Markers. When in a received MPA Request Frame or MPA Reply Frame and the value is '0', Markers MUST NOT be added to the data stream by the sender. When '1' Markers MUST be added as described in section 5.14.3 MPA Markers on page 20.14. C: This bit declares an endpoint's preferred CRC usage. When this field is '0' in the MPA Request Frame and the MPA Reply Frame, CRCs MUST not be checked and need not be generated by either endpoint. When this bit is '1' in either the MPA Request Frame or MPA Reply Frame, CRCs MUST be generated and checked by both endpoints. Note that even when not in use, the CRC field remains present in the FPDU. When CRCs are not in use, the CRC field MUST be considered valid for FPDU checking regardless of its contents. R: This bit is set to zero, and not checked on reception in the MPA Request Frame. In the MPA Reply Frame, this bit is the Rejected Connection bit, set by the Responders ULP to indicate acceptance '0', or rejection '1', of the connection parameters provided in the Private Data. Res: This field is reserved for future use. It MUST be set to zero when sending, and not checked on reception. Rev: This field contains the Revision of MPA. For this version of the specification senders MUST set this field to one. MPA receivers compliant with this version of the specification MUST check this field. If the MPA receiver cannot interoperate with the received version, then it MUST close the connection and report an error locally. Otherwise, the MPA receiver should report the received version to the ULP. PD_Length: This field MUST contain the length in Octets of the Private Data field. A value of zero indicates that there is no Private Data field present at all. If the receiver detects that the PD_Length field does not match the length of the Private Data field, or if the length of the Private Data field exceeds 512 octets, the receiver MUST close the connection and report an error locally. Otherwise, the MPA receiver should pass the PD_Length value and Private Data to the ULP. Private Data: This field may contain any value defined by ULPs or may not be present. The Private Data field MUST between 0 and 512 octets in length. ULPs define how to size, set, and validate this field within these limits. 126.96.36.199.2 Connection Startup Rules The following rules apply to MPA connection Startup Phase: 1. When MPA is started in the Initiator mode, the MPA implementation MUST send a valid MPA Request Frame. The MPA Request Frame MAY include ULP supplied Private Data. 2. When MPA is started in the Responder mode, the MPA implementation MUST wait until a MPA Request Frame is received and validated before entering full MPA/DDP operation. If the MPA Request Frame is improperly formatted, the implementation MUST close the TCP connection and exit MPA. If the MPA Request Frame is properly formatted but the Private Data is not acceptable, the implementation SHOULD return an MPA Reply Frame with the Rejected Connection bit set to '1'; the MPA Reply Frame MAY include ULP supplied Private Data; the implementation MUST exit MPA, leaving the TCP connection open. The ULP may close TCP or use the connection for other purposes. If the MPA Request Frame is properly formatted and the Private Data is acceptable, the implementation SHOULD return an MPA Reply Frame with the Rejected Connection bit set to '0'; the MPA Reply Frame MAY include ULP supplied Private Data; and the Responder SHOULD prepare to interpret any data received as FPDUs and pass any received ULPDUs to DDP. Note: Since the receiver's ability to deal with Markers is unknown until the Request and Reply frames have been received, sending FPDUs before this occurs is not possible. Note: The requirement to wait on a Request Frame before sending a Reply frame is a design choice, it makes for well ordered sequence of events at each end, and avoids having to specify how to deal with situations where both ends start at the same time. 3. MPA Initiator mode implementations MUST receive and validate a MPA Reply Frame. If the MPA Reply Frame is improperly formatted, the implementation MUST close the TCP connection and exit MPA. If the MPA Reply Frame is properly formatted but is the Private Data is not acceptable, or if the Rejected Connection bit set to '1', the implementation MUST exit MPA, leaving the TCP connection open. The ULP may close TCP or use the connection for other purposes. If the MPA Reply Frame is properly formatted and the Private Data is acceptable, and the Reject Connection bit is set to '0', the implementation SHOULD enter full MPA/DDP operation mode; interpreting any received data as FPDUs and sending DDP ULPDUs as FPDUs. 4. MPA Responder mode implementations MUST receive and validate at least one FPDU before sending any FPDUs or Markers. Note: this requirement is present to allow the Initiator time to get its receiver into Full Operation before an FPDU arrives, avoiding potential race conditions at the Initiator. This was also subject to some debate in the work group before rough consensus was reached. Eliminating this requirement would allow faster startup in some types of applications. However, that would also make certain implementations (particularly "dual stack") much harder. 5. If a received "Key" does not match the expected value, (See 188.8.131.52.1 MPA Request and Reply Frame Format above) the TCP/DDP connection MUST be closed, and an error returned to the ULP. 6. The received Private Data fields may be used by Consumers at either end to further validate the connection, and set up DDP or other ULP parameters. The Initiator ULP MAY close the TCP/MPA/DDP connection as a result of validating the Private Data fields. The Responder SHOULD return a MPA Reply Frame with the "Reject Connection" Bit set to '1' if the validation of the Private Data is not acceptable to the ULP. 7. When the first FPDU is to be sent, then if Markers are enabled, the first octets sent are the special Marker 0x00000000, followed by the start of the FPDU (the FPDU's ULPDU Length field). If Markers are not enabled, the first octets sent are the start of the FPDU (the FPDU's ULPDU Length field). 8. MPA implementations MUST use the difference between the MPA Request Frame and the MPA Reply Frame to check for incorrect "Initiator/Initiator" startups. Implementations SHOULD put a timeout on waiting for the MPA Request Frame when started in Responder mode, to detect incorrect "Responder/Responder" startups. 9. MPA implementations MUST validate the PD_Length field. The buffer that receives the Private Data field MUST be large enough to receive that data; the amount of Private Data MUST not exceed the PD_Length, or the application buffer. If any of the above fails, the startup frame MUST be considered improperly formatted. 10. MPA implementations SHOULD implement a reasonable timeout while waiting for the entire startup frames; this prevents certain denial of service attacks. ULPs SHOULD implement a reasonable timeout while waiting for FPDUs, ULPDUs and application level messages to guard against application failures and certain denial of service attacks. 184.108.40.206.3 Example Delayed Startup sequence A variety of startup sequences are possible when using MPA on TCP. Following is an example of an MPA/DDP startup that occurs after TCP has been running for a while and has exchanged some amount of streaming data. This example does not use any Private Data (an example that does is shown later in 220.127.116.11.1.4.2 Example Immediate Startup using Private Data on page 42),40), although it is perfectly legal to include the Private Data. Note that since the example does not use any Private Data, there are no ULP interactions shown between receiving "Startup frames" and putting MPA into Full Operation. Initiator Responder +---------------------------+ |ULP streaming mode | | <Hello> request to | | transition to DDP/MPA | +--------------------------+ | mode (optional) | --------> |ULP gets request; | +---------------------------+ |enables MPA Responder mode| |with last (optional) | |streaming mode <Hello Ack>| |for MPA to send. | +---------------------------+ |MPA waits for incoming | |ULP receives streaming | <-------- | <MPA Request frame> | | <Hello Ack>; | +--------------------------+ |Enters MPA Initiator mode; | |MPA sends | | <MPA Request Frame>; | |MPA waits for incoming | +--------------------------+ | <MPA Reply Frame | - - - - > |MPA receives | +---------------------------+ | <MPA Request Frame> | |Consumer binds DDP to MPA,| |MPA sends the | | <MPA Reply Frame>. | |DDP/MPA enables FPDU | +---------------------------+ |decoding, but does not | |MPA receives the | < - - - - |send any FPDUs. | | <MPA Reply Frame> | +--------------------------+ |Consumer binds DDP to MPA, | |DDP/MPA begins full | |operation. | |MPA sends first FPDU (as | +--------------------------+ |DDP ULPDUs become | ========> |MPA Receives first FPDU. | |available). | |MPA sends first FPDU (as | +---------------------------+ |DDP ULPDUs become | <====== |available. | +--------------------------+ Figure 8:10: Example Delayed Startup negotiation An example Delayed Startup sequence is described below: * Active and passive sides start up a TCP connection in the usual fashion, probably using sockets APIs. They exchange some amount of streaming mode data. At some point one side (the MPA Initiator) sends streaming mode data that effectively says "Hello, Lets go into MPA/DDP mode." * When the remote side (the MPA Responder) gets this streaming mode message, the Consumer would send a last streaming mode message that effectively says "I Acknowledge your Hello, and am now in MPA Responder Mode". The exchange of these messages establishes the exact point in the TCP stream where MPA is enabled. The Responding Consumer enables MPA in the Responder mode and waits for the initial MPA startup message. * The Initiating Consumer would enable MPA startup in the Initiator mode which then sends the MPA Request Frame. It is assumed that no Private Data messages are needed for this example, although it is possible to do so. The Initiating MPA (and Consumer) would also wait for the MPA connection to be accepted. * The Responding MPA would receive the initial MPA Request Frame and would inform the Consumer that this message arrived. The Consumer can then accept the MPA/DDP connection or close the TCP connection. * To accept the connection request, the Responding Consumer would use an appropriate API to bind the TCP/MPA connections to a DDP endpoint, thus enabling MPA/DDP into Full Operation. In the process of going to Full Operation, MPA sends the MPA Reply Frame. MPA/DDP waits for the first incoming FPDU before sending any FPDUs. * If the initial TCP data was not a properly formatted MPA Request Frame MPA will close or reset the TCP connection immediately. * The Initiating MPA would receive the MPA Reply Frame and would report this message to the Consumer. The Consumer can then accept the MPA/DDP connection, or close or reset the TCP connection to abort the process. * On determining that the Connection is acceptable, the Initiating Consumer would use an appropriate API to bind the TCP/MPA connections to a DDP endpoint thus enabling MPA/DDP into Full Operation. MPA/DDP would begin sending DDP messages as MPA FPDUs. 18.104.22.168.4 Use of Private Data This section is advisory in nature, in that it suggests a method that a ULP can deal with pre-DDP connection information exchange. 22.214.171.124.1.4.1 Motivation Prior RDMA protocols have been developed that provide Private Data via out of band mechanisms. As a result, many applications now expect some form of Private Data to be available for application use prior to setting up the DDP/RDMA connection. Following are some examples of the use of Private Data. An RDMA Endpoint (referred to as a Queue Pair, or QP, in InfiniBand and the [VERBS]) must be associated with a Protection Domain. No receive operations may be posted to the endpoint before it is associated with a Protection Domain. Indeed under both the InfiniBand and proposed RDMA/DDP verbs [VERBS] an endpoint/QP is created within a Protection Domain. There are some applications where the choice of Protection Domain is dependent upon the identity of the remote ULP client. For example, if a user session requires multiple connections, it is highly desirable for all of those connections to use a single Protection Domain. Note: use of Protection Domains is further discussed in [RDMASEC]. InfiniBand, the DAT APIs [DAT-API] and the [IT-API] all provide for the active side ULP to provide Private Data when requesting a connection. This data is passed to the ULP to allow it to determine whether to accept the connection, and if so with which endpoint (and implicitly which Protection Domain). The Private Data can also be used to ensure that both ends of the connection have configured their RDMA endpoints compatibly on such matters as the RDMA Read capacity (see [RDMAP]). Further ULP- specific uses are also presumed, such as establishing the identity of the client. Private Data is also allowed for when accepting the connection, to allow completion of any negotiation on RDMA resources and for other ULP reasons. There are several potential ways to exchange this Private Data. For example, the InfiniBand specification includes a connection management protocol that allows a small amount of Private Data to be exchanged using datagrams before actually starting the RDMA connection. This draft allows for small amounts of Private Data to be exchanged as part of the MPA startup sequence. The actual Private Data fields are carried in the MPA Request Frame, and the MPA Reply Frame. If larger amounts of Private Data or more negotiation is necessary, TCP streaming mode messages may be exchanged prior to enabling MPA. 126.96.36.199.1.4.2 Example Immediate Startup using Private Data Initiator Responder +---------------------------+ |TCP SYN sent | +--------------------------+ +---------------------------+ --------> |TCP gets SYN packet; | +---------------------------+ | Sends SYN-Ack | |TCP gets SYN-Ack | <-------- +--------------------------+ | Sends Ack | +---------------------------+ --------> +--------------------------+ +---------------------------+ |Consumer enables MPA | |Consumer enables MPA | |Responder Mode, waits for | |Initiator mode with | | <MPA Request frame> | |Private Data; MPA sends | +--------------------------+ | <MPA Request Frame>; | |MPA waits for incoming | +--------------------------+ | <MPA Reply Frame | - - - - > |MPA receives | +---------------------------+ | <MPA Request Frame> | |Consumer examines Private | |Data, provides MPA with | |return Private Data, | |binds DDP to MPA, and | |enables MPA to send an | | <MPA Reply Frame>. | |DDP/MPA enables FPDU | +---------------------------+ |decoding, but does not | |MPA receives the | < - - - - |send any FPDUs. | | <MPA Reply Frame> | +--------------------------+ |Consumer examines Private | |Data, binds DDP to MPA, | |and enables DDP/MPA to | |begin Full Operation. | |MPA sends first FPDU (as | +--------------------------+ |DDP ULPDUs become | ========> |MPA Receives first FPDU. | |available). | |MPA sends first FPDU (as | +---------------------------+ |DDP ULPDUs become | <====== |available. | +--------------------------+ Figure 9:11: Example Immediate Startup negotiation Note: the exact order of when MPA is started in the TCP connection sequence is implementation dependent; the above diagram shows one possible sequence. Also, the Initiator "Ack" to the Responder's "SYN-Ack" may be combined into the same TCP segment containing the MPA Request Frame (as is allowed by TCP RFCs). The example immediate startup sequence is described below: * The passive side (Responding Consumer) would listen on the TCP destination port, to indicate its readiness to accept a connection. * The active side (Initiating Consumer) would request a connection from a TCP endpoint (that expected to upgrade to MPA/DDP/RDMA and expected the Private Data) to a destination address and port. * The Initiating Consumer would initiate a TCP connection to the destination port. Acceptance/rejection of the connection would proceed as per normal TCP connection establishment. * The passive side (Responding Consumer) would receive the TCP connection request as usual allowing normal TCP gatekeepers, such as INETD and TCPserver, to exercise their normal safeguard/logging functions. On acceptance of the TCP connection, the Responding Consumer would enable MPA in the Responder mode and wait for the initial MPA startup message. * The Initiating Consumer would enable MPA startup in the Initiator mode to send an initial MPA Request Frame with its included Private Data message to send. The Initiating MPA (and Consumer) would also wait for the MPA connection to be accepted, and any returned Private Data. * The Responding MPA would receive the initial MPA Request Frame with the Private Data message and would pass the Private Data through to the Consumer. The Consumer can then accept the MPA/DDP connection, close the TCP connection, or reject the MPA connection with a return message. * To accept the connection request, the Responding Consumer would use an appropriate API to bind the TCP/MPA connections to a DDP endpoint, thus enabling MPA/DDP into Full Operation. In the process of going to Full Operation, MPA sends the MPA Reply Frame which includes the Consumer supplied Private Data containing any appropriate Consumer response. MPA/DDP waits for the first incoming FPDU before sending any FPDUs. * If the initial TCP data was not a properly formatted MPA Request Frame, MPA will close or reset the TCP connection immediately. * To reject the MPA connection request, the Responding Consumer would send an MPA Reply Frame with any ULP supplied Private Data (with reason for rejection), with the "Rejected Connection" bit set to '1', and may close the TCP connection. * The Initiating MPA would receive the MPA Reply Frame with the Private Data message and would report this message to the Consumer, including the supplied Private Data. If the "rejected Connection" bit is set to a '1', MPA will close the TCP connection and exit. If the "Rejected Connection" bit is set to a '0', and on determining from the MPA Reply Frame Private Data that the Connection is acceptable, the Initiating Consumer would use an appropriate API to bind the TCP/MPA connections to a DDP endpoint thus enabling MPA/DDP into Full Operation. MPA/DDP would begin sending DDP messages as MPA FPDUs. 188.8.131.52.5 "Dual stack" implementations MPA/DDP implementations are commonly expected to be implemented as part of a "dual stack" architecture. One "stack" is the traditional TCP stack, usually with a sockets interface API (Application Programming Interface). The second stack is the MPA/DDP "stack" with its own API, and potentially separate code or hardware to deal with the MPA/DDP data. Of course, implementations may vary, so the following comments are of an advisory nature only. The use of the two "stacks" offers advantages: TCP connection setup is usually done with the TCP stack. This allows use of the usual naming and addressing mechanisms. It also means that any mechanisms used to "harden" the connection setup against security threats are also used when starting MPA/DDP. Some applications may have been originally designed for TCP, but are "enhanced" to utilize MPA/DDP after a negotiation reveals the capability to do so. The negotiation process takes place in TCP's streaming mode, using the usual TCP APIs. Some new applications, designed for RDMA or DDP, still need to exchange some data prior to starting MPA/DDP. This exchange can be of arbitrary length or complexity, but often consists of only a small amount of Private Data, perhaps only a single message. Using the TCP streaming mode for this exchange allows this to be done using well understood methods. The main disadvantage of using two stacks is the conversion of an active TCP connection between them. This process must be done with care to prevent loss of data. To avoid some of the problems when using a "dual stack" architecture the following additional restrictions may be required by the implementation: 1. Enabling the DDP/MPA stack SHOULD be done only when no incoming stream data is expected. This is typically managed by the ULP protocol. When following the recommended startup sequence, the Responder side enters DDP/MPA mode, sends the last streaming mode data, and then waits for the MPA Request Frame. No additional streaming mode data is expected. The Initiator side ULP receives the last streaming mode data, and then enters DDP/MPA mode. Again, no additional streaming mode data is expected. 2. The DDP/MPA MAY provide the ability to send a "last streaming message" as part of its Responder DDP/MPA enable function. This allows the DDP/MPA stack to more easily manage the conversion to DDP/MPA mode (and avoid problems with a very fast return of the MPA Request Frame from the Initiator side). Note: Regardless of the "stack" architecture used, TCP's rules MUST be followed. For example, if network data is lost, re-segmented or re-ordered, TCP MUST recover appropriately even when this occurs while switching stacks. 6.27.2 Normal Connection Teardown Each half connection of MPA terminates when DDP closes the corresponding TCP half connection. A mechanism SHOULD be provided by MPA to DDP for DDP to be made aware that a graceful close of the LLP connection has been received by the LLP (e.g. FIN is received). 78 Error Semantics The following errors MUST be detected by MPA and the codes SHOULD be provided to DDP or other Consumer: Code Error 1 TCP connection closed, terminated or lost. This includes lost by timeout, too many retries, RST received or FIN received. 2 Received MPA CRC does not match the calculated value for the FPDU. 3 In the event that the CRC is valid, received MPA Marker (if enabled) and ULPDU Length fields do not agree on the start of a FPDU. If the FPDU start determined from previous ULPDU Length fields does not match with the MPA Marker position, MPA SHOULD deliver an error to DDP. It may not be possible to make this check as a segment arrives, but the check SHOULD be made when a gap creating an out of order sequence is closed and any time a Marker points to an already identified FPDU. It is OPTIONAL for a receiver to check each Marker, if multiple Markers are present in an FPDU, or if the segment is received in order. 4 Invalid MPA Request Frame or MPA Response Frame received. In this case, the TCP connection MUST be immediately closed. DDP and other ULPs should treat this similar to code 1, above. When conditions 2 or 3 above are detected, an MPA-aware TCPoptimized MPA/TCP implementation MAY choose to silently drop the TCP segment rather than reporting the error to DDP. In this case, the sending TCP will retry the segment, usually correcting the error, unless the problem was at the source. In that case, the source will usually exceed the number of retries and terminate the connection. Once MPA delivers an error of any type, it MUST NOT pass or deliver any additional FPDUs on that half connection. For Error codes 2 and 3, MPA MUST NOT close the TCP connection following a reported error. Closing the connection is the responsibility of DDP's ULP. Note that since MPA will not Deliver any FPDUs on a half connection following an error detected on the receive side of that connection, DDP's ULP is expected to tear down the connection. This may not occur until after one or more last messages are transmitted on the opposite half connection. This allows a diagnostic error message to be sent. 89 Security Considerations This section discusses the security considerations for MPA. 8.19.1 Protocol-specific Security Considerations The vulnerabilities of MPA to third-party attacks are no greater than any other protocol running over TCP. A third party, by sending packets into the network that are delivered to an MPA receiver, could launch a variety of attacks that take advantage of how MPA operates. For example, a third party could send random packets that are valid for TCP, but contain no FPDU headers. An MPA receiver reports an error to DDP when any packet arrives that cannot be validated as an FPDU when properly located on an FPDU boundary. A third party could also send packets that are valid for TCP, MPA, and DDP, but do not target valid buffers. These types of attacks ultimately result in loss of connection and thus become a type of DOS (Denial Of Service) attack. Communication security mechanisms such as IPsec [RFC2401] may be used to prevent such attacks. Independent of how MPA operates, a third party could use ICMP messages to reduce the path MTU to such a small size that performance would likewise be severely impacted. Range checking on path MTU sizes in ICMP packets may be used to prevent such attacks. [RDMAP] and [DDP] are used to control, read and write data buffers over IP networks. Therefore, the control and the data packets of these protocols are vulnerable to the spoofing, tampering and information disclosure attacks listed below. In addition, Connection to/from an unauthorized or unauthenticated endpoint is a potential problem with most applications using RDMA, DDP, and MPA. 184.108.40.206.1 Spoofing Spoofing attacks can be launched by the Remote Peer, or by a network based attacker. A network based spoofing attack applies to all Remote Peers. Because the MPA Stream requires a TCP Stream in the ESTABLISHED state, certain types of traditional forms of wire attacks do not apply -- an end-to-end handshake must have occurred to establish the MPA Stream. So, the only form of spoofing that applies is one when a remote node can both send and receive packets. Yet even with this limitation the Stream is still exposed to the following spoofing attacks. 220.127.116.11.1.1.1 Impersonation A network based attacker can impersonate a legal MPA/DDP/RDMAP peer (by spoofing a legal IP address), and establish an MPA/DDP/RDMAP Stream with the victim. End to end authentication (i.e. IPsec or ULP authentication) provides protection against this attack. 18.104.22.168.1.1.2 Stream Hijacking Stream hijacking happens when a network based attacker follows the Stream establishment phase, and waits until the authentication phase (if such a phase exists) is completed successfully. He can then spoof the IP address and re-direct the Stream from the victim to its own machine. For example, an attacker can wait until an iSCSI authentication is completed successfully, and hijack the iSCSI Stream. The best protection against this form of attack is end-to-end integrity protection and authentication, such as IPsec to prevent spoofing. Another option is to provide physical security. Discussion of physical security is out of scope for this document. 22.214.171.124.1.1.3 Man in the Middle Attack If a network based attacker has the ability to delete, inject replay, or modify packets which will still be accepted by MPA (e.g., TCP sequence number is correct, FPDU is valid etc.) then the Stream can be exposed to a man in the middle attack. The attacker could potentially use the services of [DDP] and [RDMAP] to read the contents of the associated data buffer, modify the contents of the associated data buffer, or to disable further access to the buffer. The only countermeasure for this form of attack is to either secure the MPA/DDP/RDMAP Stream (i.e. integrity protect) or attempt to provide physical security to prevent man-in-the-middle type attacks. The best protection against this form of attack is end-to-end integrity protection and authentication, such as IPsec, to prevent spoofing or tampering. If Stream or session level authentication and integrity protection are not used, then a man-in-the-middle attack can occur, enabling spoofing and tampering. Another approach is to restrict access to only the local subnet/link, and provide some mechanism to limit access, such as physical security or 802.1.x. This model is an extremely limited deployment scenario, and will not be further examined here. 126.96.36.199.2 Eavesdropping Generally speaking, Stream confidentiality protects against eavesdropping. Stream and/or session authentication and integrity protection is a counter measurement against various spoofing and tampering attacks. The effectiveness of authentication and integrity against a specific attack, depend on whether the authentication is machine level authentication (as the one provided by IPsec), or ULP authentication. 8.29.2 Introduction to Security Options The following security services can be applied to an MPA/DDP/RDMAP Stream: 1. Session confidentiality - protects against eavesdropping. 2. Per-packet data source authentication - protects against the following spoofing attacks: network based impersonation, Stream hijacking, and man in the middle. 3. Per-packet integrity - protects against tampering done by network based modification of FPDUs (indirectly affecting buffer content through DDP services). 4. Packet sequencing - protects against replay attacks, which is a special case of the above tampering attack. If an MPA/DDP/RDMAP Stream may be subject to impersonation attacks, or Stream hijacking attacks, it is recommended that the Stream be authenticated, integrity protected, and protected from replay attacks; it may use confidentiality protection to protect from eavesdropping (in case the MPA/DDP/RDMAP Stream traverses a public network). IPsec is capable of providing the above security services for IP and TCP traffic. ULP protocols may be able to provide part of the above security services. See [NFSv4CHANNEL] for additional information on a promising approach called "channel binding". From [NFSv4CHANNEL]: "The concept of channel bindings allows applications to prove that the end-points of two secure channels at different network layers are the same by binding authentication at one channel to the session protection at the other channel. The use of channel bindings allows applications to delegate session protection to lower layers, which may significantly improve performance for some applications." 8.39.3 Using IPsec With MPA IPsec can be used to protect against the packet injection attacks outlined above. Because IPsec is designed to secure individual IP packets, MPA can run above IPsec without change. IPsec packets are processed (e.g., integrity checked and decrypted) in the order they are received, and an MPA receiver will process the decrypted FPDUs contained in these packets in the same manner as FPDUs contained in unsecured IP packets. MPA Implementations MUST implement IPsec as described in Section 8.49.4 below. The use of IPsec is up to ULPs and administrators. 8.49.4 Requirements for IPsec Encapsulation of MPA/DDP The IP Storage working group has spent significant time and effort to define the normative IPsec requirements for IP Storage [RFC3723]. Portions of that specification are applicable to a wide variety of protocols, including the RDDP protocol suite. In order to not replicate this effort, an MPA ONon TCP implementation MUST follow the requirements defined in RFC3723 Section 2.3 and Section 5, including the associated normative references for those sections. Additionally, since IPsec acceleration hardware may only be able to handle a limited number of active IKE Phase 2 SAs, Phase 2 delete messages MAY be sent for idle SAs, as a means of keeping the number of active Phase 2 SAs to a minimum. The receipt of an IKE Phase 2 delete message MUST NOT be interpreted as a reason for tearing down an DDP/RDMA Stream. Rather, it is preferable to leave the Stream up, and if additional traffic is sent on it, to bring up another IKE Phase 2 SA to protect it. This avoids the potential for continually bringing Streams up and down. Note that there are serious security issues if IPsec is not implemented end-to-end. For example, if IPsec is implemented as a tunnel in the middle of the network, any hosts between the peer and the IPsec tunneling device can freely attack the unprotected Stream. 910 IANA Considerations No IANA actions are required by this document. If a well-known port is chosen as the mechanism to identify a DDP on MPA on TCP, the well-known port must be registered with IANA. Because the use of the port is DDP specific, registration of the port with IANA is left to DDP. 1011 References 10.111.1 Normative References [iSCSI] Satran, J., Internet Small Computer Systems Interface (iSCSI), RFC 3720, April 2004. [RFC1191] Mogul, J., and Deering, S., "Path MTU Discovery", RFC 1191, November 1990. [RFC2018] Mathis, M., Mahdavi, J., Floyd, S., Romanow, A., "TCP Selective Acknowledgment Options", RFC 2018, October 1996. [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [RFC3723] Aboba B., et al, "Securing Block Storage Protocols over IP", RFC3723, April 2004. [RFC793] Postel, J., "Transmission Control Protocol - DARPA Internet Program Protocol Specification", RFC 793, September 1981. [RDMASEC] Pinkerton J., Deleganes E., Bitan S., "DDP/RDMAP Security", draft-ietf-rddp-security-09.txt (work in progress), MAY 2006. 10.211.2 Informative References [CRCTCP] Stone J., Partridge, C., "When the CRC and TCP checksum disagree", ACM Sigcomm, Sept. 2000. [DAT-API] DAT Collaborative, "kDAPL (Kernel Direct Access Programming Library) and uDAPL (User Direct Access Programming Library)", http://www.datcollaborative.org. [DDP] H. Shah et al., "Direct Data Placement over Reliable Transports", draft-ietf-rddp-ddp-06.txt (Work in progress), May 2006. [IT-API] The Open Group, "Interconnect Transport API (IT-API)" Version 2.1, http://www.opengroup.org. [RFC2401] Atkinson, R., Kent, S., "Security Architecture for the Internet Protocol", RFC 2401, November 1998. [RFC0896] J. Nagle, "Congestion Control in IP/TCP Internetworks", RFC 896, January 1984. [NagleDAck] Minshall G., Mogul, J., Saito, Y., Verghese, B., "Application performance pitfalls and TCP's Nagle algorithm", Workshop on Internet Server Performance, May 1999.[NFSv4CHANNEL] Williams, N., "On the Use of Channel Bindings to Secure Channels", Internet-Draft draft-ietf-nfsv4-channel- bindings-02.txt, July 2004. [RDMAP] R. Recio et al., "RDMA Protocol Specification", draft-ietf-rddp-rdmap-06.txt, May 2006. [RFC2960] R. Stewart et al., "Stream Control Transmission Protocol", RFC 2960, October 2000. [RFC792] Postel, J., "Internet Control Message Protocol", September 1981 [RFC1122] Braden, R.T., "Requirements for Internet hosts - communication layers", October 1989. [VERBS] J. Hilland et al., "RDMA Protocol Verbs Specification", draft-hilland-iwarp-verbs-v1.0-RDMAC.pdf April 2003, http://www.rdmaconsortium.org. 1112 Appendix This appendix is for information only and is NOT part of the standard. The appendix covers three topics; Section 11.112.1 is an analysis of MPA on TCP and why it is useful to integrate MPA with TCP (with modifications to typical TCP implementations) to reduce overall system buffering and overhead. Section 11.212.2 covers some MPA receiver implementation notes. Section 11.312.3 covers methods of making MPA implementations interoperate with both IETF and RDMA Consortium versions of the protocols. 11.112.1 Analysis of MPA over TCP Operations This appendix analyzes the impact of MPA on the TCP sender, receiver, and wire protocol. One of MPA's high level goals is to provide enough information, when combined with the Direct Data Placement Protocol [DDP], to enable out-of-order placement of DDP payload into the final Upper Layer Protocol (ULP) buffer. Note that DDP separates the act of placing data into a ULP buffer from that of notifying the ULP that the ULP buffer is available for use. In DDP terminology, the former is defined as "Placement", and the later is defined as "Delivery". MPA supports in-order Delivery of the data to the ULP, including support for Direct Data Placement in the final ULP buffer location when TCP segments arrive out-of-order. Effectively, the goal is to use the pre-posted ULP buffers as the TCP receive buffer, where the reassembly of the ULP Protocol Data Unit (PDU) by TCP (with MPA and DDP) is done in place, in the ULP buffer, with no data copies. This Appendix walks through the advantages and disadvantages of the TCP sender modifications proposed by MPA: 1) that MPA prefers that the TCP sender to do Header Alignment, where a TCP segment should begin with an MPA Framing Protocol Data Unit (FPDU) (if there is payload present). 2) that there be an integral number of FPDUs in a TCP segment (under conditions where the Path MTU is not changing). This Appendix concludes that the scaling advantages of FPDU Alignment are strong, based primarily on fairly drastic TCP receive buffer reduction requirements and simplified receive handling. The analysis also shows that there is little effect to TCP wire behavior. 188.8.131.52.1 Assumptions 184.108.40.206.1.1.1 MPA is layered beneath DDP [DDP] MPA is an adaptation layer between DDP and TCP. DDP requires preservation of DDP segment boundaries and a CRC32C digest covering the DDP header and data. MPA adds these features to the TCP stream so that DDP over TCP has the same basic properties as DDP over SCTP. 220.127.116.11.1.1.2 MPA preserves DDP message framing MPA was designed as a framing layer specifically for DDP and was not intended as a general-purpose framing layer for any other ULP using TCP. A framing layer allows ULPs using it to receive indications from the transport layer only when complete ULPDUs are present. As a framing layer, MPA is not aware of the content of the DDP PDU, only that it has received and, if necessary, reassembled a complete PDU for Delivery to the DDP. 18.104.22.16822.214.171.124 The size of the ULPDU passed to MPA is less than EMSS under normal conditions To make reception of a complete DDP PDU on every received segment possible, DDP passes to MPA a PDU that is no larger than the EMSS of the underlying fabric. Each FPDU that MPA creates contains sufficient information for the receiver to directly place the ULP payload in the correct location in the correct receive buffer. Edge cases when this condition does not occur are dealt with, but do not need to be on the fast path 126.96.36.199188.8.131.52 Out-of-order placement but NO out-of-order Delivery DDP receives complete DDP PDUs from MPA. Each DDP PDU contains the information necessary to place its ULP payload directly in the correct location in host memory. Because each DDP segment is self-describing, it is possible for DDP segments received out of order to have their ULP payload placed immediately in the ULP receive buffer. Data delivery to the ULP is guaranteed to be in the order the data was sent. DDP only indicates data delivery to the ULP after TCP has acknowledged the complete byte stream. 184.108.40.206.2 The Value of FPDU Alignment Significant receiver optimizations can be achieved when Header Alignment and complete FPDUs are the common case. The optimizations allow utilizing significantly fewer buffers on the receiver and less computation per FPDU. The net effect is the ability to build a "flow-through" receiver that enables TCP-based solutions to scale to 10G and beyond in an economical way. The optimizations are especially relevant to hardware implementations of receivers that process multiple protocol layers - Data Link Layer (e.g., Ethernet), Network and Transport Layer (e.g., TCP/IP), and even some ULP on top of TCP (e.g., MPA/DDP). As network speed increases, there is an increasing desire to use a hardware based receiver in order to achieve an efficient high performance solution. A TCP receiver, under worst case conditions, has to allocate buffers (BufferSizeTCP) whose capacities are a function of the bandwidth- delay product. Thus: BufferSizeTCP = K * bandwidth [octets/Second] * Delay [Seconds]. Where bandwidth is the end-to-end bandwidth of the connection, delay is the round trip delay of the connection, and K is an implementation dependent constant. Thus BufferSizeTCP scales with the end-to-end bandwidth (10x more buffers for a 10x increase in end-to-end bandwidth). As this buffering approach may scale poorly for hardware or software implementations alike, several approaches allow reduction in the amount of buffering required for high-speed TCP communication. The MPA/DDP approach is to enable the ULP's buffer to be used as the TCP receive buffer. If the application pre-posts a sufficient amount of buffering, and each TCP segment has sufficient information to place the payload into the right application buffer, when an out-of- order TCP segment arrives it could potentially be placed directly in the ULP buffer. However, placement can only be done when a complete FPDU with the placement information is available to the receiver, and the FPDU contents contain enough information to place the data into the correct ULP buffer (e.g., there is a DDP header available). For the case when the FPDU is not aligned with the TCP segment, it may take, on average, 2 TCP segments to assemble one FPDU. Therefore, the receiver has to allocate BufferSizeNAF (Buffer Size, Non-Aligned FPDU) octets: BufferSizeNAF = K1* EMSS * number_of_connections + K2 * EMSS Where K1 and K2 are implementation dependent constants and EMSS is the effective maximum segment size. For example, a 1 Gbps link with 10,000 connections and an EMSS of 1500B would require 15 MB of memory. Often the number of connections used scales with the network speed, aggravating the situation for higher speeds. FPDU Alignment would allow the receiver to allocate BufferSizeAF (Buffer Size, Aligned FPDU) octets: BufferSizeAF = K2 * EMSS for the same conditions. A FPDU Aligned receiver may require memory in the range of ~100s of KB - which is feasible for an on-chip memory and enables a "flow-through" design, in which the data flows through the NIC and is placed directly in the destination buffer. Assuming most of the connections support FPDU Alignment, the receiver buffers no longer scale with number of connections. Additional optimizations can be achieved in a balanced I/O sub-system -- where the system interface of the network controller provides ample bandwidth as compared with the network bandwidth. For almost twenty years this has been the case and the trend is expected to continue - while Ethernet speeds have scaled by 1000 (from 10 megabit/sec to 10 gigabit/sec), I/O bus bandwidth of volume CPU architectures has scaled from ~2 MB/sec to ~2 GB/sec (PC-XT bus to PCI-X DDR). Under these conditions, the FPDU Alignment approach allows BufferSizeAF to be indifferent to network speed. It is primarily a function of the local processing time for a given frame. Thus when the FPDU Alignment approach is used, receive buffering is expected to scale gracefully (i.e. less than linear scaling) as network speed is increased. 220.127.116.11.1.2.1 Impact of lack of FPDU Alignment on the receiver computational load and complexity The receiver must perform IP and TCP processing, and then perform FPDU CRC checks, before it can trust the FPDU header placement information. For simplicity of the description, the assumption is that a FPDU is carried in no more than 2 TCP segments. In reality, with no FPDU Alignment, an FPDU can be carried by more than 2 TCP segments (e.g., if the PMTU was reduced). ----++-----------------------------++-----------------------++----- +---||---------------+ +--------||--------+ +----------||----+ | TCP Seg X-1 | | TCP Seg X | | TCP Seg X+1 | +---||---------------+ +--------||--------+ +----------||----+ ----++-----------------------------++-----------------------++----- FPDU #N-1 FPDU #N Figure 10:12: Non-aligned FPDU freely placed in TCP octet stream The receiver algorithm for processing TCP segments (e.g., TCP segment #X in Figure 10:12: Non-aligned FPDU freely placed in TCP octet stream) carrying non-aligned FPDUs (in-order or out-of-order) includes: Data Link Layer processing (whole frame) - typically including a CRC calculation. 1. Network Layer processing (assuming not an IP fragment, the whole Data Link Layer frame contains one IP datagram. IP fragments should be reassembled in a local buffer. This is not a performance optimization goal) 2. Transport Layer processing -- TCP protocol processing, header and checksum checks. a. Classify incoming TCP segment using the 5 tuple (IP SRC, IP DST, TCP SRC Port, TCP DST Port, protocol) 3. Find FPDU message boundaries. a. Get MPA state information for the connection If the TCP segment is in-order, use the receiver managed MPA state information to calculate where the previous FPDU message (#N-1) ends in the current TCP segment X. (previously, when the MPA receiver processed the first part of FPDU #N-1, it calculated the number of bytes remaining to complete FPDU #N-1 by using the MPA Length field). Get the stored partial CRC for FPDU #N-1 Complete CRC calculation for FPDU #N-1 data (first portion of TCP segment #X) Check CRC calculation for FPDU #N-1 If no FPDU CRC errors, placement is allowed Locate the local buffer for the first portion of FPDU#N-1, CopyData(local buffer of first portion of FPDU #N-1, host buffer address, length) Compute host buffer address for second portion of FPDU #N-1 CopyData (local buffer of second portion of FPDU #N-1, host buffer address for second portion, length) Calculate the octet offset into the TCP segment for the next FPDU #N. Start Calculation of CRC for available data for FPDU #N Store partial CRC results for FPDU #N Store local buffer address of first portion of FPDU #N No further action is possible on FPDU #N, before it is completely received If TCP out-of-order, receiver must buffer the data until at least one complete FPDU is received. Typically buffering for more than one TCP segment per connection is required. Use the MPA based Markers to calculate where FPDU boundaries are. When a complete FPDU is available, a similar procedure to the in-order algorithm above is used. There is additional complexity, though, because when the missing segment arrives, this TCP segment must be run through the CRC engine after the CRC is calculated for the missing segment. If we assume FPDU Alignment, the following diagram and the algorithm below apply. Note that when using MPA, the receiver is assumed to actively detect presence or loss of FPDU Alignment for every TCP segment received. +--------------------------+ +--------------------------+ +--|--------------------------+ +--|--------------------------+ | | TCP Seg X | | | TCP Seg X+1 | +--|--------------------------+ +--|--------------------------+ +--------------------------+ +--------------------------+ FPDU #N FPDU #N+1 Figure 11:13: Aligned FPDU placed immediately after TCP header The receiver algorithm for FPDU Aligned frames (in-order or out-of- order) includes: 1) Data Link Layer processing (whole frame) - typically including a CRC calculation. 2) Network Layer processing (assuming not an IP fragment, the whole Data Link Layer frame contains one IP datagram. IP fragments should be reassembled in a local buffer. This is not a performance optimization goal) 3) Transport Layer processing -- TCP protocol processing, header and checksum checks. a. Classify incoming TCP segment using the 5 tuple (IP SRC, IP DST, TCP SRC Port, TCP DST Port, protocol) 4) Check for Header Alignment. (Described in detail in Section 5.4).6). Assuming Header Alignment for the rest of the algorithm below. a. If the header is not aligned, see the algorithm defined in the prior section. 5) If TCP is in-order or out-of-order the MPA header is at the beginning of the current TCP payload. Get the FPDU length from the FPDU header. 6) Calculate CRC over FPDU 7) Check CRC calculation for FPDU #N 8) If no FPDU CRC errors, placement is allowed 9) CopyData(TCP segment #X, host buffer address, length) 10) Loop to #5 until all the FPDUs in the TCP segment are consumed in order to handle FPDU packing. Implementation note: In both cases the receiver has to classify the incoming TCP segment and associate it with one of the flows it maintains. In the case of no FPDU Alignment, the receiver is forced to classify incoming traffic before it can calculate the FPDU CRC. In the case of FPDU Alignment the operations order is left to the implementer. The FPDU Aligned receiver algorithm is significantly simpler. There is no need to locally buffer portions of FPDUs. Accessing state information is also substantially simplified - the normal case does not require retrieving information to find out where a FPDU starts and ends or retrieval of a partial CRC before the CRC calculation can commence. This avoids adding internal latencies, having multiple data passes through the CRC machine, or scheduling multiple commands for moving the data to the host buffer. The aligned FPDU approach is useful for in-order and out-of-order reception. The receiver can use the same mechanisms for data storage in both cases, and only needs to account for when all the TCP segments have arrived to enable Delivery. The Header Alignment, along with the high probability that at least one complete FPDU is found with every TCP segment, allows the receiver to perform data placement for out-of-order TCP segments with no need for intermediate buffering. Essentially the TCP receive buffer has been eliminated and TCP reassembly is done in place within the ULP buffer. In case FPDU Alignment is not found, the receiver should follow the algorithm for non aligned FPDU reception which may be slower and less efficient. 18.104.22.168.1.2.2 FPDU Alignment effects on TCP wire protocol An MPA-awareIn an optimized MPA/TCP implementation, TCP exposes its EMSS to MPA. MPA uses the EMSS to calculate its MULPDU, which it then exposes to DDP, its ULP. DDP uses the MULPDU to segment its payload so that each FPDU sent by MPA fits completely into one TCP segment. This has no impact on wire protocol and exposing this information is already supported on many TCP implementations, including all modern flavors of BSD networking, through the TCP_MAXSEG socket option. In the common case, the ULP (i.e. DDP over MPA) messages provided to the TCP layer are segmented to MULPDU size. It is assumed that the ULP message size is bounded by MULPDU, such that a single ULP message can be encapsulated in a single TCP segment. Therefore, in the common case, there is no increase in the number of TCP segments emitted. For smaller ULP messages, the sender can also apply packing, i.e. the sender packs as many complete FPDUs as possible into one TCP segment. The requirement to always have a complete FPDU may increase the number of TCP segments emitted. Typically, a ULP message size varies from few bytes to multiple EMSS (e.g., 64 Kbytes). In some cases the ULP may post more than one message at a time for transmission, giving the sender an opportunity for packing. In the case where more than one FPDU is available for transmission and the FPDUs are encapsulated into a TCP segment and there is no room in the TCP segment to include the next complete FPDU, another TCP segment is sent. In this corner case some of the TCP segments are not full size. In the worst case scenario, the ULP may choose a FPDU size that is EMSS/2 +1 and has multiple messages available for transmission. For this poor choice of FPDU size, the average TCP segment size is therefore about 1/2 of the EMSS and the number of TCP segments emitted is approaching 2x of what is possible without the requirement to encapsulate an integer number of complete FPDUs in every TCP segment. This is a dynamic situation that only lasts for the duration where the sender ULP has multiple non-optimal messages for transmission and this causes a minor impact on the wire utilization. However, it is not expected that requiring FPDU Alignment will have a measurable impact on wire behavior of most applications. Throughput applications with large I/Os are expected to take full advantage of the EMSS. Another class of applications with many small outstanding buffers (as compared to EMSS) is expected to use packing when applicable. Transaction oriented applications are also optimal. TCP retransmission is another area that can affect sender behavior. TCP supports retransmission of the exact, originally transmitted segment (see [RFC793] section 2.6, [RFC793] section 3.7 "managing the window" and [RFC1122] section 22.214.171.124). In the unlikely event that part of the original segment has been received and acknowledged by the remote peer (e.g., a re-segmenting middle box, as documented in 5.4.1Section 6.1, Re-segmenting Middle boxes and non MPA-aware TCPoptimized MPA/TCP senders on page 31),29), a better available bandwidth utilization may be possible by re-transmitting only the missing octets. If an MPA-aware TCPoptimized MPA/TCP retransmits complete FPDUs, there may be some marginal bandwidth loss. Another area where a change in the TCP segment number may have impact is that of Slow Start and Congestion Avoidance. Slow-start exponential increase is measured in segments per second, as the algorithm focuses on the overhead per segment at the source for congestion that eventually results in dropped segments. Slow-start exponential bandwidth growth for MPA-aware TCPoptimized MPA/TCP is similar to any TCP implementation. Congestion Avoidance allows for a linear growth in available bandwidth when recovering after a packet drop. Similar to the analysis for slow-start, MPA-aware TCPoptimized MPA/TCP doesn't change the behavior of the algorithm. Therefore the average size of the segment versus EMSS is not a major factor in the assessment of the bandwidth growth for a sender. Both Slow Start and Congestion Avoidance for an MPA-aware TCPoptimized MPA/TCP will behave similarly to any TCP sender and allow an MPA-aware TCPoptimized MPA/TCP to enjoy the theoretical performance limits of the algorithms. In summary, the ULP messages generated at the sender (e.g., the amount of messages grouped for every transmission request) and message size distribution has the most significant impact over the number of TCP segments emitted. The worst case effect for certain ULPs (with average message size of EMSS/2+1 to EMSS), is bounded by an increase of up to 2x in the number of TCP segments and acknowledges. In reality the effect is expected to be marginal. 11.212.2 Receiver implementation Transport & Network Layer Reassembly Buffers: The use of reassembly buffers (either TCP reassembly buffers or IP fragmentation reassembly buffers) is implementation dependent. When MPA is enabled, reassembly buffers are needed if out of order packets arrive and Markers are not enabled. Buffers are also needed if FPDU Alignment is lost or if IP fragmentation occurs. This is because the incoming out of order segment may not contain enough information for MPA to process all of the FPDU. For cases where a re-segmenting middle box is present, or where the TCP sender is not MPA-aware,optimized, the presence of Markers significantly reduces the amount of buffering needed. Recovery from IP Fragmentation must be transparent to the MPA Consumers. 126.96.36.199.1 Network Layer Reassembly Buffers Most IP implementations set the IP Don't Fragment bit. Thus upon a path MTU change, intermediate devices drop the IP datagram if it is too large and reply with an ICMP message which tells the source TCP that the path MTU has changed. This causes TCP to emit segments conformant with the new path MTU size. Thus IP fragments under most conditions should never occur at the receiver. But it is possible. There are several options for implementation of network layer reassembly buffers: 1. drop any IP fragments, and reply with an ICMP message according to [RFC792] (fragmentation needed and DF set) to tell the Remote Peer to resize its TCP segment 2. support an IP reassembly buffer, but have it of limited size (possibly the same size as the local link's MTU). The end Node would normally never advertise a path MTU larger than the local link MTU. It is recommended that a dropped IP fragment cause an ICMP message to be generated according to RFC792. 3. multiple IP reassembly buffers, of effectively unlimited size. 4. support an IP reassembly buffer for the largest IP datagram (64 KB). 5. support for a large IP reassembly buffer which could span multiple IP datagrams. An implementation should support at least 2 or 3 above, to avoid dropping packets that have traversed the entire fabric. There is no end-to-end ACK for IP reassembly buffers, so there is no flow control on the buffer. The only end-to-end ACK is a TCP ACK, which can only occur when a complete IP datagram is delivered to TCP. Because of this, under worst case, pathological scenarios, the largest IP reassembly buffer is the TCP receive window (to buffer multiple IP datagrams that have all been fragmented). Note that if the Remote Peer does not implement re-segmentation of the data stream upon receiving the ICMP reply updating the path MTU, it is possible to halt forward progress because the opposite peer would continue to retransmit using a transport segment size that is too large. This deadlock scenario is no different than if the fabric MTU (not last hop MTU) was reduced after connection setup, and the remote Node's behavior is not compliant with [RFC1122]. 188.8.131.52.2 TCP Reassembly buffers A TCP reassembly buffer is also needed. TCP reassembly buffers are needed if FPDU Alignment is lost when using TCP with MPA or when the MPA FPDU spans multiple TCP segments. Buffers are also needed if Markers are disabled and out of order packets arrive. Since lost FPDU Alignment often means that FPDUs are incomplete, an MPA on TCP implementation must have a reassembly buffer large enough to recover an FPDU that is less than or equal to the MTU of the locally attached link (this should be the largest possible advertised TCP path MTU). If the MTU is smaller than 140 octets, the buffer MUST be at least 140 octets long to support the minimum FPDU size. The 140 octets allows for the minimum MULPDU of 128, 2 octets of pad, 2 of ULPDU_Length, 4 of CRC, and space for a possible Marker. As usual, additional buffering may provide better performance. Note that if the TCP segment were not stored, it is possible to deadlock the MPA algorithm. If the path MTU is reduced, FPDU Alignment requires the source TCP to re-segment the data stream to the new path MTU. The source MPA will detect this condition and reduce the MPA segment size, but any FPDUs already posted to the source TCP will be re-segmented and lose FPDU Alignment. If the destination does not support a TCP reassembly buffer, these segments can never be successfully transmitted and the protocol deadlocks. When a complete FPDU is received, processing continues normally. 11.312.3 IETF Implementation Interoperability with RDMA Consortium Protocols The RDMA Consortium created early specifications of the MPA/DDP/RDMA protocols and some manufacturers created implementations of those protocols before the IETF versions were finalized. These protocols and are very similar to the IETF versions making it possible for implementations to be created or modified to support either set of specifications. For those interested, the RDMA Consortium protocol documents can be obtained at http://www.rdmaconsortium.org. In this section, implementations of MPA/DDP/RDMA that conform to the RDMAC specifications are called RDMAC RNICs. Implementations of MPA/DDP/RDMA that conform to the IETF RFCs are called IETF RNICs. Without the exchange of MPA Request/Reply Frames, there is no standard mechanism for enabling RDMAC RNICs to interoperate with IETF RNICs. Even if a ULP uses a well-known port to start an IETF RNIC immediately in RDMA mode (i.e., without exchanging the MPA Request/Reply messages), there is no reason to believe an IETF RNIC will interoperate with an RDMAC RNIC because of the differences in the version number in the DDP and RDMAP headers on the wire. Therefore, the ULP or other supporting entity at the RDMAC RNIC must implement MPA Request/Reply Frames on behalf of the RNIC in order to negotiate the connection parameters. The following section describes the results following the exchange of the MPA Request/Reply Frames before the conversion from streaming to RDMA mode. 184.108.40.206.1 Negotiated Parameters Three types of RNICs are considered: Upgraded RDMAC RNIC - an RNIC implementing the RDMAC protocols which has a ULP or other supporting entity that exchanges the MPA Request/Reply Frames in streaming mode before the conversion to RDMA mode. Non-permissive IETF RNIC - an RNIC implementing the IETF protocols which is not capable of implementing the RDMAC protocols. Such an RNIC can only interoperate with other IETF RNICs. Permissive IETF RNIC - an RNIC implementing the IETF protocols which is capable of implementing the RDMAC protocols on a per connection basis. The Permissive IETF RNIC is recommended for those implementers that want maximum interoperability with other RNIC implementations. The values used by these three RNIC types for the MPA, DDP, and RDMAP versions as well as MPA Markers and CRC are summarized in Figure 12.14. +----------------++-----------+-----------+-----------+-----------+ | RNIC TYPE || DDP/RDMAP | MPA | MPA | MPA | | || Version | Revision | Markers | CRC | +----------------++-----------+-----------+-----------+-----------+ +----------------++-----------+-----------+-----------+-----------+ | RDMAC || 0 | 0 | 1 | 1 | | || | | | | +----------------++-----------+-----------+-----------+-----------+ | IETF || 1 | 1 | 0 or 1 | 0 or 1 | | Non-permissive || | | | | +----------------++-----------+-----------+-----------+-----------+ | IETF || 1 or 0 | 1 or 0 | 0 or 1 | 0 or 1 | | permissive || | | | | +----------------++-----------+-----------+-----------+-----------+ Figure 12.14. Connection Parameters for the RNIC Types. For MPA Markers and MPA CRC, enabled=1, disabled=0. It is assumed there is no mixing of versions allowed between MPA, DDP and RDMAP. The RNIC either generates the RDMAC protocols on the wire (version is zero) or the IETF protocols (version is one). During the exchange of the MPA Request/Reply Frames, each peer provides its MPA Revision, Marker preference (M: 0=disabled, 1=enabled), and CRC preference. The MPA Revision provided in the MPA Request Frame and the MPA Reply Frame may differ. From the information in the MPA Request/Reply Frames, each side sets the Version field (V: 0=RDMAC, 1=IETF) of the DDP/RDMAP protocols as well as the state of the Markers for each half connection. Between DDP and RDMAP, no mixing of versions is allowed. Moreover, the DDP and RDMAP version MUST be identical in the two directions. The RNIC either generates the RDMAC protocols on the wire (version is zero) or the IETF protocols (version is one). In the following sections, the figures do not discuss CRC negotiation because there is no interoperability issue for CRCs. Since the RDMAC RNIC will always request CRC use, then, according to the IETF MPA specification, both peers MUST generate and check CRCs. 220.127.116.11.2 RDMAC RNIC and Non-permissive IETF RNIC Figure 1315 shows that a Non-permissive IETF RNIC cannot interoperate with an RDMAC RNIC, despite the fact that both peers exchange MPA Request/Reply Frames. For a Non-permissive IETF RNIC, the MPA negotiation has no effect on the DDP/RDMAP version and it is unable to interoperate with the RDMAC RNIC. The rows in the figure show the state of the Marker field in the MPA Request Frame sent by the MPA Initiator. The columns show the state of the Marker field in the MPA Reply Frame sent by the MPA Responder. Each type of RNIC is shown as an Initiator and a Responder. The connection results are shown in the lower right corner, at the intersection of the different RNIC types, where V=0 is the RDMAC DDP/RDMAP version, V=1 is the IETF DDP/RDMAC version, M=0 means MPA Markers are disabled and M=1 means MPA Markers are enabled. The negotiated Marker state is shown as X/Y, for the receive direction of the Initiator/Responder. +---------------------------++-----------------------+ | MPA || MPA | | CONNECT || Responder | | MODE +-----------------++-------+---------------+ | | RNIC || RDMAC | IETF | | | TYPE || | Non-permissive| | | +------++-------+-------+-------+ | | |MARKER|| M=1 | M=0 | M=1 | +---------+----------+------++-------+-------+-------+ +---------+----------+------++-------+-------+-------+ | | RDMAC | M=1 || V=0 | close | close | | | | || M=1/1 | | | | +----------+------++-------+-------+-------+ | MPA | | M=0 || close | V=1 | V=1 | |Initiator| IETF | || | M=0/0 | M=0/1 | | |Non-perms.+------++-------+-------+-------+ | | | M=1 || close | V=1 | V=1 | | | | || | M=1/0 | M=1/1 | +---------+----------+------++-------+-------+-------+ Figure 13:15: MPA negotiation between an RDMAC RNIC and a Non-permissive IETF RNIC. 18.104.22.168.3.2.1 RDMAC RNIC Initiator If the RDMAC RNIC is the MPA Initiator, its ULP sends an MPA Request Frame with Rev field set to zero and the M and C bits set to one. Because the Non-permissive IETF RNIC cannot dynamically downgrade the version number it uses for DDP and RDMAP, it would send an MPA Reply Frame with the Rev field equal to one and then gracefully close the connection. 22.214.171.124.3.2.2 Non-Permissive IETF RNIC Initiator If the Non-permissive IETF RNIC is the MPA Initiator, it sends an MPA Request Frame with Rev field equal to one. The ULP or supporting entity for the RDMAC RNIC responds with an MPA Reply Frame that has the Rev field equal to zero and the M bit set to one. The Non- permissive IETF RNIC will gracefully close the connection after it reads the incompatible Rev field in the MPA Reply Frame. 11.3.312.3.3 RDMAC RNIC and Permissive IETF RNIC Figure 1416 shows that a Permissive IETF RNIC can interoperate with an RDMAC RNIC regardless of its Marker preference. The figure uses the same format as shown with the Non-permissive IETF RNIC. +---------------------------++-----------------------+ | MPA || MPA | | CONNECT || Responder | | MODE +-----------------++-------+---------------+ | | RNIC || RDMAC | IETF | | | TYPE || | Permissive | | | +------++-------+-------+-------+ | | |MARKER|| M=1 | M=0 | M=1 | +---------+----------+------++-------+-------+-------+ +---------+----------+------++-------+-------+-------+ | | RDMAC | M=1 || V=0 | N/A | V=0 | | | | || M=1/1 | | M=1/1 | | +----------+------++-------+-------+-------+ | MPA | | M=0 || V=0 | V=1 | V=1 | |Initiator| IETF | || M=1/1 | M=0/0 | M=0/1 | | |Permissive+------++-------+-------+-------+ | | | M=1 || V=0 | V=1 | V=1 | | | | || M=1/1 | M=1/0 | M=1/1 | +---------+----------+------++-------+-------+-------+ Figure 14:16: MPA negotiation between an RDMAC RNIC and a Permissive IETF RNIC. A truly Permissive IETF RNIC will recognize an RDMAC RNIC from the Rev field of the MPA Req/Rep Frames and then adjust its receive Marker state and DDP/RDMAP version to accommodate the RDMAC RNIC. As a result, as an MPA Responder, the Permissive IETF RNIC will never return an MPA Reply Frame with the M bit set to zero. This case is shown as a not applicable (N/A) in Figure 14. 126.96.36.199. 188.8.131.52 RDMAC RNIC Initiator When the RDMAC RNIC is the MPA Initiator, its ULP or other supporting entity prepares an MPA Request message and sets the revision to zero and the M bit and C bit to one. The Permissive IETF Responder receives the MPA Request message and checks the revision field. Since it is capable of generating RDMAC DDP/RDMAP headers, it sends an MPA Reply message with revision set to zero and the M and C bits set to one. The Responder must inform its ULP that it is generating version zero DDP/RDMAP messages. 184.108.40.206.3.3.2 Permissive IETF RNIC Initiator If the Permissive IETF RNIC is the MPA Initiator, it prepares the MPA Request Frame setting the Rev field to one. Regardless of the value of the M bit in the MPA Request Frame, the ULP or other supporting entity for the RDMAC RNIC will create an MPA Reply Frame with Rev equal to zero and the M bit set to one. When the Initiator reads the Rev field of the MPA Reply Frame and finds that its peer is an RDMAC RNIC, it must inform its ULP that it should generate version zero DDP/RDMAP messages and enable MPA Markers and CRC. 11.3.412.3.4 Non-Permissive IETF RNIC and Permissive IETF RNIC For completeness, Figure 1517 shows the results of MPA negotiation between a Non-permissive IETF RNIC and a Permissive IETF RNIC. The important point from this figure is that an IETF RNIC cannot detect whether its peer is a Permissive or Non-permissive RNIC. +---------------------------++-------------------------------+ | MPA || MPA | | CONNECT || Responder | | MODE +-----------------++---------------+---------------+ | | RNIC || IETF | IETF | | | TYPE || Non-permissive| Permissive | | | +------++-------+-------+-------+-------+ | | |MARKER|| M=0 | M=1 | M=0 | M=1 | +---------+----------+------++-------+-------+-------+-------+ +---------+----------+------++-------+-------+-------+-------+ | | | M=0 || V=1 | V=1 | V=1 | V=1 | | | IETF | || M=0/0 | M=0/1 | M=0/0 | M=0/1 | | |Non-perms.+------++-------+-------+-------+-------+ | | | M=1 || V=1 | V=1 | V=1 | V=1 | | | | || M=1/0 | M=1/1 | M=1/0 | M=1/1 | | MPA +----------+------++-------+-------+-------+-------+ |Initiator| | M=0 || V=1 | V=1 | V=1 | V=1 | | | IETF | || M=0/0 | M=0/1 | M=0/0 | M=0/1 | | |Permissive+------++-------+-------+-------+-------+ | | | M=1 || V=1 | V=1 | V=1 | V=1 | | | | || M=1/0 | M=1/1 | M=1/0 | M=1/1 | +---------+----------+------++-------+-------+-------+-------+ Figure 15:17: MPA negotiation between a Non-permissive IETF RNIC and a Permissive IETF RNIC. 1213 Author's Addresses Stephen Bailey Sandburst Corporation 600 Federal Street Andover, MA 01810 USA Phone: +1 978 689 1614 Email: firstname.lastname@example.org Paul R. Culley Hewlett-Packard Company 20555 SH 249 Houston, Tx. USA 77070-2698 Phone: 281-514-5543 Email: email@example.com Uri Elzur Broadcom 16215 Alton Parkway CA, 92618 Phone: 949.585.6432 Email: firstname.lastname@example.org Renato J Recio IBM Internal Zip 9043 11400 Burnett Road Austin, Texas 78759 Phone: 512-838-3685 Email: email@example.com John Carrier Cray Inc. 411 First Avenue S, Suite 600 Seattle, WA 98104-2860 Phone: 206-701-2090 Email: firstname.lastname@example.org 1314 Acknowledgments Dwight Barron Hewlett-Packard Company 20555 SH 249 Houston, Tx. USA 77070-2698 Phone: 281-514-2769 Email: email@example.com Jeff Chase Department of Computer Science Duke University Durham, NC 27708-0129 USA Phone: +1 919 660 6559 Email: firstname.lastname@example.org Ted Compton EMC Corporation Research Triangle Park, NC 27709, USA Phone: 919-248-6075 Email: email@example.com Dave Garcia Hewlett-Packard Company 19333 Vallco Parkway Cupertino, Ca. USA 95014 Phone: 408.285.6116 Email: firstname.lastname@example.org Hari Ghadia Adaptec, Inc. 691 S. Milpitas Blvd., Milpitas, CA 95035 USA Phone: +1 (408) 957-5608 Email: email@example.com Howard C. Herbert Intel Corporation MS CH7-404 5000 West Chandler Blvd. Chandler, Arizona 85226 Phone: 480-554-3116 Email: firstname.lastname@example.org Jeff Hilland Hewlett-Packard Company 20555 SH 249 Houston, Tx. USA 77070-2698 Phone: 281-514-9489 Email: email@example.com Mike Ko IBM 650 Harry Rd. San Jose, CA 95120 Phone: (408) 927-2085 Email: firstname.lastname@example.org Mike Krause Hewlett-Packard Corporation, 43LN 19410 Homestead Road Cupertino, CA 95014 USA Phone: +1 (408) 447-3191 Email: email@example.com Dave Minturn Intel Corporation MS JF1-210 5200 North East Elam Young Parkway Hillsboro, Oregon 97124 Phone: 503-712-4106 Email: firstname.lastname@example.org Jim Pinkerton Microsoft, Inc. One Microsoft Way Redmond, WA, USA 98052 Email: email@example.com Hemal Shah 16215 Alton Parkway Irvine, California 92619-7013 USA Phone: +1 949 926-6941 Email: firstname.lastname@example.org Allyn Romanow Cisco Systems 170 W Tasman Drive San Jose, CA 95134 USA Phone: +1 408 525 8836 Email: email@example.com Tom Talpey Network Appliance 375 Totten Pond Road Waltham, MA 02451 USA Phone: +1 (781) 768-5329 EMail: firstname.lastname@example.org Patricia Thaler Broadcom 16215 Alton Parkway Irvine, CA 92618 Phone: 916 570 2707 email@example.com Jim Wendt Hewlett Packard Corporation 8000 Foothills Boulevard MS 5668 Roseville, CA 95747-5668 USA Phone: +1 916 785 5198 Email: firstname.lastname@example.org Jim Williams Emulex Corporation 580 Main Street Bolton, MA 01740 USA Phone: +1 978 779 7224 Email: email@example.com Full Copyright Statement This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Copyright (C) The Internet Society (2006). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights. Intellectual Property The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79. Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at firstname.lastname@example.org.