[Docs] [txt|pdf] [Tracker] [Email] [Nits]
Versions: 00
INTERNET-DRAFT Jeff Hilland
draft-hilland-rddp-verbs-00.txt Hewlett-Packard Company
Paul Culley
Hewlett-Packard Company
Jim Pinkerton
Microsoft Corporation
Renato Recio
IBM Corporation
Expires: October, 2003
RDMA Protocol Verbs Specification
1 Status of this Memo
This document is an Internet-Draft and is subject to all provisions
of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet-
Drafts.
Internet-Drafts are draft documents valid for a maximum of six
months and may be updated, replaced, or obsoleted by other documents
at any time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/1id-abstracts.html The list of Internet-Draft
Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
2 Abstract
This document describes an abstract interface to a RDMA enabled NIC
(RNIC). This interface is implemented as a combination of the RNIC,
its associated firmware, and host software. It provides access to
the RNIC queuing and memory management resources, as well as the
underlying networking layers.
Hilland, et al. Expires October 2003 [Page 1]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
Table of Contents
1 Status of this Memo.........................................1
2 Abstract....................................................1
3 Introduction................................................7
4 Glossary....................................................9
4.1 Abbreviations..............................................19
5 RNIC Interface.............................................22
5.1 The RNIC...................................................23
5.1.1 RNIC Resources...........................................23
5.1.1.1 Expected Creation Sequence............................24
5.1.1.2 Expected Destruction Sequence.........................25
5.1.2 Opening an RNIC..........................................28
5.1.3 Query RNIC...............................................28
5.1.4 Closing an RNIC..........................................28
5.2 Protection Domains.........................................28
5.2.1 Allocating a PD..........................................29
5.2.2 Deallocating a PD........................................30
5.3 Completion Queues..........................................30
5.3.1 Creating a Completion Queue..............................30
5.3.2 Querying Completion Queue Attributes.....................31
5.3.3 Modifying Completion Queue Attributes....................32
5.3.4 Destroying a Completion Queue............................32
6 Queue Pairs................................................33
6.1 Queue Pair Resource Handling...............................34
6.1.1 Creating a Queue Pair....................................34
6.1.2 Querying Queue Pair Attributes...........................35
6.1.3 Modifying Queue Pair Attributes..........................36
6.1.4 Destroying a Queue Pair..................................39
6.2 Queue Pair Resource States.................................41
6.2.1 Idle State...............................................43
6.2.1.1 Idle to Idle..........................................44
6.2.1.2 Idle to RTS...........................................44
6.2.1.3 Idle to Error.........................................46
6.2.2 RTS (Ready to Send) State................................48
6.2.2.1 RTS to RTS............................................48
6.2.2.2 RTS to Closing........................................49
6.2.2.3 RTS to Terminate......................................49
6.2.2.4 RTS to Error..........................................50
6.2.3 Terminate State..........................................53
6.2.4 Error State..............................................56
6.2.5 Closing State............................................58
6.3 Shared Receive Queue.......................................62
6.3.1 Creating a Shared Receive Queue..........................63
6.3.2 Modifying a Shared Receive Queue.........................63
6.3.3 Destroying a Shared Receive Queue........................63
6.3.4 Associating an S-RQ with a QP............................64
6.3.5 Shared Receive Queue Processing Model....................64
6.3.6 S-RQ Error Semantics.....................................66
6.3.7 S-RQ Resource Sizing.....................................66
Hilland, et al. Expires October 2003 [Page 2]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
6.3.8 S-RQ Limit Checking......................................67
6.4 Stopping QP processing and Sending the Terminate Message...68
6.5 Outstanding RDMA Read Resource Management..................71
6.5.1 Example IRD/ORD Negotiation..............................74
6.6 Connection Management......................................75
6.6.1 Connection Initialization................................75
6.6.1.1 Active Connection Initialization after LLP Startup....76
6.6.1.2 Passive Connection Initialization after LLP Startup...78
6.6.2 Connection Teardown......................................79
6.6.2.1 Normal Close..........................................80
6.6.2.2 ULP Initiated Termination.............................81
6.6.2.3 ULP Initiated Abortive Teardown.......................82
6.6.2.4 Remote Termination....................................83
6.6.2.5 Local Termination, Local Abortive Teardown and Remote
Abortive Teardown...............................................83
7 Memory Management..........................................87
7.1 Memory Management Overview.................................87
7.2 Steering Tag (STag)........................................88
7.2.1 STag of zero.............................................90
7.2.2 Summary of Memory Region STag States.....................91
7.3 Memory Registration........................................93
7.3.1 Memory Regions...........................................94
7.3.1.1 Memory Region Tagged Offset (TO)......................94
7.3.2 Memory Region Creation and Registration..................94
7.3.2.1 Allocate Non-Shared Memory Region STag................95
7.3.2.2 RI-Register Non-Shared Memory Region..................95
7.3.2.3 RI-Reregister Non-Shared Memory Region................96
7.3.2.4 Register Shared Memory Region.........................98
7.3.2.5 Fast-Register Non-Shared Memory Region................99
7.4 Access to Registered Memory...............................100
7.4.1 Local Access to Registered Memory.......................101
7.4.2 Remote Access to Registered Memory......................101
7.4.3 Multiple Registrations of Memory Regions................103
7.5 Memory Access Control.....................................104
7.5.1 Local Access Control....................................105
7.5.2 Remote Access Control...................................106
7.6 Addressing................................................106
7.6.1 Addressing Registered Memory............................106
7.6.1.1 Addressing with VA based TO..........................107
7.6.1.2 Addressing with Zero Based TO........................108
7.6.2 Physical Buffer Lists...................................109
7.6.2.1 Page Lists...........................................109
7.6.2.2 Block Lists..........................................110
7.6.3 Error Checking of Local and Remote Accesses to MRs......110
7.7 Querying Memory Regions...................................111
7.8 Invalidating Memory Regions...............................111
7.9 Deallocation of STag associated with a Memory Region......114
7.10 Memory Windows..........................................115
7.10.1 Allocating Memory Windows..............................115
7.10.2 Binding Memory Windows to Memory Regions...............116
Hilland, et al. Expires October 2003 [Page 3]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
7.10.3 Querying Memory Windows................................120
7.10.4 Invalidating or De-allocating Memory Windows...........120
7.10.4.1 Invalidating or De-allocating Active Windows.........121
7.10.5 Summary of Memory Window STag States...................121
7.10.6 Error Checking during Memory Window Operations.........122
7.10.6.1 Error Checking at Window Bind Time...................122
7.10.6.2 Error Checking at Window Access Time.................123
7.10.6.3 Error Checking at Window Invalidate Time.............123
8 Work Requests and the WR Processing Model.................125
8.1 Work Requests.............................................125
8.1.1 Creating Work Requests..................................125
8.1.2 Work Request Types......................................125
8.1.2.1 Send/Receive.........................................125
8.1.2.2 RDMA.................................................126
8.1.2.3 Memory...............................................129
8.1.3 Work Request Contents...................................130
8.1.3.1 Signaled Completions.................................130
8.1.3.2 Scatter/Gather List..................................131
8.1.3.3 RDMA Data Source & Data Sink.........................132
8.2 Work Request Processing Model.............................133
8.2.1 Submitting Work Request to a Work Queue.................133
8.2.2 Work Request Processing.................................134
8.2.2.1 Memory Management Operation Ordering.................137
8.2.2.2 Read Fence and Local Fence Indicators................140
8.2.3 Completion Processing...................................143
8.2.4 Returning Completed Work Requests.......................144
8.2.5 Asynchronous Completion Notification....................145
8.3 Error Handling............................................147
8.3.1 Immediate Errors........................................148
8.3.2 Work Completion Errors..................................148
8.3.3 Asynchronous Errors.....................................150
9 RNIC Verbs................................................157
9.1 Consumer Accessibility....................................157
9.2 RNIC Resource Management..................................158
9.2.1 RNIC....................................................158
9.2.1.1 Open RNIC............................................158
9.2.1.2 Query RNIC...........................................159
9.2.1.3 Close RNIC...........................................161
9.2.2 Protection Domain.......................................162
9.2.2.1 Allocate PD..........................................162
9.2.2.2 Deallocate PD........................................163
9.2.3 Completion Queue........................................163
9.2.3.1 Create CQ............................................163
9.2.3.2 Query CQ.............................................164
9.2.3.3 Modify CQ............................................165
9.2.3.4 Destroy CQ...........................................166
9.2.4 Shared Receive Queue....................................167
9.2.4.1 Create S-RQ..........................................167
9.2.4.2 Query S-RQ...........................................168
9.2.4.3 Modify S-RQ..........................................169
Hilland, et al. Expires October 2003 [Page 4]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
9.2.4.4 Destroy S-RQ.........................................170
9.2.5 Queue Pair..............................................170
9.2.5.1 Create QP............................................170
9.2.5.2 Query QP.............................................174
9.2.5.3 Modify QP............................................176
9.2.5.4 Destroy QP...........................................178
9.2.6 Memory Management.......................................179
9.2.6.1 Allocate Non-Shared Memory Region STag...............179
9.2.6.2 Register Non-Shared Memory Region (RI-Register)......180
9.2.6.3 Query Memory Region..................................182
9.2.6.4 Deallocate STag......................................183
9.2.6.5 Reregister Non-Shared Memory Region (RI-Reregister)..184
9.2.6.6 Register Shared Memory Region........................187
9.2.6.7 Allocate Memory Window...............................188
9.2.6.8 Query Memory Window..................................189
9.3 Work Request Processing...................................190
9.3.1 QP Operations...........................................190
9.3.1.1 PostSQ...............................................190
9.3.1.2 PostRQ...............................................197
9.3.2 CQ Operations...........................................198
9.3.2.1 Poll for Completion (Poll CQ)........................198
9.3.2.2 Request Completion Notification......................200
9.4 Event Handling............................................200
9.4.1 Set Completion Event Handler............................200
9.4.2 Set Asynchronous Event Handler..........................202
9.5 Result Types..............................................203
9.5.1 Immediate Status Codes..................................203
9.5.1.1 RNIC Management Verb Status..........................204
9.5.1.2 PD Management Verb Status............................204
9.5.1.3 CQ Management Verb Status............................205
9.5.1.4 S-RQ Management Verb Status..........................205
9.5.1.5 QP Management Verb Status............................206
9.5.1.6 Memory Management Verb Status........................207
9.5.1.7 Post Verb Status.....................................208
9.5.1.8 Event Management Verb Status.........................209
9.5.2 Completion Status Codes.................................210
9.5.3 Asynchronous Event Identifiers..........................212
10 Security Considerations...................................217
11 IANA Considerations.......................................218
12 References................................................219
12.1 Normative References....................................219
12.2 Informative References..................................219
13 Appendix..................................................220
13.1 Connection Initialization at LLP Startup................220
13.2 Graceful Receive Overflow Handling......................221
14 AuthorÆs Addresses........................................223
15 Acknowledgments...........................................224
16 Full Copyright Statement..................................227
Hilland, et al. Expires October 2003 [Page 5]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
Table of Figures
Figure 1 - Architectural RNIC & RI Model..........................8
Figure 2 - Resource Creation Dependency Diagram..................25
Figure 3 - Resource Destruction Dependency Diagram...............27
Figure 4 - Allowable QP Attribute Modifications..................37
Figure 5 - Optional QP Attribute Modifications...................38
Figure 6 - QP State Diagram......................................42
Figure 7 - Idle State summary....................................47
Figure 8 - RTS State summary.....................................52
Figure 9 - Terminate State summary...............................55
Figure 10 - Error State summary..................................57
Figure 11 - Closing State summary................................61
Figure 12- Terminate Control Field Values........................71
Figure 13 - An example RDMA Read Resource negotiation............75
Figure 14 - Connection Initialization after LLP Startup..........76
Figure 15 - Normal Close on TCP..................................81
Figure 16 - Abortive Teardown example on TCP.....................86
Figure 17 - Memory Region and Window State Diagram...............92
Figure 18 - Valid Combinations of MR Access Rights..............103
Figure 19 - MR to MW Valid Binding Combinations.................117
Figure 20 - Valid Combinations of MW & MR Access Rights.........119
Figure 21 - Valid QP & STag Access Right Combinations...........128
Figure 22 - Fencing on Prior Operations.........................142
Figure 23 - Completion Errors with Resulting Terminate Codes....150
Figure 24 - Affiliated Asynchronous Errors with Terminate Codes.155
Figure 25 - Unaffiliated Asynchronous Errors with Terminate Code156
Figure 26 - Memory Management Verbs.............................179
Figure 27 - PostSQ Input Modifier Validity......................196
Figure 28 - RNIC Management Verb Status.........................204
Figure 29 - PD Management Verb Status...........................204
Figure 30 - CQ Management Verb Status...........................205
Figure 31 - S-RQ Management Verb Status.........................206
Figure 32 - QP Management Verb Status...........................207
Figure 33 - Memory Management Verb Status.......................208
Figure 34 - Post Verb Status....................................209
Figure 35 - Event Management Verb Status........................209
Figure 36 - Completion Status Codes.............................212
Figure 37 - Asynchronous Event Identifiers......................216
Figure 39 - Connection Initialization at LLP Startup (using TCP)220
Hilland, et al. Expires October 2003 [Page 6]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
3 Introduction
This document describes an abstract interface to an RDMA aware NIC
(RNIC). The RNIC implements the RDMA Protocol [RDMAP][DDP] above a
reliable transport, such as [MPA] over TCP. The Verbs provide the
Consumer with a semantic definition of the RNIC Interface.
RDMA provides Verbs Consumers the capability to control data
placement, eliminate data copy operations, and significantly reduce
communications overhead and latencies by allowing one Verbs Consumer
to directly place information in another Verbs Consumer's memory,
while preserving OS and memory protection semantics. Specification
of syntactic definitions (API's, hardware registers) and
implementation details (hardware, firmware, software tradeoffs) are
beyond the scope of this specification.
Section 5 of this document defines the semantics of the RNIC
Interface (RI). This interface is implemented as a combination of
the RNIC, its associated firmware, and host software. Section 6
describes Queue Pairs, which represent the focus of interaction with
the RNIC for work submission. Section 7 describes Memory Management
and how the RNIC accesses buffers which contain data to be
transferred. Section 8 describes Work Requests and the WR Processing
Model, detailing the processing of the units of work from submission
to completion. Section 9 describes the RNIC Verbs. The Verbs are an
abstract description of the functionality of an RNIC Interface.
Section 10 describes security issues associated with implementing an
RDMA infrastructure.
A concept frequently encountered in this specification is that of
the Verbs Consumer, or simply, the Consumer. The precise meaning of
the phrase varies, as a function of context, but it always means the
executing entity employing the capabilities of the RNIC to
accomplish some objective. In some instances the Verb Consumer may
be an OS kernel thread, in others a non-privileged application, and
in still others, some special, privileged process. Where the
difference is important to the correct behavior of an
implementation, it is defined explicitly.
Specification of the API used by the Verbs Consumer to access the
capabilities of the RI is outside of the scope of this
specification.
Figure 1 is a conceptual diagram that describes an architectural
model which includes Privileged Mode consumers, Non-Privileged Mode
consumers, RNIC components and the RI.
Hilland, et al. Expires October 2003 [Page 7]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
< Figure 1 did not convert properly from source >
< to be corrected in an upcoming version >
Figure 1 - Architectural RNIC & RI Model
Hilland, et al. Expires October 2003 [Page 8]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
4 Glossary
Access Rights - The Local and Remote Memory Access Rights assigned
to an STag. This includes Local Read, Local Write, Remote Read,
Remote Write, Remote Access Flag, and Bind.
Address List - A list of addresses that represent the physical pages
or blocks referenced by the Physical Buffer List.
Advertisement (Advertised, Advertise, Advertisements, Advertises) -
The act of informing a Remote Peer that a Local Node's Buffer is
available to it. A Node makes a buffer available for incoming
RDMA Read Request Message or incoming RDMA Write Message access
by informing its RDMA/DDP peer of the Tagged Buffer identifiers
(STag, TO, and buffer length). This advertisement of Tagged
Buffer information is not defined by RDMA/DDP and is left to the
ULP. A typical method would be for the Local Peer to embed the
Tagged Buffer's Steering Tag, TO, and length in a Send Message
destined for the Remote Peer.
Affiliated Asynchronous Event - This is an indication from the Verb
layer to the Consumer that an event has occurred related to a
specific identifiable RNIC Resource, such as a Completion Queue
or Queue Pair.
Affiliated Error - An error that can be directly related back to a
specific RNIC Resource, such as a QP, S-RQ or CQ, but that
cannot be returned through a Work Completion.
Associated QP - The QP on the Remote Peer which is directly
accessing the other end of the RDMA Stream.
Asynchronous Error - This is an error that could not be reported
through immediate or completion error-handling mechanisms at the
local end. An asynchronous mechanism is necessary as a single
point of error handling for errors which could not otherwise be
reported through the normal mechanism since they are not
associated directly with any single QP, S-RQ or CQ or the QP
and/or CQ is in a state where an error cannot be reported.
Asynchronous errors may be Unaffiliated or may be Affiliated
with a specific QP, CQ or S-RQ.
Base Tagged Offset (Base TO) - The offset assigned to the first byte
of a Memory Region or a Memory Window.
Bind, Binding, Bound - The act of associating an STag, TO, and
Length within a previously registered Memory Region in order to
define a Memory Window.
Hilland, et al. Expires October 2003 [Page 9]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
Block List - A list of physical addresses describing a set of memory
blocks, which specifies the block size, list of physical
addresses, and offset to the start of the memory region of the
first block. Each block has the same length and that length can
be any value in the range supported by the RNIC. Each block may
start at a byte granularity address. The starting address for
the entire list may be an offset into the first block and the
entire list may have any length.
Complete (Completed, Completion, Completes) - When the Consumer can
determine that a particular RDMA Operation has performed all
functions specified for the RDMA Operation, including Placement
and Delivery. This can be determined through a Work Completion
for Signaled Work Requests. For Unsignaled Work Requests, this
means that the Completion Rules have been met. Note that this is
a superset of the [RDMAP] definition for RDMA Completion.
Completion Error - A Processing Error reported through the
Completion Queue.
Completion Queue (CQ) - A sharable queue containing one or more
entries which can contain Completion Queue Entries. A CQ is used
to create a single point of completion notification for multiple
Work Queues. The Work Queues associated with a Completion Queue
may be from different QPs and of differing queue types (SQs or
RQs).
Completion Queue Entry (CQE) - The RNIC Interface internal
representation of a Work Completion.
Completion Status - The resultant status of a Work Request returned
as part of a Work Completion.
Consumer, Verbs Consumer - A software process that communicates
using RDMA/DDP Verbs. The Consumer typically consists of an
application program, or an operating system adaptation layer,
which provides some OS specific API.
Direct Data Placement Protocol (DDP) - A wire protocol that supports
Direct Data Placement by associating explicit memory buffer
placement information with the LLP payload units.
Data Delivery (Delivery, Delivered, Delivers) - Delivery is defined
as the process of informing the ULP or Consumer that a
particular Message is available for use. This is specifically
different from Data Placement, which may generally occur in any
order, while the order of Data Delivery is strictly defined.
Data Placement (Placement, Placed, Places) - A mechanism whereby ULP
data contained within RDMA/DDP Segments may be put directly into
Hilland, et al. Expires October 2003 [Page 10]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
its final destination in memory without processing by the ULP.
This may occur even when the RDMA/DDP Segments arrive out of
order. Note that this differs from Data Delivery (see definition
in this section). From the Verbs viewpoint, Data Placement is
only confirmed upon Completion.
Data Sink - The peer receiving a data payload. Note that the Data
Sink can be required to both send and receive RDMA/DDP Messages
to transfer a data payload.
Data Source - The peer sending a data payload. Note that the Data
Source can be required to both send and receive RDMA/DDP
Messages to transfer a data payload.
Event - An indication provided by the RDMAP Layer to the ULP to
indicate a Completion or other condition requiring immediate
attention.
Fabric - The collection of links, switches, and routers that connect
a set of Nodes with RDMA/DDP protocol implementations.
First Byte Offset (FBO) - The offset into the first Physical Buffer
of a Memory Region. The value of the FBO cannot exceed the size
of the Physical Buffer Entry Size associated with the Memory
Region.
Handle - An opaque identifier used to reference an RNIC or an RNIC
Resource. Whether this is an index, object or some other
construct is outside the scope of this specification.
Immediate Error - - An error discovered by the RNIC Interface (RI) and
reported through the RI without affecting the RNIC.
Inbound RDMA Read Queue Depth (IRD) - The maximum number of incoming
outstanding RDMA Read Request Messages the RNICÆs QP can handle
at the Data Source.
Inbound RDMA Read Request Queue (IRRQ) - The RI internal resource
which handles incoming RDMA Read Request Messages, queues them
for processing them by the RI, and then generates the RDMA Read
Response Messages. This corresponds to Queue Number 1 in [DDP].
Invalidate STag (Invalidate, Invalidated, etc.) - A mechanism used
to prevent the Remote Peer from reusing an Advertised STag,
until the Local Peer transitions the STag to the Valid state.
Invalidate Local STag - A Work Request that takes an STag which is
valid within the local RI and performs an Invalidate STag
operation.
Hilland, et al. Expires October 2003 [Page 11]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
iWARP - A suite of wire protocols comprised of [RDMAP] & [DDP]. The
iWARP protocol suite may be layered above [MPA] and [TCP], or it
may be layered over [SCTP] or other transport protocols.
Local Access - The rights used to verify the RNIC's ability to
access the Data Sink for incoming Untagged Messages, the Data
Source for outgoing Untagged Messages and the Data Source for
outgoing RDMA Write Messages.
Local Fence - To block the current operation from executing until
all prior local operations submitted on the same Work Queue have
Completed.
Local Peer - The RDMA/DDP protocol implementation on the local end
of the connection. Used to refer to the local entity when
describing a protocol exchange or other interaction between two
Nodes.
Lower Layer Protocol (LLP) - The protocol layer beneath the protocol
layer currently being referenced. For example, for DDP the LLP
is SCTP, MPA, or other transport protocols. For RDMA, the LLP is
DDP.
LLP Closed (LLP Close)- When the LLP Stream can no longer be used
for data transmission. If there is a single LLP Stream on an LLP
Connection, it may also mean that the LLP Connection has been
torn down. For example, for TCP this could include the states
TIME_WAIT, CLOSING, LAST-ACK, and CLOSED
LLP Connection - Corresponds to an LLP transport-level connection
between the peer LLP layers on two nodes.
LLP Reset - The abnormal LLP closing mechanism, usually used to
indicate that the LLP Stream (and possibly Connection) was
aborted mid-stream. An example of this would be a TCP connection
being closed due to the reception or transmission of a TCP RST
on the connection.
LLP Stream - Corresponds to a single bi-directional LLP transport-
level association between the peer LLP layers on two Nodes. One
or more LLP Streams may map to a single transport-level LLP
Connection. For transport protocols that support multiple
Streams per connection (e.g. SCTP), a LLP Stream corresponds to
one transport-level Stream.
Memory Region (MR) - An area of memory that the Consumer wants the
RNIC to be able to (locally or locally and remotely) access
directly in a logically contiguous fashion. A Memory Region is
identified by an STag, a Base TO, and a length. A Memory Region
is associated with a Physical Buffer List through the STag.
Hilland, et al. Expires October 2003 [Page 12]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
Memory Registration (Registration, Register) - The mechanism used to
enable direct (local or local and remote) access by the RNIC of
a Consumer Memory Region. The memory registration operation
associates a Physical Buffer List to the Steering Tag (STag)
returned.
Memory Translation and Protection Table(s) (TPT) - The data
structure(s) used by an RNIC to control buffer access and
translate STags and Tagged Offsets into local memory addresses
directly accessible by the local Node.
Memory Window (MW) - - A subset of a Memory Region, which can be
remotely accessed in a logically contiguous fashion. A Memory
Window is identified by an STag, a Base TO, and a length, but
also references an underlying Memory Region and has Access
Rights.
Message Sequence Number (MSN) - For the Untagged Buffer Model, it
specifies a sequence number that is increasing with each DDP
Message.
Modifiers - In a Verb definition, the list of input and output
objects that specify how, and on what, the Verb is to be
executed.
Node - A computing device attached to one or more links of a Fabric
(network). A Node in this context does not refer to a specific
application or protocol instantiation running on the computer. A
Node may consist of one or more RNICs installed in a host
computer.
Non-Privileged Mode - An operating mode in which Consumers must rely
on another agent, having a sufficiently high level of privilege,
to manipulate OS data structures.
Non-Shared Memory Region - A Memory Region that solely owns the
Physical Buffer List associated with the Memory Region.
Specifically, the PBL is not shared, and has never been shared,
with another Memory Region.
Outbound RDMA Read Queue Depth (ORD) - The maximum number of
outstanding RDMA Read Request Messages the RNIC can initiate
from the SQ at the Data Sink.
Outstanding - The state of a Work Request after it has been posted
on a Work Queue, but before the retrieval of the Work
Completion, or confirmation that the WR has been completed, by
the Consumer.
Hilland, et al. Expires October 2003 [Page 13]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
Page List - A list of physical addresses describing a set of memory
pages, which specifies the page size, list of physical
addresses, and offset to the start of the memory region of the
first page. The starting physical addresses of each page is
aligned on power-of-two addresses and the size of the page is a
power of two. Note that it is possible for the starting offset
to be an offset into the first page and to be of a byte
granularity and the entire list may have an arbitrary length.
Physical Address - A physical address is used by an RNIC to retrieve
contents from the local host's memory. Physical addresses are
determined via the translation of the STag and Tagged Offset by
the use of the Memory Translation and Protection Table(s).
Physical Buffer - A set of physically contiguous memory locations
that can be directly accessed by the RNIC through Physical
Addresses. A Physical Buffer can either be a block buffer or a
page buffer, depending on its use as part of a Page List or
Buffer List.
Physical Buffer Entry Size - The size, in bytes, of each Physical
Buffer in the Physical Buffer List. If the Physical Buffer List
references a Page List, the size is a power of two. If the
Physical Buffer List references a Block List, the size can have
any value within the range supported by the RNIC.
Physical Buffer List (PBL) - A list of Physical Buffers. The
Physical Buffer List can either be a Block List or a Page List.
Physical Memory Addresses - The addresses an RNIC uses when
accessing host system memory.
Pinning memory - A function supplied by the OS that forces the
Memory Region to be resident in physical memory and keeps the
virtual-to-physical address translations constant from the
RNIC's point of view.
Place - Also Placed, Placement. See Data Placement.
Post Receive Queue Work Request (PostRQ) - A Verb that posts a Work
Request to the Receive Queue of a Queue Pair. This is done to
indicate the Data Sink Buffers for incoming Send Operation
Types.
Post Send Queue Work Request (PostSQ) - A Verb that posts a Work
Request to the Send Queue of a Queue Pair. This is done to
initiate all data transfer operations as well as Fast-Register,
Bind MW and Local Invalidate operations.
Hilland, et al. Expires October 2003 [Page 14]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
Privileged Mode - A mode in which Consumers operate where they have
a privilege level sufficient to access OS internal data
structured directly, and that have the responsibility to control
access to the RI.
Processing Error - An error detected below the RNIC Interface during
the processing of a Work Request or an incoming RDMA operation.
Protection Domain (PD) - A mechanism for tracking the association of
Queue Pairs, Memory Windows, and Memory Regions. PDs are
intended to be set by a Privileged Consumer to provide
protection of one process from accessing another's memory
through the use of the RNIC.
Protection Domain ID (PD ID) - The identifier which represents a
Protection Domain. It is passed in as an Input Modifier when
creating QPs, Memory Windows and MRs. The value of PD IDs are
compared during processing of Work Requests.
Queue Pair (QP) - The pair of queues that allow the Consumer to
interact with the RNIC Interface. The two queues are the Send
Queue and the Receive Queue. Each queue stores a Work Queue
Element from the time it is posted until the time it is
completed.
Queue Pair Context - The collection of information needed by the
RNIC Interface to perform the RDMA Operations associated with
the Queue Pair. This includes various pointers to buffers,
queues, and CQs, as well as LLP specific connection and stream
information.
Queue Pair Identifier (QP ID) - An identifier representing a Queue
Pair.
Read Fence - To block the current operation from executing until all
prior RDMA Read Type WRs submitted to the Send Queue have
Completed.
Receive Queue (RQ) - One of the two Work Queues associated with a
Queue Pair. The Receive Queue contains Work Queue Elements that
describe the Buffers into which data from incoming Send
Operation Types is placed.
Remote Access - The Access Rights used to verify the RNIC's ability
to access the Data Sink for incoming DDP Tagged Messages and the
Data Source for RDMA Read Request Messages.
Remote Direct Memory Access (RDMA) - A method of accessing memory on
a remote system in which the local system specifies the remote
location of the data to be transferred. Employing an RNIC in the
Hilland, et al. Expires October 2003 [Page 15]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
remote system allows the access to take place without
interrupting the processing of the CPU(s) on the system. Also
used to indicate the layer implementing the RDMAP wire protocol
semantics.
RDMA Message - The sequence of DDP segments which represents an RDMA
Operation.
RDMA Operation - A sequence of RDMAP Messages, including control
Messages, to transfer data from a Data Source to a Data Sink.
The following RDMA Operations are defined - RDMA Write
Operation, RDMA Read Operation, Send Operation, Send with
Invalidate Operation, Send with Solicited Event Operation, Send
with Solicited Event & Invalidate Operation, and Terminate
Operation. Note that the various forms of Send Operations are
defined in [RDMAP] to be called Send Type Operations.
RDMA Protocol (RDMAP) - A wire protocol that supports RDMA
Operations to transfer ULP data between a Local Peer and the
Remote Peer. See [RDMAP].
RDMA Read Operation - An RDMA Operation that consists of a single
RDMA Read Request Message and a single RDMA Read Response
Message. The Data Sink uses this operation to transfer the
contents of a Data Source buffer from the Remote Peer to the
Local Peer.
RDMA Read Request - An RDMA Message used by the Data Sink to request
the Data Source to transfer the contents of a buffer. The RDMA
Read Request Message describes both the Data Source and Data
Sink buffers.
RDMA Read Response - An RDMA Message used by the Data Source to
respond to an RDMA Read Request Message.
RDMA Read Type Work Request - A PostSQ Work Request which specifies
an operation type of either an RDMA Read or an RDMA Read with
Invalidate Local STag.
RDMA Stream - A single bi-directional association between the peer
RDMA layers on two Nodes over a single LLP Stream.
RDMA Write Operation - An RDMA Operation that transfers the contents
of a source buffer from the Local Peer to a destination buffer
at the Remote Peer using an RDMAP Write Message. The RDMAP Write
Message only describes the Data Sink's buffer.
RDMA Network Interface Controller (RNIC) - A network I/O adapter or
embedded controller with iWARP and Verbs functionality.
Hilland, et al. Expires October 2003 [Page 16]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
Remote Peer - The RDMA protocol implementation on the opposite end
of the connection. Used to refer to the remote entity when
describing protocol exchanges or other interactions between two
Nodes.
Remote RDMA Read Operation - a sequence of events that begins upon
receipt of an incoming RDMA Read Request by the RI and stays in-
process until the corresponding RDMA Read Response Message has
been generated. This includes posting the RDMA Read Request to
the Inbound RDMA Read Request Queue (See Section 6.5 -
Outstanding RDMA Read Resource Management).
RNIC Interface (RI) - The presentation of the RNIC to the Verbs
Consumer as implemented through the combination of the RNIC and
the RNIC device driver.
Scatter/Gather Element (SGE) - An individual entry in a
Scatter/Gather List. Each SGE consists of an STag, Tagged Offset
and Length.
Scatter/Gather List (SGL) - A List of Scatter/Gather Elements. The
list describes one or more ULP Buffers which will have their
data gathered on transmission or scattered upon reception.
Send - An RDMA Operation that transfers the contents of an Untagged
buffer from the Local Peer to an Untagged buffer at the Remote
Peer.
Send Operation Types - The set of Send operations that result in the
consumption of a Receive Queue Work Request at the Data Sink.
Specifically this includes Send, Send with Invalidate, Send with
Solicited Event and Send with Solicited Event & Invalidate.
Send Queue (SQ) - One of the two Work Queues associated with a Queue
Pair. The Send Queue contains PostSQ Work Queue Elements that
have specific operation types, such as Send Type, RDMA Write, or
RDMA Read Type Operations, as well as STag operations such as
Bind and Invalidate.
Shared Memory Region - An MR that currently shares, or at one time
shared, the Physical Buffer List associated with the Memory
Region. Specifically, the PBL is currently shared or was
previously shared with another Memory Region.
Shared Receive Queue - An optional mechanism which allows the
Receive Queues from multiple QPs to retrieve Receive Queue Work
Queue Elements from the same shared queue as needed.
Signaled - A WR which requires that the RNIC generate a Work
Completion.
Hilland, et al. Expires October 2003 [Page 17]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
Solicited Event (SE) - A facility by which an RDMA Operation sender
may cause an Event to be generated at the recipient, if the
recipient is configured to generate such an Event, when a Send
with Solicited Event or Send with Solicited Event & Invalidate
Message is received.
Steering Tag (STag) - An identifier of a Memory Window or Memory
Region. STags are composed of two components: an STag Index and
an STag Key. The Consumer forms the STag by combining the STag
Index with the STag Key. This specification further refines the
definitions of STags contained in [RDMAP] and [DDP].
STag Key - The least significant 8 bit portion of an STag. This
field of an STag can be set to any value by the Consumer when
performing a Memory Registration operation, such as Bind Memory
Window, Fast-Register Memory Region and Register Memory Region.
STag Index - The most significant 24 bits of an STag. This field of
the STag is managed by the RI and is treated as an opaque object
by the Consumer.
Tagged Buffer - A buffer that can be Advertised to a Remote Peer
through exchange of an STag, Tagged Offset, and length.
Tagged Offset (TO) - The offset within a Tagged Buffer.
Terminate - An RDMA Message used by a Node to pass an error
indication to the Remote Peer on an RDMA Stream.
Upper Layer Protocol (ULP) - The protocol layer above the Verb
layer. An example is SDP.
ULP Buffer - A buffer owned above the RI that can be represented
within the RNIC, in whole or in part, by a Memory Window or a
Memory Region.
ULP Message - The ULP data that is handed to a specific protocol
layer for transmission. Data boundaries are preserved as they
are transmitted through iWARP.
ULP Payload - The portion of a ULP Message that is contained within
a single protocol segment or packet (e.g. a DDP Segment).
Unaffiliated Asynchronous Event - This is an indication from the
Verb layer to the Consumer that an event has occurred unrelated
to any single identifiable RNIC Resource.
Unsignaled - A Work Request which only generates a Work Completion
if it encounters an error during processing.
Hilland, et al. Expires October 2003 [Page 18]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
Untagged Buffer - A buffer which is not Advertised to a Remote Peer,
that has Local Access Rights, and that is referenced by an STag,
Tagged Offset, and length.
Verbs - An abstract description of the functionality of an RNIC
Interface. The OS may expose some or all of this functionality
via one or more APIs to applications. The OS will also use some
of the functionality to manage the RNIC Interface.
Virtual Address - An address represented in the address space of a
local process on a node. It is generally used to present
logically contiguous addressability for an underlying and
possibly non-contiguous list of physical pages.
Virtual Address Based Tagged Offset (VA Based TO) - The Base TO of
an MR or MW that starts at a non-zero TO.
Work Completion (WC) - The output modifiers that the Consumer
retrieves from a Completion Queue indicating the results of a
Work Request.
Work Queue (WQ) - One of either a Send Queue or Receive Queue.
Work Queue Element (WQE) - The RNIC Interface's internal
representation of Work Request.
Work Request (WR) - An elementary object used by Consumers to
enqueue a requested operation (WQEs) onto the Send and Receive
Queues of a QP.
Work Request List (WRL) - A list of Work Requests.
Zero Based Tagged Offset (Zero Based TO) - The Base TO of an MR or
MW that starts at TO=0.
4.1 Abbreviations
CQ - Completion Queue
CQE - Completion Queue Entry
DDP - Direct Data Placement Protocol
FBO - First Byte Offset
IRD - Inbound RDMA Read Queue Depth
IRRQ - Inbound RDMA Read Request Queue
LLP - Lower Layer Protocol
Hilland, et al. Expires October 2003 [Page 19]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
MR - Memory Region
MW - Memory Window
ORD - Outbound RDMA Read Queue Depth
PBL - Physical Buffer List
PD - Protection Domain
PD ID - Protection Domain Identifier
QP - Queue Pair
QP ID - Queue Pair Identifier
RQ - Receive Queue
RDMA - Remote Direct Memory Access
RDMAP - Remote Direct Memory Access Protocol
RNIC - RDMA NIC
RI - RNIC Interface
SGE - Scatter-Gather Element
SGL - Scatter-Gather List
SE - Solicited Event
S-RQ - Shared Receive Queue
SQ - Send Queue
STag - Steering Tag
TO - Tagged Offset
TPT - Translation & Protection Table
ULP - Upper Layer Protocol
WC - Work Completion
WQ - Work Queue
WQE - Work Queue Element
Hilland, et al. Expires October 2003 [Page 20]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
WR - Work Request
WRL - Work Request List
Hilland, et al. Expires October 2003 [Page 21]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
5 RNIC Interface
The RNIC Interface (RI) is the locus of interaction between the
Consumer of RNIC services and the RNIC. Semantic behavior of the
RNIC is specified via Verbs, which enable creation and management of
Queue Pairs, management of the RNIC, management of Work Requests,
and transferring error indications from the RI that may be surfaced
via the Verbs. All these activities must be carried out so as to
enable Verbs Consumers to expect the same level of protection and
security as are guaranteed other entities supported by the host
operating system.
A fundamental function of the RI is management of RNICs. This
includes arranging access to them, accessing and modifying their
attributes, and shutting them down. These activities are described
below, and details of the corresponding Verbs semantics are given in
subsequent sections.
Direct, protected access to Consumer memory is critical to realizing
the performance potential of the RNIC. This specification describes
the semantics of memory access defined in this architecture. It
describes in detail the ideas of Memory Regions and Memory Windows,
how they are created and managed, Access Rights for local and remote
access to registered memory, and the semantics of errors that may
arise.
The RI is assumed to be a traditional software interface, typically
synchronous in behavior, while QP interactions are assumed to be
work requests queued to connection specific, hardware based queues.
The queue processing model and associated memory protection
semantics allow QPs to be safely mapped and utilized by both Non-
Privileged and Privileged routines.
Queue Pairs (QPs) are a key component required for the operation of
the RI. They are the RNIC resource used by Consumers to submit Work
Requests to the RI. A QP is used to interact with an RDMA Stream on
an RNIC which is running the RDMA Protocol. There may be thousands
of QPs per RNIC. Each QP provides the Consumer with a single point
of access to an individual RDMA Stream.
Work Requests (WRs) provide the mechanism for Consumers to enqueue
Work Queue Elements (WQEs) onto the Send and Receive queues of a QP.
The varieties of WRs, and the dynamics of their creation, use, and
disposition are described in the sections to follow, as are the
disposition of errors that may arise as WR are processed. Details of
the WR contents are discussed as well.
Completion Queues (CQs) provide the mechanism for the Consumer to
retrieve WR status. In addition, there are notification mechanisms
Hilland, et al. Expires October 2003 [Page 22]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
which help a Consumer to efficiently notice when WRs have completed
processing in the RI. There may be thousands of CQs per RNIC.
Event Handlers provide the mechanism for Consumers to be notified of
Asynchronous Events which occur within the RI but which cannot be
reported through the Completion Queues due to their asynchronous
nature or the fact that they are not easily associated with a Work
Completion.
5.1 The RNIC
Consumers gain access to an RNIC through the RNIC Interface. The
Verbs allow the Consumer to open the RNIC, retrieve RNIC attributes,
and close the RNIC.
All resources MUST be in the scope of the RNIC on which they are
created. This means that there is no requirement for resources on
one RNIC to be available, associated with or meaningful to another
RNIC, even if they are managed by the same RNIC driver. This
includes all QPs, STags, PDs, CQs, and multiple Completion Event
Handlers. This also means that any IDs which are created by the RI
are specific to that RNIC and are not guaranteed to be unique across
all RNICs.
An intent of the architecture is to allow an implementation to pass
Work Requests and Work Completions to and from a Non-Privileged Mode
Consumer process directly to and from the RNIC. Another intent of
the architecture is to optimize for a Privileged Mode
implementation, which shares the Work Request and Work Completion
requirements of Non-Privileged Mode Consumers but has slightly
different memory management requirements.
Because the architecture attempts to optimize for both Privileged
Mode and Non-Privileged Mode Consumers, there are some Verbs and
Verb modes which are not allowed to be executed by non-Privileged
Mode Consumers. An example of this is the use of the STag of zero or
the ability to do Fast-Register WRs. In addition, there are some
operations that, while being allowed in kernel mode, are intended to
be used by Non-Privileged mode applications. An example of this is
Memory Windows. Any restrictions are clearly specified in this
document where required.
5.1.1 RNIC Resources
RNIC Resources can be allocated from a variety of places. They can
be allocated in host memory on behalf of the Consumer or allocated
within the RNIC. Where an RNIC allocates resources is implementation
specific. Consequently, values that the RNIC returns as output
modifiers when Querying the RNIC indicate the maximum amount of any
Hilland, et al. Expires October 2003 [Page 23]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
given resource it can allocate, in the absence of other resource
allocations.
For example, an RNIC may allocate QPs, CQs from the same memory
within the RNIC. If a Consumer allocates the maximum amount of QPs
before allocating any CQs, it may not be able to allocate any CQs
due to an insufficient resource condition - even though the RNIC
indicates that its maximum number of CQs is much larger than the
number currently allocated.
The purpose of a handle is to provide a mechanism to lookup a
specific resource. Resources that have handles associated with them
are the RNIC, CQ, S-RQ, QP, and Asynchronous Event Handler. Often a
handle is an address in memory. An identifier or index also
references a specific resource. An identifier or index is used when
the value must be used in a comparison operation. The QP ID, PD ID,
Completion Event Handler Identifier and STag Index fall in this
category.
It is expected that a resource manager above the RI will manage RNIC
resources appropriately for the operating environment.
5.1.1.1 Expected Creation Sequence
Due to RI Resource interdependencies, there is an ordering sequence
to the allocation and creation of RNIC resources. The sequence
indicated below, while not strictly required in all cases, may be
helpful to the reader.
1. Open the RNIC and setup up an Asynchronous Event Handler.
2. Prior to initiating a LLP Connection, select the opened RNIC on
which you will create the connection and create a Protection
Domain.
3. Create one or more Completion Queues.
4. Set up one or more Completion Event Handlers.
5. Allocate and initialize a Shared Receive Queue, if desired.
6. Allocate and initialize one or more QPs.
7. Register one or more Memory Regions.
8. Allocate Non-Shared Memory Region STags, if desired.
9. Allocate Memory Windows, if desired.
10. Transition the QP through the state diagram to RTS.
Hilland, et al. Expires October 2003 [Page 24]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
11. Initiate Work Request Processing through PostSQ, PostRQ and Poll
CQ.
Below in Figure 2 is a dependency diagram which may also be helpful
when determining the order in which resources are created. The
arrows indicate that the resource the arrow comes from must be
created or allocated before the item the arrow points to can be
created or allocated.
< Figure 2 did not convert properly from source >
< to be corrected in an upcoming version >
Figure 2 - Resource Creation Dependency Diagram
5.1.1.2 Expected Destruction Sequence
Due to RI Resource interdependencies, there is an ordering of de-
allocation and destruction of RNIC resources. The sequence indicated
Hilland, et al. Expires October 2003 [Page 25]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
below, while not strictly required in all cases, may be helpful to
the reader.
1. Invalidate all Memory Windows which are in the Valid state
through a QP WR, if possible.
2. Drain the SQ & RQ of WRs and poll the Work Completions through
the CQ.
3. Transition the QP state to Closing.
4. When the QP is in the Idle state, Destroy the Memory Windows.
5. Destroy the Memory Regions.
6. Destroy the Queue Pair.
7. Destroy the Shared Receive Queue, if created.
8. Destroy the Completion Queues.
9. Destroy the Protection Domain.
10. Close the RNIC.
Below in Figure 3 is a dependency diagram which may also be helpful
when determining the order in which resources are destroyed. The
arrows indicate that the resource the arrow comes from must be
destroyed or deallocated before the item the arrow points to can be
destroyed or deallocated. A dashed line means the action should
occur before the resource can be destroyed. A solid line means the
action must occur before the resource can be destroyed.
Hilland, et al. Expires October 2003 [Page 26]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
< Figure 3 did not convert properly from source >
< to be corrected in an upcoming version >
Figure 3 - Resource Destruction Dependency Diagram
Hilland, et al. Expires October 2003 [Page 27]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
5.1.2 Opening an RNIC
The Open RNIC Verb is used to open an RNIC and returns an opaque
handle to uniquely reference each RNIC so that Consumers can
distinguish between RNICs in the Local Node.
Opening an RNIC prepares it for use by the Consumer. Once opened, an
RNIC cannot be opened again until after it has been closed. At the
time the RNIC is opened, the RI MUST perform any initialization
functions required by the RNIC and the RI.
When the Consumer invokes the Open RNIC Verb, it indicates if this
RNIC is to be opened in Page Mode or Block Mode. The RI MUST
initialize the RNIC in either Page Mode or Block Mode, as indicated
by the Consumer with the input modifier. This will affect all Memory
Registrations and usage as well as resource consumption on the RNIC.
Note that while Page Mode MUST be supported, Block Mode is OPTIONAL.
For more information on Block Mode vs. Page Mode, see Section 7.6.2
- Physical Buffer Lists.
Detailed information on the accompanying Verb can be found in
Section 9.2.1.1 - Open RNIC.
5.1.3 Query RNIC
Consumers MUST be able to retrieve all of the defined attributes and
characteristics of the RNIC through the Query RNIC Verb. The full
list of RNIC Attributes is defined in Section 9.2.1.2 - Query RNIC.
The maximum values returned when querying the RNIC are values which
the RI will not exceed. This does not imply that a Consumer can
allocate all resources to their maximum levels simultaneously.
5.1.4 Closing an RNIC
Closing the RNIC resets the RNIC and deallocates any resources
allocated during the RNIC open.
The RI MUST track all RNIC resources created on behalf of the
Consumer, such as those allocated within the RI during the creation
of PDs, QPs, CQs, Memory Windows and MRs. When the Close RNIC verb
returns, the RI MUST have freed all RNIC resources.
Detailed information on the accompanying Verb can be found in
Section 9.2.1.3 - Close RNIC.
5.2 Protection Domains
A Protection Domain (PD) is the mechanism used to associate Queue
Pairs with Memory Regions and Memory Windows as a means of enabling
Hilland, et al. Expires October 2003 [Page 28]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
and controlling RNIC access to host system memory. A Protection
Domain is represented by a unique identifier called a Protection
Domain Identifier (PD ID).
When the Consumer creates a PD, a PD ID is returned. The Consumer
then provides the PD ID to the RI when creating QPs, MRs & Memory
Windows. When a data transfer takes place, if the STag refers to an
MR, then the PD ID of the MR is validated against the PD ID of the
QP. If they do not match, the data transfer generates an error and
no data transfer takes place. If the STag refers to an MW, then the
PD ID of the MW is validated against the PD ID of the QP when the MW
is Bound to the QP. When a data transfer takes place, the QP ID of
the MW is validated against the QP ID of the QP. These rules allow
the Consumer to ensure that any STag being used on that connection,
either locally or remotely, has been specifically allowed by the
Consumer to be used on that connection.
Each Queue Pair in an RNIC MUST be associated with a single PD ID.
Multiple Queue Pairs MUST be able to be associated with the same PD
ID.
Each Memory Region MUST be associated with a single PD ID. Multiple
Memory Regions MUST be able to be associated with the same PD ID.
Each Memory Window MUST be associated with a single PD ID when
allocated. Multiple Memory Windows MUST be able to be associated
with the same PD ID.
The RI MUST be able to associate any PD ID with any MW, MR, QP or S-
RQ on the RNIC.
Binding a Memory Window to a Memory Region and Fast-Register are
performed as Send Queue operations. The Bind operation MUST only be
allowed if the PD ID of the QP matches the PD ID of the Memory
Region and the PD ID of the QP matches the PD ID of the Memory
Window. Similarly, the Fast-Register operation MUST only be allowed
if the PD ID of the QP matches the PD ID of the STag used as an
input modifier for the Fast-Register. If the PD ID checks fail for
either operation, the operation MUST NOT take place and a Completion
Error MUST be generated.
Note that S-RQs use PDs as well. PD rules for S-RQs are covered in
Section 6.3
5.2.1 Allocating a PD
Protection Domains MUST only be allocated through the RI. A PD ID is
required to be supplied as an input modifier when creating a Queue
Pair, registering a Memory Region, or allocating a Memory Window.
Hilland, et al. Expires October 2003 [Page 29]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
The RI MUST assign a unique PD ID to each PD allocated by the RI. PD
ID's MUST be unique per RNIC. PD ID's MAY be unique across multiple
RNICS which share the same RI.
Detailed information on the accompanying Verb can be found in
Section 9.2.2.1 - Allocate PD.
5.2.2 Deallocating a PD
PDs MUST only be deallocated through the RI. A PD MUST NOT be
deallocated if it is still associated with any Queue Pair, Shared
Receive Queue, Memory Region, or Memory Window. If this is
attempted, the Verbs MUST return an Immediate Error and not allow
the PD to be deallocated.
Detailed information on the accompanying Verb can be found in
Section 9.2.2.2 - Deallocate PD.
5.3 Completion Queues
The Completion Queue consists of entries to hold Work Completions.
The RI's internal representations of Work Completions are called
Completion Queue Entries (CQEs). The RI will post a CQE to the CQ
when it completes the operation of a Signaled WR. The Consumer then
Polls the CQ to retrieve the CQE as a Work Completion. When the Work
Completion is retrieved, the CQE is freed from the CQ and the entry
is available for another Work Request's Work Completion information.
For an Unsignaled WR, the RI will not generate a CQE when the WR
completes successfully. The RI will post a CQE to the CQ when an
Unsignaled WR completes in an error. For more information on
Signaled and Unsignaled Completions, see Section 8.1.3.1.
A Completion Queue (CQ) MUST be the only mechanism used for the
retrieval of Work Completions.
A single CQ is used to hold CQEs from one or more Work Queues across
one or more Queue Pairs on the same RNIC. A CQ MAY have zero or more
Work Queue associations. Completion Queues MUST be able to service
Send Queues, Receive Queues or both. Work Queues from multiple QPs
MUST be able to be associated with a single CQ.
Completion Queues and Completion Queue Entries are internal to the
RNIC Interface, and are not directly accessible, nor is the format
directly visible, by Verb Consumers.
5.3.1 Creating a Completion Queue
Completion Queues MUST only be created through the RNIC Interface.
Hilland, et al. Expires October 2003 [Page 30]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
The RI MUST verify that the Consumer has specified the number of
CQEs the CQ should hold when creating a Completion Queue. The
Consumer should ensure that this value is the maximum number of
Completions the Consumer expects to be outstanding. The RNIC will
then create the CQ with at least the specified number of entries.
The number of entries allocated for the CQ by the RI MAY be greater
than the number requested. If the CQ can be created, the RI MUST
return the actual number of entries allocated for that CQ to the
Consumer. If the RI is unable to allocate at least as many entries
as the Consumer requested, an Immediate Error MUST be returned and
the CQ MUST NOT be created.
The RI is NOT REQUIRED to perform CQ overflow detection or
protection. Therefore, the CQ overflow error codes in this document
are OPTIONAL. When an overflow occurs, the results are
indeterminate. Overflow of a CQ MUST NOT affect QPs which do not
report Work Completions to that CQ and MUST NOT affect other CQs.
Consequently, when creating the CQ, the Consumer should request
enough outstanding Work Requests so that if every possible
outstanding WR were to complete (such as may happen in an error
case), there would be room for the CQE on the CQ. The RI MUST NOT
enforce that every WQE on every Work Queue associated with the CQ
must have a CQE available for the WQE's Work Completion information.
If the Consumer wishes to have deterministic error behavior, at
Create/Modify QP, the sum of the maximum number of WQEs associated
with a single CQ should be less than or equal to the number of
entries in the CQ. A Consumer can size the CQ smaller, in which case
the error semantics of a CQ overflow are not deterministic, but
possible RNIC behavior includes overwriting previous CQEs in whole
or in part and thus may result in a data integrity issue.
An additional consideration for sizing the CQ is QP Destruction. Any
outstanding WRs which were on a Work Queue when it is destroyed may
occupy entries on the associated CQ. For more information, see
Section 6.1.4 - Destroying a Queue Pair.
Detailed information on the accompanying Verb can be found in
Section 9.2.3.1 - Create CQ.
5.3.2 Querying Completion Queue Attributes
There are two Completion Queue attributes that can be queried
through the RI.
The first of these attributes is the maximum number of entries
allowed on the CQ. This attribute MUST be able to be retrieved
through the Query CQ Verb.
Hilland, et al. Expires October 2003 [Page 31]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
The other attribute is the Completion Event Handler Identifier,
which also MUST be able to be retrieved through the Query CQ Verb.
With one exception, the CQ Verbs do not expose which Work Queues are
associated with a CQ. The exception is that the QP ID is reported by
Poll CQ.
Detailed information on the accompanying Verb can be found in
Section 9.2.3.2 - Query CQ.
5.3.3 Modifying Completion Queue Attributes
An implementation MUST support resizing of a CQ through the RI while
WRs are outstanding. Work Completions MUST NOT be lost due to a CQ
resize. Resizing the CQ MUST NOT directly generate errors beyond
Resize CQ Verb Immediate Errors and must either succeed or fail
atomically. It is understood that this may adversely affect
performance, and MAY result in connection timeouts. Note that this
could ultimately result in the connection being torn down. If the
Consumer wishes to avoid any possibility of a connection being torn
down during the CQ resize operation, it should quiesce operations to
the Work Queues associated with the CQ before resizing the CQ. The
RI MUST NOT allow a CQ to be resized to a size that is smaller than
the number of CQEs currently on the CQ; if this is attempted, an
Immediate Error MUST be returned.
Detailed information on the accompanying Verb can be found in
Section 9.2.3.3 - Modify CQ.
5.3.4 Destroying a Completion Queue
CQs MUST only be destroyed through the RI.
A CQ MUST NOT be destroyed if it is still associated with any Work
Queue. If this is attempted, the Verbs MUST return an Immediate
Error and not allow the CQ to be destroyed.
When the Destroy CQ Verb returns, the RI MUST have returned or
released any host resources allocated below the RNIC Interface on
behalf of the Consumer that are related to the specified CQ. After
the Destroy CQ Verb returns, the RI MUST NOT return any more Work
Completions that are associated with the destroyed CQ.
Detailed information on the accompanying Verb can be found in
Section 9.2.3.4 - Destroy CQ.
Hilland, et al. Expires October 2003 [Page 32]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
6 Queue Pairs
Queue Pairs (QP) are the RNIC resource used by Consumers to submit
operations to the RNIC. A QP consists of a pair of Work Queues (Send
and Receive) as well as a posting mechanism for each queue. The Send
Queue (SQ) and Receive Queue (RQ) are each Work Queues, in that the
Consumer posts Work Requests (WR) to them in order to get the RI to
perform operations. In addition, there are resources that make up
the QP with which the Consumer does not directly interact. These
include the Inbound RDMA Read Request Queue and the Work Queue
Elements (WQEs).
Work Queue Elements are the representation of Work Requests inside
of the RI, once the Work Requests have been posted to the QP.
An internal Inbound RDMA Read Request Queue (IRRQ) MUST be
associated with a Queue Pair when the QP is created or modified to
support greater than zero incoming RDMA Read Request Messages. The
IRRQ enqueues incoming RDMA Read Request Messages and processes them
in order, sending RDMA Read Response Messages as a result. The depth
of this queue MUST be specified when the QP is created and is set
with the IRD Input Modifier.
A QP is created by the RI at the request of a Consumer. The
resources required by the RI to create the Work Queues and get them
to transmit and receive resources are allocated at this time. The
memory needed may be allocated from system memory, memory associated
within the RNIC, or any other resources accessible through the
Verbs.
Certain QP attributes may be changed after QP creation. A Modify QP
Verb is provided to modify the attributes. The details of this Verb
are defined in Section 6.1.3 - Modifying Queue Pair Attributes.
The Consumer should instruct the RI to destroy a QP that is no
longer in use. The semantics for destruction of a QP are provided in
this Section 6.1.4 - Destroying a Queue Pair.
The Verbs Post Send Queue Work Request (Section 9.3.1.1 PostSQ) and
Post Receive Queue Work Request (Section 9.3.1.2 PostRQ) provide a
posting mechanism for the Consumer to indicate to the RI that there
is work for the RI to perform and that there is a new WR,
represented within the RI by a WQE, on the Work Queue. Details of
Work Request handling are defined in Section 8 - Work Requests and
the WR Processing Model.
Hilland, et al. Expires October 2003 [Page 33]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
6.1 Queue Pair Resource Handling
6.1.1 Creating a Queue Pair
Queue Pairs are created through the RI. When a QP is created, the RI
MUST verify that the Consumer has specified a complete set of
initial attributes. The attributes that need to be defined when the
QP is created are specified in Section 9.2.5.1 - Create QP.
Two of the attributes that must be initialized when a QP is created
is the maximum number of Outstanding WRs on the SQ and the maximum
number of Outstanding WRs on the RQ. This number represents the
maximum number of WRs which have been submitted but which have not
Completed at any given time. This is really the maximum depth of the
SQ or RQ and not the number of WRs on the Work Queue at the moment.
The RI MUST support Consumers specifying the maximum number of
outstanding WRs on the SQ and on the RQ and allow the maximum number
of outstanding WRs on the SQ to be different from that on the RQ.
The Consumer requests a maximum number of outstanding WRs on the SQ
and on the RQ. The RI MUST return the maximum number of outstanding
WRs allocated on the SQ and on the RQ, and each of these numbers MAY
be greater than the number requested. For information on determining
when WRs are completed, see Section 8.1.3.1 - Signaled Completions.
Note that if the QP uses an S-RQ for incoming Untagged Messages, the
maximum number of Outstanding WRs on the RQ is not needed.
Each Work Queue in a QP MUST be associated with one and only one CQ
when that QP is created.
Since both WQEs and CQEs are implemented below the RI and the
implementations are outside the scope of this specification, they
may be implemented using a variety of mechanisms, including in the
Local Host virtual memory address space. The RI MAY require that the
Work Queues be in the same memory space as the corresponding
Completion Queues or the creation of the QP will fail. Therefore the
Consumer should assume that the CQ & QP share the same address
space. If the RI detects that QP and CQ are inaccessible to each
other, creation of the QP MAY fail.
Other attributes that MUST be initialized when a QP is created are
whether or not this QP will support the Fast-Register Non-Shared
Memory Region operation and whether the QP supports an STag of zero.
These attributes must only be enabled on QPs used by Privileged Mode
Consumers. See Section 7.2.1 - STag of zero for an explanation of
the STag of Zero. For an explanation of the Fast-Register Non-Shared
Memory Region operation, see Section 7.3.2.5 - Fast-Register Non-
Shared Memory Region.
When a QP is created it MUST be associated with a PD. This is done
by specifying the PD ID as an Input Modifier to Create QP.
Hilland, et al. Expires October 2003 [Page 34]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
An attribute that MUST be created within the RI when the Consumer
invokes the Create QP Verbs is a Queue Pair Identifier (QP ID). The
QP ID MUST be used by the RI to uniquely identify this QP within
this RNIC to the Consumer. The QP ID is used when trying to
determine if a Memory Window is Bound to the QP, as discussed in
Section 7.10.2 - Binding Memory Windows to Memory Regions. The QP ID
value MUST be returned as part of the Create QP, Query QP and Poll
CQ Verbs.
Create QP MUST NOT associate an LLP Connection or LLP Stream with
the QP. No data will flow until the QP is Associated with another QP
through an LLP Stream and the QP state is changed to RTS. For more
details, see Section 6.6.1 - Connection Initialization.
A QP can exist in one of several states. For the details of the QP
states, see Section 6.2 - Queue Pair Resource States. The following
list summarizes the valid QP states:
* Idle state - No LLP Stream is associated with the QP.
* RTS state - An LLP Stream is associated with the QP and normal
data transfer can occur.
* Closing state - An error free LLP Close has begun but has not
finished. It was initiated by either the Remote Peer or Local
Peer.
* Terminate state - An error occurred. A Terminate Message was
either sent or received, and the QP is waiting for either a LLP
Close or LLP Reset before automatically transitioning the QP to
the Error state.
* Error state - An error occurred. No LLP Stream is associated
with the QP. A Terminate Message will be available through
QueryQP if the QP transitioned through the Terminate state
before entering the Error state. If the transition was from the
Closing state to the Error state, a Terminate Message may be
available.
When the QP is created, it is initialized to the Idle state.
Detailed information on the accompanying Verb can be found in
Section 9.2.5.1 - Create QP.
6.1.2 Querying Queue Pair Attributes
Queue Pairs have attributes that can be retrieved through the Query
QP Verb. The RI MUST support the complete list of QP attributes as
described in Section 9.2.5.2 - Query QP.
Hilland, et al. Expires October 2003 [Page 35]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
6.1.3 Modifying Queue Pair Attributes
Certain QP attributes may be modified after the QP has been created.
If the Consumer invokes Modify QP without specifying all Required
Attributes as defined in Figure 4, the RI MUST NOT modify any of the
QP attributes and MUST return an Immediate Error. The RI MUST allow
the Consumer to request a change for the Allowed Additional
Attributes as described in Figure 4, for the QP state transitions
also shown in the figure. On Consumer request, the RI MAY change the
allowed Additional Attributes as described in Figure 5, for the QP
state transitions shown in the figure, if the RI indicates through
Query RNIC that the attribute in question is allowed to be changed.
The Modify QP Verb output modifiers can be used to determine if the
changes are actually made.
If any of the QP attributes requested to be modified are invalid or
the requested state transition is invalid, the RI MUST NOT modify
any of the QP attributes and an Immediate Error MUST be returned.
Note that the table is heavily dependent upon the QP state. For
further information on the QP state, see Section 6.2 - Queue Pair
Resource States.
Hilland, et al. Expires October 2003 [Page 36]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
Transition Attributes that Attributes that the RI must
Consumer is Required Support and the Consumer may
to Supply for the Supply for the State
State Transition Transition
Idle->Idle Next state ORD
Idle->RTS Next state, Stream message buffer,
LLP Stream Handle ORD
Idle->Error Next state None
RTS->RTS Next state ORD
(Footnote 1)
RTS->CLOSING Next state None
RTS->TERM Next state None
RTS->Error Next state None
Error->Idle Next state None
Figure 4 - Allowable QP Attribute Modifications
Footnote 1: Changing these parameters in RTS requires care to avoid
race conditions to prevent errors.
Hilland, et al. Expires October 2003 [Page 37]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
Transition Attributes that the RI Optionally Supports
and the Consumer may Supply for the State
Transition
Idle->Idle Max Number of SQ WQE,
Max Number of RQ WQE (Footnote 2),
IRD,
QP's RQ Limit,
QP's RQ Limit Armed
Idle->RTS Max Number of SQ WQE,
Max Number of RQ WQE (Footnote 2),
IRD,
QP's RQ Limit,
QP's RQ Limit Armed
Idle->Error None
RTS->RTS Max Number of SQ WQE,
(Footnote 3) Max Number of RQ WQE (Footnote 2),
IRD
RTS->CLOSING None
RTS->TERM None
RTS->Error None
Error->Idle None
Figure 5 - Optional QP Attribute Modifications
It is possible to modify the QP attributes in Figure 4 and Figure 5
with Work Requests outstanding on the QP. Depending on the
modification, any Work Requests outstanding on the specified QP
might not execute properly when the attributes are changed.
An RNIC MAY allow the Consumer to change the maximum number of
outstanding WRs on the SQ and on the RQ. The RNIC MUST indicate to
the Consumer if it supports the ability to change the number of
Footnote 2: Note that changing the Max Number of RQ WQEs has no
effect if the QP uses an S-RQ
Footnote 3: Changing these parameters in RTS requires care to avoid
race conditions to prevent errors.
Hilland, et al. Expires October 2003 [Page 38]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
outstanding WRs on a QP. If the RNIC supports it, it MUST allow the
number of outstanding WRs on both the SQ and the RQ to be changed
while WRs are still outstanding. In addition, the RI MUST support
the ability to change this on every QP if it indicates an ability to
change the outstanding number of WRs.
It is understood that changing the number of WRs that a Work Queue
may have outstanding may adversely affect performance. Resizing the
QP MUST NOT cause Immediate, Completion or Asynchronous Errors, with
the exception of Immediate Errors returned by the Modify Queue Pair
Verb and possible LLP time-outs. It is expected that the resize
operation MAY adversely affect the Associated QP attempting to
communicate with the Local QP during the resize operation in the
form of LLP time-outs and retries which could result in LLP Stream
teardown (which would result in an Asynchronous Error). It is
suggested that the Consumer only perform this resize operation when
activity on the connections has been quiesced to minimize the risk
of transitioning Associated QPs to the Error state as a result of
LLP time-outs.
If the number of requested outstanding WRs is smaller than the
actual number of outstanding WRs currently on the Work Queue(s),
then the modification of the QP MUST fail with an Immediate Error
and the QP MUST remain in the original state.
For information on performing a Modify QP and modifying the value of
IRD and/or ORD, see Section 6.5 - Outstanding RDMA Read Resource
Management.
When the Modify QP Verb completes, any state change requested MUST
have occurred or an Immediate Error MUST be returned, in which case
the QP state and accompanying modifier changes MUST remain as they
were prior to the Modify QP Verb being invoked.
The LLP Stream and the LLP Stream Message Buffer Input Modifiers for
Modify QP are covered in Section 6.6.1.
Detailed information on the accompanying Verb can be found in
Section 9.2.5.3 - Modify QP.
6.1.4 Destroying a Queue Pair
Queue Pairs MUST only be destroyed through the RNIC Interface.
Successful destruction of a QP MUST release all resources allocated
by the RI for the QP on behalf of the Consumer. The RI MUST have
destroyed the QP when the Destroy QP Verb has successfully
completed. If the LLP Stream is still associated with the QP, a
Destroy QP MUST include disassociating the LLP resources from the
QP, and MAY include an LLP Reset. After a Destroy QP finishes, the
Hilland, et al. Expires October 2003 [Page 39]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
QP ID will be immediately available for use on any subsequently
created QP. The QP will cease processing all WRs, and no additional
CQEs resulting from any outstanding WRs on this QP will be posted to
the CQ.
The RI MUST not allow a QP to be destroyed if there are still Memory
Windows Bound to the QP. If the Consumer attempts to destroy a QP
with Memory Windows Bound to the QP, an Immediate Error MUST be
returned by the RI.
The RI MUST allow the Destroy QP Verb to succeed regardless of the
QP's state, provided there are no MWs Bound to it. For more
information on the resource destruction and deallocation sequence,
see Section 5.1.1.2 - Expected Destruction Sequence.
It is RECOMMENDED that before a Consumer attempts to destroy a Queue
Pair, it should cleanly complete all outstanding Work Requests and
invalidate all Memory Windows which are Bound to the QP. It is
recommended that ULPs and Consumers provide a graceful termination
mechanism, return all Advertised STags to a known state, submit WRs
to Invalidate all outstanding Memory Windows and then move through
the Closing state. The Consumer should then retrieve all outstanding
Work Completions through the CQ(s) associated with the QP's SQ & RQ.
Only then should the Consumer destroy the QP.
A QP is allowed to have Work Requests outstanding on both Work
Queues when a request to destroy the QP is made.
Any outstanding WRs posted to the QP but not yet processed by the RI
MAY result in CQEs that MAY be retrievable by the Consumer. Note
that even in the case where CQEs were generated it might not be
possible for the Consumer to retrieve them after the QP has been
destroyed. Since it is implementation dependent as to whether CQEs
are consumed for outstanding WRs on a QP after that QP is destroyed,
for the purposes of CQ overflow prevention, the Consumer should
consider each outstanding WR to have consumed an entry in the CQ.
There are three ways to free the CQE consumed within the CQ. Any
method is acceptable and they are not mutually exclusive. The three
methods are:
* the Consumer polls the CQ (See Section 9.3.2.1 - Poll for
Completion (Poll CQ)) until the CQ is empty, or
* the Consumer retrieves a WC for a WR which was submitted to a
Work Queue associated with the same CQ and that WR was submitted
after the previous QP was destroyed, or
* the Consumer polls (See Section 9.3.2.1 - Poll for Completion
(Poll CQ) a number of Work Completions equal to the total number
of entries that the CQ can hold.
Hilland, et al. Expires October 2003 [Page 40]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
Detailed information on the accompanying Verb can be found in
Section 9.2.5.4 - Destroy QP.
6.2 Queue Pair Resource States
The RI MUST restrict the QP to only be in one of the five Resource
States (or just "states") as shown in Figure 6. The RI MUST NOT
support transitions between QP states that are not shown in Figure
6.
During any state in which iWARP processing is done, it is possible
for errors to be detected by the RNIC. When this occurs, the QP
state will eventually transition to the Error state.
State transitions must only be initiated by the Modify QP Verb,
except where otherwise explicitly stated in the state descriptions.
Creation of a QP causes the QP to enter the state diagram in the
Idle state. Destruction of a QP causes the QP to exit the state
diagram.
Below, in Figure 6, is the QP State diagram. It shows the five QP
states and the allowed transitions between states as well as the
events and methods which cause those transitions. The individual
states and transitions are described in the following sections in
detail.
Hilland, et al. Expires October 2003 [Page 41]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
< Figure 6 did not convert properly from source >
< to be corrected in an upcoming version >
Figure 6 - QP State Diagram
Hilland, et al. Expires October 2003 [Page 42]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
6.2.1 Idle State
The QP MUST be in the Idle state following QP creation or when moved
to this state with Modify QP. In this state, Send or Receive WRs MAY
be posted but they MUST NOT be processed and CQEs MUST NOT be
generated.
Note that whether or not the Consumer posts WRs to the Send Queue
when the QP is in the Idle state depends on the method chosen for
connection initialization (see Section 6.6.1 - Connection
Initialization).
While in the Idle state the RI MUST NOT associate an LLP stream to
the QP.
The RI MUST return an Immediate Error if the Consumer attempts to
transition the QP from the Idle state to the Terminate state or to
the Closing state.
A short summary table describing the state changes for Idle state is
shown in Figure 7. The following are detailed descriptions of those
changes.
Note that under certain conditions the Consumer might be required to
flush Work Requests from a prior RDMAP Stream when in the Idle
state. This can be done by transitioning the QP from the Idle to
Error state (the Error state flushes all WRs) and then back to the
Idle state. This may be necessary if when the Idle state is reached
automatically (i.e. no Consumer intervention) from the RTS state at
the Local Peer, which will occur if:
* the QP is currently in the RTS state, and the Consumer is
actively posting Work Requests (PostSQ or PostRQ),
* the Remote Peer initiates an LLP Close (e.g. for TCP, it
generates a FIN segment),
* the Local RI receives the LLP Close request, and immediately
transitions to Closing state,
* the RI automatically creates an LLP Close acknowledgement (i.e.
for TCP, it generates a FIN ACK segment), thus finishing the LLP
Close from the Local PeerÆs perspective,
* the RI flushes all WRs, and no errors occurred during the LLP
Close or flush,
* the RI automatically transitions the QP to the Idle state,
Hilland, et al. Expires October 2003 [Page 43]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
* the Consumer is not aware of the transition to Idle, and posts a
Work Request thinking it can still transmit or receive data.
Note that a normal close should only be done by a ULP after an end-
to-end synchronization to ensure all outstanding Work Requests have
been flushed end-to-end. This is because RDMAP does not provide a
graceful close. Thus if the Consumer performed a PostSQ, it is an
error made by the Consumer. However, if the ULP posted an extra
PostRQ buffer, it is arguable whether this is an error made by the
Consumer or not. In either case, to recover the resources before
reusing the QP, the Consumer should cause the QP to transition to
Error state to flush the WQEs on the SQ and RQ, and then transition
the QP back to the Idle state.
6.2.1.1 Idle to Idle
The Modify QP Verb MUST allow a transition of the QP from the Idle
state to the Idle state. This is to allow certain Queue Pair Context
attributes to be modified in this state before an association with a
Remote Peer's QP has been established.
6.2.1.2 Idle to RTS
The Modify QP Verb MUST allow a transition from the Idle state to
the RTS state. This is to support LLP Stream establishment. For this
transition, the Modify QP Verb requires an LLP Stream Handle, and
allows a Stream Message Buffer as well as other Input Modifiers. In
order to transition from Idle to RTS, the LLP must be in its
"Established" state, able to send and receive data. If not, the
Modify QP Verb MUST return an Immediate Error. For more details on
LLP Stream establishment, see Section 6.6.1 - Connection
Initialization.
The RI performs the following actions in the Idle to RTS transition,
which MAY be performed in order:
1. The RI resets the RDMAP, DDP and MPA layers to the initial
conditions specified in the appropriate specifications. For
example, the DDP Untagged Message Sequence Numbers (MSN) for the
Receive queue & IRRQ, and the MPA marker position must be reset
as described in [RDMAP], [DDP], and [MPA].
2. If the Modify QP Verb includes a Stream Message Buffer to send,
it is RECOMMENDED that the RI performs the following list in
order:
1. The implementation should stop receiving messages from the LLP
Stream.
Hilland, et al. Expires October 2003 [Page 44]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
2. The RI should transmit the specified message buffer to the
Remote Peer in streaming mode.
3. The RI should associate the LLP Stream with the RDMAP, DDP,
and MPA layers and the RI should enable the QP to receive and
transmit iWARP messages.
4. The implementation should resume receiving messages from the
LLP Stream.
3. If the Modify QP Verb does not include a Stream Message Buffer
to send, the RI should associate the LLP Stream with the RDMA,
DDP, and MPA layers and the RI should enable the QP to receive
and transmit iWARP messages.
4. The RI moves the QP to the RTS state and begins normal
operation.
The RI MAY implement the Verb in other ways, but the end result
MUST:
1. Associate RDMAP and Lower layers with the QP;
2. While in streaming mode, transmit any Stream Message Buffer that
was included in the Modify QP;
3. Ensure that the QP enables reception and transmission of iWARP
messages; and,
4. That regardless of how quickly the remote side returns the first
iWARP message, ensure that messages MUST NOT be lost.
For example, if the Verb did not stop the LLP receive side, the
following race condition MUST be handled properly:
1. The Associated QP transitions to RTS,
2. It begins transmitting RDMA packets,
3. Then the rapid arrival of an iWARP message from the Remote Peer
occurs while the Local Peer is transitioning, but not completed
the transition, to the RTS state.
Note that the Modify QP Idle to RTS transition that includes a
Stream Message Buffer to send may take a significant amount of time
to complete. This is due to the requirement to reliably transmit the
stream message.
Hilland, et al. Expires October 2003 [Page 45]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
6.2.1.3 Idle to Error
The Modify QP Verb MUST allow the Consumer to modify the QP from the
Idle state to the Error state.
If it becomes necessary to remove WQEs posted to the queues in the
Idle state, the Consumer may Modify the QP to the Error state, and
then back to Idle. Any WQEs on the SQ & RQ will be Completed with a
Flushed status by this procedure. This procedure will not change the
Completion Status of CQEs already Completed on the CQ. The Consumer
can then Poll for Completion on the Completion Queue and examine the
Completion Status to determine which WRs were flushed.
Note there is no effect on the LLP since no LLP Stream has been
associated with the QP at this point.
Hilland, et al. Expires October 2003 [Page 46]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
Event Action Next
State
PostSQ, PostRQ Enqueue WQE Idle
WQE is present on or added WQE is NOT processed Idle
to the tail of the SQ
Modify QP->Idle (Footnote 4) Idle
Modify QP->RTS and Stream Reset RDMAP and Lower layers RTS
Message Buffer included to their initial conditions.
Associate RDMAP and Lower
layers with QP. Transmit the
specified Stream Message
buffer in Streaming mode and
enable iWARP mode as
described in 6.2.1.2.
Modify QP->RTS with NO Reset RDMAP and lower layers RTS
Stream Message Buffer to their initial conditions.
included Associate RDMAP and lower
layers with QP. Enable iWARP
mode.
Modify QP->Error Error
PostSQ/PostRQ error Return an Immediate Error Idle
Modify QP, results in error Return an Immediate Error Idle
Figure 7 - Idle State summary
Footnote 4: This transition allows changing QP parameters as defined
in Figure 4.
Hilland, et al. Expires October 2003 [Page 47]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
6.2.2 RTS (Ready to Send) State
The RTS state is the main operational state for iWARP operation. All
normal message processing, both incoming and outgoing, occurs in
this state.
The QP MUST be in the RTS state to begin transmitting and receiving
any messages. Prior to moving to this state, the LLP Connection &
LLP Stream MUST be fully established.
Once in this state, any WQEs already posted on the Send Queue will
begin processing. Any new WQEs posted MUST be added to the tail of
the queue, (and begin processing, if the queue is empty). Once in
this state, valid incoming iWARP Messages MUST be processed, placed
and Completed. In this state, posted Receive WRs will be added to
the Receive Queue (or S-RQ), processed when a Send Operation Type
arrives, and Completed as described in Section 8.2.4 - Completed
Work Requests.
The RTS state MAY be left automatically by any of a variety of
processing Errors, which will cause a transition to either the
Terminate or Error state. See Section 8.3 - Error Handling for
details on which errors result in transitioning to which state.
The RI MUST return an Immediate Error if the Consumer attempts to
transition the QP from the RTS state to the Idle state.
A short summary table describing the state changes for RTS state is
shown in Figure 8. Following are detailed descriptions of those
changes.
6.2.2.1 RTS to RTS
The Modify QP Verb MUST allow the Consumer to modify the QP from the
RTS state to the RTS state. This allows certain QP parameters to be
changed while the QP is Associated with another QP through an LLP
Stream.
Among the parameters that MAY be changed are IRD and ORD, the
maximum number of WQEs supported by the SQ or RQ. A Consumer should
take care when making changes to these parameters in order to
prevent potential race conditions between the Modify operation, the
posting of operations on the Send and Receive Queue, and incoming
messages. For example, reducing the size of the Send or Receive
Queue can only be done when there are fewer WQEs present on the
queue than the new size. It is the responsibility of the consumer to
track the number of outstanding WR on the SQ and RQ if it intends to
modify the size of the SQ or the RQ. For IRD and ORD details, see
Section 6.5 - Outstanding RDMA Read Resource Management.
Hilland, et al. Expires October 2003 [Page 48]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
6.2.2.2 RTS to Closing
If the Remote Peer begins an LLP Close operation that does not
include a Terminate Message (e.g. for TCP a FIN was received), the
RI MUST cause the QP to leave the RTS state automatically. If all
Send Queue Work Requests and Remote RDMA Read Operations (i.e.
incoming RDMA Read Request Messages and associated RDMA Read
Response Messages) are completed, the QP MUST transition to the
Closing state; If this is not true, or a Terminate Message was
received, the QP MUST transition to the Terminate state (see
following section). In all of the above cases the RI MUST create an
Affiliated Asynchronous Event to report the transition.
The Modify QP Verb MUST allow the Consumer to modify the QP from the
RTS state to the Closing state, to begin an LLP Close operation
(e.g. for TCP a FIN segment is generated), and MUST NOT generate an
Affiliated Asynchronous Event. See Section 6.6.2.1 - Normal Close
for more details. When doing a Modify QP to Closing, all Send Queue
Work Requests should have been previously Completed, any Remote RDMA
Read Operations should have been previously finished, and the
Consumer should have stopped posting PostSQ operations, so that no
work remains for the QP to do. If this is not the case, the RI MUST
ensure that either of the following actions are taken:
* The Modify QP MAY cause a transition to the Closing state which
is immediately followed by a transition to the Error state (due
to the SQ being non-empty).
* The Modify QP MAY cause a transition to the Closing state
followed by a transition to the Idle state (because the SQ was
originally empty, the LLP Close completed, causing the
transition to the Idle state, and yet the Consumer was still
posting SQ operations).
If this Modify QP Verb completes without error, the QP has
successfully transitioned to the Closing state (although it may have
already transitioned out of the Closing state).
6.2.2.3 RTS to Terminate
The Modify QP Verb MUST allow the Consumer to modify the QP from the
RTS state to the Terminate state. This enables the Consumer to
inform the Remote Peer that an Abnormal ULP Termination of the
connected stream is being done. The Modify QP will result in the
Error Code subfield of the Terminate Control Field of the Terminate
Message (See [RDMAP]) having a value of 0x0000: Local Catastrophic
Error. The Terminate Buffer will then be available to the Local node
via Query QP and to the Remote Peer through Query QP (provided the
Terminate Message arrives at and is processed by the Remote Peer).
Hilland, et al. Expires October 2003 [Page 49]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
When this Verb completes, the QP is in the Terminate state. For more
details, see 6.6.2.2 - ULP Initiated Termination.
The RTS to Terminate state transition MUST occur automatically
following: a locally detected error; a Remote Peer beginning an LLP
Close (e.g. for TCP a FIN was received) with either local Send Queue
WQEs incomplete, or local Remote RDMA Read Operations incomplete;
operation error; or any other error that would cause the RI to
generate a Terminate Message. If the transition to the Terminate
state is due to other locally detected errors, the RI MUST create
the appropriate Asynchronous Error Event reporting that error. See
Section 8.3.3 - Asynchronous Errors.
The WR, if any, which caused the QP to enter into the Terminate
state MUST be completed with the correct Completion Error Code for
the error through the CQ associated with the WQ that experienced the
error.
If a remote Terminate Message is received, the Terminate state MUST
be automatically entered and an Asynchronous Error Event MUST be
reported with a status of "Termination Message Received". In this
case, the RI MUST NOT send a Terminate Message back to the Remote
Peer. Note that if TCP is the LLP, depending upon implementation of
LLP Close, the RI may immediately transition to the Error state or
it may wait for a TCP ACK before the transition.
6.2.2.4 RTS to Error
The Modify QP Verb MUST allow the Consumer to modify the QP from the
RTS state to the Error state. This enables the Consumer to perform
an Abnormal ULP initiated Abortive Teardown (for more details, see
Section 6.6.2.3 - ULP Initiated Abortive Teardown).
An LLP failure that prevents further transmissions will also cause
the RTS to Error transition.
When the QP transitions from the RTS state to the Error state, the
LLP stream MUST NOT be associated with the QP.
The following are done prior to entering Error state:
* The RI MUST stop processing SQ WRs, Remote RDMA Read Operations,
and any incoming iWARP Segments targeting the QP. See Section
6.4 - Stopping QP processing and Sending the Terminate Message
for additional information.
* If the LLP Stream has not closed, an LLP Reset MUST occur
* The LLP Stream resources MUST no longer be associated with the
QP once the LLP actions, if any, are taken.
Hilland, et al. Expires October 2003 [Page 50]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
* If this transition is due to a failure of the LLP, the RI MUST
create an Asynchronous Error event reporting the error.
When the prior items complete, the QP MUST be transitioned to the
Error state.
Hilland, et al. Expires October 2003 [Page 51]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
Event Action Next
State
PostSQ, PostRQ Enqueue WQE RTS
Valid iWARP Segment Arrives Process Segment RTS
WQE is present on or added Process WQE(s) and send RTS
to Send Queue data (as necessary)
Modify QP->Closing Begin LLP Graceful Close Closing
Modify QP->RTS Modify QP parameters as RTS
document in Section 6.5
Modify QP->Error Stop QP processing,LLP Error
Reset & LLP Disassociated
Modify QP->Terminate Generate Terminate Message Terminate
PostSQ/PostRQ error Return an Immediate Error RTS
Modify QP, resulting in an Return an Immediate Error RTS
Immediate Error
LLP Failure that prevents Stop QP processing, LLP Error
transmission of the Reset, LLP Disassociated,
Terminate Message Create Asynchronous Error
LLP Failure that allows Generate Terminate Message Terminate
transmission of the
Terminate Message
Local incoming RDMA Message Generate Terminate Message Terminate
processing error (RDMA Read
Request, RDMA Read Response,
or RDMA Write handling)
Local incoming Send Type Generate Terminate Message Terminate
Message Processing Error
Local WQ processing error Complete WR as necessary, Terminate
Generate Terminate Message
Received Terminate Message Terminate
LLP Close Received AND (SQ Generate Terminate Message Terminate
NOT empty OR IRRQ NOT empty)
LLP Close Received AND SQ Closing
empty AND IRRQ empty
Figure 8 - RTS State summary
Hilland, et al. Expires October 2003 [Page 52]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
6.2.3 Terminate State
The Terminate state is used to send the final Terminate Message and
begin an LLP Close if an error has occurred, or as a staging ground
to perform an LLP Close if a Terminate Message was received from the
Remote Peer. This state is transitory. The duration is limited by
the time to finish the LLP Close operation or a final timeout in LLP
Close (which would cause an LLP Reset).
When the Terminate state is exited to the Error state, the LLP
Stream MUST no longer be associated with the QP and the LLP Stream
MUST be in either a condition of LLP Closed or LLP Reset.
It is possible to examine the Terminate Message buffer while in this
state by using Query QP (Section 9.2.5.2) to retrieve the Terminate
Message.
A short summary table describing the state changes for the Terminate
state is shown in Figure 9. The following are detailed descriptions
of those changes.
While in the Terminate state, the following are done:
* The RI MUST stop processing SQ WRs, Remote RDMA Read Operations
and any new incoming iWARP Segments targeting the QP. For
additional information, see Section 6.4 - Stopping QP processing
and Sending the Terminate Message.
* The RNIC MUST attempt to send the RDMAP Terminate Message,
indicating the cause of error, except when the Terminate state
is entered due to reception of a remote Terminate Message. Note
that sending the Terminate Message may not be successful if an
LLP Reset occurs.
* The RI MUST begin an LLP Close operation.
* If the current stream is the last (or only) active LLP Stream on
the LLP Connection, or the LLP is in a state where all streams
are unable to operate, the LLP Close MUST cause the LLP
Connection to be closed. (For example, in [TCP] the FIN is sent
and the close sequence is done.)
* If an LLP error occurs during the sending of the Terminate
Message (including reception of an incoming LLP Reset, between
the time the Terminate state is entered and the LLP Close
sequence is completed), or due to an LLP final timeout while the
LLP Close operation is not finished, then an LLP Reset MUST
occur and its resources MUST no longer be associated with the
QP. Note that the LLP MUST use a timeout to detect errors, so
that the QP is in the Terminate state for a bounded time.
Hilland, et al. Expires October 2003 [Page 53]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
* At some point in the Terminate state, the RI MUST begin to
return an Immediate Error for any attempt to post a WR to a Work
Queue; prior to that point, WQEs MUST be enqueued (and
eventually flushed) or result in an Immediate Error.
* The RI MAY begin to flush any incomplete WRs on the SQ or RQ.
Please see the Section 6.2.4 - Error State for further
requirements about flushing incomplete WRs.
* When the prior actions are done:
1. If the transition to the Terminate state is due to the
Modify QP Verb, the RI MUST NOT create an Asynchronous Error
Event reporting "Error State Entered". If the transition to
the Terminate state is due to the Modify QP Verb, but an LLP
error occurred while in the Terminate state, then the RI
MUST generate an Asynchronous Error reporting "Bad Close".
2. If the transition to the Terminate state is due to an error
that is reported in a Work Completion, the RI MUST NOT
create an Asynchronous Error. See Section 8.3.2 - Work
Completion Errors. If the transition to the Terminate state
is due to an error that is reported in a Completion, but an
LLP error occurred while in the Terminate state, then the RI
MUST generate an Asynchronous Error reporting "Bad Close".
When the actions listed above are complete, and the LLP Close is
finished, the QP state MUST move automatically to the Error state.
When the LLP Close is finished or an LLP Reset occurs, the RI MUST
disassociate the QP from the LLP Stream, including any LLP Stream
context and any resources associated with it. Disassociating the LLP
Stream from the QP means that it becomes possible for the QP to be
transitioned to Idle and to RTS with a new LLP Stream.
Hilland, et al. Expires October 2003 [Page 54]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
Any attempt to perform a Modify QP in the Terminate state MUST
return with an Immediate Error.
Event Action Next
State
On entry Stop QP processing, Terminate
Send & attempt to complete
Terminate Message if one
wasn't received.
LLP Close Initiated
LLP Close complete Create Asynchronous Event Error
if necessary,
LLP Disassociated from QP
LLP Failure that prevents LLP Reset and create Error
transmission of the Asynchronous Event if
Terminate Message necessary,
LLP Disassociated from QP
Valid IWARP Segment Arrives Ignore Segment Terminate
PostSQ/PostRQ error Return an Immediate Error Terminate
Modify QP Return an Immediate Error Terminate
WQE is present on or added WQE is NOT processed and is Terminate
to a Work Queue eventually flushed.
Figure 9 - Terminate State summary
Hilland, et al. Expires October 2003 [Page 55]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
6.2.4 Error State
The Error state provides an indication that the QP has experienced
an error (or transitioned to the Error state through the use of a
Modify QP) and has stopped operations. On entry to the Error state,
the LLP Stream MUST NOT be associated with the QP.
The RI MUST return an Immediate Error if the Consumer attempts to
transition the QP from the Error state to the RTS, Terminate, or
Closing state.
The following is done on entry into the Error state:
* The RI MUST flush any incomplete WRs on the SQ or RQ. All WQEs
on the SQ and RQ, except for the WQE that caused the error (if
any), MUST be returned with the Flushed Error Completion Status
through the Completion Queue associated with the WQ. Note that
the WQE which caused the error may not be at the head of the
Work Queue. The Consumer should expect in some cases to retrieve
Work Completions with the Flushed Error Completion Status, as
well as potential successful completions, before retrieving the
WC for the WR which caused the error. The RI MUST NOT return
more than one Work Completion with a Work Completion Status set
to something other than the Flushed Completion Status or the
Success Completion Status.
* At some point in the execution of the flushing operation, the RI
MUST begin to return an Immediate Error for any attempt to post
a WR to a Work Queue; prior to that point, any WQEs posted to a
Work Queue MUST be enqueued and then flushed as described above
(e.g. The PostSQ is done in Non-Privileged Mode and the Non-
Privileged Mode portion of the RI has not yet been informed that
the QP is in the Error state).
If a Terminate Message was sent or received, the RI MUST allow the
Consumer to retrieve it through the Query QP Verb (Section 9.2.5.2).
Following entry to the Error state, and before Destroying the QP or
restarting the QP by going through Idle to RTS, it may be necessary
to clean up some of the resources associated with the QP.
* Work Completions should be reaped by using Poll for Completion
(Poll CQ) (see Section 9.3.2.1) before destroying the QP,
otherwise they may become inaccessible.
* Memory Window resources MUST be deallocated by using Deallocate
STag (see Section 9.2.6.4). This is necessary since in the Valid
state they are associated with the QP. QP destruction will fail
when Memory Windows which are in the Valid state are still Bound
to the QP.
Hilland, et al. Expires October 2003 [Page 56]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
* Memory Regions can be invalidated by posting an Invalidate Local
STag WR to other SQs in the same PD, or they can be deallocated
by using Deallocate STag. If left in the Valid state, the
associated memory may be at risk of unexpected remote access.
If the QP is transitioning to the Error state, or has not yet
finished flushing the Work Queues, a Modify QP request to transition
to the IDLE state MUST fail with an Immediate Error. If none of the
prior conditions are true, a Modify QP to the Idle state MUST take
the QP to the Idle state. No other state transitions out of Error
are supported. Any attempt to transition the QP to a state other
than Idle MUST result in an Immediate Error.
A short summary table describing the state changes for Error state
is shown in Figure 10.
Event Action Next
State
On Entry Flush any incomplete WQEs
Modify QP->Idle Idle
(no outstanding WRs and
not in transition to Error)
Modify QP->Idle Return an Immediate Error Error
(outstanding WRs or
in transition to Error)
Post WR Post WQE, and then Flush it, Error
OR
Return an Immediate Error
Modify QP, resulting in an Return an Immediate Error Error
error
Figure 10 - Error State summary
Hilland, et al. Expires October 2003 [Page 57]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
6.2.5 Closing State
This state is used to wait for the LLP to complete the LLP Close, if
no errors occurred. For some LLPs or some RI implementations, moving
a QP from the RTS state to the Idle state can require an end-to-end
acknowledgement or require the Remote Peer to close their half of
the LLP Stream before the LLP Close is finished. This may take a
significant amount of time. Thus the Closing state is provided so
that these operations are done in a fashion that is visible to the
Consumer. Note that some RI implementations may require the LLP
Stream to be completely closed before transitioning to the Idle
state. This can be in the order of tens of seconds (e.g. an RI
implementation on TCP may require TCP to be in the CLOSED state,
possibly waiting in the TIME-WAIT state for a significant amount of
time).
If the LLP Close operation does not require the LLP to transmit
messages (e.g. for SCTP there is no mechanism to close a single LLP
Stream, thus when one LLP Stream is closed and other LLP Streams
remain active, there is no end-to-end handshake required), then the
RI MAY transition rapidly through this state.
When the Closing state is exited to Idle, the LLP Stream MUST NOT be
associated with the QP.
Any attempt to perform a Modify QP in the Closing state MUST return
an Immediate Error.
Errors detected by the RI when the QP is in the Closing state result
in a transition to the Error state; for LLP failures, this is
indicated with the specific Asynchronous Event "LLP Connection
Lost".
A short summary table describing the state changes for the Closing
state is shown in Figure 11. Following are detailed descriptions of
those changes.
The following are done prior to exiting Closing state:
* The RI MUST stop processing SQ WRs and Remote RDMA Read
Operations targeting the QP.
* The RI MUST stop processing any incoming segments, though the RI
MAY process any arriving Terminate Messages.
* At some point in the Closing state the RI MUST begin to return
an Immediate Error for any attempt to post a WR to a Work Queue;
prior to that point, WQEs MUST be enqueued or result in an
Immediate Error.
Hilland, et al. Expires October 2003 [Page 58]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
* The RI MUST flush all incomplete WQEs on the RQ. All WQEs on the
RQ MUST be returned with the Flushed Error Completion Status
through the Completion Queue associated with the RQ. If RQ WQEs
are enqueued, the RI MUST flush the WQE with the Flushed Error
Completion Status through the Completion Queue associated with
the RQ.
* If no errors have been detected (see next bullet), an LLP Close
MUST occur. If the LLP Stream is the last or only active stream
for the LLP Connection, the LLP Connection MUST be attempted to
be closed gracefully. (For example, in [TCP] the FIN is sent and
close sequence is done.).
* The RI MUST generate an Asynchronous Error if:
o Any SQ WQEs were on the SQ at any time during the Closing
state. Note, this condition may happen if the PostSQ is done
in Non-Privileged Mode and the Non-Privileged Mode portion
of the RI has not yet been informed that the QP is in the
Closing state. Also, the Error state will flush all SQ WQEs.
o Any incoming data arrives during the LLP Close. If the
incoming data is a Terminate Message, the RI MAY allow the
Consumer to retrieve the Terminate Message through the Query
QP Verb.
o Any Remote RDMA Read Operations are in process.
o An LLP Stream failure (e.g. LLP Stream is lost) occurs
during the LLP Close. Note that the RI MUST use a timeout
mechanism to detect LLP errors during the LLP Close, so that
the QP is in the Closing state for a bounded time. If the
LLP detects a final timeout, it MUST be considered an error.
* If the RI generates an Asynchronous Error, the following MUST
occur in order:
o An LLP Reset MUST occur and the LLP resources MUST no longer
be associated with the QP.
o The QP MUST be transitioned to the Error state.
o The RI MUST generate an Asynchronous Event
* If no error occurs during the LLP Close operation:
o When all RQ WRs have been flushed and the LLP Close has
finished, the LLP Stream MUST be disassociated with the QP,
the RI MUST generate an Asynchronous Event "LLP Close
Complete".
Hilland, et al. Expires October 2003 [Page 59]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
o When the prior items complete, the QP MUST be transitioned
to the Idle state.
When the LLP Close is finished or an LLP Reset occurs, the RI MUST
disassociate the QP from the LLP Stream, including any LLP Stream
context and any resources associated with it. Disassociating the LLP
Stream from the QP means that it becomes possible for the QP to be
transitioned to Idle and to RTS with a new LLP Stream.
Note that it is possible for the Consumer to post WRs while the
automatic transition from RTS to Closing to Idle is occurring. See
Section 6.2.1 - Idle State for additional details.
Hilland, et al. Expires October 2003 [Page 60]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
Event Action Next
State
On Entry Stop QP processing, start LLP Closing
Close, and start Flushing any
incomplete WQEs on Receive
queues.
LLP Close complete, all RQ Create Asynchronous Event: Idle
WQEs flushed, and no SQ "LLP Close Complete", LLP
WQEs on the SQ Disassociated from QP
At least one SQ WQE on the Perform LLP Reset, Create Error
SQ or Remote RDMA Read Asynchronous Event: "Bad
Operation in progress. Close", LLP Disassociated
from QP
LLP Connection Failure Perform LLP Reset, Create Error
Asynchronous Event: "LLP
Connection Lost", LLP
Disassociated from QP
Segment Arrives and is not Perform LLP Reset, Segment is Error
a Terminate Message not processed. Create
Asynchronous Event: "Bad
Close", LLP Disassociated
from QP
Segment Arrives and is a Perform LLP Reset, MAY create Error
Terminate Message Async Event: "Bad Close"; MAY
allow examination of
Terminate Message, LLP
Disassociated from QP
PostSQ/PostRQ with Return an Immediate Error Closing
Immediate Error
Modify QP Return an Immediate Error Closing
PostRQ without Immediate Enqueue and flush Closing
Error
PostSQ without Immediate Enqueue & Flush, Perform LLP Error
Error Reset, Create Async Event
"Bad Close", LLP
Disassociated from QP.
Figure 11 - Closing State summary
Hilland, et al. Expires October 2003 [Page 61]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
6.3 Shared Receive Queue
The Verbs support a Shared Receive Queue (S-RQ). Support for the
Shared Receive Queue is OPTIONAL. The Query RNIC Verb MUST indicate
whether the RNIC supports the Shared Receive Queue.
A Shared Receive Queue is an RNIC resource which allows multiple RQs
to retrieve WQEs from the same shared queue on an as needed basis.
This allows a Consumer to post WRs to the S-RQ instead of the RQ.
When a message arrives, the RI uses a WQE from the S-RQ and makes it
appear as if the WQE has been copied from the S-RQ to the QP's RQ. A
CQE for an incoming message which result in a WQE being consumed
from an S-RQ MUST be posted to the CQ associated with the QP's RQ.
The RI MUST return the maximum number of S-RQs supported by the RI
as an output modifier of Query RNIC, and the value MUST be zero if
the RI does not support S-RQs.
The RI MUST return the maximum number of outstanding WRs on an S-RQ
as an output modifier of Query RNIC, and the value MUST be zero if
the RI does not support S-RQs.
Each S-RQ MUST be associated with a single PD ID. Multiple S-RQs
MUST be able to be associated with the same PD ID.
The SQ of a QP associated with an S-RQ MUST operate no differently
than the SQ of a QP which is not associated with an S-RQ.
When using an S-RQ, the RI MUST allow Work Requests to be posted to
the S-RQ and MUST NOT allow WRs to be posted to an RQ of a QP
associated with the S-RQ.
If the RI supports an S-RQ, then it MUST:
* support the Create S-RQ Verb (See Section 9.2.4.1),
* support the Query S-RQ Verb (See Section 9.2.4.2),
* support the Modify S-RQ Verb (See Section 9.2.4.3),
* support the Destroy S-RQ Verb (See Section 9.2.4.4),
* support the S-RQ Handle as an Input Modifier for Create QP (See
Section 9.2.5.1), and
* support an S-RQ Limit Event and a QP RQ Limit Event (See Section
6.3.8),
* support the S-RQ Handle as an Input Modifier for PostRQ,
Hilland, et al. Expires October 2003 [Page 62]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
* support the S-RQ Handle as an Asynchronous Event Handler routine
parameter.
6.3.1 Creating a Shared Receive Queue
When the S-RQ is created, it MUST be associated with a PD ID, and
the maximum number of WRs which can be posted at any time must be
provided as an Input Modifier. Note that the number of WQEs on the
S-RQ at any given moment is dependent upon the completion semantics
described below.
6.3.2 Modifying a Shared Receive Queue
The RI MAY allow the Consumer to change the maximum number of
outstanding WRs on the S-RQ. If the RI supports the ability to
change the number of outstanding WRs on a SQ and RQ, and the RI
supports S-RQs, then it MUST:
* allow the maximum number of outstanding WRs on the S-RQ to be
changed;
* allow the maximum number of outstanding WRs to be changed while
WRs are still outstanding; and
* support the ability to change this on every S-RQ.
It is understood that changing the number of WRs that an S-RQ may
have outstanding MAY adversely affect performance. Resizing the S-RQ
MUST NOT cause Immediate, Completion or Asynchronous Errors, with
the exception of Immediate Errors returned by the Modify S-RQ Verb
and possible LLP time-outs. It is expected that the resize operation
MAY adversely affect the Associated QPs attempting to communicate
with the QPs associated with the S-RQ during the resize operation
possibly resulting in LLP time-outs and retries which could result
in LLP Stream teardown (which would result in an Asynchronous
Error). It is suggested that the Consumer only perform this resize
operation when activity on the connections has been quiesced to
minimize the risk of transitioning Associated QPs to the Error state
as a result of LLP time-outs.
If the number of requested outstanding WRs is smaller than the
actual number of outstanding WRs currently on the S-RQ, then the
modification of the S-RQ MUST fail with an Immediate Error and the
S-RQ MUST remain in the original state.
6.3.3 Destroying a Shared Receive Queue
The Verbs provide a Destroy S-RQ Verb to allow a Consumer to destroy
an S-RQ that is no longer needed. The RI MUST only allow an S-RQ to
be destroyed when all the QPs associated with that S-RQ have been
Hilland, et al. Expires October 2003 [Page 63]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
destroyed. The RI MUST allow an S-RQ to be destroyed when there are
WRs still posted to the S-RQ. Note that it is recommended that a
Consumer drain the S-RQ or track all WRs posted to the S-RQ before
destroying it so that no WRs are lost. For example, a WR which was
Posted to the S-RQ but which was never Completed would still be on
the S-RQ when the S-RQ was destroyed so the Consumer would never be
notified that the buffers associated with the WR were available
again.
After the Destroy S-RQ returns to the Consumer, the RI:
* MUST have freed all RI resources associated with Receive Work
Requests that were not Completed and were posted on that S-RQ,
and
* MUST ensure that it will no longer reference any Consumer
resources associated with Receive Work Requests that were not
Completed and were posted on that S-RQ.
6.3.4 Associating an S-RQ with a QP
A Shared Receive Queue MUST only be associated with a QP when the QP
is created. When the QP is created, the RI MUST ignore the maximum
number of outstanding RQ WRs Input Modifier.
6.3.5 Shared Receive Queue Processing Model
If a QP is associated with an S-RQ, the RI MUST allow WRs to be
posted to the S-RQ using PostRQ, specifying the S-RQ Handle instead
of the QP Handle. If the QP is associated with an S-RQ, the RI MUST
NOT allow WRs to be posted to the Local RQ through PostRQ and MUST
return an Immediate Error if Posting to the Local RQ is attempted by
the Consumer.
The RI MUST ensure that S-RQs follow the rules for Work Queues with
respect to the posting rules and completion rules defined in Section
8.2.1 - Submitting Work Request to a Work Queue and Section 8.2.3 -
Completion Processing. This means the RI MUST prevent a Consumer
from overflowing the S-RQ using the PostRQ.
When an incoming Untagged Message arrives on a QP, the RI determines
if the QP is associated with an S-RQ. If it is, the RI must make it
appear as if the WQE has been dequeued from the S-RQ and queued to
the QP's local RQ. This does not guarantee that the S-RQ WQE is
free. The S-RQ WQE is considered to be part of the S-RQ until the
Work Completion associated with the S-RQ WQE has been retrieved or
the S-RQ is destroyed.
The RI MAY dequeue or use the S-RQ WQEs in any order. Since the WQEs
are in an implementation specific order, the Consumer should not
Hilland, et al. Expires October 2003 [Page 64]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
depend on S-RQ post order in any way. The RI should support one of
the following two models: sequential order or arrival order.
* In sequential ordering, the RI dequeues S-RQ WQEs as messages
arrive. If messages arrive out of order, in addition to
dequeueing the WQE required to place the data for that message,
the RI also dequeues a WQE for each message with an MSN lower
than the out-of-order message that has not arrived and does not
yet have an associated WQE.
* In arrival ordering, the RI dequeues S-RQ WQEs as the messages
arrive. If messages arrive out of order, only the WQE required
to place the out of order message will be dequeued from the S-
RQ. WQEs required to place data for the messages with an MSN
lower than the out of order message will be dequeued from the S-
RQ when those messages arrive.
The RI MUST Complete incoming Send Message Types in the order they
were Posted to the Associated QP's Send Queue. This means Work
Completions retrieved from the CQ for any individual QP will be
retrieved only in Message Sequence Number (MSN) order (see [DDP] for
details). The RI MUST dequeue only one WQE from the S-RQ to place
any message represented by a single MSN. Note that the Work
Completions are not necessarily in the order in which the Send
Message Types arrived, nor in the order the WQEs were posted to the
S-RQ, nor in the order the WQEs were dequeued from the S-RQ.
When a Work Completion which represents a WR originally submitted to
an S-RQ has been returned to the Consumer via the Poll for
Completion Verb, the RI MUST allow the Consumer to be able to post
another Work Request to the S-RQ immediately.
All QPs that use an S-RQ MUST be able to consume S-RQ WQEs, as long
as the S-RQ has unconsumed WQEs available. If there are no S-RQ WQEs
when an Untagged Message arrives on a QP which is associated with
that S-RQ, then the LLP Stream MAY be Terminated. If the LLP Stream
is not terminated, the reader should see Section 13.2 - Graceful
Receive Overflow Handling for one implementation option.
Protection Domain checking rules are slightly different for an S-RQ.
An S-RQ MUST have a PD ID assigned as an Input Modifier for Create
S-RQ. When an Untagged Message arrives and the QP has been
determined to use an S-RQ for its incoming Untagged Message WQEs,
then the PD ID of the STags in the WQEs MUST be validated against
the PD ID of the S-RQ and MUST NOT be validated against the PD ID of
the QP.
Note that due to the Protection Domain checking rule above, the
Consumer will not be able to invalidate an STag used by the S-RQ
unless the S-RQ's PD is the same as the QP's PD, even if the QP uses
Hilland, et al. Expires October 2003 [Page 65]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
the S-RQ. This is because the PD used for comparison in Invalidation
operations is that of the QP, not the S-RQ.
The use of the STag of zero as part of a SGE in a WR MUST be
validated by the RI based on the QP's attribute which indicates if
it is allowed on the QP. If use of the STag of zero is not permitted
on the QP and a WQE referencing STag zero is processed on the QP,
the RI MUST return a Completion Error. Consequently, if the Consumer
uses the STag of zero in S-RQ Work Requests and the S-RQ is accessed
by QPs that have the use of STag of zero enabled as well as QPs that
do not have the use of STag of zero enabled, then the QPs that do
not have the use of STag of zero enabled will transition to the
Error state as soon as they retrieve a WQE which contains an STag of
zero.
6.3.6 S-RQ Error Semantics
All errors encountered MUST be reported through Work Completions
where possible. This is due to the semantic requiring the WQE to
appear as if it had been on the QP's RQ. The exception is that a
catastrophic S-RQ error MUST be reported as an Affiliated
Asynchronous Error.
Errors related to a connection for a QP associated with an S-RQ MUST
NOT affect the S-RQ. Any WQEs already consumed by the QP from the S-
RQ will be completed in error or flushed in the case of an LLP
Stream error. Any other QPs associated with the S-RQ MUST remain
unaffected by a local QP error.
Errors related to a Work Request on an S-RQ will be posted to the CQ
associated with the QP's RQ if they are processing errors, or
returned as Verb results if they are Immediate Errors.
In the case of a catastrophic S-RQ failure, any QP associated with
the S-RQ will transition to the Terminate state when the QP attempts
to dequeue a WQE from the S-RQ when handling an incoming Send Type
Message. The resource ID returned by the Asynchronous Event Handler
MUST be the QP ID. All outstanding WQEs on the QP will be flushed
and an Affiliated Asynchronous Event: "S-RQ error on a QP" MUST be
generated as part of the Terminate state transition.
The RI MUST NOT flush the WQEs on an S-RQ which have not been used
to Place incoming Untagged Messages when any associated QP
transitions to the Terminate, Error or Closing states.
6.3.7 S-RQ Resource Sizing
The Consumer is responsible for sizing the S-RQ and the CQs
associated with the QP's RQs appropriately. The RI MUST ignore the
sizing information provided for the QP's RQ when the QP uses an S-
Hilland, et al. Expires October 2003 [Page 66]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
RQ. The Consumer should note this fact when invoking the Create QP
Verb using an S-RQ handle. In addition, S-RQs are subject to the
Completion confirmation rules defined in Section 8.2.3 - Completion
Processing. This means that the WR MUST be considered to be in the
scope of the RI, and thus using a WQE on the S-RQ until the Work
Completion has been retrieved. In addition, the RI MUST allow any
single RQ to utilize all of the WQEs posted to an S-RQ . Note also
that the RI is not required to perform CQ overflow detection.
The RQ size Input Modifier is not used when a QP is associated with
an S-RQ. In this case, the RQ has no defined size. It can be up to
the size of the S-RQ. If the S-RQ is resized, any QP MUST be able to
utilize all of the WQEs posted to the S-RQ. It is up to the
implementation to process multiple messages in progress at one time.
Note that the number of messages that can be in progress at once is
limited by the S-RQ size, the LLP receive window, and possibly other
factors.
6.3.8 S-RQ Limit Checking
An RI that supports the S-RQ MUST support an S-RQ Limit
Notification. An RI that supports S-RQ MUST support an S-RQ Limit
input modifier on the Create S-RQ and Modify S-RQ Verbs to establish
the value of the Limit. The S-RQ Limit detection MUST be armed by
the RI upon creation of the S-RQ, if non-zero. This is only used for
generation of the Affiliated Asynchronous Event and MUST NOT
otherwise disrupt the QP operation. When the number of available (or
unused) WQEs posted to the S-RQ drops below the S-RQ Limit, the RI
MUST generate an Asynchronous Event and provide the S-RQ Handle as
the Resource ID. This event will only be triggered once after it is
armed and will not generate another event until the Consumer re-arms
the event. The RI MUST allow the Consumer to re-arm this event
through the use of Modify S-RQ. The RI MUST arm this event when the
S-RQ is created if the S-RQ Limit is greater than zero. The RI MUST
allow an already armed S-RQ Limit to be armed again. If the S-RQ
Limit is armed for an S-RQ and the maximum number of outstanding WRs
on the S-RQ is modified below S-RQ Limit, then the RI MUST return an
Immediate Error indicating that an invalid Input Modifier was
provided.
An RI that supports the S-RQ MUST support a QP RQ Limit Notification
for QPs associated with an S-RQ. The QP RQ Limit detection MUST be
armed by the RI upon creation of the QP, if non-zero. The Consumer
specifies the QP RQ Limit as part of either Create QP or Modify QP.
This is only used for generation of the Affiliated Asynchronous
Event and MUST NOT otherwise disrupt the QP operation. When the
number of messages in progress on the QP (which is defined as
messages being Placed, and thus have WQEs associated with them, but
which have not yet had CQEs generated for the WQEs and thus have not
been Delivered to the Consumer) exceeds the QP's RQ Limit, the RI
Hilland, et al. Expires October 2003 [Page 67]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
MUST generate an Asynchronous Event and provide the QP ID as the
Resource ID. This event will only be triggered once after it is
armed and will not generate another event until the Consumer re-arms
the event. The RI MUST allow the Consumer to re-arm this event
through the use of Modify QP. The RI MUST arm this event when the QP
is created if the QP's RQ Limit is greater than zero. The RI MUST
allow an already armed S-RQ Limit to be armed again. If the S-RQ
Limit specified in the Create S-RQ or Modify S-RQ is greater than
the maximum number of outstanding WRs on the S-RQ, then the RI MUST
return an Immediate Error indicating that an invalid Input Modifier
was provided.
Note that neither Limit Notification forces Work Completions to be
retrieved by the Consumer. Only retrieving the Work Completions
allows the Consumer to Post additional WQEs to the S-RQ.
Consequently, if separate Consumers are allowed to share an S-RQ,
then one Consumer could consume all or part of the S-RQ entries if
it does not retrieve Work Completions.
6.4 Stopping QP processing and Sending the Terminate Message
Certain conditions require that QP operations be stopped, and a
final Terminate Message be sent. Stopping WR processing on the QP
and transmission of a Terminate Message are associated with QP state
changes; the specific QP state transitions that require this are
described in Section 6.2 - Queue Pair Resource States. When a QP
must be stopped, either by a Modify QP Verb, or by QP state change
due to an error, the following notes apply:
1. For Errors that do not impact the integrity of an outbound DDP
Segment or for Modify QP Verb invocations that require stopping
the QP, outbound processing MUST be stopped only on DDP Segment
boundaries, in the absence of LLP errors. Any Terminate Message
(if required) MUST be filled out as described in [RDMAP] and
MUST be sent after the last complete outbound DDP Segment.
For Errors that impact the integrity of an outbound DDP Segment
that require stopping the QP:
o If the RI has not begun sending the DDP Segment, then
outbound processing MUST be stopped before the DDP Segment
is sent; and the Terminate Message and error code MUST be
sent instead of the erroneous DDP Segment.
o If the RI has begun sending the DDP Segment, then outbound
processing MUST be stopped immediately on the byte that
experienced the error and the LLP Stream MUST be Reset.
2. For Errors or Modify QP Verbs (except for RTS to Closing
transitions) that require stopping the QP, the RI MUST cease to
Hilland, et al. Expires October 2003 [Page 68]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
process inbound DDP Segments, at least by the time that any
currently in-process DDP Segment has completed processing.
The semantics of stopping QP processing and handling incoming
DDP segments for Modify QP Verbs that require the transition
from RTS to Closing are discussed at length in Section 6.2.5.
Subsequent inbound DDP Segments (if any) are ignored and any
inbound DDP Segments that have been Placed but not Delivered are
never Delivered.
3. For Modify QP Verbs that require stopping the QP, the RI SHOULD
stop outbound QP processing prior to sending any current DDP
Segment to the LLP and MUST stop outbound QP processing at least
by the time that any currently in-process outbound message has
completed processing.
4. For Errors detected while creating RDMA Write, Send Type, or
RDMA Read Type Work Requests, the RI MUST stop outbound QP
processing prior to sending the current DDP Segment to the LLP.
The Terminate Message and Error code MUST be sent instead of the
original message (or DDP Segment). In this case, the [RDMAP]
Terminate Message's Terminate Control Field is set to represent
RDMA and the Error Type is set to represent Local Catastrophic
Error.
5. For Errors detected while creating RDMA Read Responses to a
Remote RDMA Read Operation, the RI MUST stop outbound QP
processing prior to sending the erroneous DDP Segment to the
LLP. The Terminate Message and Error code are sent instead of
the erroneous RDMA Read Response Message.
6. For Errors detected while creating CQEs, or other reasons not
directly associated with creating an outbound DDP Segment, the
RI SHOULD stop outbound QP processing prior to sending any
current DDP Segment to the LLP and MUST stop outbound QP
processing at least by the time that any currently in-process
outbound DDP message has completed processing. In this case,
[RDMAP] Terminate Message's Terminate Control Field's Header
Control Bits are all zero.
7. If an error is detected by an iWARP implementation while an
incoming DDP Segment data is being Placed, the error actions
(changing state, stopping the QP, etc.) MUST be delayed until
after the segment is actually delivered by the LLP. If more than
one error is detected on incoming segments, then the first DDP
Segment Delivered with a detected error MUST result in the error
actions. The first detected error MAY have been detected by the
LLP, DDP Layer, or RDMA Layer. If, while waiting for Delivery of
an incoming segment that contains an error, another error is
Hilland, et al. Expires October 2003 [Page 69]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
detected that is not associated with incoming segments (for
example, an LLP error, Send Queue or RDMA Read Response
processing error), then the RI MUST perform the actions for that
error without waiting for Delivery of any other segments.
8. For errors detected on incoming DDP Segments (after they have
been Delivered by the LLP), the Terminate Message MUST include a
copy of the iWARP header from the DDP Segment in error (see
[RDMA]).
Below, in Figure 12, is a table which should indicate the values for
the fields in the Terminate Control Field of the Terminate Message
in [RDMAP].
Layer EType Error HdrCt DDP Term Term
Code Seg. DDP RDMA
Lgt Hdr. Hdr.
For Modify QP from RTS No Terminate Message is sent.
to Error
For Modify QP from RTS RDMA Local None 000b All All All
to Terminate (0x) Catast. (0x) zeros zeros zeros
(0x)
For Errors detected RDMA Local None 000b All All All
while creating RDMA (0x) Catast. (0x) zeros zeros zeros
Write, Send Type, or (0x)
RDMA Read Request
Messages
For Errors detected RDMA Local None 000b All All All
while creating (0x) Catast. (0x) zeros zeros zeros
completions, or other (0x)
reasons not directly
associated with
creating an outbound
DDP Segment
For Errors detected Depends on error, see [RDMAP] specification
processing or Placing and/or Sections 8.3.2 & 8.3.3.
incoming Send Type,
RDMA Write, RDMA Read
Request or RDMA Read
Response Messages
Hilland, et al. Expires October 2003 [Page 70]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
Layer EType Error HdrCt DDP Term Term
Code Seg. DDP RDMA
Lgt Hdr. Hdr.
For Errors detected Depends on error, see [RDMAP] specification
while creating RDMA and/or Sections 8.3.2 & 8.3.3.
Read Response Messages
For LLP layer errors No Terminate Message is sent.
detected by an iWARP
implementation (e.g.
incoming LLP Reset
while QP in RTS)
Incoming Terminate Msg No Terminate Message is sent.
Figure 12- Terminate Control Field Values
6.5 Outstanding RDMA Read Resource Management
RDMA allows multiple RDMA Read Request Messages to be outstanding on
a single LLP Stream. To enable this feature, the RNIC provides
resources associated with both the inbound and outbound stream. For
each outbound RDMA Read Request Message, the RNIC has some resources
to track the request until a local Completion occurs. Similarly, for
each inbound RDMA Read Request Message, the RNIC has an Inbound RDMA
Read Request Queue (IRRQ) (associated with the DDP Queue Number of
1) to store the state of the request until it has been satisfied by
sending all of the requested data in the RDMA Read Response Message.
The Input Modifier that specifies this value is called the Inbound
RDMA Read Queue Depth (IRD).
The Outbound RDMA Read Queue Depth (ORD) is the allocated number of
outstanding RDMA Read Request Messages the RNIC is allowed to have
outstanding at the Data Sink of an RDMA Read Operation. This is the
resource used to track the request until a local Completion occurs.
The Inbound RDMA Read Queue Depth (IRD) is the allocated number of
incoming RDMA Read Request Messages a QP can support at the Data
Source for an RDMA Read Operation. This is the resource used to
track inbound RDMA Read Request Messages.
An RNIC MUST implement these resources as either per QP resources,
or shared per RNIC resources. Per QP means that the resources are
tied to the QP and are most likely part of the QP Context. Per RNIC
resources implies that the RNIC has a pool of such resources
internally that it assigns to the QP based on the values of IRD and
ORD associated with the QP.
Hilland, et al. Expires October 2003 [Page 71]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
Query RNIC MUST return the type of resources the specified RNIC
supports. The results are returned in the following Output Modifiers
for Query RNIC:
* The maximum number of Inbound RDMA Read Request Queue messages
that can be outstanding per RNIC. This is the per RNIC parameter
that corresponds to IRD. This value is Zero if the resources for
handling Inbound RDMA Read Requests are not shared between QPs.
* The maximum number of Outbound RDMA Read Request messages that
can be outstanding per RNIC. This is the per RNIC parameter that
corresponds to ORD. This value is Zero if the outstanding RDMA
Read Requests are not shared between QPs.
* The maximum number of inbound RDMA Read Request Messages that
the Inbound RDMA Read Request Queue can store per QP
(corresponds to IRD).
* The maximum number of outbound RDMA Read Request Messages that
can be outstanding per QP (corresponds to ORD).
The Consumer is responsible for setting the RDMA Read Data Sink QP's
ORD so that it does not exceed the Associated QP's IRD at the Data
Source.
If the Consumer attempts to set IRD or ORD to one or greater, and
there are not enough resources to allow this, the Create QP or
Modify QP Verb MUST fail with an Immediate Error. This can happen
because the maximum amount of IRD/ORD resources returned by Query
RNIC MAY be affected by consumption of unrelated resources, so that
not all of the reported resources may actually be available
simultaneously.
If the IRD and ORD resources are not shared between QPs (e.g. fixed
per QP instead of allocated out of a pool for the RNIC), then the
ULP need only negotiate the values for IRD and ORD. But if the IRD
and ORD resources are shared across the RNIC, then some function of
the Consumer or Consumer's environment (such as a resource manager)
must determine how to allocate the resources among the QPs in
addition to negotiating the IRD and ORD values.
The RNIC MUST ensure that it does not issue more RDMA Read Request
Messages than is specified by the QP's ORD value. However, the RI
MUST allow the Consumer to post as many RDMA Read Type Work Requests
as it can, within the limit of the total Work Requests the Send
Queue can support. The RI MUST delay processing of an RDMA Read Type
Work Request posted to the SQ which would result in exceeding the
QP's ORD value until a prior RDMA Read Type Work Request Completes.
Hilland, et al. Expires October 2003 [Page 72]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
The rules in Section 8.2.2 enable subsequent Work Requests to be
executed before the RDMA Read Type Work Request Completes. If
however, a delay in processing occurs due to waiting for a prior
RDMA Read Type Work Completion, this will effectively prevent
subsequent Work Requests from being executed until the delay is over
(i.e. stall Send Queue processing). If the Consumer wants to avoid
this type of delay in Send Queue processing, it can issue up to as
many RDMA Read Work Requests as supported by the value of ORD for
that QP, and when each one Completes, then add an additional RDMA
Read Type Work Request.
The Consumer should manage the number of RDMA Read Request Messages
outstanding, either by correctly setting the QP's ORD value to be
less than or equal to the Associated QP's IRD value, or by limiting
the number of RDMA Read Type Work Requests the Consumer posts on the
Send Queue at any one time to be less than or equal to the
Associated QP's IRD value. If this is not done correctly, the Local
Peer may attempt to send more RDMA Read Request Messages than the
Remote Peer can accept, which will result in an error from the
Remote Peer that Terminates the RDMAP Stream (See Section 6.6.2.4 -
Remote Termination).
The RDMA Read Resources (IRD and ORD) MUST be initialized at QP
creation (Create QP). The RDMA Read Resources MAY be changed while
the QP is in Idle state and when the QP is in the RTS state. If the
Consumer changes the resources while the QP is in the RTS state, the
Consumer should ensure that no RDMA Read Operations are outstanding
for the affected direction (outbound for ORD, inbound for IRD). If
the Consumer modifies the RDMA Read Resources when RDMA Read
Operations are outstanding, the QP state MAY be indeterminate and
the RI MUST NOT adversely affect any other QPs supported by the RI.
Changing RDMA Read Resources when RDMA Read Operations are not
outstanding is easily done if IRD and ORD are set before any RDMA
Read Work Requests are posted by either Peer. If RDMA Read Work
Requests have already been posted, it is up to the Consumer to
ensure that they have all Completed before changing IRD or ORD or
the QP may be in an indeterminate state.
The following semantics are required of the RI:
* All RNICs MUST allow the Consumer to reduce the ORD in the IDLE
and RTS states.
* It is OPTIONAL for an RI to allow the Consumer to increase IRD
or ORD after the QP has been created.
* It is OPTIONAL for an RI to accept reductions of IRD from the
Consumer after the QP has been created.
Hilland, et al. Expires October 2003 [Page 73]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
* The RNIC MUST support a total number of inbound RDMA Read
Request Messages and outbound RDMA Read Request Messages, so
that each is at least equal to the total number of QPs supported
by the RNIC. The RNIC thus MUST be able to support at least
IRD=1 and ORD=1 for each QP.
* RNICs that implement shared "per RNIC" RDMA Read Resources for
IRD and ORD, MUST have enough so that all of the QPs can be
assigned a value of one for IRD and one for ORD. It is up to the
resource manager to allocate these resources fairly, so that
applications that need RDMA Read Resources can be assured of
their availability.
Note that the maximum amount of resources returned by Query RNIC may
be adversely affected by consumption of unrelated resources, so that
not all of the reported number may actually be available
simultaneously.
If the Consumer attempts to set either IRD or ORD to one or greater,
and there are not enough resources to allow this, the Create QP or
Modify QP Verb MUST fail with an Immediate Error.
Note that when using "per RNIC" resources, the Create or Modify QP
IRD and ORD values are also limited by the "per QP" resources.
6.5.1 Example IRD/ORD Negotiation
The example in Figure 13 shows one possible negotiation for a single
direction (if the ULP uses RDMA Read Operations in both directions
on the RDMA Stream, it must also do the same thing in reverse). Note
that the last step may be omitted if the ULP is not interested in
reducing the resources used at the left side of the connection when
the right side supports less.
Hilland, et al. Expires October 2003 [Page 74]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
< Figure 13 did not convert properly from source >
< to be corrected in an upcoming version >
Figure 13 - An example RDMA Read Resource negotiation
6.6 Connection Management
6.6.1 Connection Initialization
RDMA Stream initialization can occur as the transport connection is
created or sometime thereafter. In the latter case, the connection
may require a ULP supplied end-to-end handshake before iWARP is
initialized. Either the active or passive side of the connection may
initiate turning on iWARP.
In either case, the ULP must know, before iWARP mode is to begin,
which model of operation is to be used by the ULP.
An RI MUST support RDMA Stream initialization sometime after the
transport connection is established and some streaming mode data has
been sent.
An RI MAY support RDMA Stream startup along with the transport
connection, with no streaming mode data sent. This option is more
completely described in Section 13.1 - Connection Initialization at
LLP Startup.
Once iWARP initialization is complete, the RI MUST allow only iWARP
messages to be sent across the LLP connection until the RDMA Stream
is torn down.
Hilland, et al. Expires October 2003 [Page 75]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
Section 6.6.1.1 and 6.6.1.2 provide informative examples of methods
for the ULP to transition to RDMA mode. Other implementations are
possible.
6.6.1.1 Active Connection Initialization after LLP Startup
For this discussion, the Active side goes to iWARP mode first. In
the figures below, the thin lines represent TCP Streaming mode and
the thick lines represent iWARP mode.
< Figure 14 did not convert properly from source >
< to be corrected in an upcoming version >
Figure 14 - Connection Initialization after LLP Startup
Below is the sequence for an active side iWARP startup. Note that
the dotted line arrows above indicate messages that may not be
needed for some implementations.
1. The ULP establishes the LLP Connection and LLP Stream.
2. The active side ULP ensures that the passive side is able to
enter iWARP mode via some negotiation or other mechanism, which
is outside the scope of this specification.
Hilland, et al. Expires October 2003 [Page 76]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
3. The active side Consumer creates a QP, setting up the CQ, PD
etc., and registers memory for buffers. Note that in some
instances, this may have been done at some previous time during
the initialization process.
4. The active side Consumer posts receive buffers via PostRQ that
are appropriate for the expected traffic. A first message may
arrive quickly after the transition to RTS.
5. The active side Consumer moves the QP to the RTS state. The
Consumer includes the LLP Stream Handle in the Modify QP Verb,
and a single message buffer which contains the last streaming
mode message to be sent to the Remote Peer. The RI uses the
presence of this message buffer to recognize the Active startup
sequence. For information on implementing this state transition,
see Section 6.2.1.2 - Idle to RTS.
6. When the active side Consumer receives the first RDMA/DDP
Message from the passive side (e.g. a Send type message), the
active side Consumer is free to post additional Work Requests to
the Send Queue. The active side Consumer should not have posted
any SQ WRs while the QP was in the Idle state, or while the QP
is in the RTS state. The active side consumer should not post
any SQ WRs until the first RDMA/DDP Message is received. If the
Consumer posts SQ WRs during either of these times, the Remote
Peer is likely to improperly synchronize to the LLP Stream and
to Terminate the LLP Stream. One way that the Consumer can
determine that the message arrives is to have the initial
message sent from the Associated QP have the Solicited Event bit
set, thus generating an event at the Local Peer.
7. If the local Consumer intends to perform RDMA Read Operations,
the local Consumer obtains, by some ULP defined message, the
number of Incoming RDMA Read Request Messages that the Remote
Peer can have outstanding (IRD). If the Remote Peer's IRD is
smaller than the local Peer's ORD, the local Consumer should
also perform a Modify QP Verb with the Remote Peer's IRD placed
into the local ORD prior to posting the first RDMA Read Type WR.
The local Consumer may also transmit, in some ULP defined
message, the number of Outbound RDMA Read Request Messages that
the Local Peer can have outstanding (ORD).
8. If the local ULP intends the QP to be a target of RDMA Read
Operations, the local Consumer provides, in some ULP defined
mechanism, the number of Inbound RDMA Read Request Messages that
the Local Peer can have outstanding (IRD). The Consumer may also
receive, by some ULP defined mechanism, the Number of Outbound
RDMA Read Request Messages that the Remote Peer can have
outstanding (ORD). If the Remote Peer's ORD is smaller than the
Local Peer's IRD and the Local RNIC supports IRD reduction, the
Hilland, et al. Expires October 2003 [Page 77]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
local Consumer could perform a Modify QP Verb with the Remote
Peer's ORD placed into the local IRD prior to posting the first
RDMA Read Type WR.
6.6.1.2 Passive Connection Initialization after LLP Startup
Below is the sequence for a passive side iWARP startup:
1. The passive side ULP establishes the LLP Connection and LLP
Stream.
2. The passive side ULP informs the active side that it is able to
enter iWARP mode via some negotiation.
3. The passive side ULP waits for the Active side to send a last
streaming mode message to indicate that it should enter RDMA
mode and that the remote node is in RDMA mode. When that message
arrives, and if it indicates that iWARP mode is desired, the
passive side Consumer continues with the items below.
4. The passive side Consumer creates a QP, setting up the CQ, PD
etc. Note that this may have been done previously.
5. The passive side Consumer posts receive buffers appropriate for
the expected traffic to the RQ.
6. The passive side Consumer posts at least one Send type Work
Request that is used by the active side to complete the
negotiation. The WR may contain any data that the ULP needs to
communicate.
Note: the passive side Consumer may delay the posting of buffers
and Work Requests until after the transition to RTS, described
below.
7. The passive side Consumer moves the QP to RTS state, specifying
the LLP Stream Handle. The passive side Consumer does not
include a last streaming mode message buffer in the Modify QP
Verb; if it does, the Remote Peer is likely to improperly
synchronize to the RDMA Stream and be forced to terminate the
LLP Stream.
8. The passive side Consumer may now begin posting additional Work
Requests.
9. If the local Consumer intends to perform RDMA Read Operations,
the local Consumer obtains, in some ULP defined message, the
number of incoming RDMA Read Request Messages that the Remote
Peer can have outstanding (IRD). If the Remote Peer's IRD is
smaller than the local Peer's ORD, the local Consumer should
Hilland, et al. Expires October 2003 [Page 78]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
also perform a Modify QP Verb with the Remote Peer's IRD placed
into the local ORD prior to posting the first RDMA Read Type WR.
The local Consumer may also transmit, in some ULP defined
message, the number of outgoing RDMA Read Request Messages that
the Local Peer can have outstanding (ORD).
10. If the local Consumer intends the QP to be a target of RDMA Read
Operations, the Consumer provides, in some ULP defined message,
the number of incoming RDMA Read Request Messages that the Local
Peer can have outstanding (IRD). The Consumer may also receive,
in some ULP defined message, the number of outgoing RDMA Read
Request Messages that the Remote Peer can have outstanding
(ORD). If the Remote Peer's ORD is smaller than the Local Peer's
IRD, the local Consumer may also perform a Modify QP Verb with
the Remote Peer's ORD value placed into the local IRD prior to
posting the first RDMA Read Type WR, if the RI supports IRD
reduction.
6.6.2 Connection Teardown
Five types of iWARP and LLP connection teardown mechanisms are
supported:
* A normal close is an LLP Close that finishes with no errors (see
Section 6.2.5 - Closing State, for a list of possible errors).
This is used when the Consumers on both sides of the connection
have sent their last message and wish to close the LLP Stream
(see Section 6.6.2.1 - Normal Close).
* A ULP initiated Termination is used when the ULP desires to
perform an LLP Close with an error message to the Associated QP
(see Section 6.6.2.2 - ULP Initiated Termination).
* A ULP initiated Abortive Teardown is used when the ULP wishes to
perform an LLP Reset with no error message to the Associated QP
(see Section 6.6.2.3 - ULP Initiated Abortive Teardown).
* Remote Termination occurs when the RI receives a Terminate
Message from the Associated QP, and the LLP Close process has
begun (see Section 6.6.2.4 - Remote Termination).
* Local Termination, Local Abortive Teardown and Remote Abortive
Teardown occur when the RI or LLP Stream detects an error and a
Terminate Message is sent prior to an LLP Close or an LLP Reset
is initiated (see Section 6.6.2.5 - Local Termination, Local
Abortive Teardown and Remote Abortive Teardown).
Sections 6.6.2.1 through 6.6.2.5 provide informative examples of
methods for the ULP to terminate an RDMA Stream. Other
implementations are possible.
Hilland, et al. Expires October 2003 [Page 79]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
6.6.2.1 Normal Close
A normal close is provided as a mechanism for the ULP to cease
activity, flush any receive buffers that have been posted to the RQ,
and disassociate the LLP Stream from the QP. It requires that no
errors occur during the close process. If an error occurs, it is now
an abnormal close, which would cause the QP to transition to the
Error state.
The Consumer initiates a normal close, either locally or remotely,
when both sides of a LLP Stream agree to the close.
When the Consumer desires a normal close, the following items must
be done:
1. The Consumer waits for all outstanding Work Requests on the Send
Queues on both sides of the LLP Stream to be Completed. Note:
the Completion on the remote WQ can be inferred by the arrival
of a SEND message from the ULP that indicates that it intends to
do no more work.
2. One of the Consumers moves the QP state to Closing with the
Modify QP Verb, resulting in the following actions:
o If any WQEs are present on the Send Queue, or if any RDMA
Read Operations are incomplete on the IRRQ, an error will
result (for more information, see Section 6.2.5 - Closing
State).
o The RI stops QP processing and flushes all incomplete WQEs
on the Receive Queue by Completing them with the Flushed
Completion Status.
o The RI performs an LLP Close. If this QP was using the last
LLP Stream on the LLP Connection, the RI closes the LLP
Connection.
o When the LLP Close actions are complete, the RI
automatically moves the QP to the Idle state and an
Affiliated Asynchronous Event: "LLP Close Complete" is
created.
3. The Consumer may re-use the QP for a new LLP Stream or it may
destroy the QP (see Section 6.1.3 - Modifying Queue Pair
Attributes and Section 6.1.4 - Destroying a Queue Pair).
The normal close may also be initiated remotely (e.g. for TCP a FIN
segment is received). If the Send Queue is empty and the IRRQ is
empty, the RI moves the QP state to the Closing state and an
Hilland, et al. Expires October 2003 [Page 80]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
Asynchronous Event: "LLP Close Complete" will be generated. If this
is the last LLP Stream, the LLP Connection will be closed.
< Figure 15 did not convert properly from source >
< to be corrected in an upcoming version >
Figure 15 - Normal Close on TCP
6.6.2.2 ULP Initiated Termination
A ULP initiated Termination is usually used when the Consumer (such
as the OS) detects an error. The ULP needs to perform an LLP Close,
but would like to let the Remote Peer know that an error occurred.
Note that an ULP initiated termination may entail loss of data.
When the ULP desires a ULP initiated Termination, the following
items must be done:
1. The Consumer modifies the QP to the Terminate state.
o Before returning from the Modify QP -> Terminate, the RI
stops QP processing, formats a Terminate Message containing
the termination code: "Local Catastrophic Error" and sends
it to the Remote Peer.
o The RI performs an LLP Close. If the LLP cannot deliver the
Terminate Message, an LLP Reset is performed, and the RI
generates an Asynchronous Error Event: "Bad Close".
Hilland, et al. Expires October 2003 [Page 81]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
2. After returning from the Modify QP -> Terminate, the Consumer
waits for the QP to automatically be moved to the Error state.
This is signaled by an Asynchronous Error Event: "Error State
Entered".
3. Once in the Error state, the RI flushes all incomplete WQEs on
both the Send and Receive Queues by completing them with the
Flushed Completion Status. The Consumer would presumably reap
all of the Work Completions to ensure all resources are cleaned
up. Once the Consumer believes all Work Completions have been
reaped, it should attempt to transition the QP to the Idle state
by performing a Modify QP. If the transition is successful, the
Consumer knows it can either re-use the QP for another LLP
Stream or call Destroy QP (see Section 6.1.3 - Modifying Queue
Pair Attributes and Section 6.1.4 - Destroying a Queue Pair). If
the Modify QP returns with an error (presumably because Work
Requests are still being flushed), the Consumer must try at a
later time to transition to the Idle state. The Consumer might
arm a timeout. If the Consumer is unable to transition to the
Idle state after some amount of time, it should destroy the QP
(presumably because the QP can not recover from an internal
error).
6.6.2.3 ULP Initiated Abortive Teardown
A ULP initiated Abortive Teardown is usually used when the Consumer
(such as the OS) detects an error, and the ULP needs to tear down
the entire LLP Stream immediately (i.e. perform an LLP Reset). Note
that a ULP initiated abortive teardown may entail loss of data.
When the ULP desires an Abnormal ULP initiated Abortive Teardown,
the following items must be done:
1. The Consumer modifies the QP to the Error state.
o The RI stops QP processing and performs an LLP Reset.
2. Once in the Error state, the RI flushes all incomplete WQEs on
both the Send and Receive Queues by completing them with the
Flushed Completion Status. The Consumer would presumably reap
all of the Work Completions to ensure all resources are cleaned
up. Once the Consumer believes all Work Completions have been
reaped, it should attempt to transition the QP to the Idle state
by performing a Modify QP. If the transition is successful, the
Consumer knows it can either re-use the QP for another LLP
Stream or it can invoke Destroy QP (see Section 6.1.3 -
Modifying Queue Pair Attributes and Section 6.1.4 - Destroying a
Queue Pair). If the Modify QP returns with an error (presumably
because Work Requests are still being flushed), the Consumer
must try at a later time to transition to the Idle state. The
Hilland, et al. Expires October 2003 [Page 82]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
Consumer might arm a timeout. If the Consumer is unable to
transition to the Idle state after some amount of time, it
should destroy the QP (presumably because the QP can not recover
from an internal error).
6.6.2.4 Remote Termination
Remote Termination occurs when the Associated QP sends a Terminate
Message to the Local Peer. Note that remote termination may entail
loss of data.
When the Remote Peer sends a Terminate Message, and it is locally
received, the following sequence occurs:
1. The RI stops QP processing.
2. The RI moves the QP automatically to the Terminate state. The RI
then generates an Asynchronous Error Event: "Terminate Message
Received".
3. The RI performs an LLP Close, or if an LLP final timeout occurs,
an LLP Reset.
4. The RI moves the QP to the Error state.
5. Once in the Error state, the RI flushes all incomplete WQEs on
both the Send and Receive Queues by completing them with the
Flushed Completion Status. The Consumer would presumably reap
all of the Work Completions to ensure all resources are cleaned
up. Once the Consumer believes all Work Completions have been
reaped, it should attempt to transition the QP to the Idle state
by performing a Modify QP. If the transition is successful, the
Consumer knows it can either re-use the QP for another LLP
Stream or it can invoke Destroy QP (see Section 6.1.3 -
Modifying Queue Pair Attributes and Section 6.1.4 - Destroying a
Queue Pair). If the Modify QP returns with an error (presumably
because Work Requests are still being flushed), the Consumer
must try at a later time to transition to the Idle state. The
Consumer might arm a timeout. If the Consumer is unable to
transition to the Idle state after some amount of time, it
should destroy the QP (presumably because the QP can not recover
from an internal error).
6.6.2.5 Local Termination, Local Abortive Teardown and Remote Abortive
Teardown
iWARP defines an abortive teardown mechanism which is invoked if a
catastrophic iWARP error is encountered locally. iWARP attempts to
send a Terminate Message, but depending upon the condition of the
LLP, it is possible a Terminate Message can not be sent or can not
Hilland, et al. Expires October 2003 [Page 83]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
be successfully delivered to the Associated QP. If an LLP Stream
error occurs, it is possible for the LLP Stream or LLP Connection to
be torn down before a) iWARP is aware of the error, b) before iWARP
is able to send the Terminate Message, or c) after iWARP has posted
the Terminate Message to the LLP, but it is still in the LLP send
queue. Thus the Consumer at the Remote Peer may or may not be able
to retrieve a valid Terminate reason for some forms of abortive
teardown. The Consumer at the Remote Peer can retrieve the Terminate
Message, if available, using the Query QP when the QP has
transitioned to the Error state. The Consumer at the Local Peer
should always be able to retrieve the Terminate Message that was
sent (if the QP transitioned through the Terminate state),
regardless of whether it was successfully delivered to the Remote
Peer.
Note that an abortive teardown may entail loss of data. The RI will
complete all outstanding (incomplete) iWARP messages in error. In
general, when an abortive teardown occurs it is impossible to tell
for sure what iWARP messages were successfully placed and delivered
at the Remote Peer. Thus even completed messages on the Send Queue
should be treated as incomplete unless a ULP Acknowledge has been
received. Note that Completed RDMA Read Type Work Requests act as a
ULP Acknowledgement, in that any prior RDMA Write Messages, Send
Type Messages, RDMA Read Operations and the RDMA Read Request
Message itself are required to have arrived at the Remote Peer
before the RDMA Read Response Message can be generated at the Remote
Peer to Complete the RDMA Read Type Work Request.
When iWARP detects a local error the following items are done:
1. If the LLP Stream is still functional, the RI moves the QP to
the Terminate state. If the error was not reported in a CQE, the
RI generates an Asynchronous Error Event, with an appropriate
error code (see 8.3.3 - Asynchronous Errors). Then the RI stops
QP processing.
If the LLP Stream is not functional, the RI performs an LLP
Reset and moves the QP to the Error state. If the error was not
reported in a CQE, the RI generates an Asynchronous Error Event,
with an appropriate error code (see 8.3.3 - Asynchronous
Errors). The RI skips steps 2 and 3 below.
2. The RI formats a Terminate Message with an appropriate
termination error code and sends it to the Remote Peer.
3. The RI performs an LLP Close. If the LLP could not successfully
perform the LLP Close (e.g. for TCP, transitioning through the
normal closing states incurred a final timeout), an LLP Reset
occurs. Once either the LLP Close or LLP Reset is finished, the
RI transitions the QP to the Error state.
Hilland, et al. Expires October 2003 [Page 84]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
4. Once in the Error state, the RI flushes all incomplete WQEs on
both the Send and Receive Queues by completing them with the
Flushed Completion Status. The Consumer would presumably reap
all of the Work Completions to ensure all resources are cleaned
up. Once the Consumer believes all Work Completions have been
reaped, it should attempt to transition the QP to the Idle state
by performing a Modify QP. If the transition is successful, the
Consumer knows it can either re-use the QP for another LLP
Stream or it can invoke Destroy QP (see Section 6.1.3 -
Modifying Queue Pair Attributes and Section 6.1.4 - Destroying a
Queue Pair). If the Modify QP returns with an error (presumably
because Work Requests are still being flushed), the Consumer
must try at a later time to transition to the Idle state. The
Consumer might arm a timeout. If the Consumer is unable to
transition to the Idle state after some amount of time, it
should destroy the QP (presumably because the QP can not recover
from an internal error).
Figure 16 is an example of how the abortive teardown might occur.
Other sequences of events are possible. For example, the TCP FIN
could be sent in a separate TCP segment. Another example is the
Remote Peer RI might not transition from the Terminate state when
the LLP can no longer be used for data transmission (i.e. the TCP
FIN ACK segment is sent). Instead it waits for TCP finite state
machine to reach the Closed state. If the latter implementation is
used, QP resources may not be able to be recycled until after TCP
finishes transitioning through the TIME-WAIT state, which takes a
considerable amount of time. See Section 10, Security
Considerations, for potential security issues with this approach.
Hilland, et al. Expires October 2003 [Page 85]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
< Figure 16 did not convert properly from source >
< to be corrected in an upcoming version >
Figure 16 - Abortive Teardown example on TCP
Hilland, et al. Expires October 2003 [Page 86]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
7 Memory Management
7.1 Memory Management Overview
There are two basic methods for enabling memory to be accessed by an
RNIC. These are Memory Regions and Memory Windows. Memory Regions
are used to assign an STag to a Physical Buffer List, associate it
with a starting Tagged Offset and length, and assign it Memory
Access Rights. Memory Windows are used to assign an STag to a
portion, or window, of a Memory Region.
Fundamental to Memory Management is the definition of an STag (see
Section 7.2 - Steering Tag (STag)) and the Tagged Offset (TO)
associated with it (see Section 7.3.1.1 - Memory Region Tagged
Offset (TO) and Section 7.6.1 - Addressing Registered Memory). Also
fundamental is the concept of a Physical Buffer List (PBL), which
contains the physical address mappings for the memory used in the
Memory Region, as discussed in Section 7.6.2 - Physical Buffer
Lists.
An STag can be associated with either a Memory Region or a Memory
Window. While both Memory Regions and Memory Windows can be used for
data transfer operations, they differ with respect to the Verbs used
to manipulate them. These distinctions are covered in great detail
in this section.
There are three mechanisms for associating a Memory Region's STag
with a Physical Buffer List. A Consumer can allocate an STag with
the PBL in one step, as is done with RI-Register Non-Shared Memory
Region. A Consumer can also allocate an STag and then use a Fast-
Register WR to associate the PBL with the STag. Finally, a Consumer
can create a new STag that is associated with an existing Memory
Region through the Register Shared Memory Region. For more
information on Memory Region creation, see Section 7.3.2 - Memory
Region Creation and Registration.
There are two types of Memory Regions. These are Non-Shared MR and
Shared MR. A Non-Shared MR has a PBL that is not shared with other
MRs. A Shared MR has a PBL that may be shared with other MRs. A Non-
Shared MR becomes a Shared MR through the Register Shared Memory
Region operation. For more information on Shared Memory Regions, see
Section 7.3.2.4 - Register Shared Memory Region. MR (without any
qualifiers) is used to refer to both Non-Shared MR and Shared MRs.
Before use, Memory Windows must first be allocated and then Bound to
a Memory Region. The allocation is a RI Verbs call, but the Bind
operation is a WR. For more information on Memory Windows, see
Section 7.10 - Memory Windows.
Hilland, et al. Expires October 2003 [Page 87]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
Memory registration enables access to a Memory Region by a specific
RNIC. Binding a Memory Window enables the specific RNIC to access
memory represented by that Memory Window. STags are specific to an
RNIC and the RI is NOT REQUIRED to grant access to the Memory Region
by other local RNICs.
Mechanisms are provided for Re-registering Non-Shared Memory
Regions. These are discussed in Sections 7.3.2.3 - RI-Reregister
Non-Shared Memory Region. In addition, the Verbs provide mechanisms
for Registering Memory Regions which share PBL mappings. These are
discussed in Section 7.3.2.4 - Register Shared Memory Region.
Architecturally, only Bind Memory Window and Fast-Register Non-
Shared Memory Region are anticipated to be optimized for
performance. The rest of the Memory Registration mechanisms are not
anticipated to be performance optimized.
All Memory Regions MUST have Access Rights associated with them to
indicate if local read, local write, remote read and remote write
accesses are allowed. This is discussed in Section 7.4 - Access to
Registered Memory. All Memory Windows MUST have Access Rights
associated with them to indicate if remote read and remote write
accesses are allowed. This is discussed in Section 7.4 - Access to
Registered Memory.
Non-Shared Memory Regions and Memory Windows have to be invalidated
before they can have their PBL associations changed. This has other
benefits as well, such as preventing remote accesses using that
STag. This is discussed is Section 7.8 - Invalidating Memory Regions
and 7.10.4 - Invalidating or De-allocating Memory Windows.
The RI also provides Verbs for retrieving STag attributes, as
discussed in Section 7.7 - Querying Memory Regions and 7.10.3 -
Querying Memory Windows. The Verbs also define the destruction and
deallocation of Memory Windows and Memory Regions in Section 7.9 -
Deallocation of STag associated with a Memory Region and in Section
7.10.4 - Invalidating or De-allocating Memory Windows, respectively.
7.2 Steering Tag (STag)
All local and remote memory accesses through the Verbs require the
use of an STag. For local access, the STag, along with a Tagged
Offset (TO) is used by the RI, when processing a Work RequestÆs SGE,
to identify a memory location within a specific Memory Region. For
remote access, the STag, along with a TO, is used by the RI when
handling RDMA operations to identify a memory location within a
specific Memory Region or Memory Window.
An STag is a 32-bit identifier that has two sub-fields: a Consumer
provided STag Key and an RI provided STag Index. The STag Key is the
Hilland, et al. Expires October 2003 [Page 88]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
8 least significant bits of the STag. The STag Index is the 24 most
significant bits of the STag.
The 8 bit STag Key is provided by the Consumer. The Consumer can use
the STag Key in any way it desires. For example, it can be used as
an incrementing value to help discover application errors by using a
different value with each registration. As a general rule, the
Consumer provides the STag Key to the RI whenever the consumer
causes the transition of an STag to the Valid state, or when the
STag is being Invalidated. In the Invalid state, only the STag Index
is meaningful.
There is no default value for the STag Key. The RI MUST use the STag
Key provided by the Consumer for the following Verbs:
* Register Non-Shared Memory Region,
* Register Shared Memory Region,
* Reregister Non-Shared Memory Region,
* PostSQ Verb Fast-Register Non-Shared Memory Region operation,
and
* PostSQ Verb Bind operation,
* PostSQ Invalidate Local STag.
The RI MUST return the value of the STag Index sub-field on an
invocation of the following:
* Allocate Non-Shared Memory Region STag,
* Allocate Memory Window,
* Register Non-Shared Memory Region,
* Register Shared Memory Region, and
* Reregister Non-Shared Memory Region.
The RI MUST use the same STag Index sub-field as was passed in by
the Consumer, on an invocation of the following:
* Query Memory Region,
* Query Memory Window,
* Register Shared Memory Region,
Hilland, et al. Expires October 2003 [Page 89]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
* Reregister Non-Shared Memory Region,
* PostSQ Fast-Register Non-Shared Memory Region,
* PostSQ Bind Memory Window,
* PostSQ Invalidate Local STag, and
* Deallocate STag.
Implementation Note: To guarantee that the immediately previous STag
is no longer valid, the Consumer may change the STag Key field each
time the STag is bound. The use of a suitable random number with
each binding can provide a valuable interface check and diagnostic
tool.
7.2.1 STag of zero
The STag of zero (STag with a value of zero) is a special STag. It
has a fixed value for the STag Index and STag Key. The STag Key is
composed of all zeros and the STag Index is composed of all zeros.
It has no PD associated with it and it cannot be used for Remote
Access operations.
The purpose of an STag of zero is to allow Privileged Mode Consumers
to be able to reference a Physical Buffer in a WR without first
registering the buffer with the RI. This approach has the advantage
of reduced overhead. It has the potential disadvantage that the
buffer is represented by only a single SGE and therefore must be
contiguous. Note that buffers which are not contiguous can be
represented by multiple SGEs in this case, but all SGLs have a
finite limit of the number of entries allowed by the RI. If the
buffer is not physically contiguous, any access to the non-existent
memory may result in an access error.
Using an STag of zero as part of a Scatter/Gather Element tells the
RNIC that it MUST interpret the TO portion of the SGE as a physical
address on the local node. Note the RI MUST never generate an STag
Index of zero. The RI MUST NOT allow the Consumer to associate an
STag Key with the STag of zero.
The STag of zero has the following semantics, which are different
than the semantics of any other STag:
1. The RNIC MUST NOT perform any PD checks on an STag of zero.
2. When accessing an STag of zero on a given QP, the RNIC MUST
assure access to the STag of zero is enabled on that QP. If
allowing an STag of zero is not enabled, then the operation MUST
result in a protection error.
Hilland, et al. Expires October 2003 [Page 90]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
3. The RNIC MUST NOT permit any remote access that references STag
of zero and any attempt to do so MUST result in a protection
error. The RI MUST grant STag of zero Local Read and Local Write
Access Rights.
4. The RNIC MUST NOT allow Memory Windows to be Bound to STag of
zero. Any attempt to do so MUST result in an error.
5. The RNIC MUST NOT allow a Local or Remote Invalidation of the
STag of zero. Any attempt to do so MUST result in an error. The
STag of zero MUST always be in the Valid state.
6. The RNIC MUST NOT allow an STag of zero to be an input modifier
of an RI-Reregister Non-Shared Memory Region, Register Shared
Memory Region, Query Memory Region, Query Memory Window, Bind
Memory Window, Deallocate STag, Invalidate STag or Fast-Register
and MUST return an Immediate Error if a Consumer attempts to do
so.
7. The RI MUST NOT return a value of zero as an STag Index for RI-
Register Non-Shared Memory Region, RI-Reregister Non-Shared
Memory Region, Register Shared Memory Region, Allocate Non-
Shared Memory Region STag and Allocate Memory Window.
7.2.2 Summary of Memory Region STag States
The STag associated with a Non-Shared Memory Region has two states.
They are Invalid and Valid. Memory accesses MUST NOT be allowed if
the STag is in the Invalid state.
Below in Figure 17 is the Memory Region and Memory Window state
diagram. It indicates the state transitions required to change
Memory Regions and Memory Windows from the Valid state to and from
the Invalid state. In addition, it denotes the effects of the
Register Shared Memory Region Verb on a Memory Region.
Hilland, et al. Expires October 2003 [Page 91]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
< Figure 17 did not convert properly from source >
< to be corrected in an upcoming version >
Figure 17 - Memory Region and Window State Diagram
For a Non-Shared Memory Region, the following bulleted list
indicates the state, if memory access is allowed in that state, and
what Verbs are used to enter and exit the specified state.
Hilland, et al. Expires October 2003 [Page 92]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
* Invalid - May not be used to access a memory location.
o Entered through: Allocate Non-Shared Memory Region STag,
PostSQ Invalidate STag, incoming Send with Invalidate STag
Message, incoming Send with Solicited Event and Invalidate
STag Message, or local RDMA Read with Invalidate Local STag
WR.
o Exited through: RI-Register Non-Shared Memory Region, RI-
Reregister Non-Shared Memory Region, Fast-Register Non-
Shared Memory Region WR, or Deallocate STag.
* Valid - May be used to access a memory location.
o Entered through: RI-Register Non-Shared Memory Region, RI-
Reregister Non-Shared Memory Region, Fast-Register Non-
Shared Memory Region WR.
o Exited through: PostSQ Invalidate STag, incoming Send with
Invalidate STag Message, incoming Send with Solicited Event
and Invalidate STag Message, local RDMA Read with Invalidate
Local STag WR, or Deallocate STag.
Note: Deallocate STag exits the state logic captured above, as does
RI-Reregister Non-Shared Memory Region (if a different STag is
returned).
The STag associated with a Shared Memory Region MUST always be in
the Valid state. Note that the Register Shared Memory Region Verb
does two things - it returns a new Shared Memory Region STag for an
existing Memory Region's Physical Buffer List (either Shared or Non-
Shared), and if the input STag is for a Non-Shared MR, the Non-
Shared MR is permanently converted into a Shared MR (See Section
7.3.2.4 - Register Shared Memory Region). The following bulleted
list indicates what Verbs are used to enter and exit the Valid state
for a Shared Memory Region.
* Valid - May be used to access a memory location.
o Entered through: Register Shared Memory Region.
o Exited through: Deallocate STag.
Note: Deallocate STag of a Non-Shared MR MUST exit the state logic
captured above.
7.3 Memory Registration
Memory Registration provides mechanisms that allow Consumers to
describe a set of virtually contiguous memory locations or a set of
Hilland, et al. Expires October 2003 [Page 93]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
physically contiguous memory locations to the RI in order to allow
the RNIC to access either as a virtually contiguous buffer using the
STag and Tagged Offset.
Memory Registration provides the RNIC with a mapping between a STag
and Tagged Offset and a Physical Memory Address. It also provides
the RNIC with a description of the access control associated with
the memory location.
Before using a data buffer with the RI, all Consumers MUST
explicitly register with the RI the memory locations associated with
the data buffer, except when using an STag of zero. Local or remote
attempts to access unregistered memory MUST result in a protection
error. Thus every WR simply uses an STag, TO and length to reference
a buffer.
Memory Registration MAY fail due to the RNICÆs inability to find
resources to hold information needed by the RNIC to record the
registration. Memory MUST NOT be registered in this case and MUST
NOT consume any RI resources if the Registration fails.
7.3.1 Memory Regions
A set of memory locations that have been registered are referred to
as a Memory Region (MR).
The RNIC uses two values to identify a memory location within a
Memory Region: Steering Tag (STag) and Tagged Offset (TO).
7.3.1.1 Memory Region Tagged Offset (TO)
The base of the TO field is specified by the Consumer when the
Memory Region is registered through RI-Register Non-Shared Memory
Region, RI-Reregister Non-Shared Memory Region, or Fast-Register
Non-Shared Memory Region. Two bases MUST be supported by the RNIC:
Virtual Address (VA) based TO and zero based TO. For a VA based TO,
the TO of the first memory location associated with the Memory
Region equals the VA value passed as an input modifier of the Verb
or WR used to register the Memory Region. For a zero based TO, the
TO of the first memory location associated with the Memory Region
equals zero.
7.3.2 Memory Region Creation and Registration
Before the RNIC can use a Memory Region, the resources associated
with a Memory Region must be allocated and the Memory Region must be
registered with the RNIC. The RI defines the following mechanisms
for providing these functions through the Verbs interface: Allocate
Non-Shared Memory Region STag, Register Shared Memory Region, RI-
Hilland, et al. Expires October 2003 [Page 94]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
Register Non-Shared Memory Region, RI-Reregister Non-Shared Memory
Region, and Fast-Register Non-Shared Memory Region.
When registering a Memory Region, the Consumer specifies whether
Memory Windows may be Bound to the Memory Region or not.
7.3.2.1 Allocate Non-Shared Memory Region STag
This Verb allocates memory registration resources in the RI. When
the Verb completes, the STag Index will be allocated as described
below and provided as an output modifier.
When allocating an STag:
* the RI MUST verify the Consumer specified maximum Physical
Buffer List Size is less than or equal to the size allowed by
the RI. The RI MUST return the Physical Buffer List (PBL) size
allocated, which MUST be greater than or equal to the size
requested. The RI MUST also return the allocated STag Index. If
the Consumer specified a maximum PBL Size greater than the size
allowed by the RI, the RI MUST return an Immediate Error.
* the RI MUST verify and use the Consumer specified Input Modifier
called the Remote Access Flag to indicate if Remote Access is
enabled with the STag. If the Remote Access Flag is enabled, the
RI MUST be able to allow remote reads or remote writes that
reference the STag. Otherwise, the RI MUST NOT allow the STag to
be used in remote read or remote write operations.
An STag created through the Allocate Non-Shared Memory Region STag
Verb MUST be able to be used in an RI-Reregister or a Fast-Register
Non-Shared Memory Region.
When the Allocate Non-Shared Memory Region STag Verb returns control
to the Consumer and the Verb has completed successfully, the
returned STag is in the Invalid state. The STag MUST be placed in
the Valid state before it can be used by a local or remote operation
to access a memory location. See Section 7.2.2 - Summary of Memory
Region STag States for the requirements on transitioning the STag to
the Valid state.
For a description of the Verb which Allocates an STag, see Section
9.2.6.1 - Allocate Non-Shared Memory Region STag.
7.3.2.2 RI-Register Non-Shared Memory Region
When the RI-Register Non-Shared Memory Region Verb returns, it has
allocated the appropriate memory registration resources on the RNIC
and has registered a Non-Shared Memory Region. When the RI-Register
Non-Shared Memory Region Verb is invoked:
Hilland, et al. Expires October 2003 [Page 95]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
* The RI MUST accept and use any STag Key passed in by the
Consumer for the Memory Registration.
* The RI MUST use the Physical Buffer List passed in by the
Consumer.
* The RI MUST verify and use the Consumer specified modifier which
indicates if Remote Access is enabled with the STag. If Remote
Access is enabled, the RI MUST allow remote reads or remote
writes that reference the STag. Otherwise, the RI MUST NOT allow
the STag to be used in remote read or remote write operations.
When the RI-Register Non-Shared Memory Region Verb completes
successfully:
* the RI MUST have Registered the Non-Shared Memory Region with
the RNIC,
* the RI MUST return the STag Index associated with the Non-Shared
Memory Region to the Consumer,
* the RI MUST return the number of Physical Buffer List Entries in
the allocated Physical Buffer List, which may be larger than the
requested size, and
* the returned STag MUST be in the Valid state.
See Section 9.2.6.2 - Register Non-Shared Memory Region (RI-
Register) for a description of the RI-Register Non-Shared Memory
Region Verb.
7.3.2.3 RI-Reregister Non-Shared Memory Region
This Verb conceptually performs the functional equivalent of
Deallocate STag followed by RI-Register Non-Shared Memory Region.
Where possible, resources below the Verb layer are expected to be
reused instead of deallocated and reallocated. This Verb may be used
to change the Access Rights and/or PD ID of a Region, as well as
changing the memory locations that are registered.
When the RI-Reregister Non-Shared Memory Region Verb is invoked:
* The STag MUST be the STag of a Non-Shared Memory Region.
* The STag MUST be in either the Invalid or Valid state.
* The RI MUST accept and use any STag Key passed in by the
Consumer for the Memory Reregistration.
Hilland, et al. Expires October 2003 [Page 96]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
* The RI MUST ensure that no Memory Windows are Bound to the STag
Index passed in by the Consumer. If any Memory Windows are Bound
to it, an Immediate Error is returned.
* The STag passed in by the Consumer MAY have an original PBL size
that is smaller than the new PBL size to be associated with that
STag. If the PBL passed in by the Consumer is greater than the
PBL associated with the STag, the RI MAY return an error
indicating it had insufficient resources to complete the
request.
If the RI-Reregister Non-Shared Memory Region Verb does not complete
successfully:
* If the RI returns an "Invalid RNIC handle", "Invalid STag Index"
or "One or more Memory Windows is still Bound to the Region"
Immediate Error, the RI MUST make no changes to the current
registration (assuming that it even exists).
* If the RI returns any error other than "Invalid RNIC handle",
"Invalid STag Index" or "One or more Memory Windows is still
Bound to the Region", the RI MUST Deallocate the Memory Region
associated with the STag Index used as an Input Modifier and
ensure that no new Memory Region is registered.
When the RI-Reregister Non-Shared Memory Region Verb completes
successfully:
* the RI MUST have registered the Non-Shared Memory Region with
the RNIC;
* the RI MAY return a different STag Index than the one passed in
by the Consumer. If a different STag Index is returned, all
resources associated with the prior STag MUST have been
effectively Deallocated (e.g. transition to the Deallocated
state);
* the RI MUST return the number of Physical Buffer List Entries in
the allocated Physical Buffer List, which may be larger than the
requested size,
* the RI MUST use and set the Remote Access Rights and Remote
Access Flag for the STag as indicated with the Input Modifier,
and
* the returned STag MUST be in the Valid state. This STag can be
used to access a memory location.
The Consumer should note that since the STag Index returned MAY be
different than the STag Index provided to the Verb, any attempt to
Hilland, et al. Expires October 2003 [Page 97]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
use the previous STag Index in this case would result in a memory
protection error.
The RI-Reregister Non-Shared Memory Region Verb can be used to
modify the attributes of a Memory Region created through the RI-
Register Non-Shared Memory Region, RI-Reregister Non-Shared Memory
Region, or an Allocate Non-Shared Memory Region STag Verb. A Memory
Region MUST be allowed to be reregistered an arbitrary number of
times provided the PBL length is less than or equal to the original
PBL length.
For the error case where a Remote Peer is accessing a Non-Shared
Memory Region while it is in the process of being reregistered,
implementations MUST present the same semantics as a deallocate or
invalidate operation followed by a separate registration operation.
For information on the Verb to Reregister a Memory Region, see
Section 9.2.6.5 - Reregister Non-Shared Memory Region (RI-
Reregister).
7.3.2.4 Register Shared Memory Region
Shared Memory Regions provide a way for the Consumer to obtain a new
STag Index for a Memory Region that has already been registered.
This allows optimization of RNIC resources because returning a new
STag Index allows the Consumer to assign different Access Rights,
change the VA Base, change if the Region is VA Based or Zero Based,
assign an STag Key and use a different PD, but use the same Physical
Buffer List as a previously registered Memory Region. Thus an
optimized implementation is possible where the new STag can use the
previous PBL for memory translation but has new STag properties for
Access Rights and Protection Domain checks.
When the Shared Memory Region Verb is invoked:
* If the STag Index, passed in by the Consumer, is associated with
a Non-Shared Memory Region, the RI MUST verify that the Memory
Region STag Index passed in is in the Valid state. Note that
Shared Memory Regions are always in the Valid state.
* Any Memory Windows that are currently bound to the MR,
associated with the STag Index passed in by the Consumer, MUST
be unaffected.
* The RI MUST verify that the STag Key of the existing MR matches
the STag Key supplied as an input modifier by the Consumer.
* The RI MUST accept and use any STag Key passed in by the
Consumer for the Shared Memory Registration.
Hilland, et al. Expires October 2003 [Page 98]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
* If the STag Index passed in by the Consumer references a VA
based TO, the RI MUST verify that the VA passed in by the
Consumer produces an FBO that matches the FBO of the PBL that is
associated with the STag Index passed in by the Consumer.
When the Shared Memory Region Verb completes successfully:
* the RI MUST have registered the new Shared Memory Region with
the RNIC;
* the RI MUST return a different STag Index that is associated
with the same or identical PBL as the PBL referenced by the STag
Index passed in by the Consumer;
* The RI MUST allow the new Shared Memory Region to have different
Access Rights, change the VA Base, change if the Region is VA
Based or Zero Based, assign an STag Key and a different PD; and
* if the STag Index passed in by the Consumer is associated with a
Non-Shared Memory Region, the RI MUST convert the Non-Shared
Memory Region to a Shared Memory Region but MUST NOT change any
other attributes of the Memory Region being converted.
The returned STag, which references the new, Shared Memory Region,
is in the Valid state. The STag can be used to access a memory
location.
7.3.2.5 Fast-Register Non-Shared Memory Region
Fast-Register provides a mechanism for the Consumer to use the
PostSQ Verb to invoke an asynchronous memory registration. Fast-
Register Non-Shared Memory Region MUST support registration using
STags that were created with the Allocate Non-Shared Memory Region
STag, RI-Register Non-Shared Memory Region Verb or RI-Reregister
Non-Shared Memory Region Verb and have not subsequently been
converted to a Shared Memory Region.
When the Fast-Register Non-Shared Memory Region mechanism is
invoked:
* The RI MUST accept and use any STag Key passed in by the
Consumer for the Fast-Register operation.
* The RI MUST use the STag Index passed in by the Consumer to
register a Non-Shared Memory Region with the RNIC.
* The RI MUST verify that the STag Index passed in by the Consumer
is in the same PD as the QP. The RI MUST verify that the STag
Index passed in by the Consumer is not the STag of zero. The RI
MUST verify that the STag Index passed in by the consumer is not
Hilland, et al. Expires October 2003 [Page 99]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
the STag of a Memory Window. If the STag Index is not in the
same PD as the QP or the STag is that of a Memory Window or the
STag is the STag of zero, the RI MUST return an error.
* The STag MUST be in the Invalid state at the time the Fast-
Register Non-Shared Memory Region is processed. See Section
7.2.2 - Summary of Memory Region STag States for more details.
If the STag is not in the Invalid state at the time the Fast-
Register Non-Shared Memory Region WR is processed, the RI MUST
return an error.
* If the Non-Shared Memory Region referenced by the STag does not
have a maximum PBL size greater than or equal to the PBL size
passed in the Fast-Register Non-Shared Memory Region, the RI
MUST return an error.
* The RI MUST prevent an STag with the Remote Access Flag disabled
from having its Access Rights changed to include remote Access
Rights. The RNIC MUST assure an STag with the Remote Access Flag
enabled can have its Access Rights changed to include remote and
local, or local only Access Rights. Note that the Remote Access
Flag cannot be changed except by the RI-Reregister Non-Shared
Memory Region Verb. If Remote Access Rights are requested and
the Remote Access Flag is not enabled, the RI MUST return an
error.
* The RI MUST verify that Fast-Register access is enabled on the
QP that is processing the Fast-Register Non-Shared Memory Region
operation. Note that this is intended to prevent a Non-
Privileged Mode application from accessing physical memory
without Privileged Mode intervention. If Fast-Register is not
enabled on the QP, the RI MUST return an error.
The Fast-Register operation MUST take place within the RI at any
time between when the Work Request is posted and before execution of
the Work Request immediately after the Fast-Register operation.
When the Fast-Register Non-Shared Memory Region operation completes
successfully, the associated STag MUST be in the Valid state. The
STag can be used to access a memory location.
For a description of the Fast-Register Non-Shared Memory Region
mechanism, see Section 9.3.1.1 - PostSQ.
7.4 Access to Registered Memory
The RI MUST support four distinct Memory Region Access Rights: Local
Read, Local Write, Remote Read, and Remote Write. The Access Rights
of the Memory Region MUST apply to each memory location within the
Memory Region. The RI MUST allow changing Access Rights from local
Hilland, et al. Expires October 2003 [Page 100]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
to local and remote only through an RI-Reregister or through a
Deallocate followed by an Allocate or RI-Register.
The RI MUST support a Remote Access Flag. It can be supplied as an
Input Modifier for the Allocate STag, RI-Register and RI-Reregister
Verbs. If the Remote Access Flag is enabled, the RI MUST allow the
remote Access Rights to be set on the STag. If the Remote Access
Flag is disabled, the RI MUST not allow the remote Access Rights to
be set on the STag.
When performing local and remote data transfer operations, the RI
MUST validate all 32 bits of the STag used to represent the data
transfer.
7.4.1 Local Access to Registered Memory
The RI MUST allow the Consumer to assign one or both of the Local
Access Rights to a given Memory Region. If the Consumer does not
assign one of the local Access Rights, the RI MUST return an error.
If the RI assigns Local Read Access to a Memory Region, the RNIC is
allowed to use the STag and Tagged Offset to read any location
within the Memory Region. If the RI assigns Local Write Access to a
Memory Region, the RNIC is allowed to use the STag and Tagged Offset
to write any location within the Memory Region.
Work Requests may require the Consumer to supply a locally
accessible data buffer. Locally accessible data buffers are
described by the STag associated with that Memory Region, a Tagged
Offset that points to a location within a Memory Region, and the
quantity of bytes in the buffer that may be used by the Work
Request.
The RI MUST enforce that Scatter Gather Elements used in Send
Operation Type and RDMA Write Work Requests posted to the SQ have
Local Read Access enabled or a Completion Error will result.
The RI MUST enforce that Scatter Gather Elements used in Receive
Work Requests posted to the Receive Queue or Shared-Receive Queue
have Local Write Access enabled or a Completion Error will result.
The RI MUST use only Local Access Rights when determining the Access
Rights for Scatter/Gather Elements. The RI MUST NOT use Remote
Access Rights when determining the Access Rights for Scatter/Gather
Elements.
7.4.2 Remote Access to Registered Memory
The Consumer may, in addition to the Local Access Rights, request
the RI to assign one or both of the Remote Access Rights to a given
Hilland, et al. Expires October 2003 [Page 101]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
Memory Region. The RI MUST NOT allow the Consumer to assign Remote
Write to an MR that has not been assigned Local Write. The RI MUST
NOT allow the Consumer to assign Remote Read to an MR that has not
been assigned Local Read.
If the Consumer assigns Remote Read Access to a Memory Region, the
RNIC is allowed to use the STag and Tagged Offset to read any subset
of the Memory Region when processing an incoming RDMA Read Request
Message. If the Consumer assigns Remote Write Access to a Memory
Region, the RNIC is allowed to use the STag and Tagged Offset to
write any subset of the Memory Region when processing an incoming
RDMA Write or RDMA Read Response Message. For more information, see
[RDMAP].
The RI MUST enforce that Tagged Buffers at the Data Sink targeted by
incoming RDMA Write Messages have Remote Write Access enabled or an
Asynchronous Error will result at the Data Sink.
The RI MUST enforce that Tagged Buffers whose contents are retrieved
by RDMA Read Request Messages have Remote Read Access enabled or an
Asynchronous Error will result at the Data Source.
The RI MUST enforce that Tagged Buffers consumed by RDMA Read
Response Messages have Remote Write Access enabled or an
Asynchronous Error will result at the Data Sink. The access control
on the Local Address is not verified until a remote access is
attempted through the RDMA Read Response Message.
Remote Access Rights MUST only be used by the RI when determining
the Access Rights for incoming Tagged and remote Invalidation
operations. The RI MUST NOT allow an STag with only Local Access
Rights to be Invalidated by an incoming remote Invalidation
operation or a protection error will result.
Figure 18 summarizes local and remote Access Rights and the validity
of their combinations that the RI MUST enforce:
Hilland, et al. Expires October 2003 [Page 102]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
Local Remote Valid Access Combination
None None No
None Read No
None Write No
None Read and Write No
Read None Yes
Read Read Yes
Read Write No
Read Read and Write No
Write None Yes
Write Read No
Write Write Yes
Write Read and Write No
Read and Write None Yes
Read and Write Read Yes
Read and Write Write Yes
Read and Write Read and Write Yes
Figure 18 - Valid Combinations of MR Access Rights
7.4.3 Multiple Registrations of Memory Regions
The same set of memory locations may be registered multiple times,
resulting in multiple STags. There are two methods for doing this in
the architecture. The first is the Shared Memory Region, which is
discussed in Section 7.3.2.4 - Register Shared Memory Region. The
second is to simply register a set of memory locations a second time
using the same, similar or overlapping Physical Buffer List.
Regardless of the method, each resulting STag represents a separate
and distinct Memory Region and may be independently associated with
any PD and have distinct Access Rights.
Hilland, et al. Expires October 2003 [Page 103]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
The RI MUST support registration of Non-Shared Memory Regions that
have partially or completely overlapping Physical Buffer Lists and
return a different STag Index for each.
In cases where multiple registrations that use the same memory
locations is desired, provision for optimizing the use of RI
resources is provided. This Verb is called Register Shared Memory
Region and is discussed in Section 7.3.2.4 - Register Shared Memory
Region and the Verb is discussed in Section 9.2.6.6 - Register
Shared Memory Region.
Given an existing Non-Shared Memory Region, a Shared Memory Region
Verb creates a new Shared Memory Region associated with the same
Physical Memory Addresses, with the intention that the new Shared
Memory Region shares RNIC mapping resources to the extent possible.
This also turns the existing Non-Shared Memory Region into a Shared
Memory Region. Through repeated calls to the Register Shared Memory
Region Verb, an arbitrary number of Shared Memory Regions can
potentially share the same RNIC mapping resources, all associated
with the same Physical Memory Addresses. The Base TO, VA (if the
input STag Index references a VA Based TO), PD ID, and Access Rights
specified for the new Shared Memory Region need not be the same as
those of the existing Memory Region. For a VA Based TO, the RI MUST
verify that the VA passed in by the Consumer produces a FBO that
matches the FBO of the PBL that is associated with the STag Index
passed in by the Consumer. The lengths are by definition the same.
7.5 Memory Access Control
Only a Privileged Mode Consumer can invoke an RI-Register, RI-
Reregister, or Allocate Non-Shared Memory Region STag Verb. In
general, the OS is responsible for determining and enforcing access
control policy for memory registrations it does on behalf of Non-
privileged Consumers. For instance, it is anticipated, but not
required, that operating systems will enforce policies similar to
the following:
* A Non-Privileged Mode Consumer has control over which of its
memory areas can be accessed by local and remote RNIC data
transfer operations.
* A Non-Privileged Mode Consumer can enable any local memory area
it has access to for access by RNIC data transfer operations.
* A Non-Privileged Mode Consumer cannot enable RNIC read access to
memory areas that the Consumer itself doesnÆt have read access
to.
Hilland, et al. Expires October 2003 [Page 104]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
* A Non-Privileged Mode Consumer cannot enable RNIC write access
to memory areas that the Consumer itself doesnÆt have write
access to.
When a Consumer creates QPs or CQs (through the appropriate Verbs),
the RI automatically allocates and pins any local memory needed for
the associated RI internal control structures. Access by the RNIC to
these control structures is implicitly enabled. Access by the
Consumer to these control structures is supported only indirectly
through Verbs. Any STags used within the RI that are used for the
control structures (if they exist) MUST NOT be exposed to the
Consumer.
A Consumer controls which Memory Regions and Memory Windows are
accessible by each QP through the use of PDs. Prior to creating any
QPs, registering any Memory Regions, or allocating any Memory
Windows, the Consumer should allocate one or more PDs. When
registering Memory Regions or allocating Memory Windows, the
Consumer specifies the PD ID to associate to each. For information
on the use of PDs, see Section 5.2 - Protection Domains.
7.5.1 Local Access Control
With Send Type, RDMA Write, and Receive Queue WRs, the Consumer
explicitly specifies the data buffers to be accessed through the
local Scatter Gather Elements (SGEs) that the Consumer posts with
the associated Work Requests.
When registering a Memory Region, a Privileged Consumer can
generally specify the following local Access Rights for the Region:
read only, write only, read and write.
The Consumer can access the Memory Region through the STag. This
STag grants the Consumer local Access Rights for the entire Memory
Region as bounded by the base TO and byte length and the granularity
of the access control is enforced at the byte level.
The following list defines the local Access Rights requirements for
SGEs used in local operations:
* Local read access MUST be specified for Gather Elements used in
Send Type WRs and RDMA Write WRs,
* Local Write access MUST be specified for Scatter Elements used
in Receive WRs, and
* For RDMA Read Type WRs, Local Access Rights are not used to
verify the Local Address or Remote Address.
Hilland, et al. Expires October 2003 [Page 105]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
7.5.2 Remote Access Control
When a Consumer wants to allow Remote Peers to access its local
memory using RDMA Writes or RDMA Read Operations, the Consumer
should explicitly enable remote access and Advertise an appropriate
STag to the Remote Peer for it to use when initiating these RDMA
Operations targeting the ConsumerÆs (local) memory.
A Consumer can use either of two mechanisms to enable remote access
to its memory. The first mechanism consists of using a Memory Region
that has remote Access Rights. The second mechanism consists of
allocating and binding Memory Windows. Either results in an STag
with associated remote Access Rights for the memory referenced by
the STag.
Two types of remote access - read and write - are supported. RDMA
Write requires Remote Write Access at the Remote Peer. The RDMA
Protocol converts an RDMA Read Type WR into an RDMA Read Operation
that uses two RDMAP Messages: RDMA Read Request and RDMA Read
Response. Remote Read Access MUST be enabled for Memory Regions read
by a remote RDMA Read Request Message. Remote Write Access MUST be
enabled for Memory Regions written by a remote RDMA Read Response
Message. If the Memory Region does not have the appropriate Access
Rights, a protection error occurs.
For RDMA Read Operations, during the processing of a RDMA Read Type
WR, the RNIC is responsible for generating one RDMA Read Request
Message that contains a description of the Local Address and Remote
Address. Local Access Rights are not used to verify the Local
Address or Remote Address. The Remote Access Rights of the Local
Address is not verified until an incoming RDMA Read Response Message
is received. The Remote Access Rights of the Remote Address are
verified when the Remote Peer processes the RDMA Read Request
Message.
In order to set either Remote Access control types in a Fast-
Register operation, when the Non-Shared Memory Region STag was
created, it MUST have been created with the Remote Access Flag
enabled.
7.6 Addressing
The Tagged Offset field is used by local and remote operations to
address registered Memory Regions.
7.6.1 Addressing Registered Memory
The RI MUST support two mechanisms for specifying the offset within
Memory Regions: VA Based TO and Zero Based TO. At the time the
Memory Region is registered, the RI MUST allow the Consumer to
Hilland, et al. Expires October 2003 [Page 106]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
choose between these two mechanisms. A Virtual Address Base Tagged
Offset (VA Based TO) is one that has a Tagged Offset base that
starts at a non-zero Virtual Address. A Zero Based Tagged Offset
(Zero Based TO) is one that has a Tagged Offset base that starts at
zero.
7.6.1.1 Addressing with VA based TO
The Virtual Addresses that Consumers manipulate and pass as input
modifiers are referred to simply as Virtual Addresses in this
specification. The size of the Virtual Addresses used to specify a
Memory Region to be registered is implementation dependent. The size
of the TO MUST be 64 bits. The TO passed in the SGE defines the VA
of the first byte of the SGE.
A Memory Region is specified by a Virtual Address that points to the
first byte, which is specified by the First Byte Offset of the
Physical Buffer List, and by the length of the set in bytes. The
Physical Buffer size that backs the Region depends on the host
system hardware and host operating system.
The RI MUST allow a Consumer to specify an arbitrary alignment and
length of the virtually contiguous buffer to be registered through a
RI-Register Non-Shared Memory Region Verb, RI-Reregister Non-Shared
Memory Region Verb, or Fast-Register Non-Shared Memory Region.
The following operations should be performed before registering a VA
Based TO Non-Shared Memory Region:
* Translate the set of virtually contiguous memory locations that
are associated with the Non-Shared Memory Region into a Physical
Buffer List.
* Pin the Physical Buffers in the Physical Buffer List.
While a Memory Region is Valid, every Physical Buffer within the
Region must be pinned down in physical memory. This guarantees to
the RNIC that the Memory Region is physically resident (not paged
out) and that the virtual to physical address translation remains
fixed while the Region is registered. The RI is NOT REQUIRED to
verify that the Physical Buffers in the Physical Buffer List are
pinned.
When the Consumer registers a Non-Shared Memory Region addressed
through the VA based TO mechanism, the following input modifiers are
passed to the RI (along with additional input modifiers - see
Section 9.2.6):
* Virtual Address - The VA Physical Buffer offset portion of the
VA defines the offset into the first Physical Buffer of the Non-
Hilland, et al. Expires October 2003 [Page 107]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
Shared Memory Region. The RI checks that the VA modulo Physical
Buffer Size equals the FBO.
* Physical Buffer size - Size of all Physical Buffers referenced
by the Non-Shared Memory Region.
* First Byte Offset (FBO) - Offset into the first Physical Buffer
of the Non-Shared Memory Region
When a RI-Register Non-Shared Memory Region Verb, RI-Reregister Non-
Shared Memory Region Verb, Register Shared Memory Region or Fast-
Register Non-Shared Memory Region is processed, the RI MUST verify
that the Base TO modulo the Physical Buffer Size is equal to the VA
modulo the Physical Buffer Size.
7.6.1.2 Addressing with Zero Based TO
A zero based contiguous set of memory locations is specified by the
length of the set in bytes. The RI MUST associate a TO that has a
value of zero with the First Byte Offset in the Physical Buffer
List.
The following operations must be performed before registering a zero
Based TO Non-Shared Memory Region:
* Translate the set of virtually contiguous memory locations
associated with the Non-Shared Memory Region into a Physical
Buffer List.
* Pin the Physical Buffers in the Physical Buffer List.
While a Memory Region is Valid, every Physical Buffer within the
Region must be pinned down in physical memory. This guarantees to
the RNIC that the Memory Region is physically resident (not paged
out) and that the virtual to physical address translation remains
fixed while the Region is registered. The RI is NOT REQUIRED to
verify that the Physical Buffers in the Physical Buffer List are
pinned.
When the Consumer registers a Non-Shared Memory Region addressed
through the Zero Based TO mechanism, the following input modifiers
are passed to the RI (along with additional input modifiers - see
Section 9.2.6):
* First Byte Offset - Offset into the first Physical Buffer of the
Non-Shared Memory Region
* Buffer size - Size of all Physical Buffers referenced by the
Non-Shared Memory Region.
Hilland, et al. Expires October 2003 [Page 108]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
When a RI-Register Non-Shared Memory Region Verb, RI-Reregister Non-
Shared Memory Region Verb, Register Shared Memory Region Verb or
Fast-Register Non-Shared Memory Region WR is processed for a Zero
base TO MR, the base TO MUST be set to zero.
Note that a Memory Window cannot be bound to a Zero base TO MR.
7.6.2 Physical Buffer Lists
Two Physical Buffer types are defined in this specification: Page
and Block. The RI MUST support the Page Physical Buffer type.
Support for the Block Physical Buffer type by the RI is OPTIONAL. If
the RI supports Block Mode, the RI MUST support the ability to place
the RNIC into either Block Mode or Page Mode when the RNIC is
opened. The RI MUST support a mechanism for querying the RNIC to
determine if the Block Physical Buffer type is supported.
Memory that is part of a Physical Buffer List should remain pinned
while the RI has any reference to it. It is not safe for the
Consumer to assume that when an STag is deallocated that the
Physical Buffer can be unpinned, since another STag may still have a
reference to that resource. It is the responsibility of the Consumer
to determine if and when the Physical Buffers should be unpinned.
7.6.2.1 Page Lists
A Page List is defined by the following attributes:
* Page size - The size, in bytes, of each page in the list.
* Address List - A list of addresses that point to the physical
pages referenced by the Page List. The Address List has the
following attributes:
o All pages in the list have the same size, and that size MUST
be a power of two.
o Page addresses MUST be an integral number of page size. In
other words, each address in the Address List modulo page
size MUST equal zero.
* First Byte Offset (FBO) - Byte offset to start of Memory Region
within the first page.
* Length - Total length in bytes of the Memory Region.
When a Page List is used to register a Non-Shared Memory Region that
has a VA based TO, the RI MUST check that the VA modulo the Page
Size equals the FBO.
Hilland, et al. Expires October 2003 [Page 109]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
7.6.2.2 Block Lists
A Block List is defined by the following attributes:
* Block size - The size, in bytes, of each block in the list.
* Address List - A list of addresses that point to the physical
blocks referenced by the Block List. The Address List has the
following attributes:
o The RI MUST interpret each block referenced in the Address
List as having the same size.
o The RI MUST allow Block Addresses to have an arbitrary byte
alignment.
* First Byte Offset (FBO) - Byte offset to start of Memory Region
within the first block.
* Length - Total length in bytes of the Memory Region.
When a Block List is used to register a Non-Shared Memory Region
that has a VA based TO, the RI MUST check that the VA modulo the
Block Size equals the FBO.
7.6.3 Error Checking of Local and Remote Accesses to MRs
When a local or remote operation attempts to access a registered
Memory Region, the RI MUST ensure that:
* The Access Rights of the Memory Region allow the type of access
being performed by the operation,
* The Access Rights of the QP allow the type of access being
performed by the operation,
* For a QP not associated with an S-RQ, the PD ID associated with
the Memory Region matches the PD ID associated with the QP that
is processing the operation,
* For a QP that is associated with an S-RQ:
o On an incoming Send Operation Type, the PD ID associated
with the Memory Region matches the PD ID associated with the
S-RQ that is processing the operation, and
o On an outbound Send or RDMA Write, or any incoming RDMA
Message, the PD ID associated with the Memory Region matches
the PD ID associated with the QP that is processing the
operation,
Hilland, et al. Expires October 2003 [Page 110]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
* The memory access as specified by the TO & length is within the
base and bounds of the Memory Region. The RI MUST enforce this
with a byte level granularity.
If the length of the access is zero, the RI MUST NOT perform any of
the above checks on the Memory Region.
7.7 Querying Memory Regions
Memory Regions have attributes that can be retrieved through the
Query Memory Region Verb. The RI MUST support the complete list of
QP attributes as described in Section 9.2.6.3 - Query Memory Region.
7.8 Invalidating Memory Regions
When access to a Non-Shared Memory Region by an RI is no longer
required, but the Consumer wants to retain the STag for use in
future Fast-Register Non-Shared Memory Region and RI-Reregister Non-
Shared Memory Region Verb invocations, the Consumer may directly
invalidate access to the Non-Shared Memory Region through an
Invalidate Local STag WR or an RDMA Read with Invalidate Local STag
WR. Additionally, an STag may be invalidated by a remote Consumer
through the use of a Send with Invalidate Message or a Send with
Solicited Event and Invalidate Message.
Multiple Memory Regions can represent memory locations that have
been registered multiple times. The invalidation of a single STag
prevents RNIC access to those memory locations via the STag
associated with that Memory Region. Access to the memory locations
via STags associated with other Memory Regions other than the STag
being Invalidated MUST NOT be affected. Invalidating an STag
associated with a Memory Region that partially or completely overlap
other Memory Regions MUST NOT cause the RI to affect the
registration of those other Memory Regions.
The requirements for unpinning the physical buffers associated with
deallocated Memory Regions are covered in Section 7.6.2 - Physical
Buffer Lists.
Invalidating an STag associated with a Shared Memory Region MUST
result in an Completion Error. Consequently, using an STag
associated with a Shared Memory Region under the following
conditions will cause a Completion Error at the Data Sink that
results in the LLP Stream being torn down after the data transfer
operation takes place:
* As the STag specified in an Invalidate Local STag WR.
* As the Data Sink STag for an RDMA Read with Invalidate Local
STag WR.
Hilland, et al. Expires October 2003 [Page 111]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
* As the STag to be Invalidated for a Send with Invalidate or Send
with SE & Invalidate Message.
When a local Invalidate Local STag WR, a local RDMA Read with
Invalidate Local STag WR, an incoming Send with Invalidate, or an
incoming Send with Solicited Event and Invalidate completes
successfully, the RNIC MUST place the associated STag in the Invalid
state. For more information, see Section 8.2.2.1 - Memory Management
Operation Ordering.
An Invalidated STag retains associated RI resources, such as the PD,
and the Remote Access Flag, and the number of Physical Buffer List
entries but the contents of the Address List Entries become
indeterminate when the Memory Region is in the Invalid state.
The RI MUST fail Local Work Requests or Remote Operations that
attempt to access memory locations in a Non-Shared Memory Region
that has had its STag Invalidated with a protection error. The RNIC
MUST NOT be able to access any memory locations through an STag that
is in the Invalid state.
For Non-Shared Memory Regions created through the RI-Register Non-
Shared Memory Region Verb, when an STag is Invalidated, the RNIC
MUST retain:
* The Maximum Physical Buffer List (PBL) size and entries used:
o When the RI-Register Non-Shared Memory Region was invoked,
if an RI-Reregister Verb has not been invoked on the Non-
Shared Memory Region; or
o On the last RI-Reregister Non-Shared Memory Region that used
the Non-Shared Memory Region.
* The state of the Remote Access Flag.
* The PD associated with the Non-Shared Memory Region.
For Non-Shared Memory Regions created through the Allocate Non-
Shared Memory Region STag Verb, when an STag is Invalidated, the
RNIC MUST retain:
* The Maximum Physical Buffer List size and entries used:
o When the STag was created for a Non-Shared Memory Region, if
an RI-Reregister Verb has not been invoked on the Non-Shared
Memory Region; or
o On the last RI-Reregister Non-Shared Memory Region that used
the Non-Shared Memory Region.
Hilland, et al. Expires October 2003 [Page 112]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
* The state of the Remote Access Flag.
* The PD associated with the Non-Shared Memory Region.
For Memory Regions created through the RI-Reregister Non-Shared
Memory Region Verbs, when an STag is Invalidated, the RNIC MUST
retain:
* The Maximum Physical Buffer List (PBL) size and entries used:
o When the RI-Register Non-Shared Memory Region was invoked,
if an RI-Reregister Verb has not been invoked on the Non-
Shared Memory Region; or
o On the last RI-Reregister Non-Shared Memory Region that used
the Non-Shared Memory Region.
* The PD associated with the Non-Shared Memory Region.
If a Fast-Register is invoked after an RI-Register Memory Region ,
Allocate Non-Shared Memory Region STag or RI-Reregister Memory
Region, the Consumer is guaranteed that the RNIC can register a Non-
Shared Memory Region with a PBL size that is equal to or smaller
than the original PBL size returned when the Non-Shared Memory
Region was created or allocated.
An STag is allowed to already be in the Invalid state, when the RNIC
performs the STag Invalidation.
In order to perform an Invalidation Operation on a given QP, either
through a Local Invalidation operation or an incoming Send with
Invalidate or Send with Solicited Event and Invalidate, the
following checks MUST be performed by the RI:
* The STag MUST be Non-Shared and in the Valid or Invalid state.
* The STag MUST NOT be the STag of zero.
* If the STag is that of a Non-Shared Memory Region, the PD ID of
the STag MUST equal the PD ID of the QP.
* If the STag is that of a Non-Shared Memory Region, there MUST
NOT be any Memory Windows Bound to it.
* The STag Key supplied by the Invalidate Operation must be
validated against the STag Key associated with the Memory Region
when moving the STag to the Invalid state.
* If the Invalidation Operation is due to an Incoming Send with
Invalidate or Send with Solicited Event & Invalidate, the RI
Hilland, et al. Expires October 2003 [Page 113]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
MUST ensure that the QP has either of the remote Access Rights
enabled and the STag has either of the remote Access Rights
enabled.
If any of the above checks fail, a Protection Error MUST result
unless the STag is in the Deallocated state, in which case an
Operation Error MUST result. If the operation was initiated by a
Local Invalidation, a Completion Error MUST result. If the operation
was initiated by an incoming Invalidation operation, a processing
error MUST result and the Queue Pair will enter the Terminate state.
For descriptions of the Work Requests that Invalidate STags
(Invalidate STag, Send with Invalidate, Send with Solicited Event
and Invalidate and RDMA Read with Invalidate Local STag), see
Section 9.3.1.1 - PostSQ.
7.9 Deallocation of STag associated with a Memory Region
The Consumer can reverse the allocation or registration process that
created the STag by invoking the Deallocate STag Verb. The process
of deallocating an STag MUST revoke all RNIC Access Rights
associated with that STag.
The RI MUST verify that the STag Index used as an Input Modifier is
a valid STag on the specified RNIC.
Multiple Memory Regions can represent memory locations that have
been registered multiple times. The deallocation of a single STag
prevents RNIC access to those memory locations via the STag
associated with that Memory Region. Access to memory locations using
STags associated with other Memory Regions MUST NOT be affected.
Deallocating an STag associated with a Memory Region that partially
or completely overlaps other Memory Regions MUST NOT cause the RI to
affect the registration of those other Memory Regions. Deallocating
an STag associated with a Shared Memory Region MUST NOT cause the RI
to affect the registration of any other Shared Memory Region.
The requirements for unpinning the physical buffers associated with
deallocated Memory Regions are covered in Section 7.6.2 - Physical
Buffer Lists.
When the Deallocate STag Verb is invoked, any in-process Local or
Remote Operations that are actively referencing memory locations by
using the STag being deallocated, MUST fail with a protection error.
Local or Remote Operations attempting to access memory locations in
a Memory Region with a deallocated STag MUST fail with a protection
error.
Hilland, et al. Expires October 2003 [Page 114]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
Before the Deallocate Verb returns, the RI MUST free all resources
associated with the STag and revoke the right to use the STag in
Local or Remote Operations.
When a Deallocate STag is invoked, the RI MUST NOT:
* check the state of the associated STag. That is, an STag
associated with a Non-Shared MR can be in either the Valid or
Invalid state when the Deallocate STag is invoked.
* check the STag Key portion of the STag. Note that the Deallocate
Verb does not have an STag Key Input Modifier.
If any Memory Windows are Bound to the Memory Region and the
Consumer invokes the Deallocate STag Verb, the RI MUST return an
Immediate Error and MUST NOT deallocate the Memory Region. Memory
Windows can reverse the Bind process through deallocation or
invalidation.
For a description of the Deallocate Memory Region mechanism, see
Section 9.2.6.4 - Deallocate STag.
7.10 Memory Windows
When a Consumer needs more flexible control over remote access to
its memory, the Consumer can use Memory Windows. Memory Windows are
intended for situations where:
* A Non-Privileged Mode Consumer wants to grant and revoke remote
Access Rights to a registered Region in a dynamic fashion with
less of a performance penalty than using
deallocation/registration or invalidation/re-registration.
* A Consumer wants to grant different remote Access Rights to
different Remote Peers and/or grant those rights over different
ranges within a registered Region.
To use a Memory Window, the Consumer allocates a Memory Window and
then Binds it to a specified TO range of an existing Memory Region
that is enabled for use with Memory Windows. The range can include
the entire Memory Region or any subset of the Memory Region.
See Section 9.2.6 - Memory Management for a description of the Verbs
used to manage Memory Windows.
7.10.1 Allocating Memory Windows
The Allocate Memory Window Verb is used to allocate a Memory Window.
When the Verb returns, it must have allocated Memory Window
resources on the RNIC, associated the STag with the PD ID supplied
Hilland, et al. Expires October 2003 [Page 115]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
as an Input Modifier by the Consumer, and returned the STag
associated with the allocated Memory Window. The RI MUST ensure that
the returned STag is in the Invalid state. The RI MUST NOT allow the
returned STag to be used with RI-Reregister Non-Shared Memory
Region, Register Shared Memory Region, Query Memory Region or Fast-
Register Non-Shared Memory Region. For allocating a Memory Window,
see Section 9.2.6.7 - Allocate Memory Window.
7.10.2 Binding Memory Windows to Memory Regions
The PostSQ Verb is used to Bind a Memory Window to a previously
registered Memory Region. After the WR that Binds the MW is
processed, the STag associated with the Memory Window is in the
Valid state.
The RI MUST allow a MW to Bind to a Non-Shared Memory Region. The RI
MUST allow a MW to Bind to a Shared Memory Region. The RI MUST allow
all allocated MWs to be Bound to a single MR. The RI MUST allow all
allocated MWs to be Bound to a single QP.
If the STag representing the Memory Region to which the Memory
Window will Bind has an STag of zero, the Verb MUST return either an
Immediate Error or a Completion Error.
During the processing of PostSQ Bind Memory Window Verb, the RNIC
MUST ensure that the PD ID of the Memory Window equals the PD ID of
the Memory Region and with the PD ID of the QP that is processing
the PostSQ Bind Memory Window Verb. If the three PD IDs are equal,
the Memory Window is Bound to the Memory Region and is associated
with the QP that processed the PostSQ Bind Memory Window Verb.
Otherwise an invalid PD Completion Error is returned to the
Consumer. When a Memory Window is Bound to a QP at this point, it is
conceptually equivalent to having the PD ID of the Memory Window
replaced with the QP ID of the QP. Thus, instead of performing a PD
check upon validating the STag for incoming RDMA operations, the QP
ID of the Memory Window MUST be equal to the QP ID of the QP where
the incoming RDMA operation arrived.
The RI MUST check that the QP has the ability to Bind Memory Windows
enabled.
When Binding a Memory Window, the RI MUST ensure that the memory
locations being associated with the Memory Window are within the
base TO and length of the associated Memory Region. The RI MUST
support Memory Windows with a Zero Based TO. The RI MUST support
Memory Windows with a VA Based TO. The RI MUST allow Memory Windows
to bind to Memory Regions with a VA based TO. If the Memory Window
has a VA based TO, the RNIC MUST ensure that the value assigned for
the base of the Memory Window be between the MR's base VA, and the
MR's Base VA plus the MR's length.
Hilland, et al. Expires October 2003 [Page 116]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
When the Bind MW WR completes successfully:
* The RI MUST have Bound the MW to the Non-Shared Memory Region.
* The RI MUST have Bound the MW to the QP that processed the Bind
WR, by associating the QP's QP ID to the MW.
* The RI MUST have set the MW STag's access rights as requested by
the Consumer.
* The RI MUST accept and use the STag Key passed in by the
Consumer for the Bind operation.
* The RI MUST have set the MW Address Type as requested by the
Consumer.
* If the Address Type of the MW was requested as VA Based, the RI
MUST have set the Virtual Address as requested by the Consumer.
* The RI MUST have placed the MW STag in the Valid State.
Figure 19 indicates which MR to MW Binding combinations are valid.
Note that the figure is based on the Base TO type of the Memory
Region and Memory Window. If the Consumer attempts to Bind a MW to a
Zero-based TO MR, the RI MUST return an error. The Underlying Memory
Region in this case may be either a Non-Shared Memory Region or a
Shared Memory Region.
Underlying Memory Memory Window TO base Valid combination
Region TO base
Zero based Zero based No
Zero based VA based No
VA based Zero based Yes
VA based VA based Yes
Figure 19 - MR to MW Valid Binding Combinations
When a remote access references a Bound Memory Window, the RNIC MUST
ensure that the QP ID associated with the Memory Window matches the
QP ID associated with the remote access' RDMA Stream. The RNIC MUST
also ensure that the memory locations being referenced by the remote
access are within the base TO and length of the associated Bound
Memory Window. The RI MUST enforce this with a byte level
granularity.
Hilland, et al. Expires October 2003 [Page 117]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
When Binding a Memory Window, a Consumer can request any combination
of remote Access Rights for the Window. However, if the associated
Memory Region does not have local write access enabled and the
Consumer requests remote write for the Window, implementations MUST
return a Completion Error.
Memory Windows MUST support two distinct remote Access Rights:
Remote Read and Remote Write. Bind Memory Window WRs must specify
one or both of these rights. Memory Windows with Remote Write Access
MUST be bound to Memory Regions that have Local Write Access
Enabled. Memory Windows with Remote Read access MUST be bound to
Memory Regions that have Local Read Access Enabled.
A Consumer is allowed and commonly expected to enable remote Access
Rights when Binding a Window that it may not have enabled when it
registered the underlying Region - provided it doesnÆt violate the
above rule regarding local access. For example, a Consumer might
register a Region with no remote Access Rights, and later Bind one
or more Windows to that Region that would grant remote Access
Rights.
Figure 20 summarizes the access right mappings between Memory
Regions and Memory Windows and if the Memory Window Access Right
requested is allowable or not. The RI MUST validate Memory Windows
Access Right requests according to Figure 20 and if the Access Right
requested is not allowed, the Bind operation must result in a
Completion Error.
Underlying Memory Requested Remote Access Right Requested
Region's Local Access Rights for allowed:
Access Rights Memory Window
Local Read Remote Write No
Local Read Remote Read Yes
Local Read Remote Read and Write No
Local Write Remote Write Yes
Local Write Remote Read No
Local Write Remote Read and Write No
Local Read and Write Remote Write Yes
Local Read and Write Remote Read Yes
Hilland, et al. Expires October 2003 [Page 118]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
Underlying Memory Requested Remote Access Right Requested
Region's Local Access Rights for allowed:
Access Rights Memory Window
Local Read and Write Remote Read and Write Yes
None Any No
Any None No
Figure 20 - Valid Combinations of MW & MR Access Rights
Allocating or de-allocating a Memory Window requires a Privileged
mode transition for a Non-Privileged Consumer, and thus incurs the
associated software overhead. Binding a Memory Window is performed
with a Work Request posted to a Send Queue, and thus incurs far less
software overhead.
An STag used in a PostSQ Bind Memory Window Verb MUST be in the
Invalid state.
Each time a Memory Window is Bound, the Consumer passes the STag Key
portion of the STag to the RI. The RI MUST use the STag Key provided
by the Consumer. Additionally, the RI MUST NOT change the STag Index
portion of the STag passed in by the Consumer. Note that the Bind
Memory Window WR has unique ordering rules which are detailed in
Section 8.2.2.1 - Memory Management Operation Ordering. Once the
Bind operation has completed processing, RNIC implementations MUST
guarantee that no additional accesses on this Memory Window can be
performed with any STag Key other than the one used in the last Bind
operation.
If the RNIC detects an error with the Bind operation, it MUST put
the QP into the Error state.
Multiple Windows can be Bound to the same Memory Region, each with
arbitrary remote Access Rights, and their associated areas can be
overlapping or disjoint.
For a description of the error conditions checked during MW Bind and
MW access, see Section 7.10.6 - Error Checking during Memory Window
Operations.
For a description of the Bind Memory Window operation, see Section
9.3.1.1 - PostSQ.
Hilland, et al. Expires October 2003 [Page 119]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
7.10.3 Querying Memory Windows
Memory Windows have attributes that can be retrieved through the
Query Memory Window Verb. The RI MUST support the complete list of
QP attributes as described in Section 9.2.6.8 - Query Memory Window.
7.10.4 Invalidating or De-allocating Memory Windows
When access to a Memory Window by the RI is no longer required, but
the Consumer wants to retain the STag for use in future PostSQ Bind
Memory Window Verb invocations, the Consumer may directly invalidate
access to the Memory Window through either an Invalidate Local STag
WR or an RDMA Read with Invalidate Local STag WR. Additionally, an
STag associated with a Memory Window may be invalidated by a remote
Consumer through the use of a Send with Invalidate Message or a Send
with Solicited Event and Invalidate Message. For more information on
these Verbs, see Section 7.8 - Invalidating Memory Regions.
Memory Windows are Deallocated in a fashion similar to Memory
Regions: with the Deallocate STag Verb. For more information, see
Section 7.9 - Deallocation of STag associated with a Memory Region.
When processing an Invalidate operation on an MW STag:
* and the MW is in the Valid state, the RI MUST check and enforce
that the QP ID associated with the MW is equal to the QP ID of
the QP processing the Invalidate Local STag WR. If the QP IDs
match, the RNIC MUST place the specified local STag in the
Invalid state. If the QP IDs do not match, the RI MUST return an
error.
* and the MW is in the Invalid state, the RI MUST check and
enforce that the PD ID associated with the MW is equal to the PD
ID associated with the QP processing the Invalidate Local STag
WR. If the PD IDs do not match, the RI MUST return an error.
When a local Invalidate Local STag WR, local RDMA Read with
Invalidate Local STag WR, an incoming Send with Invalidate Message,
or an incoming Send with Solicited Event and Invalidate Message
completes successfully, the RNIC MUST:
* transition the associated STag to the Invalid state,
* change the association of the newly invalidated STag from the QP
to the PD of the QP that processed the STag Invalidation,
* retain the Memory Window resources associated with the STag,
* remove the association of the Memory Window with the underlying
Memory Region.
Hilland, et al. Expires October 2003 [Page 120]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
An invalidated STag which was either Invalidated as described above,
or in the Invalid state because it was created through the Allocate
Memory Window Verb but never used, can be used as the MW in a PostSQ
Bind Memory Window WR.
Once an STag associated with a MW is successfully Invalidated, the
RI MUST associate the STag with the PD associated with the QP
processing the Invalidate Local STag WR.
For information on Invalidating Memory Windows through the
Invalidate Local STag or RDMA Read with Invalidate Local STag WR,
see Section 9.3.1.1 - PostSQ. For information on Invalidating Memory
Windows through Send with Invalidate or Send with Solicited Event &
Invalidate WR, see Section 9.3.1.1 - PostSQ. For a description of
the Verb to deallocate a Memory Window, see Section 9.2.6.4 -
Deallocate STag.
7.10.4.1 Invalidating or De-allocating Active Windows
Under normal operation, it is improper for a Consumer to deallocate
or Invalidate the STag of the Memory Window while it is being used
in an incoming, remote operation. However, this can occur if the
Remote Consumer misbehaves, or it can occur under error recovery
circumstances.
Any Remote Operations that are in-process and actively using a
Memory Window when its STag is Invalidated MUST fail with a
protection error. Once the Completion of the Invalidate operation
has been determined by the Consumer, the RI MUST guarantee that no
additional accesses can be performed under the previous binding.
Any Remote Operations that are in-process and actively using a
Memory Window when it is deallocated MUST fail with a protection
error. Once the de-allocation Verb completes, RNIC implementations
MUST guarantee that no additional accesses can be performed through
that Memory Window.
An STag is allowed to already be in the Invalid state, when the RNIC
performs the STag Invalidation.
7.10.5 Summary of Memory Window STag States
An STag associated with a Memory Window has two states:
* Invalid - May not be used to access a memory location.
o Entered through: Allocate Memory Window, PostSQ Invalidate
STag WR, incoming Send with Invalidate STag Message,
incoming Send with Solicited Event and Invalidate STag
Message, or local RDMA Read with Invalidate Local STag WR.
Hilland, et al. Expires October 2003 [Page 121]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
o Exited through: PostSQ Bind Memory Window WR or Deallocate
STag.
* Valid - May be used to access a memory location.
o Entered through: PostSQ Bind Memory Window WR.
o Exited through: PostSQ Invalidate STag MW, incoming Send
with Invalidate STag Message, incoming Send with Solicited
Event and Invalidate STag Message, local RDMA Read with
Invalidate Local STag WR, or Deallocate STag.
Note: Deallocate STag exits the state logic captured above.
7.10.6 Error Checking during Memory Window Operations
7.10.6.1 Error Checking at Window Bind Time
The RI MUST check for the following error conditions during the
Memory Window Bind operation and, if any error is detected the RI
MUST return a Completion Error.
* The RNIC MUST check and enforce that the MW STag is an MW STag
and is in the Invalid state.
* The RNIC MUST check and enforce that the QP has Memory Window
Binding enabled.
* The RNIC MUST check and enforce that the STag of the MR is an MR
STag and is in the Valid state and is not the STag of zero.
* The RNIC MUST check and enforce that the Memory Window, Memory
Region, and QP belong to the same PD.
* The RNIC MUST check and assure that the Memory Region has Window
binding enabled.
* The RNIC MUST check and enforce that the Memory Window Access
Rights are compatible with the Access Rights of the underlying
Memory Region. (See Figure 19).
* The RNIC MUST check and enforce that the Memory Region is not a
Zero based TO MR.
* The RNIC MUST check and enforce that the Memory Window base TO
and bounds is within the base TO and bounds of the underlying
Memory Region. The RI MUST enforce this with a byte level
granularity.
Hilland, et al. Expires October 2003 [Page 122]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
7.10.6.2 Error Checking at Window Access Time
The following conditions MUST be checked for each incoming RDMAP
Tagged Message targeting an STag that is associated with a Memory
Window:
* The RNIC MUST check and enforce that the MW STag is in the Valid
state.
* The RNIC MUST check and enforce that the QP ID associated with
the Memory Window is equal to the QP ID associated with the
incoming remote operation that is accessing the Memory Window.
* The RNIC MUST check and enforce the incoming memory access as
represented by the TO and length is within the TO base and
bounds of the Memory Window. The RI MUST enforce this with a
byte level granularity.
* The RNIC MUST check and enforce the Access Rights associated
with the Memory Window.
* The RNIC MUST NOT check or enforce the Access Rights associated
with the Memory Region to which the Memory Window is Bound.
* The RI MUST check that the appropriate MW and QP Remote Access
Rights are enabled for the incoming RDMA Message. For example,
if the incoming RDMA Message is an RDMA Write targeting a MW,
the RI must check that the MW and the QP have Remote Write
Access Rights enabled.
If any of the above checks fail, the RI MUST not allow the memory
access to take place and a protection error MUST be generated.
If the length of the access is zero, the RI MUST NOT perform any of
the above checks on the Memory Window.
Note that the QP attributes must be verified as well. For more
information, see Section 8.1.2.2.
7.10.6.3 Error Checking at Window Invalidate Time
The following conditions MUST be checked on a PostSQ Invalidate
Local STag WR, RDMA Read with Invalidate Local STag WR, incoming
Send with Invalidate Message, or incoming Send with Solicited Event
and Invalidate Message that accesses a Memory Window:
* If the Memory Window is in the Valid state, the RNIC MUST check
and enforce that the QP ID associated with the Memory Window is
equal to the QP ID associated with the QP processing the
Invalidate Local STag WR.
Hilland, et al. Expires October 2003 [Page 123]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
* If the Memory Window is in the Invalid state, the RNIC MUST
check and enforce that the PD ID associated with the Memory
Window is equal to the PD ID associated with the QP processing
the Invalidate Local STag WR.
If any of the above checks fail, the RI MUST NOT allow the
invalidation to take place and the operation MUST result in an
error.
Hilland, et al. Expires October 2003 [Page 124]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
8 Work Requests and the WR Processing Model
8.1 Work Requests
A Work Request is the fundamental unit of work used by the Consumer
to indicate to the RNIC that there is data to transfer and control
operations to process on a specific QP. The following sections
describe the creation of Work Requests, types of Work Requests and
Work Request Contents.
8.1.1 Creating Work Requests
Work Requests MUST be the only mechanism available to Consumers to
submit work to the Work Queues. The Work Requests Verbs MUST be used
only to pass operations from the Consumer to the RI. Specifically,
these Verbs are PostSQ (Section 9.3.1.1) and PostRQ (Section
9.3.1.2).
Work Requests can only be posted to the SQ or RQ of a specific QP,
or, if the QP is associated with an S-RQ, to the S-RQ associated
with the QP.
Work Requests are created by the Consumer above the RI and submitted
through the Verbs to the RI for processing. The format of Work
Requests within the RI is not defined. Its structure is opaque to
the Consumer and is not part of this specification. WRs are only
valid during the Posting process. WRs are then represented by WQEs
until Completed.
The RNIC MUST support the submission of multiple WRs to the RI as a
list of individual Work Requests. The intention of this requirement
is to allow for optimizations in the RNIC such that the RI can
inform the RNIC of WQEs in the most efficient manner for that
individual RNIC.
8.1.2 Work Request Types
There are three basic Work Request types. These are those dealing
with Send/Receive, RDMA, and Memory.
8.1.2.1 Send/Receive
The Send/Receive model supports the Untagged Buffer Model in the
RDMAP/DDP specifications. The Send/Receive model uses a one-to-one
correspondence between outgoing Sends Operation Type WRs and
incoming Receive Queue WRs. Successful Send Type Work Requests MUST
result in the consumption of a Receive Queue Work Request at the
Associated QP. Receive Queue Work Requests should be posted to the
RQ before the incoming Send Message Type arrives. If a WQE is not
available on the RQ to describe the Untagged Buffer for the incoming
Hilland, et al. Expires October 2003 [Page 125]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
Send Message Type, then the LLP Stream MAY be terminated. If the LLP
Stream is not terminated, the reader should see Section 13.2 -
Graceful Receive Overflow Handling for one implementation option.
The RI MUST allow Send Work Requests to only be posted to a Send
Queue. This includes all Send Operation Types, which are: Send, Send
with Solicited Event, Send with Invalidate and Send with Solicited
Event & Invalidate. The RI MUST allow only Receive Work Requests to
be posted to a Receive Queue or Shared Receive Queue.
A Receive Queue Scatter/Gather List Work Request MUST contain at
least enough buffer space to place the incoming Send Message Type.
If it does not, a Completion Error MUST be returned. The length of
the buffer represented by the Scatter/Gather List of a Receive Queue
Work Request MAY be greater than the length of the incoming data.
The length of incoming data MUST be returned by the RI as part of
the Work Completion. In the case of any Completion Error, the value
of the length in the Work Completion MUST be considered
indeterminate.
Since segmentation and reassembly is provided by DDP, Send Operation
Types and corresponding Receives can be larger than the EMSS (See
[RDMAP][DDP]). The maximum data transfer length supported by the
architecture is 2^32-1 octets of data. Note that for any given
message, the length of the buffers represented by the WRs posted to
the RQ MAY have a total length that is smaller than the maximum data
transfer length. It is up to the Consumer to negotiate the maximum
receive buffer size with the Remote Peer.
The Data Source of Send Operation Types MUST be a local
Scatter/Gather List. See Section 8.1.3.2 for a description of
Scatter/Gather List.
The Data Sink of Receive operations MUST be a local Scatter/Gather
List.
8.1.2.2 RDMA
RDMA Write WRs, RDMA Read WRs, and RDMA Read with Invalidate Local
STag WRs MUST NOT result in the consumption of a Receive Queue Work
Request at the Remote Peer.
The Data Source of an RDMA Write Work Request MUST be a
Scatter/Gather List consisting of local buffers.
The Data Sink used in an RDMA Read Type WR MUST be in the local
node's address space as represented by the TO, STag and Length
contained in the RDMA Read Type WR. The STag MUST be Bound to either
a Memory Region or a Memory Window containing the buffer represented
by the TO and length.
Hilland, et al. Expires October 2003 [Page 126]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
The Data Source for an RDMA Read Type WR and the Data Sink for an
RDMA Write WR MUST be in the Remote Peer's address space as
represented by the TO, STag and Length contained in the Work
Request. The STag MUST represent either a Memory Region or a Memory
Window containing the buffer represented by the STag, TO and length.
Queue Pairs have RDMA Read enable and RDMA Write enable attributes.
Memory Regions and Memory Windows have Remote Read and Remote Write
attributes as well. Memory Regions also have Local Read and Local
Write attributes. RDMA transfers MUST only take place when the
appropriate QP RDMA attribute is enabled and the appropriate STag
attribute is enabled where the STag represents either a Memory
Region or a Memory Window. If the STag is that of a Memory Window,
the attributes of the Memory Region do not apply at memory access
time. These attributes are checked at the node where the target
memory is located. After the STag Access Rights and QP Access Rights
have been verified, the RI MUST verify that the STag Access Rights
match the QP Access Rights. If the RI detects an invalid Access
Rights combination, the operation MUST result in a protection error.
The combinations of QP Access Rights and STag Access Rights which
will allow the data transfer to take place are shown in Figure 21.
Hilland, et al. Expires October 2003 [Page 127]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
STag Used as QP Attribute STag Attribute(5) Access
Allowed?
RDMA Read Type Inbound RDMA Read: Remote Read Access:
Data Source
Enabled Enabled Yes
Disabled Either No
Either Disabled No
RDMA Write or Inbound RDMA Write Remote Write
RDMA Read Type and inbound RDMA Access:
Data Sink Read Response:
Enabled Enabled Yes
Disabled Either No
Either Disabled No
RDMA Write or Local Read Access:
Send Type Data Either
Source Enabled Yes
Disabled No
Receive Data Local Write Access:
Sink Either
Enabled Yes
Disabled No
Figure 21 - Valid QP & STag Access Right Combinations
The RDMA Read with Invalidate Local STag WR behaves similar to an
RDMA Read Work Request which is then immediately followed by a
Invalidate Local STag WR on the STag in the Local Address. The
slight difference in behavior is in this case the Invalidate will
not occur until after the RDMA Read Operation is complete; while
with two separate WRs, the Invalidate operation could begin
processing before the RDMA Read Type WR Completes. Work Requests
subsequent to an RDMA Read with Invalidate Local STag WR may begin
Footnote 5: The STag may have additional Access Rights, but only the
rights listed effect the allowed access.
Hilland, et al. Expires October 2003 [Page 128]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
processing before the RDMA Read with Invalidate Local STag WR
Completes. See Section 8.2.2.1 - Memory Management Operation
Ordering for more details.
8.1.2.3 Memory
The following Memory Operations can be posted to the SQ: Bind Memory
Window, Fast-Register Non-Shared Memory Region, Invalidate Local
STag and RDMA Read with Invalidate Local STag.
8.1.2.3.1 Bind Memory Windows
The Bind Memory Window WR associates a previously allocated MW to a
specified Tagged Offset (TO) range within an existing MR, as well as
sets the MW's RDMA remote Access Rights.
Bind operations MUST be posted to the SQ as a Work Request. Binds
only affect local RNIC mapping resources and MUST NOT cause any
segment to be issued to the LLP. No resources at the associated QP
are directly affected.
For more information on the Memory Window Bind operation, see
Section 7.10.2 - Binding Memory Windows to Memory Regions.
8.1.2.3.2 Fast-Register Non-Shared Memory Region
The Fast-Register Non-Shared Memory Region WR associates an MR STag
that is in the Invalid state to a specified Physical Buffer List
(For more information on Invalidating STags, see Section 7.8 -
Invalidating Memory Regions). For information on the STag types
allowed, see Section 7.3.2.5 - Fast-Register Non-Shared Memory
Region.
Fast-Register Non-Shared Memory Region operations MUST be posted to
the Send Queue. Fast-Register Non-Shared Memory Region operations
only affect local RNIC mapping resources and do not cause any data
transfer. No resources at the Associated QP are directly affected.
8.1.2.3.3 Invalidate Local STag
The Invalidate Local STag and RDMA Read with Invalidate Local STag
WRs use the STag supplied as the target for the invalidation and
transition the STag to the Invalid state.
The STag which is the target of an Invalidate Local STag or RDMA
Read with Invalidate Local STag WR MUST be associated with a Non-
Shared Memory Region (i.e. created by Allocate Non-Shared Memory
Region STag, RI-Register Non-Shared Memory Region, RI-Reregister
Non-Shared Memory Region and has not transitioned to a Shared Memory
Region) or MW (i.e. created by Allocate Memory Window).
Hilland, et al. Expires October 2003 [Page 129]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
For information on Invalidating STags associated with a Non-Shared
MR, see Section 7.8 - Invalidating Memory Regions. For information
on Invalidating STags associated with MWs, see Section 7.10.4 -
Invalidating or De-allocating Memory Windows.
Invalidate Local STag operations MUST be posted to the Send Queue as
a Work Request. The Invalidate Local STag operations only affect
local RNIC mapping resources and MUST NOT cause any data transfer.
No resources at the Associated QP are directly affected.
The initiation of an Invalidate Local STag operation must remain
ordered with respect to other Work Requests on the same QP and the
operation must take effect before any subsequent WRs can begin
processing by the RNIC, as defined in the ordering rules in Section
8.2.2.1 and Section 8.2.2.2.
8.1.3 Work Request Contents
Every Work Request submitted through the Verbs contains all of the
information required to perform the requested operation. The exact
WR contents are covered in the Section 9.3.1.1 - PostSQ and 9.3.1.2
- PostRQ. The characteristics of two of the Post Send Request Verb
modifiers are discussed below.
8.1.3.1 Signaled Completions
Signaled Completions refer to Work Requests that result in a Work
Completion. Unsignaled Completions provide a mechanism where Work
Requests posted to the Send Queue do not generate a Work Completion
in the associated Completion Queue if the operations complete
successfully. The RI MUST support PostSQ WRs with Unsignaled
Completions on every QP.
Every WR posted to the RQ MUST result in a Work Completion.
Consequently, all RQ WRs are considered Signaled WRs.
The Consumer can indicate that it does not need a Signaled
Completion by setting the Unsignaled Completion indicator in a Work
Request posted to the SQ.
When an error is encountered on an Unsignaled or Signaled WR, a CQE
will be generated for that WR with the appropriate error code. In
addition, the RI MUST Complete all subsequent WRs with a Flushed
Error Completion Status regardless of their signaling type. The
Consumer is safe in assuming that all WRs prior to the one resulting
in an error were completed successfully.
An Unsignaled WR is defined as completed successfully when all of
the following rules are met:
Hilland, et al. Expires October 2003 [Page 130]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
* A Work Completion is retrieved from the CQ associated with the
SQ where the unsignaled Work Request was posted,
* that Work Completion corresponds to a subsequent Work Request on
the same Send Queue as the unsignaled Work Request, and
* the subsequent Work Request is ordered after the unsignaled Work
Request as per the ordering rules. Depending on the Work Request
used, this may require using the Local Fence indicator in order
to guarantee ordering.
When an unsignaled WQE completes successfully:
* The RI MUST free up any resources associated with the Unsignaled
WQE,
* The Consumer MAY consider the WQE as having completed
successfully, and
* The Consumer MAY re-use any resources associated with the
Unsignaled WQE.
The Consumer should ensure that in the event that a WQE with an
Unsignaled Completion indicator results in an error that the CQ will
not overflow as stated in Section 5.3.1. This is because the WQE
will cause a CQE and every WQE after it will cause a CQE as well
since they result in CQEs with the Flushed status.
8.1.3.2 Scatter/Gather List
The RI MUST allow each Scatter/Gather List (SGL) to contain one or
more Scatter/Gather Elements (SGE). The SGE references a buffer via
an STag, TO, and length. The STag specified in the SGE MUST be
Registered with the RI prior to submission, except for the STag of
zero. These buffers referenced by the STag MUST be considered to be
in the scope of the RI from the time they are submitted to a Work
Queue until Completion of the Work Request has been confirmed.
If a Memory Window STag is used in an SGE in a PostRQ or PostSQ Send
Operation Type or the Data Source for an RDMA Write WR, the RI MUST
Complete the Work Request with a Completion Error.
The sum total of all of the buffer lengths in an SGL MUST NOT exceed
the maximum message payload size specified for RDMAP. This is 2^32-1
bytes. If an SGE has a length of zero, the STag MUST NOT be
validated by the RI. For PostSQ WRs, the sum of the Length field in
all of the SGEs MUST be the total length of that RDMAP operation.
This value MUST be able to be zero.
Hilland, et al. Expires October 2003 [Page 131]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
An RI MAY support more than one Scatter/Gather Element per
Scatter/Gather List. The exact number of Scatter/Gather Elements per
Scatter/Gather List supported by the RNIC MUST be returned via the
Query RNIC Verb (Section 9.2.1.2) where there is one value for Send
Operation Type WR for Data Source buffers (which also applies to
PostRQ buffers) and one value for RDMA Write WR Data Source buffers.
The Consumer can specify the maximum number of Scatter/Gather
Elements per Scatter/Gather List for each Work Queue as an input
modifier to the Create QP (Section 9.2.5.1). The RI MUST return an
Immediate Error if the value in Create QP exceeds the value
supported by the RNIC.
An RI MUST support at least four Scatter/Gather Elements per
Scatter/Gather List when the Scatter/Gather List refers to the Data
Source of a Send Operation Type or the Data Sink of a Receive
Operation. An RI is NOT REQUIRED to support more than one
Scatter/Gather Element per Scatter/Gather List when the
Scatter/Gather List refers to the Data Source of an RDMA Write.
8.1.3.2.1 STag of zero Usage
The ability to use the reserved STag of zero MUST NOT be allowed for
Non-Privileged Mode accessible QPs. The RI must generate an
Affiliated Asynchronous Error if an RDMAP Tagged message is received
with an STag of zero. If the STag of zero is used in an outgoing
RDMA Read Type WR or as the Data Sink of an RDMA Write WR, the RI
MUST return a Completion Error. Thus the Consumer should not
Advertise the STag of zero, since an error will result.
8.1.3.3 RDMA Data Source & Data Sink
For RDMA Read Type Work Requests, the RI MUST support the Data
Source Local Address as an input modifier to PostSQ. The structure
representing this information is known as a Data Source Address. A
Data Source Address consists of an STag, Tagged Offset and Length.
An RI MUST support exactly one Data Source Address for RDMA Read
Type Work Requests.
For RDMA Write Work Requests, the RI MUST support the Data Source
Scatter/Gather List as an input modifier to PostSQ.
For RDMA Write and RDMA Read Type Work Requests, the RI MUST support
the Data Sink Remote Address as an input modifier to PostSQ. The
structure representing this information is known as a Data Sink
Address. A Data Sink Address consists of an STag, Tagged Offset and
Length. An RI MUST support exactly one Data Sink Address for RDMA
Read Type Work Requests and RDMA Write Work Requests.
Hilland, et al. Expires October 2003 [Page 132]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
8.2 Work Request Processing Model
The Work Request processing model describes how requests are sub-
mitted, processed by the RNIC, and the results returned to the
Consumer.
8.2.1 Submitting Work Request to a Work Queue
Work Requests are submitted to the RNIC through the Verbs. They are
represented within the RI as Work Queue Elements. Work Queue
Elements are abstract. This means they are not accessible directly
by the Consumer of the RNIC Interface.
Work Requests can be submitted to the RNIC as a list of Work
Requests. Each Work Request in the Work Request List which is
successfully inserted into the Work Queue MUST result in the
consumption of one WQE on the Work Queue, and each Work Request MUST
be submitted to the Work Queue in the order specified in the Work
Request List. When a list of WRs containing more than one WR is
posted on an SQ, RQ, or an S-RQ, the first Immediate Error in
processing a WR MUST stop processing of the Work Request List and
MUST NOT enqueue the subsequent WRs in the list onto the Work Queue.
All Work Requests prior to the Work Request in error MUST be
inserted into the Work Queue. The RI MUST return to the Consumer the
number of successfully posted WRs and the verbs result MUST indicate
the Immediate Error associated with the WR that resulted in the
first error.
The intent of supporting a WR List is to allow some implementations
to reduce the number of Consumer to RI interactions when the
Consumer has multiple WRs to post, and to reduce the number of
interactions between the RI and RNIC due to alerting the RNIC of
additional work to perform.
One of the intentions of the architecture is to allow an
implementation to pass Work Requests from a Non-Privileged Mode
Consumer directly to the RNIC. Consequently, certain Verbs are
designed to be invoked in either Privileged Mode or Non-Privileged
Mode while others are designed to be invoked only in Privileged
Mode. The Verbs that are intended to be invoked in either Privileged
Mode or Non-Privileged Mode are: PostSQ, PostRQ, Poll for Completion
and Request Completion Notification.
The RI MUST return control to the Consumer immediately after a WR or
WR List has been submitted to the SQ, RQ or S-RQ and the RNIC has
been notified that a new WR or WR List is ready to process.
The RI MUST ensure that the space occupied by a Work Request in
either the Send or Receive Work Queue is not made available for
posting a new Work Request until:
Hilland, et al. Expires October 2003 [Page 133]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
* In the case where the WR was Signaled, the associated Completion
has been reaped.
* In the case where the WR is Unsignaled, one of the following is
true:
o The WR has Completed processing successfully, OR
o The associated Completion has been reaped for the WR if the
Unsignaled WR Completed in error, OR
o A Completion associated with a subsequently posted WR to the
same WQ has been reaped.
If space is not available on a Work Queue, then an RI MUST return an
Immediate Error.
The Unsignaled WR confirmation rules dictate that the Consumer must
post a WR with the Signaled Completion indicator set with a
frequency less than or equal to the maximum number of WQEs on the
SQ. In other words, if X equals the maximum number of WQEs on the
SQ, then the Consumer must post at least one Signaled Completion
Work Request every X Work Requests. In addition, the Consumer must
retrieve a Work Completion of a Signaled Completion with a frequency
less than or equal to the maximum number of WQEs on the SQ. This is
done in order to force confirmation that prior Unsignaled WRs are
Completed. If the Consumer does not follow these rules, a situation
may arise where the Consumer is unable to post WRs to the SQ. A ULP
reply based on the data that was in a SQ WR is insufficient for
determining if the WR has completed, since hardware resources may be
held in use until the WCs are polled from the CQ.
The QP can accept Work Requests only when the QP is in a state that
allows Work Requests to be submitted.
For details on the Verbs which submit Work Requests, see Sections
9.3.1.1 - PostSQ and 9.3.1.2 - PostRQ.
8.2.2 Work Request Processing
Processing of Work Requests submitted to a Work Queue is initiated
and processed according to the rules in this section.
It is important to understand the difference between Placement and
Delivery ordering since RDMAP provides different semantics for the
two.
Note that many current protocols, both as used in the Internet and
elsewhere, assume that data is both Placed and Delivered in order.
This allowed applications to take a variety of shortcuts that
Hilland, et al. Expires October 2003 [Page 134]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
depended on in-order Placement and Delivery. For RDMAP, many of
these shortcuts are no longer safe to use, and could cause
application failure. To ensure reliable operation, applications need
to take the rules described below into account.
The following rules apply to implementations of the RDMAP protocol:
1. Send Type, RDMA Write, and RDMA Read Type Work Requests
submitted to a Send Queue MUST be initiated and sent in the
order submitted to the Send Queue.
2. Work Requests submitted to a single Send Queue or Receive Queue
MUST be Completed by the RI in the same order as the Work
Requests were submitted. Note that this does not apply to WRs
posted to S-RQs.
3. Ordering guarantees for processing and Completion notifications
exist only between Work Requests submitted to the same Work
Queue. The RI is NOT REQUIRED to provide ordering guarantees
across multiple local SQ to remote RQ pairs.
4. RDMA Messages MAY be Placed in any order while in the scope of
the RI. If an application uses overlapping buffers (points
different Messages or portions of a single Message at the same
buffer), then it is possible that the last incoming write to the
Data Sink buffer will not be the last outgoing data sent from
the Data Source.
5. For a Send Type Operation, the contents of the Receive Queue
Buffer at the Data Sink MAY be indeterminate until the Receive
Queue Work Request is Completed at the Data Sink.
6. For an RDMA Write Operation, the contents of the buffer at the
Data Sink MUST be considered indeterminate until a subsequent
Send Type Message is Completed by consuming a Receive Queue WQE
at the Data Sink.
7. For an RDMA Read Operation, the contents of the buffer at the
Data Sink MUST be considered indeterminate until the RDMA Read
Type Work Request has been Completed.
Statements 5, 6, and 7 imply no peeking at the data in a buffer
to see if all of the data has arrived. It is possible for some
data to arrive before logically earlier data does, and peeking
may cause unpredictable application failure
8. Except for Unsignaled WRs that complete successfully, the
resources associated with a Work Request must be considered to
be in the scope of the RI from the time the Work Request is sub-
mitted to a Work Queue until the associated Work Completion has
Hilland, et al. Expires October 2003 [Page 135]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
been returned. For Unsignaled WRs that complete successfully,
refer to Section 8.1.3.1 for a description of when the resources
associated with the Unsignaled WR are freed.
9. If the Consumer or Application modifies the contents of Data
Sink Buffers while the buffers are in the scope of the RI, the
state of the Data Sink Buffers is indeterminate.
10. If the Consumer or Application modifies the contents of Data
Source Buffers while the buffers are in the scope of the RI, the
state of the Data Sink buffers is indeterminate.
11. The RI is NOT REQUIRED to guarantee that the Completion of an
RDMA Write or Send Type WR at the Local Peer means that the ULP
Message has: reached the Remote Peer, reached the Remote Peer
ULP Buffer, or been examined by the Remote Peer ULP.
12. Incoming Untagged RDMAP Messages (sent in FIFO and MSN order)
MUST use RQ or S-RQ Buffers and Complete through the RQ's CQ, in
the same order as the Send Message Type Work Requests are posted
to the Associated QP's Send Queue.
13. Upon local Completion of an incoming Untagged RDMAP Message the
RI MUST guarantee that any prior Send or RDMA Write Messages
from the same Associated QP have also Completed at the Data
Sink.
14. If the Consumer overlaps its Data Sink buffers for different
operations, subsequent Operations MAY cause the RI to overwrite
the data in those buffers before the Consumer receives and
processes the Completion.
15. The RI MAY begin processing subsequent Work Requests posted to
the Send Queue (except for operations which are affected by a
fence - see Section 8.2.2.2), before Completing a prior RDMA
Read Type Work Request (including zero-length RDMA Read Type
Work Requests). Therefore, when an application does an RDMA Read
Type Work Request followed by an RDMA Write or Send Type WR
targeting the same buffer, it MAY return the data from the later
RDMA Write or Send Type WR in the RDMA Read Operation Data Sink
buffer, even though the operations Complete in order on the Send
Queue's Completion Queue. If this behavior is not desired, the
Local Peer Consumer must set the Read Fence indicator on the
later RDMA Write or Send Type Work Request.
16. Before an Inbound RDMA Read Request Message is processed (the
specified buffer is read), the RI MUST have delivered all prior
incoming RDMAP Messages initiated from the same Remote Peer's
Send Queue. Therefore, when an application does an RDMA Write or
Send Type Work Request followed by an RDMA Read Type Work
Hilland, et al. Expires October 2003 [Page 136]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
Request targeting the same remote buffer, the RDMA Read Type WR
MUST return the data as modified by the prior operations.
17. The RI MAY Complete incoming Send Message Types before the RI
has finished generating RDMA Read Response Messages for an
incoming RDMA Read Request Message (initiated from the same
Remote Peer's Send Queue). Therefore, indeterminate results may
occur if an application does an RDMA Read Type Work Request
followed by a Send Type Work Request, and uses the Work
Completion on the Associated QP's RQ Completion Queue (for the
incoming Send Type Message) as an indicator that the inbound
RDMA Read Operation processing has finished. If this behavior is
not desired, the Local Peer Consumer must set the Read Fence
indicator on the later RDMA Write (or Send Type) Work Request.
18. If more RDMA Read Type Work Requests are posted to the Send
Queue than are indicated by the ORD QP Attribute, the RI MUST
pause the processing of the Send Queue until at least one prior
RDMA Read Type WR Completes. If zero outbound RDMA Read Request
Messages are supported on the QP, and the Consumer posts an RDMA
Read Type Work Request, the RI MUST Complete the Work Request in
error.
Access by the RNIC to Memory Regions or Memory Windows are NOT
REQUIRED to be cache-coherent. If an RNIC caches some portion of
memory buffers during the time that the buffers are being processed
by the RNIC, there is no requirement that updates to these buffers
by any entity be seen by the RNIC. Also, any updates to these
buffers by the RNIC are implementation dependent and may not be
immediately seen by the system processor, other IO devices, or other
RNICs.
8.2.2.1 Memory Management Operation Ordering
This section defines the ordering constraints imposed on Work
Requests. The next section defines additional ordering constraints
that can be placed by using the Read Fence or Local Fence indicator.
Because one of the objectives of DDP is to enable placement of
incoming out-of-order DDP segments into the buffer provided by the
Consumer, ordering semantics can not be guaranteed for certain
operation combinations. If the Work Request sends payload to the
Remote Peer, just because a Work Request Completes locally does not
necessarily mean that the Remote Peer has received the data, or that
subsequent DDP Segment payload can not overwrite the current data if
targeting the same Remote Peer buffer.
Thus, for example, an RDMA Write Message, containing payload1
immediately followed by an RDMA Write Message containing payload2 to
the same Remote Peer buffer location may result in the remote buffer
Hilland, et al. Expires October 2003 [Page 137]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
containing either payload1, payload2, or some combination of
payload1 and payload2. Thus a programming model that does multiple
RDMA Write WRs into the same Remote Peer buffer location without an
end-to-end synchronization mechanism is NOT RECOMMENDED.
1. An Incoming Remote Invalidate (the Invalidate portion of the
Send with Invalidate or Send with Solicited Event & Invalidate
operation) MUST be performed after the Send Message payload is
delivered to the appropriate Receive Queue Entry buffer, and
before the Associated RQ WR Completes.
Note: Send with Invalidate is usually used by Remote Peers to
invalidate STags that were enabled for remote access and
advertised to the Remote Peer. The expected usage is:
a. Local Peer Consumer creates a Send WR containing a command
to be remotely executed and an STag enabled for Remote
access and posts it to the Send Queue.
b. Remote Consumer gets the Send Message through a Completion
of an RQ buffer, and does one or more accesses to the STag's
buffers via RDMA Read Type WRs and/or RDMA Write WRs.
c. Remote Consumer creates a Send with Invalidate or Send with
SE and Invalidate WR with the status from the Consumer's
operation and the original STag to be invalidated as an
input modifier. Note that the Read Fence indicator would
most likely be set on the Send with Invalidate or Send with
SE and Invalidate WR if the remote buffer to be Invalidated
was accessed using an RDMA Read or RDMA Read with Invalidate
Local STag WR.
d. RI at Local Peer gets the Send with Invalidate or Send with
SE and Invalidate Message, places the data according to the
RQ WQE, Invalidates the STag, and creates a CQE on the
Receive Queue's Completion Queue, which also contains the
Invalidated STag as part of the CQE.
e. Local Consumer checks that the Invalidate STag output
modifier from the Work Completion is the same as was
originally sent (as a check on the remote Consumer). If it
was not, and the Consumer wishes to prevent remote access,
the Consumer should post an Invalidate Local STag WR for the
STag.
2. RDMA Read with Invalidate Local STag
The Invalidate portion of the RDMA Read with Invalidate Local
STag Work Request MUST be performed after the RDMA Read Response
Message is delivered to the Data Sink buffers, and before a Work
Completion is retrieved for the RDMA Read with Invalidate Local
Hilland, et al. Expires October 2003 [Page 138]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
STag WR. As with RDMA Read, subsequent operations MUST be
allowed to begin executing before the Invalidate takes place,
unless the subsequent operations have the Read Fence indicator
set.
3. Fast-Register
The RI MUST ensure that the Fast-Register operation takes effect
prior to the execution of any subsequent Work Requests.
4. Bind
The RI MUST ensure that the Bind Memory Window operation takes
effect prior to the execution of any subsequent Work Requests.
5. Invalidate Local STag
The Invalidate Local STag Work Request MUST take effect prior to
the execution of any subsequent Work Requests.
The RI MAY perform Fast-Register WRs, Bind WRs and Invalidate Local
STag WRs at any time between the posting of the Work Request and the
execution of a subsequent Work Request. Consequently, it is up to
the Consumer to ensure that the posting of the Invalidate Work
Request takes place after the STag is no longer in use.
SQ processing of Memory Management Operations (Fast-Register, Bind
and Invalidate Local STag) does not usually require the prior
operation to Complete before the current operation begins execution.
Thus it is possible to have an Invalidate Local STag operation be
applied to an RDMA Write WR Data Source buffer before the RDMA Write
Message payload has been completely sent. To ensure that this does
not occur, the Local Fence indicator may be set to require that all
prior operations Complete first (See Section 8.2.2.2).
Note that performing a Fast-Register on an already registered
region, or a Bind on a Window that is already Bound, will result in
a Completion Error. As such, it is up to the application to ensure
that the STag is in the Invalid state before the Fast-Register or
Bind Memory Window Work Request is posted.
The rules for Invalidate and Fast-Register or Bind Memory Window
above are based on the following usage model:
a. Allocate an STag (through either Allocate Non-Shared Memory
Region STag or Allocate Memory Window).
b. Fast-Register or Bind the STag
c. Use the STag in a manner compatible with its Access Rights.
Hilland, et al. Expires October 2003 [Page 139]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
d. Wait for the Completion of the operations using the STag.
This ensures that the STag and its related buffer is no
longer in use.
e. Invalidate the STag
f. Loop to (b) as long as the STag is still needed; otherwise,
Deallocate the STag.
8.2.2.2 Read Fence and Local Fence Indicators
Two types of fence indicators are defined in Verbs - - a Read Fence
indicator for RDMA Write or Send Type WRs, and a Local Fence
indicator for Invalidate Local STag WRs. The Read Fence ensures that
the current WR does not execute until all prior RDMA Read Type WRs
Complete. The Local Fence indicator ensures that all prior
operations Complete before the Invalidate Local STag WR is executed.
Note that in the Verbs specification, a fence indicates that some
set of prior operations have completed before the current operation
begins. A different concept is operations that are required to
Complete before future operations in the SQ can be executed -
specifically Bind, Fast-Register, and Invalidate Local STag WR. By
default, these operations do not ensure prior operations have
completed before they execute. For Invalidate Local STag, if the
Local Fence indicator is set, it can ensure that all prior SQ
operations Complete before it executes.
Note that RDMAP does not provide any end-to-end acknowledgement
except for an RDMA Read Operation. Thus in general an end-to-end
fence is not possible without using an RDMA Read Operation, unless
an explicit ULP exchange of messages is done. Some operations are
local only operations - specifically PostSQ Invalidate Local STag,
Bind Memory Window and PostSQ Fast-Register. For combinations of
these operations and the local buffers which they operate on (the
Data Source for an RDMA Write and Send Type Operation, or the Data
Sink for an RDMA Read Operation), it is possible to ensure that a
current operation is not executed until prior operation which
operate on the referenced local buffer are Completed.
Figure 22 shows the fencing semantics when one operation is followed
by another, and whether that operation will not execute until all
prior operations have Completed, some prior operations have
completed, or potentially no prior operations have completed. The
rows are the first operation, and the columns are the second
operation. The fields are defined as follows:
* NA-1 - a fence is not applicable. An Invalidate must precede
Bind or Fast-Register. Thus in terms of potential WRs in the SQ,
Hilland, et al. Expires October 2003 [Page 140]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
it is the Invalidate Local STag operation that must be fenced to
ensure proper operation.
* NA-2 - A fence is not applicable. This is because RDMAP allows
RDMA Write Message payloads and Send Type Message payloads to be
Placed out-of-order. Thus a local Completion of prior WRs does
not ensure the payload has been Placed at the Remote Peer.
* Not Needed - A fence is not needed, because RDMAP requires that
the RDMA Read Request Message at the Data Source (i.e. the
Remote Peer) must be executed in order. Note that RDMAP does not
ensure that operations which are sent after the RDMA Read
Request Message occur after the RDMA Read Type WR Completes.
Thus the need for the Read Fence Indicator for RDMA Write and
Send Type WRs.
* Yes, Full - If the Local Fence indicator is set on the
Invalidate Local STag WR, then the operation and subsequent
operations will not be executed until all prior operations
Complete. Note that this can effectively cause a pipeline stall
in transmission of RDMAP Messages, and should be used
judiciously.
* Yes, Partial - If the Read Fence indicator is set on the RDMA
Write or Send Type WR, then all prior RDMA Read Type WRs must
Complete before the current operation can begin execution.
Hilland, et al. Expires October 2003 [Page 141]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
PostSQ Send RDMA RDMA Bind Fast- Invalidate
Work Type Write Read Register
Request
Send Type NA-2 NA-2 Not Needed NA-1 NA-1 Yes, full
RDMA Write NA-2 NA-2 Not Needed NA-1 NA-1 Yes, full
RDMA Read Yes, Yes, Not Needed NA-1 NA-1 Yes, full
Partial Partial
Bind NA-2 NA-2 Not Needed NA-1 NA-1 Yes, full
Fast- NA-2 NA-2 Not Needed NA-1 NA-1 Yes, full
Register
Invalidate NA-2 NA-2 Not Needed NA-1 NA-1 Yes, full
Figure 22 - Fencing on Prior Operations
The following paragraphs provide the rules which dictate the above
behavior.
Read Fence - set in RDMA Write or Send Type Work Requests to ensure
all prior RDMA Read Type WRs have been processed by the RI.
The RI MUST provide a Read Fence indicator for Send Type Work
Requests and RDMA Write Work Requests. This indicator MUST cause
the RI to pause before the execution of the Read Fenced Work
Request if all prior RDMA Read Type Work Requests are not
complete. Once all prior RDMA Read Type Work Requests are
complete the RI MUST resume SQ processing.
Local Fence - set in Invalidate Local STag Work Requests to ensure
all prior operations have been processed by the RI.
The RI MUST provide a Local Fence indicator for the Invalidate
Local STag Work Request. This Indicator MUST cause the RI to
wait until all prior Work Requests on the Send Queue Complete.
Once all prior WRs on the SQ complete, the RI MUST resume SQ
processing.
Note: This indicator may be used by the Consumer when there are
insufficient STags available to allow them to remain in use
until the Consumer can process the Completions for Work Requests
using those STags. For example, the following sequence could be
used:
a. Allocate an STag (either Allocate Non-Shared Memory Region
or Allocate Memory Window)
Hilland, et al. Expires October 2003 [Page 142]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
b. Fast-Register or Bind the STag
c. Use the STag in a manner compatible with its Access Rights.
d. Invalidate the STag using an Invalidate Local STag Work
Request with the with Local Fence indicator set.
e. Loop to (b) as long as the STag is still needed; otherwise,
Deallocate the STag by invoking the Deallocate STag Verb.
Using this model, the application can reuse an STag multiple
times without having to wait for the prior Work Request to
Complete before posting the next Work Request. Using the Local
Fence indicator may require the RI to stall before processing
the Invalidate Local STag Work Request, reducing the rate of
Send Queue processing.
Implementation of an end-to-end fence - using an RDMA Write WR
followed by an RDMA Read Type WR.
An end-to-end fence ensures that all outstanding operations have
been flushed from the network fabric prior to the next operation
executing. [RDMAP] enables an application to use an RDMA Read
Operation to ensure that all RDMA Write Operations and Send Type
Operations prior to the RDMA Read Operation on the same RDMAP
Stream have made it to remote memory and can be read back by any
other RDMAP Stream connecting through the same remote RNIC with
access to the remote memory. The RDMA Read Operation need not be
to any of the data written, and can even be a zero length RDMA
Read Operation (which does not even require a valid Data Source
STag) to have this effect. This enables the Consumer to
implement an end-to-end fence by waiting for a RDMA Read WR
Completion to determine that data is up to date at the Remote
Peer.
If the requirement, for example, is to ensure, from the Data
Source, that one RDMA Write Message has been Placed at the
Remote Peer before another RDMA Write Message occurs, the
following sequence can be used by the Consumer:
a. Perform one (or more) RDMA Write WR(s).
b. Perform an RDMA Read Type WR (zero length is acceptable)
c. Perform a second RDMA Write WR with the Read Fence indicator
enabled on the Work Request.
8.2.3 Completion Processing
A CQE is an internal representation of the Work Completion. The
results from a Work Request operation are placed in a Completion
Hilland, et al. Expires October 2003 [Page 143]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
Queue Entry (CQE) on the CQ associated with the Work Queue when the
request has completed. A CQE MUST be generated for each WQE that
results in a Work Completion.
8.2.4 Returning Completed Work Requests
All Work Completions are abstracted through the Verbs. The only
method of retrieving a Work Completion MUST be through the Poll for
Completion Verb. The RI MUST enable the Consumer to be able to
retrieve WCs resulting from WRs posted to QPs which are in any valid
QP state. Note that a destroyed QP is not in a valid QP state. See
Section 6.1.4.
A Work Request is confirmed Complete when the associated Work
Completion is retrieved from its CQ. The RI MUST NOT return a Work
Completion for an Unsignaled Work Request that completed
successfully. When the RI returns a single WC through Poll for
Completion, it MUST free at least one CQE. Note that more than one
CQE may be freed due to Unsignaled Completions. See Section 8.1.3.1,
Signaled Completions, for the rules on determining when Unsignaled
Work Requests have Completed.
When a Work Request has Completed, any Scatter/Gather Elements or
other information associated with the original WR are no longer in
the domain of the RI. The RI MUST NOT access any memory locations
referenced by the Scatter/Gather Elements, Local Address or Remote
Address for a WR that has Completed. The RI MUST provide Work
Completions through the Poll for Completion Verb no more than once
per Work Request. Note that if Destroy QP is invoked with Work
Requests pending, the Work Completion may be lost.
The Work Completion contents are specified in 9.3.2.1 - Poll for
Completion.
A Consumer is able to find out if a Work Completion is available by
polling or notification.
Work Completions MUST be returned when the Consumer polls the CQ in
the following cases:
* On Completion of a Work Request submitted to a Send Queue with a
Signaled Completion.
* On Completion of a Work Request submitted to a Send Queue that
completed in error.
* On Completion of a Work Request submitted to a Receive Queue.
When the Consumer desires to know if a QP has had all of its WRs
retrieved and the Work Queues are empty, but there may be only
Hilland, et al. Expires October 2003 [Page 144]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
Unsignaled Work Requests on the Send Queue, the Consumer can
transition the QP to the Error state (See Section 6.2.4) and then to
the Idle state. This will guarantee that all WRs have been
Completed. In order to ensure that the WQEs have been freed and the
entries on the CQ have been made available, the Consumer should free
any associated CQEs, if any are consumed. There are three methods
for a Consumer to free the CQE consumed within the CQ. They are:
* for the Consumer to poll the CQ (See Section 9.3.2.1 - Poll for
Completion (Poll CQ)) until the CQ is empty, or
* the Consumer retrieves a WC for a WR submitted to a Work Queue
associated with the same CQ where the former WR was submitted
and the new WR was submitted after the previous QP was
destroyed, or
* the Consumer polls (See Section 9.3.2.1 - Poll for Completion
(Poll CQ) a number of Work Completions equal to the total number
of entries that the CQ can hold.
8.2.5 Asynchronous Completion Notification
A Consumer of a CQ may request asynchronous notification of when
CQEs have been added to a Completion Queue by invoking the Request
Completion Notification Verb. The Verbs architecture assumes a
Privileged Mode intermediary will process Asynchronous CQ Events for
CQs. The Verbs architecture allows this intermediary to register one
or more CQ Event Handlers for Asynchronous CQ Events by invoking the
Set Completion Event Handler Verb. It is the responsibility of this
intermediary to create the asynchronous completion notification to
the Consumer that called the Request Completion Notification Verb.
A Completion Event Handler Identifier delineates each Completion
Event Handler. The Set Completion Event Handler is invoked once per
supported Completion Event Handler. Note that the maximum number of
supported Completion Event Handlers is returned by Query RNIC.
Each Set Completion Event Handler invocation can be used to:
* Return a Completion Event Handler Identifier that is used as an
input modifier to Create CQ (to associate a CQ with a Completion
Event Handler).
* Clear a Completion Event Handler associated with the Completion
Event Handler Identifier.
* Modify the address of the Completion Event Handler for the
Completion Event Handler Identifier.
Hilland, et al. Expires October 2003 [Page 145]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
The RI is NOT REQUIRED to disassociate CQs from CQ Event Handlers
when those CQ Event Handlers associated with the Completion Event
Handler Identifiers are cleared. If a CQ Event Handler is cleared
and the Consumer still has CQs associated with that CQ Event Handler
(through the CQ Event Handler Identifier), and a Completion occurs
which would have invoked the CQ Event Handler, behavior of the RI is
indeterminate. The Consumer should keep this in mind before clearing
the association to prevent indeterminate behavior, such as possible
race conditions.
The Request Completion Notification Verb is set on a per CQ basis.
When armed, the RI MUST generate at most one notification until the
notification has been rearmed by invoking Request Completion
Notification Verb. Once Completion Notifications have been enabled,
additional Request Completion Notification calls have no effect. The
Completion Event Handler will be called only once when the next CQE
is added to the CQ. The RI MUST invoke the Completion Event Handler
associated with the CQ Event Handler Identifier which is associated
with the CQ where the CQE was added. Once the Completion Event
Handler routine has been invoked, the Consumer should call Request
Completion Notification again to be notified when a new entry is
added to the CQ, since the notification is a "one shot" mechanism.
Existing CQEs on the CQ at the time the notification is enabled do
not result in a call to the Completion Event Handler. The Completion
Event Handler MUST be called when the next CQE is added to the CQ
after the Request Completion Notification has been set.
The RI MUST provide the ability for the Consumer to specify whether
the Completion Event Handler is invoked for either:
* the next Solicited Completion Event only, or
* the next Completion Event.
If the local Consumer requests the next solicited Completion in the
Request Completion Notification Verb, the RI MUST generate a
Completion Event when:
* an incoming Send with Solicited Event or Send with SE and
Invalidate successfully causes a Receive Queue's WQE to be
consumed, and thus a CQE to be added to a CQ, or
* a Work Completion for a Work Request which Completed in error is
added to a CQ.
If the Consumer requested an event for the next completion in the
Request Completion Notification Verb, the RI MUST generate a
Completion Event when any incoming Send operation type or Signaled
Local SQ WR completes.
Hilland, et al. Expires October 2003 [Page 146]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
If multiple calls to Request Completion Notification have been made
for the same CQ and at least one of the requests set the type to the
next Work Completion, the RI MUST invoke the CQ event handler when
the next CQE is added to that CQ. The CQ Event Handler MUST be
called only once, even if multiple CQ notification requests were
made prior to the Completion Event for the specified CQ.
The RI MUST ensure that the following sequence of events will not
result in a Completion Notification being missed. Therefore, the
following sequence of calls should be used by the Consumer when
using Request Completion Notification in order to ensure that a new
CQE is not missed for the specified CQ:
* Call Poll for Completion to dequeue all existing CQ entries
* Call Request Completion Notification.
* Call Poll for Completion to dequeue all of the CQ entries that
were added between the time the last Poll for Completion was
called and the notification was enabled.
When the Completion Event Handler is invoked, the RI MUST supply the
CQ handle of the CQ which generated the Completion notification.
The Consumer is responsible for polling the CQ to retrieve the Work
Completion. This function MUST NOT be performed automatically by the
RI when the notification occurs.
For details on the Asynchronous Completion Verbs, refer to Section
9.4.1 - Set Completion Event Handler and Section 9.3.2.2 - Request
Completion Notification.
8.3 Error Handling
The following section details many of the errors that can occur when
using the RNIC, and the responsibilities of the RNIC and the
Consumer.
Errors are returned to the Consumer by one of three mechanisms:
Immediate Errors, Work Completions, or Asynchronous Error Events.
Immediate Errors are returned immediately as an Output Modifier of a
Verb. Work Completions are used when the error can be related
directly to the Work Request in progress. Asynchronous Error Events
are used when the error can only be localized to the QP, CQ or RNIC
but are not directly attributable to any single Work Request. Each
of these errors is described below.
Hilland, et al. Expires October 2003 [Page 147]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
8.3.1 Immediate Errors
Immediate Errors are those surfaced as Verb results provided to the
Consumer via Output Modifiers. The individual Immediate Errors are
documented within each Verb in Section 9 - RNIC Verbs. A summary of
all of the Immediate Errors are covered in Section 9.5.1 - Immediate
Status Codes.
When the RI returns an Immediate Error, the RI MUST NOT affect the
RI Resource that is the subject of the verb for which the Immediate
Error is being returned, except for RI-Reregister Non-Shared Memory
Region (which has slightly different rules). That is, for an
Immediate Error returned on any verb that has the:
- RI as the subject, the RI remains unchanged;
- CQ as the subject, the CQ remains unchanged;
- QP as the subject, the QP remains unchanged;
- S-RQ as the subject, the S-RQ remains unchanged;
- STag as the subject, the STag remains unchanged (except certain
rules for RI-ReRegister Memory Region);
- PD as the subject, the PD remains unchanged;
- Asynchronous Event handling as the subject, Asynchronous Events
must not be lost.
8.3.2 Work Completion Errors
The following errors can be associated with a specific Work Request.
The RI MUST return a Completion Error via a Work Completion on the
Completion Queue associated with the Send or Receive Queue on which
the Work Request was posted for the errors defined in Figure 23. The
Work Completion's Completion Status field contains the Error
information. In each case, the QP MUST be moved to the Terminate
state and a Terminate Message is sent with the indicated Terminate
code (see Section 6.6.2.5 - Local Termination, Local Abortive
Teardown and Remote Abortive Teardown). On any Work Completion that
includes the sending of a Terminate Message, the Terminate Message
Buffer MUST be available for examination while the QP is in the
Terminate state or Error state using Query QP. The Terminate Message
may contain useful diagnostic information, depending on the error.
For information on the format of the Terminate Message, see [RDMAP].
Hilland, et al. Expires October 2003 [Page 148]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
Error Terminate Action
Code
Receive Queue Work Request Errors - These errors are probably due to a
local Consumer error.
Invalid WQE format, 0x0000 The RI Terminates the LLP
Invalid STag in SGE, Stream with Local
Base and bounds violation Catastrophic Error and the
(including length errors), QP transitions to the
Access Rights violation, Terminate state.
Invalid PD ID,
Wrap error (TO & Segment Length
caused an address to wrap).
Receive Queue Remote Protection Errors - These errors may be due to a
Consumer error at either end.
Invalidate STag Invalid. 0x0100 The RI Terminates the LLP
Stream with the indicated
Invalidate STag Access Rights. 0x0102 Error and the QP
transitions to the
Invalidate STag Invalid PD ID. 0x0103 Terminate state.
or STag not Bound to QP.
Invalidate MR STag had Bound MW. 0x0109
Send Queue Work Request Errors - These errors are probably due to a
local Consumer error.
Invalid WQE format, 0x0000 The RI Terminates the LLP
Zero ORD. Stream with Local
Catastrophic Error and the
QP transitions to the
Terminate state.
Local SQ Protection Errors - 0x0000 The RI Terminates the LLP
Send Types, RDMA Writes, and Stream with Local
RDMA Read Types: Catastrophic Error and the
Invalid STag, QP transitions to the
Base and bounds violation Terminate state.
(including length errors),
Access Rights violation,
Invalid PD ID,
Wrap error.
Hilland, et al. Expires October 2003 [Page 149]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
Error Terminate Action
Code
SQ Fast-Register errors: 0x0000 The RI Terminates the LLP
QP not in Privileged Mode, Stream with Local
Invalid Region STag, Catastrophic Error and the
Invalid Physical Buffer Size, QP transitions to the
Physical Buffer List too long, Terminate state.
STag not in Invalid state,
Invalid PD ID,
Invalid Access Rights Specified,
Invalid Virtual Address,
Invalid FBO,
Invalid Length
SQ Bind errors: 0x0000 The RI Terminates the LLP
Invalid Region STag Stream with Local
Invalid Window STag Catastrophic Error and the
Base and bounds violation QP transitions to the
Access Rights violation Terminate state.
STag not in Invalid state
MR not in Valid state
Invalid PD ID
SQ Invalidate errors 0x0000 The RI Terminates the LLP
(Footnote 6): Stream with Local
Invalid STag Catastrophic Error and the
Invalid PD ID (or QP ID) QP transitions to the
Invalidate MR STag had Bound MW Terminate state.
Figure 23 - Completion Errors with Resulting Terminate Codes
8.3.3 Asynchronous Errors
The Consumer may register an Asynchronous Event Handler to be called
when an Asynchronous Event occurs which is not associated with an
individual CQE by using the Set Asynchronous Event Handler Verb.
An input modifier to the Set Asynchronous Event Handler Verb is the
address of the event handler routine. This is a Consumer routine
that is invoked when an Asynchronous Event is generated. When the
handler routine is invoked, an indication of the origin of the
error, called an Event Record, is provided.
Footnote 6: This includes RDMA Read and Invalidate.
Hilland, et al. Expires October 2003 [Page 150]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
The errors defined in Figure 24 are returned to the Consumer via an
Event Record in the Asynchronous Event Handler.
There is only one Asynchronous Event Handler per RNIC. If Set
Asynchronous Event Handler Verb is called more than once, the new
handler MUST replace the previous handler. The RI MUST turn off
Asynchronous Event Notification if the Asynchronous Event Handler's
address is zero.
After the Asynchronous Event Handler is registered, all subsequent
asynchronous events not associated with a CQE MUST result in a call
to the handler. Until an Asynchronous Event Handler is registered,
asynchronous events will be lost.
For more information, see Section 9.4.2 - Set Asynchronous Event
Handler and Section 9.5.3 - Asynchronous Event Identifiers.
The following table covers the errors that can be associated with a
QP, thus the Event Record should include the QP ID when the error is
associated with a specific QP. On any Asynchronous Error Event that
includes the reception or sending of a Terminate Message, the
Terminate Message Buffer is available for examination while the QP
is in the Terminate or Error state by retrieving it through Query
QP. Note that Terminate Messages generated locally as well as
Terminated Messages received from the Associated QP are available
through Query QP. The Terminate Message may contain useful
diagnostic information, depending on the error. For information on
the format of the Terminate Message, see [RDMAP].
Error Terminate Action
Code
Remotely detected Errors
"Terminate Message Received" None QP -> Terminate state. See
An incoming Terminate Message has 6.6.2.4 Remote Termination
arrived.
Hilland, et al. Expires October 2003 [Page 151]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
Error Terminate Action
Code
LLP Errors - Errors on incoming RDMAP Segments or Messages probably due
to the Remote Peer or fabric corruption.
"LLP Connection Lost" - None QP -> Error state. See
Usually caused by Timeout or Too 6.6.2.4 Remote Termination
many Retries at the LLP.
"LLP Connection Reset" - None QP -> Error state. See
Caused by an incoming Reset at the 6.6.2.4 Remote Termination
LLP.
"LLP Integrity Error: Segment size 0x1000 If this cannot be
invalid" - corrected by the LLP (drop
The incoming segment is too small and retry etc.), then
to contain a valid RDMAP header, QP -> Terminate state.
or larger than supported by this The RI Terminates the LLP
implementation. Stream with the indicated
error. See 6.6.2.5.
" LLP Integrity Error: Invalid 0x0202
CRC" -
The incoming segment had a bad LLP
CRC.
"Bad FPDU" -The incoming segment 0x0203
Received MPA marker and 'Length'
fields do not agree on the start
of a FPDU
Hilland, et al. Expires October 2003 [Page 152]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
Error Terminate Action
Code
Remote Operation Errors - Protocol Errors on incoming RDMAP Segments or
Messages probably due to the Remote Peer.
Invalid DDP version 0x1206 QP -> Terminate state. The
RI Terminates the LLP
Invalid RDMA version 0x0205 Stream with the indicated
error. See 6.6.2.5.
Unexpected Opcode 0x0206
Invalid DDP Queue Number 0x1201
Invalid RDMA Read Request - RDMA 0x1201
Read not enabled
No 'L' bit when expected 0x0207
Remote Protection Errors (not associated with the RQ) - Protection
Errors on incoming DDP Segments or RDMAP Messages that are not RDMA
Read Request Messages, probably due to the Remote Peer's Consumer.
Invalid STag 0x1100 QP -> Terminate state. The
RI Terminates the LLP
Base and bounds violation 0x1101 Stream with the indicated
error. See 6.6.2.5.
Access Rights violation 0x1102
Invalid PD ID 0x1102
Wrap error - TO and segment length 0x1103
caused an address wrap past
0xFFFFFFFFFFFFFFFF
Hilland, et al. Expires October 2003 [Page 153]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
Error Terminate Action
Code
Remote Closing Error - Probably due to Consumer not properly
synchronizing the ULP close operation.
Bad Close - QP in Closing state None QP -> Error state.
and: Segment arrives, at least one
SQ WQE on the SQ, or RDMA in
progress.
Bad LLP Close - LLP Close received 0x0207 QP -> Terminate state. The
AND (the Send Queue was NOT empty RI Terminates LLP Stream
OR the IRRQ was NOT empty) with indicated error. See
(Footnote 7) 6.6.2.5.
Remote Protection Errors associated with the Receive Queue - Protection
Errors on incoming RDMAP Segments or Messages probably due to the
Remote Peer's Consumer.
Invalid MSN - MSN range not valid 0x1202 QP -> Terminate state. The
RI Terminates LLP Stream
with indicated error. See
6.6.2.5.
Invalid MSN - gap in MSN 0x1202 QP -> Terminate state. The
RI Terminates LLP Stream
with indicated error. See
6.6.2.5.
IRRQ Protection Errors - Error processing an incoming RDMA Read Request
and generating the outgoing RDMA Read Response.
Invalid STag 0x0100 QP -> Terminate state. The
RI Terminates the LLP
Base and bounds violation(includes 0x0101 Stream with the indicated
RDMA Read Request larger than error. See 6.6.2.5.
supported by the Data Source STag)
Access Rights violation 0x0102
Invalid PD ID 0x0103
Footnote 7: For TCP this would be a 1/2 close and a Terminate
Message could be sent. For SCTP, no Terminate Message is sent.
Hilland, et al. Expires October 2003 [Page 154]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
Error Terminate Action
Code
Wrap error - TO and length caused 0x0104
an address wrap past
0xFFFFFFFFFFFFFFFF
Invalid MSN - too many RDMA Read 0x1203
Request Messages in process
Invalid MSN - gap in MSN (RDMA 0x1203
Messages found missing when LLP
claims a Message is delivered.)
Invalid MSN - MSN range is not 0x1203
valid (MSN is unreasonably beyond
the end of the queue.)
Local Errors
CQ/SQ error - An error occurred on 0x0207 QP -> Terminate state. The
the CQ during a SQ completion. CQ number itself must be
CQ Overflow error determined by using Query
CQ Operation error QP. The RI Terminates the
LLP Stream with the
CQ/RQ error - An error occurred on 0x0207 indicated error. See
the CQ during a RQ completion. 6.6.2.5.
CQ Overflow error
CQ Operation error
S-RQ error on a QP - An error 0x0207 QP-> Terminate state. The
occurred while attempting to pull S-RQ can be determined by
a WQE from the S-RQ associated using Query QP. The RI
with the QP. Terminates the LLP Stream
with the indicated error.
See 6.6.2.5.
Local QP Catastrophic Error - An 0x0207 The RI will attempt to
error related to the QP occurred move the QP to the Error
while processing (probably a state. The QP is most
problem with the RNIC). likely unusable and should
be destroyed.
Figure 24 - Affiliated Asynchronous Errors with Terminate Codes
Figure 25 indicates errors that cannot be associated with a QP; the
Asynchronous Event Record MUST contain the additional information as
indicated in the table.
Hilland, et al. Expires October 2003 [Page 155]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
Error Terminate Action
Code
Locally detected Catastrophic Errors
CQ Operation Error - An error None The Asynchronous Event
occurred on the CQ unrelated to Record includes the CQ
a specific QP completion. handle. All completions on
the CQ are in an undefined
state. It may be necessary
to destroy any QPs
targeting the CQ and
destroy the CQ.
Shared Receive Queue None The Asynchronous Event
Catastrophic Failure - A problem Record includes the S-RQ
occurred with the RNIC or its handle. All WRs on the S-RQ
driver that renders the RNIC are in an undefined state.
unable to use the S-RQ. It may be necessary to
destroy any QPs using the
S_RQ and destroy the S-RQ.
RNIC Catastrophic failure - A 0x0208 The Asynchronous Event
problem occurred with the RNIC Record does not include any
or its driver that renders the additional information. If
RNIC unable to reliably possible, the RI Terminates
function. All RNIC/QP/CQ state all LLP Connections with
is indeterminate. The only Global Catastrophic Error.
recovery is to close the RNIC See 6.6.2.5
(and reopen it if desired).
Figure 25 - Unaffiliated Asynchronous Errors with Terminate Codes
Hilland, et al. Expires October 2003 [Page 156]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
9 RNIC Verbs
The Verbs described in this chapter provide an abstract definition
of the functionality provided to a host by a RI. Host RIs that are
compliant with this specification MUST exhibit the semantic behavior
described by the Verbs.
Since the Verbs define the behavior of the host RI, they may
influence the design of software constructs, such as application
programming interfaces (APIs), which provide access to the host RI.
However, this specification explicitly does not define any such API.
In particular, there is no requirement that an API used with a
compliant host RI be semantically identical to, or expose the
semantics of, the Verbs. For example, whether the input modifiers
referenced in the Verbs are pass-by-reference or pass-by-value is
outside the scope of this specification.
It is OPTIONAL for an RI to implement Block Lists. It is OPTIONAL
for an RI to implement S-RQs. Support for S-RQs can be discovered
using Query RNIC. Support for Block Lists can be discovered by
attempting to open the RNIC in Block Mode. If the Verb fails with
the error "Block List Not Supported", the RNIC does not support
Block Mode.
The RI MUST use the values and information provided in the Input
Modifiers when processing the requests and operations instantiated
in the Verbs for mandatory features. The RI MUST use the values and
information provided in the Input Modifiers when processing the
requests and operations instantiated in the Verbs for optional
features if the RI supports that optional feature.
9.1 Consumer Accessibility
Verb Consumers are the direct users of the Verbs, and are sub-
divided into two classes, Privileged and Non-Privileged.
Privileged Consumers are typically those Consumers that operate at a
privilege level sufficient to access OS internal data structures
directly, and have the responsibility to control access to the RNIC
Interface. All Verbs are available for use by Privileged Consumers.
Non-Privileged Mode Consumers are those Consumers that must rely on
another agent, having a sufficient high level of privilege, to
manipulate OS data structures. Only those Verbs specifically labeled
as such are available to be used by Non-Privileged Mode Consumers.
Conceptually, the intent is that Non-Privileged Mode Consumers are
not allowed to manipulate RI resources that could affect a QP in a
different Protection Domain. Any manipulation of resources that can
affect another Protection Domain, such as registering physical
Hilland, et al. Expires October 2003 [Page 157]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
memory, are assumed to be done by a trusted intermediary, or
Privileged Consumer.
The Protection Domain provides a mechanism to detect when a Consumer
is posting WRs to QPs with which it is not associated. The RI also
usually provides a mechanism to help prevent posting WRs to QPs not
directly owned by the Consumer (e.g. a multi-Consumer application
which shares the same PD). But it may still be possible to post a WR
to a QP that is not owned by the Consumer in some environments.
Preventing access to memory structures such as QPs not directly
created by that Consumer can be partially provided by the Local
HostÆs operating environment through the use of the virtual memory
subsystem and mapping of RNIC resources. Since this is
implementation and environment dependent, the mechanism describing
it is outside the scope of the architecture.
All Verbs can be accessed by Privileged Mode Consumers. To maintain
the access control over RI resources, the host environment MUST
provide Non-Privileged Mode Consumers with direct access to only the
following Verbs:
* PostSQ
* PostRQ
* Poll for Completion
* Request Completion Notification
9.2 RNIC Resource Management
9.2.1 RNIC
9.2.1.1 Open RNIC
Description:
Opens the specified RNIC.
The RI MUST support this Verb and MUST support all of the Input
& Output Modifiers, except where noted. For more information,
see Section 5.1.2 - Opening an RNIC.
Input Modifiers:
* The unique identifier for this RNIC. The naming scheme is
implementation dependent.
Hilland, et al. Expires October 2003 [Page 158]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
* The Physical Block List mode of the RNIC. This MUST either be
Block List mode or Page List mode. Block List mode is only valid
if the RNIC supports it.
Output Modifiers:
* If the operation completed successfully:
o RNIC Handle.
* Verb Results:
o Operation completed successfully.
o Insufficient resources to complete request.
o Invalid Modifier (RNIC name).
o Block List mode not supported.
o RNIC in use.
9.2.1.2 Query RNIC
Description:
Returns the attributes for the specified RNIC.
The RI MUST support this Verb and MUST support all of the Input
& Output Modifiers, except where noted. For more information,
see Section 5.1.3 - Query RNIC.
Input Modifiers:
* RNIC handle.
Output Modifiers:
* RNIC Attributes & Values, if the operation completed
successfully:
o Vendor specific information. This could, but is not required
to, include information such as a vendor identifier, part
number and/or hardware version.
o The maximum number of QPs supported by this RNIC.
o The maximum number of outstanding Work Requests on any Send
Queue or Receive Queue supported by this RNIC.
Hilland, et al. Expires October 2003 [Page 159]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
o The maximum number of outstanding Work Requests on any S-RQ
supported by this RNIC. If S-RQs are not supported by this
RNIC, this number is zero.
o The maximum number of Scatter/Gather Elements per Send
Operation Type Work Request supported by this RNIC. This
value also applies to the maximum number of Scatter/Gather
Elements for WRs posted to Receive Queues as well as those
posted to Shared-Receive Queues.
o The maximum number of Scatter/Gather Elements per RDMA Write
Work Request supported by this RNIC.
o The maximum number of CQs supported by this RNIC.
o The maximum number of entries in each CQ supported by this
RNIC.
o The maximum number of CQ Event Handlers supported by this
RNIC.
o The maximum number of Memory Regions supported by this RNIC.
o The maximum number of Physical Buffer Entries per Physical
Buffer List.
o The maximum number of Protection Domains supported by this
RNIC.
o The maximum number of inbound RDMA Read Request Messages
that can be in the IRRQ per RNIC. This is the per RNIC
parameter that represents the maximum total value of IRD for
all QPs. This value MUST be Zero if the resources used to
handle Inbound RDMA Read Requests are not shared between
QPs. (For more information, see Section 6.5 - Outstanding
RDMA Read Resource Management)
o The maximum number of outbound RDMA Read Request Messages
that can be outstanding per RNIC. This is the per RNIC
parameter that represents the maximum total value of ORD for
all QPs. This value is Zero if the resources used to handle
outstanding Outbound RDMA Read Request Messages are not
shared between QPs.
o The maximum number of inbound RDMA Read Request Messages
that can be in the IRRQ per QP. This represents the maximum
value for IRD for any QP.
Hilland, et al. Expires October 2003 [Page 160]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
o The maximum number of outbound RDMA Read Request Messages
that can be outstanding per QP. This represents the maximum
value for ORD for any QP.
o Ability of this RNIC to support modifying IRD after the QP
has been created.
o Ability of this RNIC to support increasing ORD after the QP
has been created.
o The maximum number of Memory Windows supported by this RNIC.
o The ability of this RNIC to support modifying the maximum
number of outstanding Work Requests per QP. (For more
information, see Section 6.1.3 - Modifying Queue Pair
Attributes)
o The Physical Block List mode of the RNIC. This MUST either
be Block List Mode or Page List Mode.
o If Block List Mode is supported:
+ The Physical Buffer Entry range of sizes supported by
this RNIC.
o If Page List Mode is supported:
+ The List of Page sizes supported by this RNIC.
o The ability of this RNIC to support Shared Receive Queues.
o The ability of this RNIC to perform CQ Overflow detection.
o If Shared Receive Queues are supported:
+ The maximum number of Shared Receive Queues supported by
this RNIC.
+ The dequeuing model the RNIC supports: arrival order or
sequential order.
* Verb Results:
o Operation completed successfully.
o Invalid RNIC handle.
9.2.1.3 Close RNIC
Description:
Hilland, et al. Expires October 2003 [Page 161]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
Closes and resets the specified RNIC.
This Verb is responsible for de-allocating resources allocated
by the RI and to make the RNIC unavailable for use by the
Consumer.
The RI MUST support this Verb and MUST support all of the Input
& Output Modifiers, except where noted. For more information,
see Section 5.1.4 - Closing an RNIC.
Input Modifiers:
* RNIC handle.
Output Modifiers:
* Verb Results
o Operation completed successfully.
o Invalid RNIC handle.
9.2.2 Protection Domain
9.2.2.1 Allocate PD
Description:
Allocates an unused Protection Domain.
The RI MUST support this Verb and MUST support all of the Input
& Output Modifiers, except where noted. For more information,
see Section 5.2.1 - Allocating a PD.
Input Modifiers:
* RNIC Handle.
Output Modifiers:
* If the operation completed successfully:
o PD ID.
* Verb Results:
o Operation completed successfully.
o Invalid RNIC handle.
Hilland, et al. Expires October 2003 [Page 162]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
o Insufficient resources to complete request.
9.2.2.2 Deallocate PD
Description:
Deallocates a previously Allocated Protection Domain.
The RI MUST support this Verb and MUST support all of the Input
& Output Modifiers, except where noted. For more information,
see Section 5.2.2 - Deallocating a PD.
The Protection Domain MUST NOT be deallocated if it is still
associated with any Queue Pair, Non-Shared Memory Region, Shared
Memory Region, Shared Receive Queue, Bound Memory Window or
Invalidated Memory Window.
Input Modifiers:
* RNIC Handle.
* PD ID.
Output Modifiers:
* Verb Results:
o Operation completed successfully.
o Invalid PD ID.
o Invalid RNIC handle.
o Protection Domain is in use.
9.2.3 Completion Queue
9.2.3.1 Create CQ
Description:
Creates a CQ on the specified RNIC. In addition, a Completion
Event Handler may be registered for the created CQ.
The Consumer must specify the minimum number of entries in the
CQ. The number of allocated entries for CQEs on the specified
CQ, which might be different than the number requested, is
returned on successful creation. The number returned differs
only when the number of actual entries is greater than the
number that the Consumer requested. If the maximum number of
Hilland, et al. Expires October 2003 [Page 163]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
entries the RNIC supports is less than the Consumer requested,
an Immediate Error is returned and the CQ is not created.
The RI MUST support this Verb and MUST support all of the Input
& Output Modifiers, except where noted. For more information,
see Section 5.3.1 - Creating a Completion Queue.
Input Modifiers:
* RNIC handle.
* The minimum number of entries in the CQ.
* Completion Event Handler Identifier - An opaque handle used to
identify a Completion Event Handler. If the identifier is set to
zero, then there is no Completion Event Handler associated with
this CQ. Completion Event Handler Identifiers are obtained via
the Set Completion Event Handler Verb.
Output Modifiers:
* If the operation completed successfully:
o The handle of the newly created CQ.
o The allocated number of entries in the CQ.
* Verb Results:
o Operation completed successfully.
o Insufficient resources to complete request.
o Invalid RNIC handle.
o Number of CQ entries requested exceeds RNIC capability.
o Invalid Completion Event Handler Identifier
9.2.3.2 Query CQ
Description:
Returns the number of entries in the specified CQ.
The RI MUST support this Verb and MUST support all of the Input
& Output Modifiers, except where noted. For more information,
see Section 5.3.2 - Querying Completion Queue Attributes.
Input Modifiers:
Hilland, et al. Expires October 2003 [Page 164]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
* RNIC handle.
* CQ handle.
Output Modifiers:
* If the operation completed successfully:
o The allocated number of entries in the CQ.
o The Completion Event Handler Identifier.
* Verb Results:
o Operation completed successfully.
o Invalid RNIC handle.
o Invalid CQ handle.
9.2.3.3 Modify CQ
Description:
Resizes the CQ.
A CQ must be able to be resized with outstanding Work
Completions on the CQ and Work Requests on queues associated
with the specified CQ. If the requested minimum number of
entries in the CQ is insufficient to hold the current number of
entries on the CQ, an Immediate Error will result.
The RI MUST support this Verb and MUST support all of the Input
& Output Modifiers, except where noted. For more information,
see Section 5.3.3 - Modifying Completion Queue Attributes.
Input Modifiers:
* RNIC handle.
* CQ handle.
* The minimum number of entries in the CQ.
Output Modifiers:
* If the operation completed successfully:
o The allocated number of entries in the CQ.
Hilland, et al. Expires October 2003 [Page 165]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
* Verb Results:
o Operation completed successfully.
o Insufficient resources to complete request.
o Invalid RNIC handle.
o Invalid CQ handle.
o Number of CQ entries requested exceeds RNIC capability.
o An Attempt to shrink the size of the queue failed because
too many Completion Queue Entries were still present on the
Completion Queue.
9.2.3.4 Destroy CQ
Description:
Destroys the specified CQ.
The CQ cannot be destroyed if any Work Queue is still associated
with the CQ.
The RI MUST support this Verb and MUST support all of the Input
& Output Modifiers, except where noted. For more information,
see Section 5.3.4 - Destroying a Completion Queue.
Input Modifiers:
* RNIC handle.
* CQ handle.
Output Modifiers:
* Verb Results:
o Operation completed successfully.
o Invalid RNIC handle.
o Invalid CQ handle.
o One or more Work Queues is still associated with the CQ.
Hilland, et al. Expires October 2003 [Page 166]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
9.2.4 Shared Receive Queue
9.2.4.1 Create S-RQ
Description:
Creates an S-RQ for the specified RNIC.
A set of initial S-RQ attributes must be specified by the
Consumer. If any of the required initial attributes are illegal
or missing, an error is returned and the S-RQ is not created.
The RI MUST support this Verb if the Query RNIC Output Modifier
indicates support for an S-RQ and MUST support all of the Input
& Output Modifiers in this case, except where noted. For more
information, see Section 6.3.1 - Creating a Shared Receive
Queue.
Input Modifiers:
* RNIC handle.
* The maximum number of outstanding Work Requests the Consumer
expects to submit to the Shared Receive Queue.
* The S-RQ Limit. The S-RQ Limit detection is armed by the RI upon
creation of the S-RQ, if the S-RQ Limit is non-zero.
* The maximum number of Scatter/Gather Elements the Consumer can
specify in a Work Request.
* PD ID.
Output Modifiers:
* If the operation completed successfully:
o The S-RQ Handle.
o The allocated number of outstanding Work Requests the
Consumer can submit to the Shared Receive Queue.
o The allocated number of scatter/gather elements that can be
specified in Work Requests. If an error is not returned,
this is guaranteed to be greater than or equal to the number
requested.
* Verb Results:
o Operation completed successfully.
Hilland, et al. Expires October 2003 [Page 167]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
o Insufficient resources to complete request.
o Invalid RNIC handle.
o Maximum number of Work Requests requested exceeds RNIC
capability.
o Maximum number of scatter/gather elements per Receive Queue
Work Request requested exceeds RNIC capability.
o Invalid PD ID.
o S-RQ Limit out of range.
9.2.4.2 Query S-RQ
Description:
Returns the attribute list and current values for the specified
S-RQ.
The RI MUST support this Verb if the Query RNIC Output Modifier
indicates support for an S-RQ and MUST support all of the Input
& Output Modifiers in this case, except where noted.
Input Modifiers:
* RNIC Handle.
* S-RQ Handle.
Output Modifiers:
* The S-RQ attributes, if the operation completed successfully.
The list of attributes returned by the query are:
o The allocated number of outstanding Work Requests supported
on the Shared Receive Queue.
o The allocated number of Scatter/Gather Elements supported on
Work Requests submitted to the Shared Receive Queue.
o PD ID.
o The S-RQ Limit.
o S-RQ Limit Armed Indicator.
* Verb Results:
Hilland, et al. Expires October 2003 [Page 168]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
o Operation completed successfully.
o Invalid RNIC handle.
o Invalid S-RQ handle.
9.2.4.3 Modify S-RQ
Description:
Modifies the attributes for the specified S-RQ.
The RI MUST support this Verb if the Query RNIC Output Modifier
indicates support for an S-RQ and MUST support all of the Input
& Output Modifiers in this case, except where noted. For more
information, see Section 6.3.2 - Modifying a Shared Receive
Queue.
Input Modifiers:
* RNIC Handle.
* S-RQ Handle.
* The S-RQ attributes to modify and their new values. The S-RQ
attributes that can be modified after the S-RQ has been created
are:
o The maximum number of outstanding Work Requests the Consumer
expects to submit to the Shared Receive Queue (if changing
is supported by the RNIC).
o The S-RQ Limit.
o Re-arm the S-RQ Limit Asynchronous Event.
Output Modifiers:
* If the operation completed successfully:
o The allocated number of outstanding Work Requests supported
on the Shared Receive Queue.
* Verb Results:
o Operation completed successfully.
o Insufficient resources to complete request.
o Invalid RNIC handle.
Hilland, et al. Expires October 2003 [Page 169]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
o Invalid S-RQ handle.
o Maximum number of Shared Receive Queue Work Requests
requested exceeds RNIC capability.
o An Attempt to shrink the size of the queue failed because
too many elements were still present.
o S-RQ Limit out of range.
o Invalid Input Modifier.
9.2.4.4 Destroy S-RQ
Description:
Destroys the specified S-RQ.
The RI MUST support this Verb if the Query RNIC Output Modifier
indicates support for an S-RQ and MUST support all of the Input
& Output Modifiers in this case, except where noted.
For more information, see Section 6.3.3 - Destroying a Shared
Receive Queue.
Input Modifiers:
* RNIC handle.
* S-RQ handle.
Output Modifiers:
* Verb Results:
o Operation completed successfully.
o Invalid RNIC handle.
o Invalid S-RQ handle.
o QPs still associated with the S-RQ.
9.2.5 Queue Pair
9.2.5.1 Create QP
Description:
Creates a QP for the specified RNIC.
Hilland, et al. Expires October 2003 [Page 170]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
A set of initial QP attributes must be specified by the
Consumer. If any of the required initial attributes are illegal
or missing, an error is returned and the Queue Pair is not
created.
The RI MUST support this Verb and MUST support all of the Input
& Output Modifiers, except where noted. For more information,
see Section 6.1.1 - Creating a Queue Pair.
Input Modifiers:
* RNIC handle.
* The QP attributes that must be specified at QP create time are:
o The CQ handle of the CQ to be associated with the Send
Queue.
o The CQ handle of the CQ to be associated with the Receive
Queue. (Note that this may be the same CQ that is associated
with the Send Queue, or it may be a different CQ than the
one associated with the Send Queue).
o The maximum number of outstanding Work Requests the Consumer
expects to submit to the Send Queue.
o The maximum number of outstanding Work Requests the Consumer
expects to submit to the Receive Queue. This value is
ignored if the QP is associated with an S-RQ.
o If the QP's RQ will be associated with an S-RQ:
+ S-RQ Handle.
+ QP RQ Limit Indicator, as discussed in Section 6.3.8 -
S-RQ Limit Checking. The QP RQ Limit detection is armed
by the RI upon creation of the QP, if non-zero.
o Inbound RDMA Read enable.
o Inbound RDMA Write and inbound RDMA Read Response enable.
o Bind Memory Windows enable.
o The maximum number of scatter/gather elements the Consumer
can specify in a Send Operation Type Work Request submitted
to the Send Queue.
Hilland, et al. Expires October 2003 [Page 171]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
o The maximum number of scatter/gather elements the Consumer
can specify in a RDMA Write Work Request submitted to the
Send Queue.
o The maximum number of scatter/gather elements the Consumer
can specify in a Work Request submitted to the Receive
Queue. This value is not returned if the QP is associated
with an S-RQ.
o ORD (Requested) - The requested maximum number of
outstanding Outgoing RDMA Read Request Messages the RNIC can
initiate from the SQ.
o IRD (Requested) - The requested maximum number of
outstanding Incoming RDMA Read Request Messages (e.g. IRRQ
depth) the RNIC can handle for this QP.
o PD ID.
o Enable or disable the Use of the STag of zero and Fast-
Register Non-Shared Memory Region Operations. This MUST only
be allowed to be enabled for Privileged Mode Consumers.
Output Modifiers:
* If the operation completed successfully:
o The QP Handle.
o The QP ID.
o The allocated number of outstanding Work Requests supported
on the Send Queue. If an error is not returned, this is
guaranteed to be greater than or equal to the number
requested. (This may require the Consumer to increase the
size of the CQ.)
o The allocated number of outstanding Work Requests supported
on the Receive Queue. If an error is not returned, this is
guaranteed to be greater than or equal to the number
requested. (This may require the Consumer to increase the
size of the CQ.) This value is not returned if the QP is
associated with an S-RQ.
o The allocated number of scatter/gather elements that can be
specified in Work Requests submitted to the Send Queue. If
an error is not returned, this is guaranteed to be greater
than or equal to the number requested.
Hilland, et al. Expires October 2003 [Page 172]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
o The allocated number of Scatter/Gather Elements supported on
RDMA Write Work Requests submitted to the Send Queue. If an
error is not returned, this is guaranteed to be greater than
or equal to the number requested.
o The allocated number of Scatter/Gather Elements that can be
specified in Work Requests submitted to the Receive Queue.
If an error is not returned, this is guaranteed to be
greater than or equal to the number requested. This value is
not returned if the QP is associated with an S-RQ.
o ORD (allocated) - The allocated number of outstanding RDMA
Read Request Messages the RNIC can initiate from the SQ at
the Data Sink. This number MUST be between zero and the
number requested, inclusive. If the Consumer requested a
non-zero number and the RI was unable to provision at least
one then an Immediate Error MUST be returned.
o IRD (allocated) - The allocated number of incoming
outstanding RDMA Read Request Messages (e.g. IRRQ depth) the
RNICÆs QP can handle at the Data Source. If the Consumer
requested a non-zero number and the RI was unable to
provision at least one then an Immediate Error MUST be
returned.
* Verb Results:
o Operation completed successfully.
o Insufficient resources to complete request.
o Invalid RNIC handle.
o Invalid CQ handle.
o Invalid S-RQ handle.
o The value requested for ORD exceeds RNIC capability.
o The value requested for IRD exceeds RNIC capability.
o Maximum number of Send Queue Work Requests requested exceeds
RNIC capability.
o Maximum number of Receive Queue Work Requests requested
exceeds RNIC capability
o Maximum number of scatter/gather elements per Send Queue
Work Request requested exceeds RNIC capability.
Hilland, et al. Expires October 2003 [Page 173]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
o Maximum number of scatter/gather elements per Receive Queue
Work Request requested exceeds RNIC capability.
o Invalid Protection Domain.
o QP RQ Limit Out of Range.
9.2.5.2 Query QP
Description:
Returns the attribute list and current values for the specified
QP.
The RI MUST support this Verb and MUST support all of the Input
& Output Modifiers, except where noted. For more information,
see Section 6.1.2 - Querying Queue Pair Attributes.
Input Modifiers:
* RNIC Handle.
* QP Handle.
Output Modifiers:
* The QP attributes, if the operation completed successfully. The
list of attributes returned by the query are:
o Handle of the Completion Queue associated with the Send
Queue.
o Handle of the Completion Queue associated with the Receive
Queue.
o Handle of the S-RQ. This value is only returned if the QP is
associated with an S-RQ.
o The allocated number of outstanding Work Requests supported
on the Send Queue.
o The allocated number of outstanding Work Requests supported
on the Receive Queue. This value is not returned if the QP
is associated with an S-RQ.
o The actual number of Scatter/Gather Elements supported on
Send Operation Type Work Requests submitted to the Send
Queue.
Hilland, et al. Expires October 2003 [Page 174]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
o The allocated number of Scatter/Gather Elements supported on
RDMA Write Work Requests submitted to the Send Queue.
o The allocated number of Scatter/Gather Elements supported on
Work Requests submitted to the Receive Queue. This value is
not returned if the QP is associated with an S-RQ.
o ORD - The allocated number of outstanding RDMA Read Request
Messages the RNIC can initiate from the SQ at the Data Sink.
o IRD - The allocated number of outstanding incoming RDMA Read
Request Messages (e.g. IRRQ depth) the RNICÆs QP can handle
at the Data Source.
o Current QP state.
o PD ID.
o QP ID.
o Use of the STag of zero and Fast-Register Non-Shared Memory
Region Operations enabled.
o Inbound RDMA Read enable.
o Inbound RDMA Write and inbound RDMA Read Response enable.
o Bind Memory Windows enable.
The following attributes are not defined unless the QP is in the
Terminate or Error states.
o A buffer containing the Terminate Message that was received
or sent (if possible).
o An indicator to state if the Terminate Message was generated
locally or by the Associated QP.
The following attributes are only defined if the QP is
associated with a Shared Receive Queue.
o Current QP's RQ Limit.
o QP's RQ Limit armed indicator.
The following attributes are only defined if the QP is not in
the Idle state.
o LLP Stream Handle.
Hilland, et al. Expires October 2003 [Page 175]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
* Verb Results:
o Operation completed successfully.
o Invalid RNIC handle.
o Invalid QP handle.
9.2.5.3 Modify QP
Description:
Modifies the attributes for the specified QP then causes the QP
to transition to the specified QP state. Only a subset of the QP
attributes can be modified in each of the QP states.
The RI MUST support this Verb and MUST support all of the Input
& Output Modifiers, except where noted. For more information,
see Section 6.1.3 - Modifying Queue Pair Attributes.
Input Modifiers:
* RNIC Handle.
* QP Handle.
* The QP attributes to modify and their new values. The QP
attributes that can be modified after the QP has been created
are:
o Next QP state. If the current state is specified, only the
QP attributes will be modified.
o ORD - The requested number of outstanding RDMA Read Request
Messages the RNIC can initiate from the SQ at the Data Sink.
o IRD - The requested number of incoming outstanding RDMA Read
Request Messages (e.g. IRRQ depth) the RNICÆs QP can handle
at the Data Source.
o The maximum number of outstanding Work Requests the Consumer
expects to submit to the Send Queue (if changing is
supported by the RNIC).
o The maximum number of outstanding Work Requests the Consumer
expects to submit to the Receive Queue (if changing is
supported by the RNIC). This value is not allowed if the QP
is associated with an S-RQ.
Hilland, et al. Expires October 2003 [Page 176]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
The following attributes are only defined if the QP is
associated with a Shared Receive Queue.
o QP's RQ Limit, as described in Section 6.3.8 - S-RQ Limit
Checking.
o Re-arm the QP's RQ Limit, as described in Section 6.3.8 - S-
RQ Limit Checking. The RI MUST allow an already armed S-RQ
limit to be armed.
Valid only when moving from Idle to RTS.
o LLP Stream Handle
o Stream Message Buffer.
Output Modifiers:
* If the operation completed successfully:
o The allocated number of outstanding Work Requests supported
on the Send Queue.
o The allocated number of outstanding Work Requests supported
on the Receive Queue. This value is not returned if the QP
is associated with an S-RQ.
o ORD - The allocated number of outstanding RDMA Read Request
Messages the RNIC can initiate from the SQ at the Data Sink.
This number MUST be between zero and the number requested,
inclusive. If the Consumer requested a non-zero number and
was unable to provision at least one then an Immediate Error
will be returned.
o IRD - The allocated number of incoming outstanding RDMA Read
Request Messages (e.g. the IRRQ depth) the RNICÆs QP can
handle at the Data Source. If the Consumer requested a non-
zero number and was unable to provision at least one then an
Immediate Error will be returned.
* Verb Results:
o Operation completed successfully.
o Insufficient resources to complete request.
o Invalid RNIC handle.
o Invalid QP handle.
Hilland, et al. Expires October 2003 [Page 177]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
o Cannot change QP attribute.
o Invalid QP state change requested.
o Maximum number of Send Queue Work Requests requested exceeds
RNIC capability.
o Maximum number of Receive Queue Work Requests requested
exceeds RNIC capability.
o The value requested for ORD exceeds RNIC capability.
o The value requested for IRD exceeds RNIC capability.
o An Attempt to shrink the size of the queue failed because
too many elements were still present.
o Invalid LLP Stream Handle.
o Invalid Modifier.
o RI still flushing WQEs.
o RQ Limit Out of Range.
9.2.5.4 Destroy QP
Description:
Destroys the specified QP.
The RI MUST support this Verb and MUST support all of the Input
& Output Modifiers, except where noted. The QP cannot be
destroyed if any Memory Windows are still Bound to the QP.
For more information, see Section 6.1.4 - Destroying a Queue
Pair.
Input Modifiers:
* RNIC handle.
* QP handle.
Output Modifiers:
* Verb Results:
o Operation completed successfully.
Hilland, et al. Expires October 2003 [Page 178]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
o Invalid RNIC handle.
o Invalid QP handle.
o Memory Windows still Bound to QP.
9.2.6 Memory Management
Memory Management Verbs are used to manage Memory Regions and Memory
Windows. The following table describes what each of the Memory
Management Verbs manage and where the Verb appears to performed:
Verb Used to manage Performed by
MR vs. MW RI vs. RNIC
Allocate Non-Shared Memory Region MR RI
STag
Register Non-Shared Memory Region MR RI
(RI-Register)
Reregister Non-Shared Memory Region MR RI
(RI-Reregister)
Register Shared Memory Region MR RI
Fast-Register Non-Shared Memory MR RNIC
Region (PostSQ)
Query Memory Region MR RI
Invalidate Local STag (PostSQ) MR or MW RNIC
Deallocate STag MR or MW RI
Allocate Memory Window MW RI
Query Memory Window MW RI
Bind Memory Window (PostSQ) MW RNIC
Figure 26 - Memory Management Verbs
9.2.6.1 Allocate Non-Shared Memory Region STag
Description:
Allocates memory registration resources on the RNIC.
Hilland, et al. Expires October 2003 [Page 179]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
The RI MUST support this Verb and MUST support all of the Input
& Output Modifiers, except where noted. For more information,
see Section 7.3.2.1 - Allocate Non-Shared Memory Region STag.
Input Modifiers:
* RNIC Handle.
* Requested Physical Buffer List size to be allocated.
* PD ID.
* Remote Access Flag. If set, Local and Remote Access is enabled.
Otherwise only Local access is enabled.
Output Modifiers:
* If the operation completed successfully:
o STag Index - used for local and, if specified by the input
modifiers, remote access.
o The actual number of Physical Buffer List Entries in the
allocated Physical Buffer List. Note that this MAY be
greater than the number requested.
* Verb Results:
o Operation completed successfully.
o Insufficient resources to complete request.
o Invalid RNIC handle.
o Invalid PD ID.
9.2.6.2 Register Non-Shared Memory Region (RI-Register)
Description:
Registers a Non-Shared Memory Region for use by an RNIC.
The RI MUST support this Verb and MUST support all of the Input
& Output Modifiers, except where noted. For more information,
see Section 7.3.2.2 - RI-Register Non-Shared Memory Region.
Input Modifiers:
* RNIC Handle.
Hilland, et al. Expires October 2003 [Page 180]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
* Physical Buffer Entry size - The size, in bytes, of each
Physical Buffer in the list. Note: If the Physical Buffer List
references a Page List, the size MUST be a power of two. If the
Physical Buffer List references a Block List, the size MAY have
a byte alignment.
* Address List - A list of addresses that point to the Physical
Buffers referenced by the Physical Buffer List. All Physical
Buffers in the list have the same size.
* Address List Length - the number of entries in the Address list.
* First Byte Offset (FBO) - Offset to start of Non-Shared Memory
Region on first Physical Buffer.
* Length - Total length of the Non-Shared Memory Region (can be of
arbitrary byte-aligned length).
* Addressing type. The Addressing type MUST be one of the
following:
o VA Based TO
o Zero Based TO
* The following input modifier is only valid if the Addressing
type is VA Based TO:
o Virtual Address - The VA address of the first byte in the
Non-Shared Memory Region.
* PD ID.
* STag Key.
* Remote Access Flag.
* Access Control - The following MAY be selected in any
combination except as noted:
o Enable Local Write Access.
o Enable Remote Write Access. Remote Write Access requires
Local Write Access to be enabled.
o Enable Local Read Access.
o Enable Remote Read Access. Remote Read Access requires Local
Read Access to be enabled.
Hilland, et al. Expires October 2003 [Page 181]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
o Enable Memory Window Binding.
Output Modifiers:
* If the operation completed successfully:
o STag Index - used for local and, if specified by the input
modifiers, remote access. Note: the RNIC associates the STag
Key passed in as an input modifier to STag associated with
the registered Non-Shared Memory Region.
o The actual number of Physical Buffer List Entries in the
allocated Physical Buffer List. Note that this MAY be
greater than the number requested.
* Verb Results:
o Operation completed successfully.
o Insufficient resources to complete request.
o Invalid RNIC handle.
o Invalid PD ID.
o Invalid Virtual Address.
o Invalid length.
o Invalid First Byte Offset.
o Invalid Access Rights requested.
o Invalid Physical Buffer List entry.
o Invalid Physical Buffer size.
9.2.6.3 Query Memory Region
Description:
Retrieves information about a specific Memory Region.
The RI MUST support this Verb and MUST support all of the Input
& Output Modifiers, except where noted. For more information,
see Section 7.7 - Querying Memory Regions.
Input Modifiers:
* RNIC Handle.
Hilland, et al. Expires October 2003 [Page 182]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
* STag Index - as originally returned from an Allocate Non-Shared
Memory Region STag, RI-Register Non-Shared Memory Region, RI-
Reregister Non-Shared Memory Region or Register Shared Memory
Region Type Verb.
Output Modifiers:
* If the operation completed successfully:
o STag Key - Current STag Key associated with the Memory
Region, if it is in the Valid state.
o Remote Access Flag.
o PD ID.
o STag State: Valid or Invalid.
o STag Type: Shared or Non-Shared.
o The actual number of Physical Buffer List Entries in the
allocated Physical Buffer List. Note that this MAY be
greater than the number requested.
o Access Control settings for the registered Region. The
following MAY be set in any combination except as noted:
+ Local Write Access Enabled.
+ Remote Write Access Enabled. Remote Write Access
requires Local Write Access to be enabled.
+ Local Read Access Enabled.
+ Remote Read Access Enabled. Remote Read Access requires
Local Read Access to be enabled.
+ Memory Window Binding Enabled.
* Verb Results:
o Operation completed successfully.
o Invalid RNIC handle.
o Invalid STag Index.
9.2.6.4 Deallocate STag
Description:
Hilland, et al. Expires October 2003 [Page 183]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
Removes an STag created through an Allocate Non-Shared Memory
Region STag, RI-Register Non-Shared Memory Region, RI-Reregister
Non-Shared Memory Region, Register Shared Memory Region or
Allocate Memory Window from the RNIC.
Work Requests or Remote Operation requests that are in-process
and actively referencing memory locations associated with the
STag being deallocated must fail with a protection error.
If the STag references a Memory Region which has Memory Windows
Bound to it, an immediate Error MUST be returned and the Memory
Region must not be destroyed or modified.
The RI MUST support this Verb and MUST support all of the Input
& Output Modifiers, except where noted. For more information,
see Section 7.9 - Deallocation of STag associated with a Memory
Region and Section 7.10.4 - Invalidating or De-allocating Memory
Windows.
Input Modifiers:
* RNIC Handle.
* STag Index - as originally returned from an Allocate Non-Shared
Memory Region STag, Allocate Memory Window, or RI-Register Non-
Shared Memory Region, RI-Reregister Non-Shared Memory Region or
Register Shared Memory Region Verb.
Output Modifiers:
* Verb Results:
o Operation completed successfully.
o Invalid RNIC handle.
o Invalid STag Index.
o One or more Memory Windows is still Bound to the Memory
Region. Applies only if the STag is associated with a Memory
Region.
9.2.6.5 Reregister Non-Shared Memory Region (RI-Reregister)
Description:
Modifies the attributes of an existing Non-Shared Memory Region.
The STag output modifier from this Verb must be used in place of
any previously issued for this Non-Shared Memory Region.
Hilland, et al. Expires October 2003 [Page 184]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
If the STag references a Non-Shared Memory Region which has
Memory Windows Bound to it, an immediate Error MUST be returned
and the Non-Shared Memory Region must not be destroyed or
modified.
The RI MUST support this Verb and MUST support all of the Input
& Output Modifiers, except where noted. For more information,
see Section 7.3.2.3 - RI-Reregister Non-Shared Memory Region.
Input Modifiers:
* RNIC Handle.
* Physical Buffer Entry size - The size, in bytes, of each
Physical Buffer Entry in the list. Note: If the Physical Buffer
List references a Page-List, the size MUST be a power of two. If
the Physical Buffer List references a Block-List, the size MAY
have a byte alignment.
* Address List - A list of addresses that point to the Physical
Buffers referenced by the Physical Buffer List. All Physical
Buffers in the list MUST have the same size.
* Address List Length - the number of entries in the Address list.
* First Byte Offset (FBO) - Offset to start of Non-Shared Memory
Region on first Physical Buffer.
* Length - Total length of Non-Shared Memory Region (can be of
arbitrary byte-aligned length).
* Addressing type. The addressing type MUST be one of the
following:
o VA Based TO
o Zero Based TO
* The following input modifier is only valid if the Addressing
type is VA Based TO:
o Virtual Address - The VA address of the first byte in the
Non-Shared Memory Region.
* PD ID.
* STag Index.
* STag Key (not the existing STag Key, but the new STag Key).
Hilland, et al. Expires October 2003 [Page 185]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
* Remote Access Flag.
* Access Control - The following MAY be selected in any
combination except as noted:
o Enable Local Write Access.
o Enable Remote Write Access. Remote Write Access requires
Local Write Access to be enabled.
o Enable Local Read Access.
o Enable Remote Read Access. Remote Read Access requires Local
Read Access to be enabled.
o Enable Memory Window Binding.
Output Modifiers:
* If the operation completed successfully:
o STag Index - used for local and, if specified by the input
modifiers, remote access. Note: the RNIC associates the STag
Key passed in as an input modifier to STag associated with
the registered Non-Shared Memory Region. If the output STag
index differs from the input STag index, the old STag index
was Deallocated.
o The actual number of Physical Buffer List Entries in the
allocated Physical Buffer List. Note that this MAY be
greater than the number requested.
* Verb Results:
o Operation completed successfully.
o Insufficient resources to complete request.
o Invalid RNIC handle.
o Invalid STag Index.
o Invalid Virtual Address.
o Invalid Length.
o Invalid PD ID.
o Invalid First Byte Offset.
Hilland, et al. Expires October 2003 [Page 186]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
o Invalid Access Rights request.
o One or more Memory Windows is still Bound to the Region.
o Invalid Physical Buffer List entry.
o Invalid Physical Buffer size.
9.2.6.6 Register Shared Memory Region
Description:
Registers a new Shared Memory Region which shares RNIC mapping
resources with a previously registered Memory Region, thus
returning a new STag. Note that other than the change of the
original Memory Region to a Shared Memory Region, the original
Memory Region remains unaffected by this operation.
The Base TO,VA (if the input STag Index references a VA Based
TO), PD ID, and Access Rights specified for the new Memory
Region need not be the same as those of the existing Memory
Region. The lengths are by definition the same.
The RI MUST support this Verb and MUST support all of the Input
& Output Modifiers, except where noted. For more information,
see Section 7.4.3 - Multiple Registrations of Memory Regions.
Input Modifiers:
* RNIC Handle.
* STag Index of the existing Memory Region. If the existing Memory
Region is Non-Shared, successful completion of this verb will
convert the existing Non-Shared Memory Region to a Shared Memory
Region.
* Addressing type. The addressing type MUST be one of the
following:
o VA Based TO
o Zero Based TO
* The following modifier is only valid if the Addressing type of
the existing region is VA Based TO:
o Virtual Address - The VA address of the first byte in the
Memory Region.
* PD ID.
Hilland, et al. Expires October 2003 [Page 187]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
* STag Key of the new STag.
* Remote Access Flag.
* Access Control - The following MAY be selected in any
combination except as noted:
o Enable Local Write Access.
o Enable Remote Write Access. Remote Write Access requires
Local Write Access to be enabled.
o Enable Local Read Access.
o Enable Remote Read Access. Remote Read Access requires Local
Read Access to be enabled.
o Enable Memory Window Binding.
Output Modifiers:
* If the operation completed successfully:
o STag Index - used for local and, if specified by the input
modifiers, remote access. Note: the RNIC associates the STag
Key passed in as an input modifier to STag associated with
the registered Shared Memory Region.
* Verb Results:
o Operation completed successfully.
o Insufficient resources to complete request.
o Invalid RNIC handle.
o Invalid STag Index.
o Invalid Virtual Address.
o Invalid PD ID.
o Invalid Access Rights requested.
9.2.6.7 Allocate Memory Window
Description:
Hilland, et al. Expires October 2003 [Page 188]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
This Verb allocates a memory window and associates it with a
Protection Domain. It is not inherently associated with any
Memory Region when allocated.
The RI MUST support this Verb and MUST support all of the Input
& Output Modifiers, except where noted. For more information,
see Section 7.10.1 - Allocating Memory Windows.
Input Modifiers:
* RNIC Handle.
* PD ID.
Output Modifiers:
* If the operation completed successfully:
o STag Index - an unbound STag for use in specifying the
Window when invoking a Bind Work Request through the Post
Send Verb.
* Verb Results:
o Operation completed successfully.
o Insufficient resources to complete request.
o Invalid RNIC handle.
o Invalid PD ID.
9.2.6.8 Query Memory Window
Description:
This Verb returns the attributes associated with the specified
memory window.
The RI MUST support this Verb and MUST support all of the Input
& Output Modifiers, except where noted. For more information,
see Section 7.10.3 - Memory Windows.
Input Modifiers:
* RNIC Handle.
* STag Index - the current STag associated with the Memory Window.
Output Modifiers:
Hilland, et al. Expires October 2003 [Page 189]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
* If the operation completed successfully:
o STag Key - current value of the STag Key, if the STag is in
the Valid state.
o STag State: Valid or Invalid.
o PD ID.
o Access Rights. The following may be set in any combination
except as noted.
+ Remote Write Access Enabled. If set Remote Write Access
is enabled.
+ Remote Read Access Enabled. If set Remote Read Access is
enabled.
* Verb Results:
o Operation completed successfully.
o Invalid RNIC handle.
o Invalid STag Index.
9.3 Work Request Processing
9.3.1 QP Operations
9.3.1.1 PostSQ
Description:
Builds a WQE on the Send Queue of the specified QP for each
entry in the Work Request List submitted by the Consumer. This
WQE is added to the end of the Send Queue and the RNIC is
notified that a new WQE is ready to be processed.
Note that not all Input Modifiers are valid for all operations.
If Input Modifiers are specified that are not valid for a
particular operation, they are ignored.
Following the Verbs is a Work Request table which contains a
List of the Operation Types and the Input Modifiers which are
required for each of those Operation Types.
The RI MUST support this Verb and MUST support all of the Input
& Output Modifiers, except where noted. For more information,
see Section 8.2.1 - Submitting Work Request to a Work Queue.
Hilland, et al. Expires October 2003 [Page 190]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
Input Modifiers:
* RNIC Handle
* QP Handle.
* A list of Work Requests. Each Work Request MUST contain the
following information:
o A user defined 64-bit Work Request ID
o Operation type. The operation type MUST be one of the
following:
+ Send
+ Send with Solicited Event
+ Send with Invalidate
+ Send with Solicited Event & Invalidate
+ RDMA Write
+ RDMA Read
+ RDMA Read with Invalidate Local STag
+ Bind Memory Window
+ Fast-Register Non-Shared Memory Region
+ Invalidate Local STag
o Completion Notification Type: Signaled or Unsignaled.
o The following list of modifiers are only valid for Send
Operation Types and RDMA Write WRs to represent the Local
Buffer:
+ Scatter/Gather List. The Scatter/Gather List can contain
zero or more Scatter/Gather Elements. This list is
specified only for Send and RDMA type operations.
+ Number of Scatter/Gather Elements.
+ Note that the length is determined by adding up the
Length field in the SGEs of the SGL.
+ Read Fence indicator.
Hilland, et al. Expires October 2003 [Page 191]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
o The following list of modifiers are only valid for RDMA Read
Type operations to represent the Local Buffer:
+ Local Address. This is a contiguous buffer represented
by a TO, an STag, and a Length to be read.
o The following list of modifiers are only valid for RDMA
Write or RDMA Read Type WRs to represent the Remote Buffer:
+ Remote Address. This is a contiguous buffer represented
by a TO and an STag.
o The following modifier is only valid for the Send with
Invalidate and Send with Solicited Event & Invalidate
operations:
+ Remote STag. This is the STag to be Invalidated at the
Remote Peer.
o The following list of modifiers are only valid for Bind
Memory Window operations:
+ STag Index for the Memory Window.
+ STag Key for the Memory Window.
+ STag for the Memory Region that the Memory Windows is to
be associated with. This parameter includes both the
STag Index and STag Key.
+ Length or range to be Bound in number of octets.
+ Addressing type. The addressing type MUST be one of the
following:
* VA Based TO
* Zero Based TO
+ Virtual Address - The VA address of the first byte into
the Memory Region. This may be different than the
starting address of the Memory Region.
+ Access Control - either or both of the following must be
selected:
* Enable Remote Write Access. Requires the Memory
Region to have Local Write Access.
Hilland, et al. Expires October 2003 [Page 192]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
* Enable Remote Read Access. Requires the Memory
Region to have Local Read Access.
o The following list of modifiers are only valid for Fast-
Register Non-Shared Memory Region operations:
+ Physical Buffer Entry size - The size, in bytes, of each
Physical Buffer in the list. Note: If the Physical
Buffer List references a Page-List, the size MUST be a
power of two. If the Physical Buffer List references a
Block-List, the size MUST be an RNIC supported size (see
Section 9.2.1.2 - Query RNIC).
+ Address List - A list of addresses that point to the
Physical Buffers referenced by the Physical Buffer List.
All Physical Buffers in the list MUST have the same
size.
+ Address List Length - the number of entries in the
Address list.
+ First Byte Offset (FBO) - Offset to start of Non-Shared
Memory Region on first Physical Buffer.
+ Length - Total length of Non-Shared Memory Region (can
be any value supported by the RNIC).
+ Addressing type. The addressing type MUST be one of the
following:
* VA Based TO
* Zero Based TO
+ The following modifier is only valid if the Addressing
type is VA Based TO:
* Virtual Address - The VA address of the first byte
in the Non-Shared Memory Region
+ STag Index.
+ STag Key.
+ Access Control - The following may be selected in any
combination except as noted:
* Enable Local Write Access.
Hilland, et al. Expires October 2003 [Page 193]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
* Enable Remote Write Access. Remote Write Access
requires Local Write Access to be enabled. The STag
Index MUST have the Remote Access Flag enabled.
* Enable Local Read Access.
* Enable Remote Read Access. Remote Read Access
requires Local Read Access to be enabled. The STag
Index MUST have the Remote Access Flag enabled.
* Enable Memory Window Binding.
o The following list of modifiers are only valid for
Invalidate Local STag operations:
+ STag to be the target of the Invalidate operation.
+ Local Fence indicator.
Below, in Figure 27, is a matrix of the Input Modifiers for PostSQ
and the Operation Types. The intersection of the matrix indicates
that the Input Modifier is required for that Operation Type by
specifying "Yes".
Opcode-> Send Send Send Send RDMA RDMA RDMA Bind Fast- Inv.
Input w/ w/ w/ Write Read Read MW Reg. Local
Modifier SE Inv. SE & w/ NS MR STag
Inv. Inv.
WR ID Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
Compltn. Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
Notif.
Type
SGL Yes Yes Yes Yes Yes
SGE No. Yes Yes Yes Yes Yes
Read Yes Yes Yes Yes Yes
Fence
Local Yes
Fence
Local Yes Yes
Address
Hilland, et al. Expires October 2003 [Page 194]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
Opcode-> Send Send Send Send RDMA RDMA RDMA Bind Fast- Inv.
Input w/ w/ w/ Write Read Read MW Reg. Local
Modifier SE Inv. SE & w/ NS MR STag
Inv. Inv.
Remote Yes Yes Yes
Address
Remote Yes Yes
STag
MW STag Yes
Key
MW STag Yes
Index
MW's Yes
MR STag
MW Yes
Length
Addr Yes Yes
Type
VA, if Yes Yes
VA Based
TO
Acs Yes
Ctrl:
Local Rd
Acs Yes Yes
Ctrl:
Remote
Rd
Acs Yes
Ctrl:
Local Wt
Acs Yes Yes
Ctrl:
Remote
Wt
Hilland, et al. Expires October 2003 [Page 195]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
Opcode-> Send Send Send Send RDMA RDMA RDMA Bind Fast- Inv.
Input w/ w/ w/ Write Read Read MW Reg. Local
Modifier SE Inv. SE & w/ NS MR STag
Inv. Inv.
Acs Yes
Ctrl:
Bind
Enable
PBLE Yes
Size
PBL Yes
FBO Yes
STag Yes Yes
Index
STag Key Yes Yes
Figure 27 - PostSQ Input Modifier Validity
Output Modifiers:
* Number of WRs posted.
* Verb Results:
o Operation completed successfully
o Invalid RNIC Handle
o Invalid QP Handle
o Too many Work Requests posted.
o Invalid operation type.
o Invalid QP state.
o Invalid Scatter/Gather list format.
o Invalid Scatter/Gather list length.
o Invalid Modifier.
Hilland, et al. Expires October 2003 [Page 196]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
9.3.1.2 PostRQ
Description:
Builds a WQE on the Receive Queue of the specified QP for each
entry in the Work Request List submitted by the Consumer. This
WQE is added to the end of the Receive Queue and the RNIC is
notified that a new WQE is ready to be processed.
The RI MUST support this Verb and MUST support all of the Input
& Output Modifiers, except where noted. For more information,
see Section 8.2.1 - Submitting Work Request to a Work Queue.
Input Modifiers:
* RNIC Handle.
* QP Handle, for QP's not associated with an S-RQ.
* S-RQ Handle, for QP's associated with an S-RQ.
* A list of Work Requests. Each Work Request MUST contain the
following information.
o A user defined 64-bit Work Request ID.
o Scatter/Gather List. The scatter/gather list can contain one
or more Data Segments.
o Number of Scatter/Gather List elements.
Output Modifiers:
* Number of WRs posted.
* Verb Results:
o Operation completed successfully.
o Invalid RNIC handle.
o Invalid QP handle.
o Invalid S-RQ handle.
o Too many Work Requests posted.
o Invalid QP state.
o Invalid Scatter/Gather list format.
Hilland, et al. Expires October 2003 [Page 197]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
o Invalid Scatter/Gather list length.
o Invalid Modifier.
o RQ Associated with S-RQ.
9.3.2 CQ Operations
9.3.2.1 Poll for Completion (Poll CQ)
Description:
Polls the specified CQ for a Work Completion.
If a CQE is present, the CQE at the head of the CQ MUST be
returned to the Consumer as a Work Completion. Note that the
resources used are expected to be directly accessible by a Non-
Privileged Mode Consumer.
The RI MUST support this Verb and MUST support all of the Input
& Output Modifiers, except where noted. For more information,
see Section 8.2.4 - Returning Completed Work Requests.
Input Modifiers:
* RNIC Handle
* CQ Handle.
Output Modifiers:
* The Work Completion. If an entry is present on the CQ and if the
operation completed successfully, this contains information
relating to a completed Work Request. If the status of the
operation that generates the Work Completion is anything other
than success, the contents of the Work Completion are undefined
except as noted below. The contents of a Work Completion are:
o The 64-bit Work Request ID set by the Consumer in the asso-
ciated Work Request. This is always valid, regardless of the
status of the operation.
o The operation type specified in the completed Work Request.
The valid operation types are:
+ Send (for WRs posted to the Send Queue)
+ Send with Solicited Event (for WRs posted to the Send
Queue)
Hilland, et al. Expires October 2003 [Page 198]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
+ Send with Invalidate (for WRs posted to the Send Queue)
+ Send with Solicited Event & Invalidate (for WRs posted
to the Send Queue)
+ RDMA Write (for WRs posted to the Send Queue)
+ RDMA Read (for WRs posted to the Send Queue)
+ RDMA Read with Invalidate Local STag (for WRs posted to
the Send Queue)
+ Memory Window Bind (for WRs posted to the Send Queue)
+ Fast-Register Non-Shared Memory Region (for WRs posted
to the Send Queue)
+ Invalidate Local STag (for WRs posted to the Send Queue)
+ Receive (for WRs posted to the Receive Queue)
o The number of bytes transferred. This is only valid if the
operation type was a Receive.
o The Completion Status of the operation. This modifier MUST
be as specified in Section 9.5.2 - Completion Status Codes.
o STag Invalidated Indicator. This indicates that the incoming
Untagged Message destined for the RQ was a Send with
Invalidate or Send with Solicited Event & Invalidate, and
thus the STag Invalidated field is valid.
o STag Invalidated. This contains the STag which was
Invalidated. This is only valid when the Invalidated STag
Indicator is set.
o QP ID. This is the QP ID of the QP where the WR which
generated this completion was posted.
* Verb Results:
o Operation completed successfully.
o Invalid RNIC handle.
o Invalid CQ handle.
o CQ empty.
Hilland, et al. Expires October 2003 [Page 199]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
9.3.2.2 Request Completion Notification
Description:
Requests the CQ event handler be called when the next CQE of the
specified type is added to the specified CQ.
A CQ event handler must be specified prior to calling this
routine (see Section 9.4.1 - Set Completion Event Handler). If
the CQ event handler has not been registered when the event is
generated, the handler will not be called.
Once the handler routine has been invoked, the Consumer must
call Request Completion Notification again to be notified when a
new entry is added to that CQ.
It is the responsibility of the Consumer to call the Poll for
Completion Verb to retrieve a Work Completion after the handler
is called.
The RI MUST support this Verb and MUST support all of the Input
& Output Modifiers, except where noted. For more information,
see Section 8.2.5 - Asynchronous Completion Notification.
Input Modifiers:
* RNIC Handle.
* CQ Handle.
* Completion notification type. This MUST be either the next
completion event or the next solicited completion event.
Output Modifiers:
* Verb Results:
o Operation completed successfully.
o Invalid RNIC handle.
o Invalid CQ handle.
9.4 Event Handling
9.4.1 Set Completion Event Handler
Description:
Hilland, et al. Expires October 2003 [Page 200]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
A RNIC MUST support one CQ Event Handler, and MAY support
additional Completion Event Handlers. Each Completion Event
Handler address is maintained by the RI and delineated by an
opaque handle called a Completion Event Handler Identifier. The
consumer uses the Set Completion Event Handler to register
individual Completion Event Handlers and obtain a unique
Completion Event Handler Identifier. The Completion Event
Handler Identifier is used in Create CQ to associate a CQ with a
specific Completion Event Handler.
This call does not automatically request a notification on a
completion event. The Request Completion Notification Verb must
be called in order to request notification.
The RI MUST support this Verb and MUST support all of the Input
& Output Modifiers, except where noted. For more information,
see Section 8.2.5 - Asynchronous Completion Notification.
Input Modifiers:
* RNIC Handle
* Completion Event Handler Address. If set to zero, then the Set
Completion Handler Verb is being used to clear the associated
Completion Event Handler address identified by the Completion
Event Handler Identifier. The Completion Event Handler will be
invoked when an appropriate Completion occurs with the following
input parameters passed in to it:
o RNIC Handle.
o CQ Handle.
* Completion Event Handler Identifier - An opaque handle used to
identify a Completion Event Handler address.
o If set to zero, the Set Completion Event Handler verb is
being used to register a new Completion Event Handler
address and the verb will return a new Completion Event
Handler Identifier.
o If set to non-zero, then the Set Completion Event Handler is
being used:
+ to clear the associated Completion Event Handler address
for the specified Completion Event Handler Identifier,
if the Completion Event Handler address is zero;
+ to modify the associated Completion Event Handler
address for the specified Completion Event Handler
Hilland, et al. Expires October 2003 [Page 201]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
Identifier, if the Completion Event Handler address is
non-zero.
Output Modifiers:
* Completion Event Handler Identifier - Only returned if the Set
Completion Event Handler verb is being used to register a new
Completion Event Handler address.
* Verb Results:
o Operation completed successfully.
o Invalid RNIC Handle.
o Invalid Completion Event Handler Identifier.
o Insufficient Resources.
9.4.2 Set Asynchronous Event Handler
Description:
Registers the asynchronous event handler. Only one asynchronous
event handler can be registered per RNIC. Additional calls to
this Verb will overwrite the handler routine to be called.
Additional calls will not generate an additional handler
routine. If the new handler address is zero, there will be no
Asynchronous Event Handler associated with the RNIC.
The RI MUST support this Verb and MUST support all of the Input
& Output Modifiers, except where noted. For more information,
see Section 8.3.3 - Asynchronous Errors.
Input Modifiers:
* RNIC Handle
* Asynchronous Event Handler Address. This routine will be invoked
with the following input parameters passed in:
o RNIC Handle.
o Event Record. This contains information which indicates the
resource type and identifier as well as which event
occurred:
+ Resource Indicator. This indicates the type of resource
to which the Resource Identifier refers. This must be
one of the following values:
Hilland, et al. Expires October 2003 [Page 202]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
* QP
* CQ
* RNIC
* S-RQ
+ Resource Identifier. This value is the QP Handle, CQ
Handle, S-RQ Handle or RNIC Handle for the Asynchronous
Event.
+ Event Identifier. This indicates the event which caused
the Asynchronous Event to be generated. The possible
list of Event Identifiers can be found in Section 9.5.3
- Asynchronous Event Identifiers.
Output Modifiers:
* Verb Results:
o Operation completed successfully.
o Invalid RNIC Handle.
9.5 Result Types
The following section is a summary of Verb results detailed in
Sections 9.2 - 9.4)
9.5.1 Immediate Status Codes
Operation completed successfully - The Verb was All Verbs
executed successfully.
Hilland, et al. Expires October 2003 [Page 203]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
9.5.1.1 RNIC Management Verb Status
Insufficient resources to complete request - An Open RNIC, Query
error was detected due to insufficient resources. RNIC
Invalid Modifier - One of the parameters were Open RNIC
invalid.
Block List mode not supported - The RNIC does not Open RNIC
support Block List mode and Block List mode was
requested.
RNIC in use - The RNIC was already in use. Open RNIC
Invalid RNIC handle - An invalid RNIC handle was Query RNIC, Close
specified. RNIC
Figure 28 - RNIC Management Verb Status
9.5.1.2 PD Management Verb Status
Insufficient resources to complete request - An Allocate PD
error was detected due to insufficient resources.
Invalid RNIC handle - An invalid RNIC handle was Allocate PD,
specified. Deallocate PD
Invalid PD ID - An invalid PD was specified. Deallocate PD
Protection Domain is in use - The PD was currently Deallocate PD
in use by a QP, Memory Region, or Memory Window.
Figure 29 - PD Management Verb Status
Hilland, et al. Expires October 2003 [Page 204]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
9.5.1.3 CQ Management Verb Status
Insufficient resources to complete request - An Create CQ, Modify
error was detected due to insufficient resources. CQ
Number of CQE requested exceeds RNIC capability - Create CQ, Modify
Too many CQ entries for this RNIC were requested. CQ
An Attempt to shrink the size of the queue failed Modify CQ
because too many elements were still present.
Invalid RNIC handle - An invalid RNIC handle was Create CQ, Query
specified. CQ, Modify CQ,
Destroy CQ, Poll
CQ
Invalid CQ handle- An invalid CQ handle was Query CQ, Modify
specified. CQ, Destroy CQ,
Poll CQ
CQ In Use - One or more QPs is still tied to the CQ. Destroy CQ
CQ empty - There were no Work Completions available Poll CQ
to be retrieved.
Invalid Completion Event Handler Identifier - An Create CQ
invalid identifier was specified.
Figure 30 - CQ Management Verb Status
9.5.1.4 S-RQ Management Verb Status
Insufficient resources to complete request - An Create S-RQ,
error was detected due to insufficient resources. Modify S-RQ
Invalid RNIC handle - An invalid RNIC handle was Create S-RQ,
specified. Query S-RQ,
Modify S-RQ,
Destroy S-RQ
Invalid PD ID - An invalid PD was specified. Create S-RQ
Maximum number of Work Requests requested exceeds Create S-RQ,
RNIC capability. Modify S-RQ
Hilland, et al. Expires October 2003 [Page 205]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
Maximum number of scatter/gather elements per Create S-RQ
Receive Queue Work Request requested exceeds RNIC
capability.
S-RQ Limit out of range Create S-RQ,
Modify S-RQ
Invalid S-RQ handle Query S-RQ,
Modify S-RQ,
Modify S-RQ
An attempt to shrink the size of the queue failed Modify S-RQ
because too many elements were still present
QPs still associated with the S-RQ Modify S-RQ
Invalid Input Modifer Modify S-RQ
Figure 31 - S-RQ Management Verb Status
9.5.1.5 QP Management Verb Status
Insufficient resources to complete request - An Create QP, Modify QP
error was detected due to insufficient resources.
Invalid RNIC handle - An invalid RNIC handle was Create QP, Query QP,
specified. Modify QP, Destroy
QP
Invalid CQ handle - An invalid CQ handle was Create QP
specified.
Value requested for ORD exceeds RNIC capability. Create QP, Modify QP
Value requested for IRD exceeds RNIC capability. Create QP, Modify QP
Maximum number of Work Requests requested exceeds Create QP, Modify QP
RNIC capability.
Maximum number of scatter/gather elements Create QP, Modify QP
requested per Work Request exceeds RNIC
capability.
Invalid PD ID - The PD ID provided was not valid Create QP
Invalid QP ID - An invalid QP handle was Query QP, Modify QP,
specified. Destroy QP
Hilland, et al. Expires October 2003 [Page 206]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
Cannot change QP attribute - An attempt was made Modify QP
to modify an attribute which is not allowed by the
RNIC (for example, number of WQEs)
An Attempt to shrink the size of the queue failed Modify QP
because too many elements were still present.
Invalid state - An invalid QP state was specified. Modify QP
Invalid LLP Stream handle Modify QP
Invalid Modifier - One of the modifiers was Modify QP
invalid or was not allowed to be modified in the
current state or state transition.
RI Still flushing WQEs - The QP is in the Error Modify QP
state and a request to transition to the Idle
state but the RI is still flushing WQEs and
therefore cannot transition.
Invalid S-RQ handle Create QP
QP RQ Limit Out of Range. Create QP, Modify QP
Memory Windows still Bound to QP Destroy QP
Figure 32 - QP Management Verb Status
9.5.1.6 Memory Management Verb Status
Insufficient resources to complete Allocate NS MR STag, RI-
request - An error was detected due to Register, RI-Reregister,
insufficient resources. Register Shared MR, Allocate MW
Invalid RNIC handle - An invalid RNIC Allocate NS MR STag, RI-
handle was specified. Register,Query MR, Deallocate
STag, RI-Reregister, Register
Shared MR, Allocate MW, Query MW
Invalid PD ID - An invalid PD ID was Allocate NS MR STag, RI-
specified. Register, RI-Reregister,
Register Shared MR, Allocate MW
Invalid Virtual Address - An invalid RI-Register, RI-Reregister,
Memory Address or Offset was Register Shared MR
specified.
Hilland, et al. Expires October 2003 [Page 207]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
Invalid Length - An invalid Length was RI-Register, RI-Reregister
specified. Too many pages or the MR
length was too long.
Invalid Access Rights requested - An RI-Register, RI-Reregister,
invalid Access Control specifier was Register Shared MR
specified.
Invalid Physical Buffer List entry. RI-Register, RI-Reregister
Invalid Physical Buffer size - The RI-Register, RI-Reregister
Physical Buffer size
(Page/Block)_requested was not
supported by the RNIC.
Invalid STag Index - An invalid Memory Query MR, RI-Reregister,
Region STag Index was specified. Deallocate STag,
Register Shared MR, Query MW
Invalid FBO - the FBO is larger than RI-register, RI-Reregister
the physical buffer size
One or more Memory Windows is still Deallocate STag, RI-Reregister,
Bound to the Region.
Figure 33 - Memory Management Verb Status
9.5.1.7 Post Verb Status
Invalid RNIC handle - An invalid RNIC handle was PostSQ, PostRQ
specified.
Invalid QP handle - An invalid QP handle was PostSQ, PostRQ
specified.
Invalid S-RQ handle - An invalid S-RQ handle was PostRQ
specified.
Too many Work Requests posted. PostSQ, PostRQ
Invalid Operation type PostSQ
Invalid QP state. PostSQ, PostRQ
Invalid Scatter/Gather list format PostSQ, PostRQ
Invalid Scatter/Gather list length - The Work PostSQ, PostRQ
Hilland, et al. Expires October 2003 [Page 208]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
Request specified more Scatter/Gather elements than
the QP can support.
RQ Associated with S-RQ - This QP is associated with PostRQ
an S-RQ and therefore the QP Handle cannot be used
to post receive Work Requests. The S-RQ handle
should be used instead.
Invalid Modifier - One of the parameters were PostSQ, PostRQ
invalid.
Figure 34 - Post Verb Status
9.5.1.8 Event Management Verb Status
Invalid RNIC handle - An invalid RNIC Request Completion
handle was specified. Notification, Set Completion
Event Handler, Set
Asynchronous Event Handler
Invalid CQ handle - An invalid CQ handle Request Completion
was specified. Notification
Invalid Notify Type - An invalid CQ Request Completion
Notification type was specified. Notification
Invalid Completion event handler Set Completion Event Handler
identifier - - An invalid identifier was
specified while attempting to clear a
Completion Event Handler address.
Insufficient Resources - The RI did not Set Completion Event Handler
have sufficient resources to complete the
request, such as when the Consumer
requests another Completion Event Handler
Identifier but has already set an amount
equal to the value returned in Query RNIC.
Figure 35 - Event Management Verb Status
Hilland, et al. Expires October 2003 [Page 209]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
9.5.2 Completion Status Codes
Success - The RNIC Operation was Send Operation Types,
successful. Receive, RDMA Write, RDMA
Read, RDMA Read with
Invalidate Local STag, Bind,
Fast-Register, Invalidate
Local STag
Flushed - The Work Request was incomplete Send Operation Types,
when the QP entered the Error state. Receive, RDMA Write, RDMA
Read, RDMA Read with
Invalidate Local STag, Bind,
Fast-Register, Invalidate
Local STag
Invalid WQE - The Work Request Element Send Operation Types,
contained a format error. Receive, RDMA Write, RDMA
Read, RDMA Read with
Invalidate Local STag, Bind,
Fast-Register, Invalidate
Local STag
Local QP Catastrophic Error - An error Send Operation Types,
related to the QP occurred while Receive, RDMA Write, RDMA
processing the Work Request. Read, RDMA Read with
Invalidate Local STag, Bind,
Fast-Register, Invalidate
Local STag
Remote Termination Error - A Terminate Send Operation Types, RDMA
Message was received from the Remote Peer Write, RDMA Read, RDMA Read
that appears to be related to the with Invalidate Local STag
execution of this Work Request. The error
type can be examined by looking at the
Terminate Message buffer via Query QP.
Invalid STag - An invalid STag was found Send Operation Types,
in the local SGL. The STag was either not Receive, RDMA Write, RDMA
found allocated, bound, or registered in Read, RDMA Read with
the RI, or an STag of zero was specified Invalidate Local STag, Bind,
for a QP without Privileged rights, or Fast-Register, Invalidate
referred to a Shared Memory Region, or the Local STag
type of STag supplied was not allowed to
be used in the specified operation.
Hilland, et al. Expires October 2003 [Page 210]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
Base & Bounds Violation - The local SGL Send Operation Types,
referenced an address beyond the limits Receive, RDMA Write, RDMA
specified for the MR or MW. This includes Read, RDMA Read with
length errors. For a Bind, the MW was not Invalidate Local STag, Bind
wholly contained in the MR.
Access Violation - The RNIC attempted to Send Operation Types,
read or write to a local SGL MR or MW that Receive, RDMA Write, RDMA
did not provide appropriate Access Rights. Read, RDMA Read with
For a Bind, the MW Access Rights were not Invalidate Local STag, Bind
compatible with the MR Access Rights.
Invalid PD ID - For one of the STags Send Operation Types,
specified in the Work Request the PD of Receive, RDMA Write, RDMA
the MR STag was not the same as the PD of Read, RDMA Read with
the QP, or, the QP of the MW STag was not Invalidate Local STag, Bind,
the same as QP. Fast-Register, Invalidate
Local STag
Wrap Error - The specified Address or Send Operation Types,
offset (TO or MO) added to the length of Receive, RDMA Write, RDMA
the operation resulted in a wrap beyond Read, RDMA Read with
the machine-supported address. Invalidate Local STag, Bind,
Fast-Register
STag to Invalidate had Invalid PD or Receive
Access Rights - The Invalidate STag on a
Receive did not have a PD ID that matched
the PD ID of the QP (for a MR) or a QP ID
that matched the QP ID of the QP (for a
MW). Or the STag did not have Access
Rights to be invalidated remotely.
Zero RDMA Read Resources - The QP ORD RDMA Read, RDMA Read with
value was set to zero. Invalidate Local STag
QP Not In Privileged Mode - The QP is not Fast-Register
enabled to perform the Privileged WR.
STag Not In Invalid state - The STag was Bind,
already registered or bound, when Fast-Register
attempting to Register or Bind it.
Invalid Page Size - The page size Fast-Register
requested was not supported by the RNIC.
Invalid Physical Buffer Size - size not Fast-Register
supported by the RNIC.
Hilland, et al. Expires October 2003 [Page 211]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
Invalid Physical Buffer List entry - for Fast-Register
page mode, the entry must start on page
size boundaries.
Invalid FBO - the FBO is larger than the Fast-Register
physical buffer size.
Invalid length - requested length is Fast-Register
larger than supported by the buffer list.
Invalid Access Rights specified. Fast-Register
Physical Buffer List too long. Fast-Register
Invalid Virtual Address - VA and FBO are Fast-Register
not consistent.
Invalid Region - The STag specified for Bind
the MR in the BIND request was invalid.
Invalid Window - The STag specified for Bind
the MW in the BIND request was invalid.
Invalid Length - The total size of the Send, Receive, RDMA Write,
data to be moved as specified by the sum RDMA Read, RDMA Read with
of the SGL elements, was larger than that Invalidate Local STag
supported by the RNIC.
Figure 36 - Completion Status Codes
9.5.3 Asynchronous Event Identifiers
The following table contains the list of Event Identifiers and
Resource Indicators that the RNIC MUST support as Asynchronous Event
Identifiers to be returned by the Asynchronous Event Handler. Note
that the Resource Indicator dictates that the appropriate Resource
Identifier corresponding to that Resource Indicator MUST be returned
as well. For more information, see Section 9.4.2 - Set Asynchronous
Event Handler.
Hilland, et al. Expires October 2003 [Page 212]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
Event Identifier and Description. Resource
Indicator
LLP Close Complete - The RDMA Stream has completed QP ID
Closing and no SQ WQEs were flushed.
Terminate Message Received QP ID
LLP Connection Reset - An incoming LLP Reset (e.g. RST QP ID
on TCP) was received.
LLP Connection Lost QP ID
LLP Integrity Error: Segment size invalid QP ID
LLP Integrity Error: Invalid CRC QP ID
LLP Integrity Error: Bad FPDU - Received MPA marker QP ID
and 'Length' fields do not agree on the start of a
FPDU
Remote Operation Error: Invalid DDP version - caused QP ID
by an inbound segment.
Remote Operation Error: Invalid RDMA version - caused QP ID
by an inbound segment.
Remote Operation Error: Unexpected Opcode - caused by QP ID
an inbound segment.
Remote Operation Error: Invalid DDP Queue Number - QP ID
caused by an inbound segment.
Remote Operation Error: Invalid RDMA Read Request QP ID
Message, RDMA Read not enabled - caused by an inbound
segment.
Remote Operation Error: Invalid RDMA Write or RDMA QP ID
Read Response Message, RDMA Write & RDMA Read Response
not enabled - caused by an inbound segment.
Remote Operation Error: Invalid RDMA Read Request QP ID
Message, message size too small or Offset non-zero -
caused by an inbound segment.
Remote Operation Error: No 'L' bit when expected - QP ID
caused by an inbound segment.
Hilland, et al. Expires October 2003 [Page 213]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
Protection Error: Invalid STag - caused by an inbound QP ID
Tagged DDP segment not valid for this QP. This
includes using the STag of zero, the STag was not
associated with the QP or the STag was in the Invalid
state.
Protection Error: Tagged Base and bounds violation - QP ID
caused by an inbound Tagged segment attempted to
access memory outside the limits assigned to the STag.
Protection Error: Tagged Access Rights violation - QP ID
caused by an inbound segment referencing a Tagged
Buffer which did not have the necessary memory Access
Rights for the requested operation.
Protection Error: Tagged Invalid PD - caused by an QP ID
inbound segment referencing a Tagged Buffer which was
not allowed to be referenced by QP.
Protection Error: Wrap error - caused by an inbound QP ID
segment not targeting the RQ.
Bad Close - The QP was in the Closing state when a QP ID
Segment arrived.
Bad LLP Close - An attempt was made to close the RDMA QP ID
Stream with work in progress.
RQ Protection Error - Invalid MSN - MSN range not QP ID
valid. Caused by an inbound segment targeting the RQ.
Possibly due to Receive Queue being empty.
RQ Protection Error - Invalid MSN - gap in MSN. Caused QP ID
by an inbound segment targeting the RQ.
IRRQ Protection Error: Invalid MSN - too many RDMA QP ID
Read Request Messages in progress - caused by an
inbound segment not targeting the IRRQ.
IRRQ Protection Error: Invalid MSN - gap in MSN - QP ID
caused by an inbound segment not targeting the RQ.
IRRQ Protection Error: Invalid MSN - range is not QP ID
valid - caused by an inbound segment not targeting the
RQ.
IRRQ Protection Error: Invalid STag - Data Source STag QP ID
determined to be invalid during RDMA Read Response
Hilland, et al. Expires October 2003 [Page 214]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
processing.
IRRQ Protection Error: Tagged Base and bounds QP ID
violation - This includes RDMA Read Request of a
message larger than supported by the RNIC. It is
detected accessing the Data Source during RDMA Read
Response processing.
IRRQ Protection Error: Tagged Access Rights violation QP ID
- Data Source Access Rights violation detected during
RDMA Read Response processing.
IRRQ Protection Error: Tagged Invalid PD - Data Source QP ID
PD violation detected during RDMA Read Response
processing.
IRRQ Protection Error: Wrap error - detected during QP ID
RDMA Read Response processing.
CQ/SQ Error: CQ Overflow Error - An error occurred on QP ID
the CQ during a SQ completion.
CQ/RQ Error: CQ Operation error - An error occurred on QP ID
the CQ during a RQ completion.
S-RQ error on a QP - An error occurred while QP ID
attempting to pull a WQE from the S-RQ associated with
the QP.
Local QP Catastrophic Error - occurred during QP ID
processing.
CQ Overflow Detected - An overflow of the Completion CQ Handle
Queue has been detected. This Error Code is OPTIONAL.
CQ Operation Error - An error occurred on the CQ CQ Handle
unrelated to a specific QP completion.
Shared Receive Queue Limit reached - The Limit value S-RQ Handle
established for the Shared Receive Queue has been
reached.
QP RQ Limit Reached - The Limit value established for QP ID
the QP's RQ has been reached.
Shared Receive Queue Catastrophic Failure - A problem S-RQ Handle
occurred with the RNIC or its driver that renders the
RNIC unable to use the S-RQ.
Hilland, et al. Expires October 2003 [Page 215]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
RNIC Catastrophic Failure - A problem occurred with RNIC Handle
the RNIC or its driver that renders the RNIC unable to
reliably function.
Figure 37 - Asynchronous Event Identifiers
Hilland, et al. Expires October 2003 [Page 216]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
10 Security Considerations
Security Considerations are necessary for the RDMA Protocols and
this specification. An Internet Draft is under development.
Hilland, et al. Expires October 2003 [Page 217]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
11 IANA Considerations
If DDP was enabled a priori for a ULP by connecting to a well-known
port, this well-known port would be registered for the DDP with
IANA.
Hilland, et al. Expires October 2003 [Page 218]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
12 References
12.1 Normative References
[RFC2026] Bradner, S., "The Internet Standards Process -- Revision
3", BCP 9, RFC 2026, October 1996.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997.
[MPA] P. Culley et al., "Markers with PDU Alignment", RDMA
Consortium Draft Specification draft-cully-iwarp-mpa-00.doc,
October 2002
[DDP] H. Shah et al., "Direct Data Placement over Reliable
Transports", RDMA Consortium Draft Specification draft-shah-
iwarp-ddp-00.txt, October 2002
[RDMAP] R. Recio et al., "RDMA Protocol Specification", RDMA
Consortium Draft Specification draft-recio-iwarp-00, October
2002
[SCTP] R. Stewart et al., "Stream Control Transmission Protocol",
RFC 2960, October 2000.
[TCP] Postel, J., "Transmission Control Protocol", STD 7, RFC 793,
September 1981.
12.2 Informative References
[IPSEC] Atkinson, R., Kent, S., "Security Architecture for the
Internet Protocol", RFC 2401, November 1998.
Hilland, et al. Expires October 2003 [Page 219]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
13 Appendix
13.1 Connection Initialization at LLP Startup
The purpose of an initialization at LLP Startup is to enable iWARP
using the minimum number of messages possible. Note that not all
RNIC/OS implementations are required to support this.
< Figure 39 did not convert properly from source >
< to be corrected in an upcoming version >
Figure 39 - Connection Initialization at LLP Startup (using TCP)
Below is an example sequence for an iWARP startup that accomplishes
this (other sequences are possible). The Sequence applies equally to
either the active or passive side.
* The Consumer establishes the LLP Connection using a non-Verbs
interface.
* The Consumer creates a QP, setting up the CQ, PD, etc., and
registers memory for buffers.
* The Consumer posts buffers to the RQ appropriate for the
expected traffic.
* If the ULP intends to transmit first, the Consumer could Post
one or more Work Request(s) on the SQ (usually a SEND message)
that will be sent after the QP is placed in the RTS state.
Hilland, et al. Expires October 2003 [Page 220]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
* The Consumer moves the QP state to RTS. The Modify QP Verb for
this includes the LLP Stream Handle, and does not include a
streaming message buffer.
* If the local Consumer intends to perform RDMA Read Type WRs, the
local Consumer obtains, in some ULP defined message, the number
of incoming RDMA Read Request Messages that the Remote Peer can
have outstanding (IRD). If the Remote Peer's IRD is smaller than
the local Peer's ORD, the local Consumer should also perform a
Modify QP Verb with the Remote Peer's IRD value placed into the
local ORD value prior to posting the first RDMA Read Type WR.
The local Consumer may also transmit, in some ULP defined
message, the number of outgoing RDMA Read Request Messages that
the Local Peer can have outstanding (ORD).
* If the local Consumer intends the QP to be the Data Source of
RDMA Read Operations, the Consumer provides, in some ULP defined
message, the number of incoming RDMA Read Request Messages (e.g.
IRRQ depth) that the Local Peer can have outstanding (IRD). The
Consumer may also receive, in some ULP defined message, the
number of outgoing RDMA Read Request Messages that the Remote
Peer can have outstanding (ORD). If the Remote Peer's ORD is
smaller than the Local Peer's IRD, the local Consumer may also
perform a Modify QP Verb with the Remote Peer's ORD value placed
into the local IRD value prior to posting the first RDMA Read
Type WR.
This specification does not define which side of the connection
sends the first message, the active or passive side; the ULP is
responsible for determining this. In addition, this specification
does not preclude the use of Active/Active connections.
RNIC Implementers note: Since there is no integration between the RI
and the LLP Connection startup sequence, as defined above, it is
possible that some data may arrive over the transport before the
RNIC is in iWARP mode. It is the responsibility of the RI to accept
this data and interpret it as iWARP data. Alternately, the Consumer
(or other service that establishes the LLP Connection) can ensure
that no data will be received prior to moving the QP to RTS state.
If neither of these methods is available, then iWARP startup with
the LLP is not available.
13.2 Graceful Receive Overflow Handling
A valid implementation option is to gracefully handle Receive Queue
or Shared-Receive Queue overflow. In a strictly layered model, this
may be difficult but in an RNIC implementation, this should be
feasible.
Hilland, et al. Expires October 2003 [Page 221]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
In the current architecture, if there are no Receive Queue Work
Queue Elements available when an Untagged Message arrives then the
connection is dropped. This is true if there is a Shared Receive
Queue or a dedicated receive queue.
In this case, the implementation (RI/RNIC), which is not relying on
an external LLP, may choose to handle this gracefully through LLP
mechanisms. In this case, the RI will choose to not drop the
connection and instead appear to pause receive queue processing
until more WQEs have been posted to the RQ or S-RQ.
How the RNIC decides to perform this function is left up to
implementation. One example mechanism which may be used to
gracefully handle receive overflow is for the implementation to drop
incoming packets when there are no WQEs on the RQ or S-RQ. This type
of mechanism may have side effects, such as causing back-off
algorithms to be invoked, but this type of mechanism is still a
valid implementation option.
Hilland, et al. Expires October 2003 [Page 222]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
14 AuthorÆs Addresses
Jeff Hilland
Hewlett-Packard Company
20555 SH 249
Houston, TX 77070-2698 USA
Phone: +1 (281) 514-9489
Email: jeff.hilland@hp.com
Paul R. Culley
Hewlett-Packard Company
20555 SH 249
Houston, TX 77070-2698 USA
Phone: +1 (281) 514-5543
Email: paul.culley@hp.com
James Pinkerton
Microsoft Corporation
One Microsoft Way
Redmond, WA. 98052 USA
Phone: +1 (425) 705-5442
Email: jpink@windows.microsoft.com
Renato Recio
IBM Corporation
11501 Burnett Road
Austin, TX 78758 USA
Phone: +1 (512) 838-1365
Email: recio@us.ibm.com
Hilland, et al. Expires October 2003 [Page 223]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
15 Acknowledgments
John Carrier
Adaptec, Inc.
691 S. Milpitas Blvd.
Milpitas, CA 95035 USA
Phone: +1 (360) 378-8526
Email: john_carrier@adaptec.com
Hari Ghadia
Adaptec, Inc.
691 S. Milpitas Blvd.,
Milpitas, CA 95035 USA
Phone: +1 (408) 957-5608
Email: hari_ghadia@adaptec.com
Patricia Thaler
Agilent Technologies, Inc.
1101 Creekside Ridge Drive, #100
M/S-RG10
Roseville, CA 95678
Phone: +1 (916) 788-5662
email: pat_thaler@agilent.com
Mike Penna
Broadcom Corporation
16215 Alton Parkway
Irvine, California 92619-7013 USA
Phone: +1 (949) 926-7149
Email: MPenna@Broadcom.com
Uri Elzur
Broadcom Corporation
16215 Alton Parkway
Irvine, California 92619-7013 USA
Phone: +1 (949) 585-6432
Email: Uri@Broadcom.com
Ted Compton
EMC Corporation
Research Triangle Park, NC 27709, USA
Phone: +1 (919) 248-6075
Email: compton_ted@emc.com
Dwight Barron
Hewlett-Packard Company
20555 SH 249
Houston, TX 77070-2698 USA
Phone: +1 (281) 514-2769
Email: Dwight.Barron@Hp.com
Hilland, et al. Expires October 2003 [Page 224]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
Mallikarjun Chadalapaka
Hewlett-Packard Company
8000 Foothills Blvd.
Roseville, CA 95747-5668, USA
Phone: +1 (916) 785-5621
Email: cbm@rose.hp.com
Dave Garcia
Hewlett-Packard Company
19333 Vallco Parkway
Cupertino, Ca. 95014 USA
Phone: +1 (408) 285-6116
Email: dave.garcia@hp.com
Mike Krause
Hewlett-Packard Company, 43LN
19410 Homestead Road
Cupertino, CA 95014 USA
Phone: +1 (408) 447-3191
Email: krause@cup.hp.com
Jim Wendt
Hewlett-Packard Company
8000 Foothills Boulevard
Roseville, CA 95747-5668 USA
Phone: +1 (916) 785-5198
Email: jim_wendt@hp.com
John L. Hufferd
IBM Corp.
650 Harry Rd.
San Jose CA
Phone: +1 (408) 256-0403
Email: hufferd@us.ibm.com
Mike Ko
IBM Corp.
650 Harry Rd.
San Jose, CA 95120, USA
Phone: +1 (408) 927-2085
Email: mako@us.ibm.com
Ellen Deleganes
Intel Corporation
MS JF5-355
2111 NE 25th Ave.
Hillsboro, OR 97124 USA
Phone: +1 (503) 712-4173
Email: ellen.m.deleganes@intel.com
Hilland, et al. Expires October 2003 [Page 225]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
Frank Berry
Intel Corporation
2111 NE 25th Ave.
Hillsboro, OR 97124 USA
Phone: +1 (503) 712-3897
Email: frank.berry@intel.com
Howard C. Herbert
Intel Corporation
MS CH7-404
5000 West Chandler Blvd.
Chandler, AZ 85226 USA
Phone: +1 (480) 554-3116
Email: howard.c.herbert@intel.com
Dave Minturn
Intel Corporation
MS JF1-210
5200 North East Elam Young Parkway
Hillsboro, OR 97124 USA
Phone: +1 (503) 712-4106
Email: dave.b.minturn@intel.com
Hemal Shah
Intel Corporation
MS PTL1
1501 South Mopac Expressway, #400
Austin, TX 78746 USA
Phone: +1 (512) 732-3963
Email: hemal.shah@intel.com
James Livingston
NEC Solutions (America), Inc.
7525 166th Ave. N.E., Suite D210
Redmond, WA 98052-7811
Phone: +1 (425) 897-2033
Email: james.livingston@necsam.com
Tom Talpey
Network Appliance
375 Totten Pond Road
Waltham, MA 02451 USA
Phone: +1 (781) 768-5329
Email: thomas.talpey@netapp.com
Hilland, et al. Expires October 2003 [Page 226]
Internet-Draft RDMA Verbs Specification 25 Apr 2003
16 Full Copyright Statement
This document and the information contained herein is provided on an
ææAS ISÆÆ basis and ADAPTEC INC., AGILENT TECHNOLOGIES INC., BROADCOM
CORPORATION, CISCO SYSTEMS INC., DELL COMPUTER CORPORATION, EMC
CORPORATION, HEWLETT-PACKARD COMPANY, INTERNATIONAL BUSINESS
MACHINES CORPORATION, INTEL CORPORATION, MICROSOFT CORPORATION, NEC
SOLUTIONS (AMERICA), INC., NETWORK APPLIANCE INC., THE INTERNET
SOCIETY, AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL
WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY
WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE
ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS
FOR A PARTICULAR PURPOSE.
Copyright (c) 2002, 2003 ADAPTEC INC., BROADCOM CORPORATION, CISCO
SYSTEMS INC., DELL COMPUTER CORPORATION, EMC CORPORATION, HEWLETT-
PACKARD COMPANY, INTERNATIONAL BUSINESS MACHINES CORPORATION, INTEL
CORPORATION, MICROSOFT CORPORATION, NETWORK APPLIANCE INC., All
Rights Reserved.
Hilland, et al. Expires October 2003 [Page 227]
Html markup produced by rfcmarkup 1.129d, available from
https://tools.ietf.org/tools/rfcmarkup/