draft-ietf-nfsv4-minorversion1-06.txt   draft-ietf-nfsv4-minorversion1-07.txt 
NFSv4 S. Shepler NFSv4 S. Shepler
Internet-Draft M. Eisler Internet-Draft M. Eisler
Intended status: Standards Track D. Noveck Intended status: Standards Track D. Noveck
Expires: February 26, 2007 Editors Expires: February 2, 2007 Editors
August 25, 2006 August 2006
NFSv4 Minor Version 1 NFSv4 Minor Version 1
draft-ietf-nfsv4-minorversion1-06.txt draft-ietf-nfsv4-minorversion1-07.txt
Status of this Memo Status of this Memo
By submitting this Internet-Draft, each author represents that any By submitting this Internet-Draft, each author represents that any
applicable patent or other IPR claims of which he or she is aware applicable patent or other IPR claims of which he or she is aware
have been or will be disclosed, and any of which he or she becomes have been or will be disclosed, and any of which he or she becomes
aware will be disclosed, in accordance with Section 6 of BCP 79. aware will be disclosed, in accordance with Section 6 of BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that Task Force (IETF), its areas, and its working groups. Note that
skipping to change at page 1, line 35 skipping to change at page 1, line 35
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt. http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html. http://www.ietf.org/shadow.html.
This Internet-Draft will expire on February 26, 2007. This Internet-Draft will expire on February 2, 2007.
Copyright Notice Copyright Notice
Copyright (C) The Internet Society (2006). Copyright (C) The Internet Society (2006).
Abstract Abstract
This Internet-Draft describes NFSv4 minor version one, including This Internet-Draft describes NFSv4 minor version one, including
features retained from the base protocol and protocol extensions made features retained from the base protocol and protocol extensions made
subsequently. The current draft includes desciption of the major subsequently. The current draft includes description of the major
extensions, Sessions, Directory Delegations, and parallel NFS (pNFS). extensions, Sessions, Directory Delegations, and parallel NFS (pNFS).
This Internet-Draft is an active work item of the NFSv4 working This Internet-Draft is an active work item of the NFSv4 working
group. Active and resolved issues may be found in the issue tracker group. Active and resolved issues may be found in the issue tracker
at: http://www.nfsv4-editor.org/cgi-bin/roundup/nfsv4. New issues at: http://www.nfsv4-editor.org/cgi-bin/roundup/nfsv4. New issues
related to this document should be raised with the NFSv4 Working related to this document should be raised with the NFSv4 Working
Group nfsv4@ietf.org and logged in the issue tracker. Group nfsv4@ietf.org and logged in the issue tracker.
Requirements Language Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [1]. document are to be interpreted as described in RFC 2119 [1].
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 10 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 9
1.1. The NFSv4.1 Protocol . . . . . . . . . . . . . . . . . . 10 1.1. The NFSv4.1 Protocol . . . . . . . . . . . . . . . . . . 9
1.2. NFS Version 4 Goals . . . . . . . . . . . . . . . . . . 10 1.2. NFS Version 4 Goals . . . . . . . . . . . . . . . . . . 9
1.3. Minor Version 1 Goals . . . . . . . . . . . . . . . . . 11 1.3. Minor Version 1 Goals . . . . . . . . . . . . . . . . . 10
1.4. Inconsistencies of this Document with Section XX . . . . 11 1.4. Inconsistencies of this Document with Section XX . . . . 10
1.5. Overview of NFS version 4.1 Features . . . . . . . . . . 11 1.5. Overview of NFS version 4.1 Features . . . . . . . . . . 10
1.5.1. RPC and Security . . . . . . . . . . . . . . . . . . 12 1.5.1. RPC and Security . . . . . . . . . . . . . . . . . . 11
1.5.2. Protocol Structure . . . . . . . . . . . . . . . . . 12 1.5.2. Protocol Structure . . . . . . . . . . . . . . . . . 11
1.5.3. File System Model . . . . . . . . . . . . . . . . . . 14 1.5.3. File System Model . . . . . . . . . . . . . . . . . . 12
1.5.4. Locking Facilities . . . . . . . . . . . . . . . . . 15 1.5.4. Locking Facilities . . . . . . . . . . . . . . . . . 13
1.6. General Definitions . . . . . . . . . . . . . . . . . . 16 1.6. General Definitions . . . . . . . . . . . . . . . . . . 14
1.7. Differences from NFSv4.0 . . . . . . . . . . . . . . . . 18 1.7. Differences from NFSv4.0 . . . . . . . . . . . . . . . . 16
2. Core Infrastructure . . . . . . . . . . . . . . . . . . . . . 18 2. Core Infrastructure . . . . . . . . . . . . . . . . . . . . . 16
2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 18 2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 16
2.2. RPC and XDR . . . . . . . . . . . . . . . . . . . . . . 18 2.2. RPC and XDR . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1. RPC-based Security . . . . . . . . . . . . . . . . . 18 2.2.1. RPC-based Security . . . . . . . . . . . . . . . . . 16
2.3. Non-RPC-based Security Services . . . . . . . . . . . . 19 2.3. COMPOUND and CB_COMPOUND . . . . . . . . . . . . . . . . 20
2.3.1. Authorization . . . . . . . . . . . . . . . . . . . . 19 2.4. Client Identifiers . . . . . . . . . . . . . . . . . . . 20
2.3.2. Auditing . . . . . . . . . . . . . . . . . . . . . . 19 2.4.1. Server Release of Clientid . . . . . . . . . . . . . 24
2.3.3. Intrusion Detection . . . . . . . . . . . . . . . . . 19 2.5. Security Service Negotiation . . . . . . . . . . . . . . 25
2.4. Transport Layers . . . . . . . . . . . . . . . . . . . . 19 2.5.1. NFSv4 Security Tuples . . . . . . . . . . . . . . . . 25
2.4.1. Ports . . . . . . . . . . . . . . . . . . . . . . . . 19 2.5.2. SECINFO and SECINFO_NO_NAME . . . . . . . . . . . . . 25
2.4.2. Stream Transports . . . . . . . . . . . . . . . . . . 19 2.5.3. Security Error . . . . . . . . . . . . . . . . . . . 26
2.4.3. RDMA Transports . . . . . . . . . . . . . . . . . . . 19 2.6. Minor Versioning . . . . . . . . . . . . . . . . . . . . 28
2.5. Session . . . . . . . . . . . . . . . . . . . . . . . . 19 2.7. Non-RPC-based Security Services . . . . . . . . . . . . 31
2.5.1. Motivation and Overview . . . . . . . . . . . . . . . 19 2.7.1. Authorization . . . . . . . . . . . . . . . . . . . . 31
2.5.2. NFSv4 Integration . . . . . . . . . . . . . . . . . . 19 2.7.2. Auditing . . . . . . . . . . . . . . . . . . . . . . 31
2.5.3. Channels . . . . . . . . . . . . . . . . . . . . . . 19 2.7.3. Intrusion Detection . . . . . . . . . . . . . . . . . 31
2.5.4. Exactly Once Semantics . . . . . . . . . . . . . . . 20 2.8. Transport Layers . . . . . . . . . . . . . . . . . . . . 32
2.6. Channel Management . . . . . . . . . . . . . . . . . . . 20 2.8.1. Required and Recommended Properties of Transports . . 32
2.6.1. Buffer Management . . . . . . . . . . . . . . . . . . 20 2.8.2. Client and Server Transport Behavior . . . . . . . . 32
2.6.2. Data Transfer . . . . . . . . . . . . . . . . . . . . 20 2.8.3. Ports . . . . . . . . . . . . . . . . . . . . . . . . 34
2.6.3. Flow Control . . . . . . . . . . . . . . . . . . . . 20 2.9. Session . . . . . . . . . . . . . . . . . . . . . . . . 34
2.6.4. COMPOUND Sizing Issues . . . . . . . . . . . . . . . 20 2.9.1. Motivation and Overview . . . . . . . . . . . . . . . 34
2.6.5. Data Alignment . . . . . . . . . . . . . . . . . . . 20 2.9.2. NFSv4 Integration . . . . . . . . . . . . . . . . . . 35
2.7. Sessions Security . . . . . . . . . . . . . . . . . . . 20 2.9.3. Channels . . . . . . . . . . . . . . . . . . . . . . 36
2.7.1. Denial of Service via Unauthorized State Changes . . 20 2.9.4. Exactly Once Semantics . . . . . . . . . . . . . . . 38
2.8. Session Mechanics - Steady State . . . . . . . . . . . . 20 2.9.5. RDMA Considerations . . . . . . . . . . . . . . . . . 46
2.8.1. Obligations of the Server . . . . . . . . . . . . . . 20 2.9.6. Sessions Security . . . . . . . . . . . . . . . . . . 48
2.8.2. Obligations of the Client . . . . . . . . . . . . . . 20 2.9.7. Session Mechanics - Steady State . . . . . . . . . . 53
2.8.3. Steps the Client Takes To Establish a Session . . . . 20 2.9.8. Session Mechanics - Recovery . . . . . . . . . . . . 54
2.8.4. Session Mechanics - Recovery . . . . . . . . . . . . 20 3. Protocol Data Types . . . . . . . . . . . . . . . . . . . . . 57
3. RPC and Security Flavor . . . . . . . . . . . . . . . . . . . 21 3.1. Basic Data Types . . . . . . . . . . . . . . . . . . . . 57
3.1. Ports and Transports . . . . . . . . . . . . . . . . . . 21 3.2. Structured Data Types . . . . . . . . . . . . . . . . . 59
3.1.1. Client Retransmission Behavior . . . . . . . . . . . 22 4. Filehandles . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.2. Security Flavors . . . . . . . . . . . . . . . . . . . . 23 4.1. Obtaining the First Filehandle . . . . . . . . . . . . . 69
3.2.1. Security mechanisms for NFS version 4 . . . . . . . . 23 4.1.1. Root Filehandle . . . . . . . . . . . . . . . . . . . 69
3.3. Security Negotiation . . . . . . . . . . . . . . . . . . 24 4.1.2. Public Filehandle . . . . . . . . . . . . . . . . . . 69
3.3.1. SECINFO and SECINFO_NO_NAME . . . . . . . . . . . . . 25 4.2. Filehandle Types . . . . . . . . . . . . . . . . . . . . 70
3.3.2. Security Error . . . . . . . . . . . . . . . . . . . 25 4.2.1. General Properties of a Filehandle . . . . . . . . . 70
3.3.3. Callback RPC Authentication . . . . . . . . . . . . . 25 4.2.2. Persistent Filehandle . . . . . . . . . . . . . . . . 71
3.3.4. GSS Server Principal . . . . . . . . . . . . . . . . 26 4.2.3. Volatile Filehandle . . . . . . . . . . . . . . . . . 71
4. Security Negotiation . . . . . . . . . . . . . . . . . . . . 26 4.3. One Method of Constructing a Volatile Filehandle . . . . 72
5. Clarification of Security Negotiation in NFSv4.1 . . . . . . 27 4.4. Client Recovery from Filehandle Expiration . . . . . . . 73
5.1. PUTFH + LOOKUP . . . . . . . . . . . . . . . . . . . . . 27 5. File Attributes . . . . . . . . . . . . . . . . . . . . . . . 74
5.2. PUTFH + LOOKUPP . . . . . . . . . . . . . . . . . . . . 27 5.1. Mandatory Attributes . . . . . . . . . . . . . . . . . . 75
5.3. PUTFH + SECINFO . . . . . . . . . . . . . . . . . . . . 27 5.2. Recommended Attributes . . . . . . . . . . . . . . . . . 75
5.4. PUTFH + Anything Else . . . . . . . . . . . . . . . . . 28 5.3. Named Attributes . . . . . . . . . . . . . . . . . . . . 76
6. NFSv4.1 Sessions . . . . . . . . . . . . . . . . . . . . . . 28 5.4. Classification of Attributes . . . . . . . . . . . . . . 76
6.1. Sessions Background . . . . . . . . . . . . . . . . . . 28 5.5. Mandatory Attributes - Definitions . . . . . . . . . . . 77
6.1.1. Introduction to Sessions . . . . . . . . . . . . . . 28 5.6. Recommended Attributes - Definitions . . . . . . . . . . 79
6.1.2. Session Model . . . . . . . . . . . . . . . . . . . . 29 5.7. Time Access . . . . . . . . . . . . . . . . . . . . . . 87
6.1.3. Connection State . . . . . . . . . . . . . . . . . . 30 5.8. Interpreting owner and owner_group . . . . . . . . . . . 87
6.1.4. NFSv4 Channels, Sessions and Connections . . . . . . 31 5.9. Character Case Attributes . . . . . . . . . . . . . . . 89
6.1.5. Reconnection, Trunking and Failover . . . . . . . . . 33 5.10. Quota Attributes . . . . . . . . . . . . . . . . . . . . 89
6.1.6. Server Duplicate Request Cache . . . . . . . . . . . 33 5.11. mounted_on_fileid . . . . . . . . . . . . . . . . . . . 90
6.2. Session Initialization and Transfer Models . . . . . . . 35 5.12. send_impl_id and recv_impl_id . . . . . . . . . . . . . 91
6.2.1. Session Negotiation . . . . . . . . . . . . . . . . . 35 5.13. fs_layout_type . . . . . . . . . . . . . . . . . . . . . 92
6.2.2. RDMA Requirements . . . . . . . . . . . . . . . . . . 36 5.14. layout_type . . . . . . . . . . . . . . . . . . . . . . 92
6.2.3. RDMA Connection Resources . . . . . . . . . . . . . . 37 5.15. layout_hint . . . . . . . . . . . . . . . . . . . . . . 92
6.2.4. TCP and RDMA Inline Transfer Model . . . . . . . . . 37 5.16. mdsthreshold . . . . . . . . . . . . . . . . . . . . . . 92
6.2.5. RDMA Direct Transfer Model . . . . . . . . . . . . . 40 6. Access Control Lists . . . . . . . . . . . . . . . . . . . . 93
6.3. Connection Models . . . . . . . . . . . . . . . . . . . 43 6.1. Goals . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.3.1. TCP Connection Model . . . . . . . . . . . . . . . . 44 6.2. File Attributes Discussion . . . . . . . . . . . . . . . 94
6.3.2. Negotiated RDMA Connection Model . . . . . . . . . . 45 6.2.1. ACL Attribute . . . . . . . . . . . . . . . . . . . . 94
6.3.3. Automatic RDMA Connection Model . . . . . . . . . . . 46 6.2.2. mode Attribute . . . . . . . . . . . . . . . . . . . 105
6.4. Buffer Management, Transfer, Flow Control . . . . . . . 46 6.3. Common Methods . . . . . . . . . . . . . . . . . . . . . 106
6.5. Retry and Replay . . . . . . . . . . . . . . . . . . . . 49 6.3.1. Interpreting an ACL . . . . . . . . . . . . . . . . . 106
6.6. The Back Channel . . . . . . . . . . . . . . . . . . . . 50 6.3.2. Computing a Mode Attribute from an ACL . . . . . . . 107
6.7. COMPOUND Sizing Issues . . . . . . . . . . . . . . . . . 51 6.4. Requirements . . . . . . . . . . . . . . . . . . . . . . 109
6.8. Data Alignment . . . . . . . . . . . . . . . . . . . . . 51 6.4.1. Setting the mode and/or ACL Attributes . . . . . . . 109
6.9. NFSv4 Integration . . . . . . . . . . . . . . . . . . . 53 6.4.2. Retrieving the mode and/or ACL Attributes . . . . . . 110
6.9.1. Minor Versioning . . . . . . . . . . . . . . . . . . 53 6.4.3. Creating New Objects . . . . . . . . . . . . . . . . 111
6.9.2. Slot Identifiers and Server Duplicate Request Cache . 53 7. Single-server Name Space . . . . . . . . . . . . . . . . . . 112
6.9.3. Resolving server callback races with sessions . . . . 56 7.1. Server Exports . . . . . . . . . . . . . . . . . . . . . 112
6.9.4. COMPOUND and CB_COMPOUND . . . . . . . . . . . . . . 57 7.2. Browsing Exports . . . . . . . . . . . . . . . . . . . . 113
6.10. Sessions Security Considerations . . . . . . . . . . . . 59 7.3. Server Pseudo File System . . . . . . . . . . . . . . . 113
6.10.1. Denial of Service via Unauthorized State Changes . . 59 7.4. Multiple Roots . . . . . . . . . . . . . . . . . . . . . 114
6.11. Session Mechanics - Steady State . . . . . . . . . . . . 63 7.5. Filehandle Volatility . . . . . . . . . . . . . . . . . 114
6.11.1. Obligations of the Server . . . . . . . . . . . . . . 63 7.6. Exported Root . . . . . . . . . . . . . . . . . . . . . 114
6.11.2. Obligations of the Client . . . . . . . . . . . . . . 63 7.7. Mount Point Crossing . . . . . . . . . . . . . . . . . . 115
6.11.3. Steps the Client Takes To Establish a Session . . . . 64 7.8. Security Policy and Name Space Presentation . . . . . . 115
6.12. Session Mechanics - Recovery . . . . . . . . . . . . . . 64 8. File Locking and Share Reservations . . . . . . . . . . . . . 116
6.12.1. Events Requiring Client Action . . . . . . . . . . . 64 8.1. Locking . . . . . . . . . . . . . . . . . . . . . . . . 116
6.12.2. Events Requiring Server Action . . . . . . . . . . . 66 8.1.1. Client and Session ID . . . . . . . . . . . . . . . . 117
7. Minor Versioning . . . . . . . . . . . . . . . . . . . . . . 66 8.1.2. State-owner and Stateid Definition . . . . . . . . . 117
8. Protocol Data Types . . . . . . . . . . . . . . . . . . . . . 69 8.1.3. Use of the Stateid and Locking . . . . . . . . . . . 120
8.1. Basic Data Types . . . . . . . . . . . . . . . . . . . . 69 8.2. Lock Ranges . . . . . . . . . . . . . . . . . . . . . . 122
8.2. Structured Data Types . . . . . . . . . . . . . . . . . 70 8.3. Upgrading and Downgrading Locks . . . . . . . . . . . . 122
9. Filehandles . . . . . . . . . . . . . . . . . . . . . . . . . 80 8.4. Blocking Locks . . . . . . . . . . . . . . . . . . . . . 123
9.1. Obtaining the First Filehandle . . . . . . . . . . . . . 80 8.5. Lease Renewal . . . . . . . . . . . . . . . . . . . . . 124
9.1.1. Root Filehandle . . . . . . . . . . . . . . . . . . . 80 8.6. Crash Recovery . . . . . . . . . . . . . . . . . . . . . 124
9.1.2. Public Filehandle . . . . . . . . . . . . . . . . . . 80 8.6.1. Client Failure and Recovery . . . . . . . . . . . . . 124
9.2. Filehandle Types . . . . . . . . . . . . . . . . . . . . 81 8.6.2. Server Failure and Recovery . . . . . . . . . . . . . 125
9.2.1. General Properties of a Filehandle . . . . . . . . . 81 8.6.3. Network Partitions and Recovery . . . . . . . . . . . 127
9.2.2. Persistent Filehandle . . . . . . . . . . . . . . . . 82 8.7. Server Revocation of Locks . . . . . . . . . . . . . . . 131
9.2.3. Volatile Filehandle . . . . . . . . . . . . . . . . . 82 8.8. Share Reservations . . . . . . . . . . . . . . . . . . . 132
9.3. One Method of Constructing a Volatile Filehandle . . . . 84 8.9. OPEN/CLOSE Operations . . . . . . . . . . . . . . . . . 133
9.4. Client Recovery from Filehandle Expiration . . . . . . . 84 8.10. Open Upgrade and Downgrade . . . . . . . . . . . . . . . 134
10. File Attributes . . . . . . . . . . . . . . . . . . . . . . . 85 8.11. Short and Long Leases . . . . . . . . . . . . . . . . . 134
10.1. Mandatory Attributes . . . . . . . . . . . . . . . . . . 86 8.12. Clocks, Propagation Delay, and Calculating Lease
10.2. Recommended Attributes . . . . . . . . . . . . . . . . . 86 Expiration . . . . . . . . . . . . . . . . . . . . . . . 135
10.3. Named Attributes . . . . . . . . . . . . . . . . . . . . 87 8.13. Vestigial Locking Infrastructure From V4.0 . . . . . . . 135
10.4. Classification of Attributes . . . . . . . . . . . . . . 87 9. Client-Side Caching . . . . . . . . . . . . . . . . . . . . . 136
10.5. Mandatory Attributes - Definitions . . . . . . . . . . . 89 9.1. Performance Challenges for Client-Side Caching . . . . . 137
10.6. Recommended Attributes - Definitions . . . . . . . . . . 90 9.2. Delegation and Callbacks . . . . . . . . . . . . . . . . 138
10.7. Time Access . . . . . . . . . . . . . . . . . . . . . . 99 9.2.1. Delegation Recovery . . . . . . . . . . . . . . . . . 139
10.8. Interpreting owner and owner_group . . . . . . . . . . . 99 9.3. Data Caching . . . . . . . . . . . . . . . . . . . . . . 141
10.9. Character Case Attributes . . . . . . . . . . . . . . . 101 9.3.1. Data Caching and OPENs . . . . . . . . . . . . . . . 141
10.10. Quota Attributes . . . . . . . . . . . . . . . . . . . . 101 9.3.2. Data Caching and File Locking . . . . . . . . . . . . 142
10.11. mounted_on_fileid . . . . . . . . . . . . . . . . . . . 102 9.3.3. Data Caching and Mandatory File Locking . . . . . . . 144
10.12. send_impl_id and recv_impl_id . . . . . . . . . . . . . 103 9.3.4. Data Caching and File Identity . . . . . . . . . . . 144
10.13. fs_layout_type . . . . . . . . . . . . . . . . . . . . . 104 9.4. Open Delegation . . . . . . . . . . . . . . . . . . . . 145
10.14. layout_type . . . . . . . . . . . . . . . . . . . . . . 104 9.4.1. Open Delegation and Data Caching . . . . . . . . . . 148
10.15. layout_hint . . . . . . . . . . . . . . . . . . . . . . 104 9.4.2. Open Delegation and File Locks . . . . . . . . . . . 149
10.16. mdsthreshold . . . . . . . . . . . . . . . . . . . . . . 104 9.4.3. Handling of CB_GETATTR . . . . . . . . . . . . . . . 149
11. Access Control Lists . . . . . . . . . . . . . . . . . . . . 105 9.4.4. Recall of Open Delegation . . . . . . . . . . . . . . 152
11.1. Goals . . . . . . . . . . . . . . . . . . . . . . . . . 105 9.4.5. Clients that Fail to Honor Delegation Recalls . . . . 154
11.2. File Attributes Discussion . . . . . . . . . . . . . . . 106 9.4.6. Delegation Revocation . . . . . . . . . . . . . . . . 155
11.2.1. ACL Attribute . . . . . . . . . . . . . . . . . . . . 106 9.5. Data Caching and Revocation . . . . . . . . . . . . . . 155
11.2.2. mode Attribute . . . . . . . . . . . . . . . . . . . 117 9.5.1. Revocation Recovery for Write Open Delegation . . . . 156
11.3. Common Methods . . . . . . . . . . . . . . . . . . . . . 118 9.6. Attribute Caching . . . . . . . . . . . . . . . . . . . 157
11.3.1. Interpreting an ACL . . . . . . . . . . . . . . . . . 118 9.7. Data and Metadata Caching and Memory Mapped Files . . . 159
11.3.2. Computing a Mode Attribute from an ACL . . . . . . . 119 9.8. Name Caching . . . . . . . . . . . . . . . . . . . . . . 161
11.4. Requirements . . . . . . . . . . . . . . . . . . . . . . 120 9.9. Directory Caching . . . . . . . . . . . . . . . . . . . 162
11.4.1. Setting the mode and/or ACL Attributes . . . . . . . 121 10. Multi-server Name Space . . . . . . . . . . . . . . . . . . . 163
11.4.2. Retrieving the mode and/or ACL Attributes . . . . . . 122 10.1. Location attributes . . . . . . . . . . . . . . . . . . 163
11.4.3. Creating New Objects . . . . . . . . . . . . . . . . 122 10.2. File System Presence or Absence . . . . . . . . . . . . 163
12. Single-server Name Space . . . . . . . . . . . . . . . . . . 124 10.3. Getting Attributes for an Absent File System . . . . . . 165
12.1. Server Exports . . . . . . . . . . . . . . . . . . . . . 124 10.3.1. GETATTR Within an Absent File System . . . . . . . . 165
12.2. Browsing Exports . . . . . . . . . . . . . . . . . . . . 125 10.3.2. READDIR and Absent File Systems . . . . . . . . . . . 166
12.3. Server Pseudo File System . . . . . . . . . . . . . . . 125 10.4. Uses of Location Information . . . . . . . . . . . . . . 167
12.4. Multiple Roots . . . . . . . . . . . . . . . . . . . . . 126 10.4.1. File System Replication . . . . . . . . . . . . . . . 167
12.5. Filehandle Volatility . . . . . . . . . . . . . . . . . 126 10.4.2. File System Migration . . . . . . . . . . . . . . . . 168
12.6. Exported Root . . . . . . . . . . . . . . . . . . . . . 126 10.4.3. Referrals . . . . . . . . . . . . . . . . . . . . . . 169
12.7. Mount Point Crossing . . . . . . . . . . . . . . . . . . 126 10.5. Additional Client-side Considerations . . . . . . . . . 169
12.8. Security Policy and Name Space Presentation . . . . . . 127 10.6. Effecting File System Transitions . . . . . . . . . . . 170
13. File Locking and Share Reservations . . . . . . . . . . . . . 128 10.6.1. Transparent File System Transitions . . . . . . . . . 171
13.1. Locking . . . . . . . . . . . . . . . . . . . . . . . . 128 10.6.2. Filehandles and File System Transitions . . . . . . . 173
13.1.1. Client ID . . . . . . . . . . . . . . . . . . . . . . 129 10.6.3. Fileid's and File System Transitions . . . . . . . . 173
13.1.2. Server Release of Clientid . . . . . . . . . . . . . 132 10.6.4. Fsid's and File System Transitions . . . . . . . . . 174
13.1.3. State-owner and Stateid Definition . . . . . . . . . 133 10.6.5. The Change Attribute and File System Transitions . . 174
13.1.4. Use of the Stateid and Locking . . . . . . . . . . . 136 10.6.6. Lock State and File System Transitions . . . . . . . 175
13.2. Lock Ranges . . . . . . . . . . . . . . . . . . . . . . 138 10.6.7. Write Verifiers and File System Transitions . . . . . 178
13.3. Upgrading and Downgrading Locks . . . . . . . . . . . . 138 10.7. Effecting File System Referrals . . . . . . . . . . . . 178
13.4. Blocking Locks . . . . . . . . . . . . . . . . . . . . . 139 10.7.1. Referral Example (LOOKUP) . . . . . . . . . . . . . . 179
13.5. Lease Renewal . . . . . . . . . . . . . . . . . . . . . 140 10.7.2. Referral Example (READDIR) . . . . . . . . . . . . . 183
13.6. Crash Recovery . . . . . . . . . . . . . . . . . . . . . 140 10.8. The Attribute fs_absent . . . . . . . . . . . . . . . . 185
13.6.1. Client Failure and Recovery . . . . . . . . . . . . . 140 10.9. The Attribute fs_locations . . . . . . . . . . . . . . . 185
13.6.2. Server Failure and Recovery . . . . . . . . . . . . . 141 10.10. The Attribute fs_locations_info . . . . . . . . . . . . 187
13.6.3. Network Partitions and Recovery . . . . . . . . . . . 143 10.11. The Attribute fs_status . . . . . . . . . . . . . . . . 196
13.7. Server Revocation of Locks . . . . . . . . . . . . . . . 147 11. Directory Delegations . . . . . . . . . . . . . . . . . . . . 199
13.8. Share Reservations . . . . . . . . . . . . . . . . . . . 148 11.1. Introduction to Directory Delegations . . . . . . . . . 200
13.9. OPEN/CLOSE Operations . . . . . . . . . . . . . . . . . 149 11.2. Directory Delegation Design (in brief) . . . . . . . . . 201
13.10. Open Upgrade and Downgrade . . . . . . . . . . . . . . . 150 11.3. Recommended Attributes in support of Directory
13.11. Short and Long Leases . . . . . . . . . . . . . . . . . 150 Delegations . . . . . . . . . . . . . . . . . . . . . . 202
13.12. Clocks, Propagation Delay, and Calculating Lease 11.4. Delegation Recall . . . . . . . . . . . . . . . . . . . 203
Expiration . . . . . . . . . . . . . . . . . . . . . . . 151 11.5. Directory Delegation Recovery . . . . . . . . . . . . . 203
13.13. Vestigial Locking Infrastructure From V4.0 . . . . . . . 151 12. Parallel NFS (pNFS) . . . . . . . . . . . . . . . . . . . . . 203
14. Client-Side Caching . . . . . . . . . . . . . . . . . . . . . 152 12.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 203
14.1. Performance Challenges for Client-Side Caching . . . . . 153 12.2. General Definitions . . . . . . . . . . . . . . . . . . 206
14.2. Delegation and Callbacks . . . . . . . . . . . . . . . . 154 12.2.1. Metadata Server . . . . . . . . . . . . . . . . . . . 206
14.2.1. Delegation Recovery . . . . . . . . . . . . . . . . . 155 12.2.2. Client . . . . . . . . . . . . . . . . . . . . . . . 206
14.3. Data Caching . . . . . . . . . . . . . . . . . . . . . . 157 12.2.3. Storage Device . . . . . . . . . . . . . . . . . . . 206
14.3.1. Data Caching and OPENs . . . . . . . . . . . . . . . 157 12.2.4. Storage Protocol . . . . . . . . . . . . . . . . . . 206
14.3.2. Data Caching and File Locking . . . . . . . . . . . . 158 12.2.5. Control Protocol . . . . . . . . . . . . . . . . . . 207
14.3.3. Data Caching and Mandatory File Locking . . . . . . . 160 12.2.6. Metadata . . . . . . . . . . . . . . . . . . . . . . 207
14.3.4. Data Caching and File Identity . . . . . . . . . . . 160 12.2.7. Layout . . . . . . . . . . . . . . . . . . . . . . . 207
14.4. Open Delegation . . . . . . . . . . . . . . . . . . . . 161 12.3. pNFS protocol semantics . . . . . . . . . . . . . . . . 208
14.4.1. Open Delegation and Data Caching . . . . . . . . . . 164 12.3.1. Definitions . . . . . . . . . . . . . . . . . . . . . 208
14.4.2. Open Delegation and File Locks . . . . . . . . . . . 165 12.3.2. Guarantees Provided by Layouts . . . . . . . . . . . 211
14.4.3. Handling of CB_GETATTR . . . . . . . . . . . . . . . 165 12.3.3. Getting a Layout . . . . . . . . . . . . . . . . . . 212
14.4.4. Recall of Open Delegation . . . . . . . . . . . . . . 168 12.3.4. Committing a Layout . . . . . . . . . . . . . . . . . 213
14.4.5. Clients that Fail to Honor Delegation Recalls . . . . 170 12.3.5. Recalling a Layout . . . . . . . . . . . . . . . . . 215
14.4.6. Delegation Revocation . . . . . . . . . . . . . . . . 171 12.3.6. Metadata Server Write Propagation . . . . . . . . . . 221
14.5. Data Caching and Revocation . . . . . . . . . . . . . . 171 12.3.7. Crash Recovery . . . . . . . . . . . . . . . . . . . 221
14.5.1. Revocation Recovery for Write Open Delegation . . . . 172 12.3.8. Security Considerations . . . . . . . . . . . . . . . 227
14.6. Attribute Caching . . . . . . . . . . . . . . . . . . . 173 12.4. The NFSv4 File Layout Type . . . . . . . . . . . . . . . 228
14.7. Data and Metadata Caching and Memory Mapped Files . . . 175 12.4.1. File Striping and Data Access . . . . . . . . . . . . 228
14.8. Name Caching . . . . . . . . . . . . . . . . . . . . . . 177 12.4.2. Global Stateid Requirements . . . . . . . . . . . . . 236
14.9. Directory Caching . . . . . . . . . . . . . . . . . . . 178 12.4.3. The Layout Iomode . . . . . . . . . . . . . . . . . . 236
15. Multi-server Name Space . . . . . . . . . . . . . . . . . . . 179 12.4.4. Storage Device State Propagation . . . . . . . . . . 237
15.1. Location attributes . . . . . . . . . . . . . . . . . . 179 12.4.5. Storage Device Component File Size . . . . . . . . . 239
15.2. File System Presence or Absence . . . . . . . . . . . . 179 12.4.6. Crash Recovery Considerations . . . . . . . . . . . . 240
15.3. Getting Attributes for an Absent File System . . . . . . 181 12.4.7. Security Considerations for the File Layout Type . . 240
15.3.1. GETATTR Within an Absent File System . . . . . . . . 181 12.4.8. Alternate Approaches . . . . . . . . . . . . . . . . 241
15.3.2. READDIR and Absent File Systems . . . . . . . . . . . 182 13. Internationalization . . . . . . . . . . . . . . . . . . . . 242
15.4. Uses of Location Information . . . . . . . . . . . . . . 183 13.1. Stringprep profile for the utf8str_cs type . . . . . . . 243
15.4.1. File System Replication . . . . . . . . . . . . . . . 183 13.2. Stringprep profile for the utf8str_cis type . . . . . . 245
15.4.2. File System Migration . . . . . . . . . . . . . . . . 184 13.3. Stringprep profile for the utf8str_mixed type . . . . . 246
15.4.3. Referrals . . . . . . . . . . . . . . . . . . . . . . 185 13.4. UTF-8 Related Errors . . . . . . . . . . . . . . . . . . 247
15.5. Additional Client-side Considerations . . . . . . . . . 185 14. Error Values . . . . . . . . . . . . . . . . . . . . . . . . 248
15.6. Effecting File System Transitions . . . . . . . . . . . 186 14.1. Error Definitions . . . . . . . . . . . . . . . . . . . 248
15.6.1. Transparent File System Transitions . . . . . . . . . 187 14.2. Operations and their valid errors . . . . . . . . . . . 262
15.6.2. Filehandles and File System Transitions . . . . . . . 189 14.3. Callback operations and their valid errors . . . . . . . 275
15.6.3. Fileid's and File System Transitions . . . . . . . . 189 14.4. Errors and the operations that use them . . . . . . . . 276
15.6.4. Fsid's and File System Transitions . . . . . . . . . 190 15. NFS version 4.1 Procedures . . . . . . . . . . . . . . . . . 283
15.6.5. The Change Attribute and File System Transitions . . 190 15.1. Procedure 0: NULL - No Operation . . . . . . . . . . . . 283
15.6.6. Lock State and File System Transitions . . . . . . . 191 15.2. Procedure 1: COMPOUND - Compound Operations . . . . . . 284
15.6.7. Write Verifiers and File System Transitions . . . . . 194 16. NFS version 4.1 Operations . . . . . . . . . . . . . . . . . 289
15.7. Effecting File System Referrals . . . . . . . . . . . . 194 16.1. Operation 3: ACCESS - Check Access Rights . . . . . . . 289
15.7.1. Referral Example (LOOKUP) . . . . . . . . . . . . . . 195 16.2. Operation 4: CLOSE - Close File . . . . . . . . . . . . 291
15.7.2. Referral Example (READDIR) . . . . . . . . . . . . . 199 16.3. Operation 5: COMMIT - Commit Cached Data . . . . . . . . 293
15.8. The Attribute fs_absent . . . . . . . . . . . . . . . . 201 16.4. Operation 6: CREATE - Create a Non-Regular File Object . 295
15.9. The Attribute fs_locations . . . . . . . . . . . . . . . 201 16.5. Operation 7: DELEGPURGE - Purge Delegations Awaiting
15.10. The Attribute fs_locations_info . . . . . . . . . . . . 203 Recovery . . . . . . . . . . . . . . . . . . . . . . . . 298
15.11. The Attribute fs_status . . . . . . . . . . . . . . . . 212 16.6. Operation 8: DELEGRETURN - Return Delegation . . . . . . 299
16. Directory Delegations . . . . . . . . . . . . . . . . . . . . 215 16.7. Operation 9: GETATTR - Get Attributes . . . . . . . . . 299
16.1. Introduction to Directory Delegations . . . . . . . . . 216 16.8. Operation 10: GETFH - Get Current Filehandle . . . . . . 301
16.2. Directory Delegation Design (in brief) . . . . . . . . . 217 16.9. Operation 11: LINK - Create Link to a File . . . . . . . 302
16.3. Recommended Attributes in support of Directory 16.10. Operation 12: LOCK - Create Lock . . . . . . . . . . . . 303
Delegations . . . . . . . . . . . . . . . . . . . . . . 218 16.11. Operation 13: LOCKT - Test For Lock . . . . . . . . . . 307
16.4. Delegation Recall . . . . . . . . . . . . . . . . . . . 219 16.12. Operation 14: LOCKU - Unlock File . . . . . . . . . . . 308
16.5. Directory Delegation Recovery . . . . . . . . . . . . . 219 16.13. Operation 15: LOOKUP - Lookup Filename . . . . . . . . . 309
17. Parallel NFS (pNFS) . . . . . . . . . . . . . . . . . . . . . 219 16.14. Operation 16: LOOKUPP - Lookup Parent Directory . . . . 311
17.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 219 16.15. Operation 17: NVERIFY - Verify Difference in
17.2. General Definitions . . . . . . . . . . . . . . . . . . 222 Attributes . . . . . . . . . . . . . . . . . . . . . . . 312
17.2.1. Metadata Server . . . . . . . . . . . . . . . . . . . 222 16.16. Operation 18: OPEN - Open a Regular File . . . . . . . . 314
17.2.2. Client . . . . . . . . . . . . . . . . . . . . . . . 222 16.17. Operation 19: OPENATTR - Open Named Attribute
17.2.3. Storage Device . . . . . . . . . . . . . . . . . . . 222 Directory . . . . . . . . . . . . . . . . . . . . . . . 328
17.2.4. Storage Protocol . . . . . . . . . . . . . . . . . . 222 16.18. Operation 21: OPEN_DOWNGRADE - Reduce Open File Access . 329
17.2.5. Control Protocol . . . . . . . . . . . . . . . . . . 223 16.19. Operation 22: PUTFH - Set Current Filehandle . . . . . . 330
17.2.6. Metadata . . . . . . . . . . . . . . . . . . . . . . 223 16.20. Operation 23: PUTPUBFH - Set Public Filehandle . . . . . 331
17.2.7. Layout . . . . . . . . . . . . . . . . . . . . . . . 223 16.21. Operation 24: PUTROOTFH - Set Root Filehandle . . . . . 332
17.3. pNFS protocol semantics . . . . . . . . . . . . . . . . 224 16.22. Operation 25: READ - Read from File . . . . . . . . . . 333
17.3.1. Definitions . . . . . . . . . . . . . . . . . . . . . 224 16.23. Operation 26: READDIR - Read Directory . . . . . . . . . 335
17.3.2. Guarantees Provided by Layouts . . . . . . . . . . . 227 16.24. Operation 27: READLINK - Read Symbolic Link . . . . . . 339
17.3.3. Getting a Layout . . . . . . . . . . . . . . . . . . 228 16.25. Operation 28: REMOVE - Remove File System Object . . . . 340
17.3.4. Committing a Layout . . . . . . . . . . . . . . . . . 229 16.26. Operation 29: RENAME - Rename Directory Entry . . . . . 342
17.3.5. Recalling a Layout . . . . . . . . . . . . . . . . . 231 16.27. Operation 31: RESTOREFH - Restore Saved Filehandle . . . 344
17.3.6. Metadata Server Write Propagation . . . . . . . . . . 237 16.28. Operation 32: SAVEFH - Save Current Filehandle . . . . . 345
17.3.7. Crash Recovery . . . . . . . . . . . . . . . . . . . 237 16.29. Operation 33: SECINFO - Obtain Available Security . . . 346
17.3.8. Security Considerations . . . . . . . . . . . . . . . 243 16.30. Operation 34: SETATTR - Set Attributes . . . . . . . . . 349
17.4. The NFSv4 File Layout Type . . . . . . . . . . . . . . . 244 16.31. Operation 37: VERIFY - Verify Same Attributes . . . . . 352
17.4.1. File Striping and Data Access . . . . . . . . . . . . 244 16.32. Operation 38: WRITE - Write to File . . . . . . . . . . 353
17.4.2. Global Stateid Requirements . . . . . . . . . . . . . 252 16.33. Operation 40: BACKCHANNEL_CTL - Backchannel control . . 357
17.4.3. The Layout Iomode . . . . . . . . . . . . . . . . . . 252 16.34. Operation 41: BIND_CONN_TO_SESSION . . . . . . . . . . . 359
17.4.4. Storage Device State Propagation . . . . . . . . . . 253 16.35. Operation 42: CREATE_CLIENTID - Instantiate Clientid . . 363
17.4.5. Storage Device Component File Size . . . . . . . . . 255 16.36. Operation 43: CREATE_SESSION - Create New Session and
17.4.6. Crash Recovery Considerations . . . . . . . . . . . . 256 Confirm Clientid . . . . . . . . . . . . . . . . . . . . 369
17.4.7. Security Considerations for the File Layout Type . . 256 16.37. Operation 44: DESTROY_SESSION - Destroy existing
17.4.8. Alternate Approaches . . . . . . . . . . . . . . . . 257
18. Internationalization . . . . . . . . . . . . . . . . . . . . 258
18.1. Stringprep profile for the utf8str_cs type . . . . . . . 259
18.2. Stringprep profile for the utf8str_cis type . . . . . . 261
18.3. Stringprep profile for the utf8str_mixed type . . . . . 262
18.4. UTF-8 Related Errors . . . . . . . . . . . . . . . . . . 263
19. Error Values . . . . . . . . . . . . . . . . . . . . . . . . 264
19.1. Error Definitions . . . . . . . . . . . . . . . . . . . 264
19.2. Operations and their valid errors . . . . . . . . . . . 276
19.3. Callback operations and their valid errors . . . . . . . 284
19.4. Errors and the operations that use them . . . . . . . . 284
20. NFS version 4.1 Procedures . . . . . . . . . . . . . . . . . 290
20.1. Procedure 0: NULL - No Operation . . . . . . . . . . . . 290
20.2. Procedure 1: COMPOUND - Compound Operations . . . . . . 291
21. NFS version 4.1 Operations . . . . . . . . . . . . . . . . . 295
21.1. Operation 3: ACCESS - Check Access Rights . . . . . . . 296
21.2. Operation 4: CLOSE - Close File . . . . . . . . . . . . 298
21.3. Operation 5: COMMIT - Commit Cached Data . . . . . . . . 299
21.4. Operation 6: CREATE - Create a Non-Regular File Object . 302
21.5. Operation 7: DELEGPURGE - Purge Delegations Awaiting
Recovery . . . . . . . . . . . . . . . . . . . . . . . . 304
21.6. Operation 8: DELEGRETURN - Return Delegation . . . . . . 305
21.7. Operation 9: GETATTR - Get Attributes . . . . . . . . . 306
21.8. Operation 10: GETFH - Get Current Filehandle . . . . . . 307
21.9. Operation 11: LINK - Create Link to a File . . . . . . . 308
21.10. Operation 12: LOCK - Create Lock . . . . . . . . . . . . 309
21.11. Operation 13: LOCKT - Test For Lock . . . . . . . . . . 313
21.12. Operation 14: LOCKU - Unlock File . . . . . . . . . . . 314
21.13. Operation 15: LOOKUP - Lookup Filename . . . . . . . . . 315
21.14. Operation 16: LOOKUPP - Lookup Parent Directory . . . . 317
21.15. Operation 17: NVERIFY - Verify Difference in
Attributes . . . . . . . . . . . . . . . . . . . . . . . 318
21.16. Operation 18: OPEN - Open a Regular File . . . . . . . . 319
21.17. Operation 19: OPENATTR - Open Named Attribute
Directory . . . . . . . . . . . . . . . . . . . . . . . 333
21.18. Operation 21: OPEN_DOWNGRADE - Reduce Open File Access . 334
21.19. Operation 22: PUTFH - Set Current Filehandle . . . . . . 335
21.20. Operation 23: PUTPUBFH - Set Public Filehandle . . . . . 336
21.21. Operation 24: PUTROOTFH - Set Root Filehandle . . . . . 338
21.22. Operation 25: READ - Read from File . . . . . . . . . . 338
21.23. Operation 26: READDIR - Read Directory . . . . . . . . . 340
21.24. Operation 27: READLINK - Read Symbolic Link . . . . . . 344
21.25. Operation 28: REMOVE - Remove File System Object . . . . 345
21.26. Operation 29: RENAME - Rename Directory Entry . . . . . 347
21.27. Operation 31: RESTOREFH - Restore Saved Filehandle . . . 348
21.28. Operation 32: SAVEFH - Save Current Filehandle . . . . . 349
21.29. Operation 33: SECINFO - Obtain Available Security . . . 350
21.30. Operation 34: SETATTR - Set Attributes . . . . . . . . . 353
21.31. Operation 37: VERIFY - Verify Same Attributes . . . . . 355
21.32. Operation 38: WRITE - Write to File . . . . . . . . . . 357
21.33. Operation 40: BACKCHANNEL_CTL - Backchannel control . . 361
21.34. Operation 41: BIND_CONN_TO_SESSION . . . . . . . . . . . 361
21.35. Operation 42: CREATE_CLIENTID - Instantiate Clientid . . 365
21.36. Operation 43: CREATE_SESSION - Create New Session and
Confirm Clientid . . . . . . . . . . . . . . . . . . . . 371
21.37. Operation 44: DESTROY_SESSION - Destroy existing
session . . . . . . . . . . . . . . . . . . . . . . . . 379 session . . . . . . . . . . . . . . . . . . . . . . . . 379
21.38. Operation 45: FREE_STATEID - Free stateid with no 16.38. Operation 45: FREE_STATEID - Free stateid with no
locks . . . . . . . . . . . . . . . . . . . . . . . . . 380 locks . . . . . . . . . . . . . . . . . . . . . . . . . 380
21.39. Operation 46: GET_DIR_DELEGATION - Get a directory 16.39. Operation 46: GET_DIR_DELEGATION - Get a directory
delegation . . . . . . . . . . . . . . . . . . . . . . . 381 delegation . . . . . . . . . . . . . . . . . . . . . . . 381
21.40. Operation 47: GETDEVICEINFO - Get Device Information . . 385 16.40. Operation 47: GETDEVICEINFO - Get Device Information . . 385
21.41. Operation 48: GETDEVICELIST . . . . . . . . . . . . . . 386 16.41. Operation 48: GETDEVICELIST . . . . . . . . . . . . . . 386
21.42. Operation 49: LAYOUTCOMMIT - Commit writes made using 16.42. Operation 49: LAYOUTCOMMIT - Commit writes made using
a layout . . . . . . . . . . . . . . . . . . . . . . . . 387 a layout . . . . . . . . . . . . . . . . . . . . . . . . 387
21.43. Operation 50: LAYOUTGET - Get Layout Information . . . . 391 16.43. Operation 50: LAYOUTGET - Get Layout Information . . . . 391
21.44. Operation 51: LAYOUTRETURN - Release Layout 16.44. Operation 51: LAYOUTRETURN - Release Layout
Information . . . . . . . . . . . . . . . . . . . . . . 394 Information . . . . . . . . . . . . . . . . . . . . . . 394
21.45. Operation 52: SECINFO_NO_NAME - Get Security on 16.45. Operation 52: SECINFO_NO_NAME - Get Security on
Unnamed Object . . . . . . . . . . . . . . . . . . . . . 396 Unnamed Object . . . . . . . . . . . . . . . . . . . . . 396
21.46. Operation 53: SEQUENCE - Supply per-procedure 16.46. Operation 53: SEQUENCE - Supply per-procedure
sequencing and control . . . . . . . . . . . . . . . . . 398 sequencing and control . . . . . . . . . . . . . . . . . 397
21.47. Operation 54: SET_SSV . . . . . . . . . . . . . . . . . 401 16.47. Operation 54: SET_SSV . . . . . . . . . . . . . . . . . 401
21.48. Operation 55: TEST_STATEID - Test stateids for 16.48. Operation 55: TEST_STATEID - Test stateids for
validity . . . . . . . . . . . . . . . . . . . . . . . . 402 validity . . . . . . . . . . . . . . . . . . . . . . . . 402
21.49. Operation 56: WANT_DELEGATION . . . . . . . . . . . . . 404 16.49. Operation 56: WANT_DELEGATION . . . . . . . . . . . . . 403
21.50. Operation 10044: ILLEGAL - Illegal operation . . . . . . 407 16.50. Operation 10044: ILLEGAL - Illegal operation . . . . . . 406
22. NFS version 4.1 Callback Procedures . . . . . . . . . . . . . 407 17. NFS version 4.1 Callback Procedures . . . . . . . . . . . . . 407
22.1. Procedure 0: CB_NULL - No Operation . . . . . . . . . . 408 17.1. Procedure 0: CB_NULL - No Operation . . . . . . . . . . 407
22.2. Procedure 1: CB_COMPOUND - Compound Operations . . . . . 408 17.2. Procedure 1: CB_COMPOUND - Compound Operations . . . . . 407
23. NFS version 4.1 Callback Operations . . . . . . . . . . . . . 410 18. NFS version 4.1 Callback Operations . . . . . . . . . . . . . 409
23.1. Operation 3: CB_GETATTR - Get Attributes . . . . . . . . 410 18.1. Operation 3: CB_GETATTR - Get Attributes . . . . . . . . 409
23.2. Operation 4: CB_RECALL - Recall an Open Delegation . . . 411 18.2. Operation 4: CB_RECALL - Recall an Open Delegation . . . 411
23.3. Operation 5: CB_LAYOUTRECALL . . . . . . . . . . . . . . 412 18.3. Operation 5: CB_LAYOUTRECALL . . . . . . . . . . . . . . 412
23.4. Operation 6: CB_NOTIFY - Notify directory changes . . . 415 18.4. Operation 6: CB_NOTIFY - Notify directory changes . . . 414
23.5. Operation 7: CB_PUSH_DELEG . . . . . . . . . . . . . . . 418 18.5. Operation 7: CB_PUSH_DELEG . . . . . . . . . . . . . . . 417
23.6. Operation 8: CB_RECALL_ANY - Keep any N delegations . . 419 18.6. Operation 8: CB_RECALL_ANY - Keep any N delegations . . 418
23.7. Operation 9: CB_RECALLABLE_OBJ_AVAIL . . . . . . . . . . 422 18.7. Operation 9: CB_RECALLABLE_OBJ_AVAIL . . . . . . . . . . 421
23.8. Operation 10: CB_RECALL_CREDIT - change flow control 18.8. Operation 10: CB_RECALL_SLOT - change flow control
limits . . . . . . . . . . . . . . . . . . . . . . . . . 423 limits . . . . . . . . . . . . . . . . . . . . . . . . . 422
23.9. Operation 11: CB_SEQUENCE - Supply callback channel 18.9. Operation 11: CB_SEQUENCE - Supply callback channel
sequencing and control . . . . . . . . . . . . . . . . . 423 sequencing and control . . . . . . . . . . . . . . . . . 423
23.10. Operation 12: CB_WANTS_CANCELLED . . . . . . . . . . . . 425 18.10. Operation 12: CB_WANTS_CANCELLED . . . . . . . . . . . . 425
23.11. Operation 10044: CB_ILLEGAL - Illegal Callback 18.11. Operation 10044: CB_ILLEGAL - Illegal Callback
Operation . . . . . . . . . . . . . . . . . . . . . . . 426 Operation . . . . . . . . . . . . . . . . . . . . . . . 426
24. Security Considerations . . . . . . . . . . . . . . . . . . . 426 19. Security Considerations . . . . . . . . . . . . . . . . . . . 427
25. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 427 20. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 427
25.1. Defining new layout types . . . . . . . . . . . . . . . 427 20.1. Defining new layout types . . . . . . . . . . . . . . . 427
26. References . . . . . . . . . . . . . . . . . . . . . . . . . 427 21. References . . . . . . . . . . . . . . . . . . . . . . . . . 428
26.1. Normative References . . . . . . . . . . . . . . . . . . 427 21.1. Normative References . . . . . . . . . . . . . . . . . . 428
26.2. Informative References . . . . . . . . . . . . . . . . . 429 21.2. Informative References . . . . . . . . . . . . . . . . . 429
Appendix A. ACL Algorithm Examples . . . . . . . . . . . . . . . 430 Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 430
A.1. Recomputing mode upon SETATTR of ACL . . . . . . . . . . 430 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 431
A.2. Computing the Inherited ACL . . . . . . . . . . . . . . 433 Intellectual Property and Copyright Statements . . . . . . . . . 432
A.2.1. Discussion . . . . . . . . . . . . . . . . . . . . . 434
A.3. Applying a Mode to an Existing ACL . . . . . . . . . . . 435
Appendix B. Acknowledgments . . . . . . . . . . . . . . . . . . 439
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 440
Intellectual Property and Copyright Statements . . . . . . . . . 441
1. Introduction 1. Introduction
1.1. The NFSv4.1 Protocol 1.1. The NFSv4.1 Protocol
The NFSv4.1 protocol is a minor version of the NFSv4 protocol The NFSv4.1 protocol is a minor version of the NFSv4 protocol
described in [2]. It generally follows the guidelines for minor described in [2]. It generally follows the guidelines for minor
versioning model laid in Section 10 of RFC 3530. However, it versioning model laid in Section 10 of RFC 3530. However, it
diverges from guidelines 11 ("a client and server that supports minor diverges from guidelines 11 ("a client and server that supports minor
version X must support minor versions 0 through X-1"), and 12 ("no version X must support minor versions 0 through X-1"), and 12 ("no
features may be introduced as mandatory in a minor version"). These features may be introduced as mandatory in a minor version"). These
divergences are due to the introduction of the sessions model for divergences are due to the introduction of the sessions model for
managing non-idempotent operations and the RECLAIM_COMPLETE managing non-idempotent operations and the RECLAIM_COMPLETE
operation. These two new features are infrastructural in nature and operation. These two new features are infrastructural in nature and
simplify implementation of existing and other new features. Making simplify implementation of existing and other new features. Making
them optional would add undue complexity to protocol definition and them optional would add undue complexity to protocol definition and
implementation. NFSv4.1 accordingly updates the Minor Versioning implementation. NFSv4.1 accordingly updates the Minor Versioning
guidelines (Section 7). guidelines (Section 2.6).
NFSv4.1, as a minor version, is consistent with the overall goals for NFSv4.1, as a minor version, is consistent with the overall goals for
NFS Version 4, but extends the protocol so as to better meet those NFS Version 4, but extends the protocol so as to better meet those
goals, based on experiences with NFSv4.0. In addition, NFSv4.1 has goals, based on experiences with NFSv4.0. In addition, NFSv4.1 has
adopted some additional goals, which motivate some of the major adopted some additional goals, which motivate some of the major
extensions in minor version 1. extensions in minor version 1.
1.2. NFS Version 4 Goals 1.2. NFS Version 4 Goals
The NFS version 4 protocol is a further revision of the NFS protocol The NFS version 4 protocol is a further revision of the NFS protocol
skipping to change at page 11, line 21 skipping to change at page 10, line 21
o Designed for protocol extensions. o Designed for protocol extensions.
The protocol is designed to accept standard extensions within a The protocol is designed to accept standard extensions within a
framework that enable and encourages backward compatibility. framework that enable and encourages backward compatibility.
1.3. Minor Version 1 Goals 1.3. Minor Version 1 Goals
Minor version one has the following goals, within the framework Minor version one has the following goals, within the framework
established by the overall version 4 goals. established by the overall version 4 goals.
o To correct significant structtural weaknesses and oversights o To correct significant structural weaknesses and oversights
discovered in the base protocol. discovered in the base protocol.
o To add clarity and specificity to areas left unaddressed or not o To add clarity and specificity to areas left unaddressed or not
addressed in sufficient detail in the base protocol. addressed in sufficient detail in the base protocol.
o To add specific features based on experience with the existing o To add specific features based on experience with the existing
protocol and recent industry developments. protocol and recent industry developments.
o To provide protocol support to take advantage of clustered server o To provide protocol support to take advantage of clustered server
deployments including the ability to provide scalabale parallel deployments including the ability to provide scalable parallel
access to files distributed among multiple servers. access to files distributed among multiple servers.
1.4. Inconsistencies of this Document with Section XX 1.4. Inconsistencies of this Document with Section XX
Section XX, RPC Definition File, contains the definitions in XDR Section XX, RPC Definition File, contains the definitions in XDR
description language of the constructs used by the protocol. Prior description language of the constructs used by the protocol. Prior
to this section, several of the constructs are reproduced for to this section, several of the constructs are reproduced for
purposes of explanation. Although every effort has been made to purposes of explanation. Although every effort has been made to
assure a correct and consistent description, the possibility of assure a correct and consistent description, the possibility of
inconsistencies exists. For any part of the document that is inconsistencies exists. For any part of the document that is
skipping to change at page 12, line 10 skipping to change at page 11, line 10
done to provide an appropriate context for both the reader who is done to provide an appropriate context for both the reader who is
familiar with the previous versions of the NFS protocol and the familiar with the previous versions of the NFS protocol and the
reader that is new to the NFS protocols. For the reader new to the reader that is new to the NFS protocols. For the reader new to the
NFS protocols, there is still a set of fundamental knowledge that is NFS protocols, there is still a set of fundamental knowledge that is
expected. The reader should be familiar with the XDR and RPC expected. The reader should be familiar with the XDR and RPC
protocols as described in [3] and [4]. A basic knowledge of file protocols as described in [3] and [4]. A basic knowledge of file
systems and distributed file systems is expected as well. systems and distributed file systems is expected as well.
This description of version 4.1 features will not distinguish those This description of version 4.1 features will not distinguish those
added in minor version one from those present in the base protocol added in minor version one from those present in the base protocol
but will treat minor version 1 as a unified whole See Section 1.7 for but will treat minor version 1 as a unified whole. See Section 1.7
a description of the differences between the two minor versions. for a description of the differences between the two minor versions.
1.5.1. RPC and Security 1.5.1. RPC and Security
As with previous versions of NFS, the External Data Representation As with previous versions of NFS, the External Data Representation
(XDR) and Remote Procedure Call (RPC) mechanisms used for the NFS (XDR) and Remote Procedure Call (RPC) mechanisms used for the NFS
version 4.1 protocol are those defined in [3] and [4]. To meet end- version 4.1 protocol are those defined in [3] and [4]. To meet end-
to-end security requirements, the RPCSEC_GSS framework [5] will be to-end security requirements, the RPCSEC_GSS framework [5] will be
used to extend the basic RPC security. With the use of RPCSEC_GSS, used to extend the basic RPC security. With the use of RPCSEC_GSS,
various mechanisms can be provided to offer authentication, various mechanisms can be provided to offer authentication,
integrity, and privacy to the NFS version 4 protocol. Kerberos V5 integrity, and privacy to the NFS version 4 protocol. Kerberos V5
will be used as described in [6] to provide one security framework. will be used as described in [6] to provide one security framework.
The LIPKEY GSS-API mechanism described in [7] will be used to provide The LIPKEY and SPKM-3 GSS-API mechanisms described in [7] will be
for the use of user password and server public key by the NFS version used to provide for the use of user password and client/server public
4 protocol. With the use of RPCSEC_GSS, other mechanisms may also be key certificates by the NFS version 4 protocol. With the use of
specified and used for NFS version 4.1 security. RPCSEC_GSS, other mechanisms may also be specified and used for NFS
version 4.1 security.
To enable in-band security negotiation, the NFS version 4.1 protocol To enable in-band security negotiation, the NFS version 4.1 protocol
has operations which provide the client a method of querying the has operations which provide the client a method of querying the
server about its policies regarding which security mechanisms must be server about its policies regarding which security mechanisms must be
used for access to the server's file system resources. With this, used for access to the server's file system resources. With this,
the client can securely match the security mechanism that meets the the client can securely match the security mechanism that meets the
policies specified at both the client and server. policies specified at both the client and server.
1.5.2. Protocol Structure 1.5.2. Protocol Structure
1.5.2.1. Core Protocol 1.5.2.1. Core Protocol
Unlike NFS Versions 2 and 3, which used a series of ancillary Unlike NFS Versions 2 and 3, which used a series of ancillary
protocols (e.g. NLM, NSM, MOUNT), within all minor versions of NFS protocols (e.g. NLM, NSM, MOUNT), within all minor versions of NFS
version 4 only a single RPC protocol is used to make requests of the version 4 only a single RPC protocol is used to make requests of the
server. Facilties, that had been separate protocols, such as server. Facilities that had been separate protocols, such as
locking, are now intergrated within a single unified protocol. locking, are now integrated within a single unified protocol.
A significant departure from the versions of the NFS protocol before
version 4 is the introduction of the COMPOUND procedure. For the NFS
version 4 protocol, in all minor versions, there are two RPC
procedures, NULL and COMPOUND. The COMPOUND procedure is defined as
a series of individual operations and these operations perform the
sorts of functions performed by traditional NFS procedures.
The operations combined within a COMPOUND request are evaluated in
order by the server, without any atomicity guarantees. A limited set
of facilities exist to pass results from one operation to another.
Once an operation returns a failing result, the evaluation ends and
the results of all evaluated operations are returned to the client.
With the use of the COMPOUND procedure, the client is able to build
simple or complex requests. These COMPOUND requests allow for a
reduction in the number of RPCs needed for logical file system
operations. For example, multi-component lookup requests can be
constructed by combining multiple LOOKUP operations. Those can be
further combined with operations such as GETATTR, READDIR, or OPEN
plus READ to do more complicated sets of operation without incurring
additional latency.
NFS Version 4.1 also contains a a considerable set of callback
operations in which the server makes an RPC directed at the client.
Callback RPC's have a similar structure to that of the normal server
requests. For the NFS version 4 protocol callbacks in all minor
versions, there are two RPC procedures, NULL and CB_COMPOUND. The
CB_COMPOUND procedure is defined in analogous fashion to that of
COMPOUND with its own set of callback operations.
Addition of new server and callback operation within the COMPOUND and
CB_COMPOUND request framework provide means of extending the protocol
in subsequent minor versions.
Except for a small number of operations needed for session creation,
server requests and callback requests are performed within the
context of a session. Sessions provide a client context for every
request and support robust replay protection for non-idempotent
requests.
1.5.2.2. Parallel Access 1.5.2.2. Parallel Access
Minor version one supports high-performance data access to a Minor version one supports high-performance data access to a
clustered server implementation by enabling a separation of metadata clustered server implementation by enabling a separation of metadata
access and data access, with the latter done to multiple servers in access and data access, with the latter done to multiple servers in
parallel. parallel.
Such parallel data access is controlled by recallable objects known Such parallel data access is controlled by recallable objects known
as "layouts", which are integrated into the protocol locking model. as "layouts", which are integrated into the protocol locking model.
skipping to change at page 14, line 18 skipping to change at page 12, line 24
is the same as previous versions. The server file system is is the same as previous versions. The server file system is
hierarchical with the regular files contained within being treated as hierarchical with the regular files contained within being treated as
opaque byte streams. In a slight departure, file and directory names opaque byte streams. In a slight departure, file and directory names
are encoded with UTF-8 to deal with the basics of are encoded with UTF-8 to deal with the basics of
internationalization. internationalization.
The NFS version 4.1 protocol does not require a separate protocol to The NFS version 4.1 protocol does not require a separate protocol to
provide for the initial mapping between path name and filehandle. provide for the initial mapping between path name and filehandle.
All file systems exported by a server are presented as a tree so that All file systems exported by a server are presented as a tree so that
all file systems are reachable from a special per-server global root all file systems are reachable from a special per-server global root
filefilandle. This allows LOOKUP operations to be used to perform filehandle. This allows LOOKUP operations to be used to perform
functions previously provided by the MOUNT protocol. The server functions previously provided by the MOUNT protocol. The server
provides any necessary pseudo fileystems to bridge any gaps that provides any necessary pseudo filesystems to bridge any gaps that
arise due unexported gaps between exported file systems. arise due unexported gaps between exported file systems.
1.5.3.1. Filehandles 1.5.3.1. Filehandles
As in previous versions of the NFS protocol, opaque filehandles are As in previous versions of the NFS protocol, opaque filehandles are
used to identify individual files and directories. Lookup-type and used to identify individual files and directories. Lookup-type and
create operations are used to go from file and directory names to the create operations are used to go from file and directory names to the
filehandle which is then used to identify the object to subsequent filehandle which is then used to identify the object to subsequent
operations. operations.
skipping to change at page 14, line 50 skipping to change at page 13, line 7
structure. Only a small set of the defined attributes are mandatory structure. Only a small set of the defined attributes are mandatory
and must be provided by all server implementations. The other and must be provided by all server implementations. The other
attributes are known as "recommended" attributes. attributes are known as "recommended" attributes.
One significant recommended file attribute is the Access Control List One significant recommended file attribute is the Access Control List
(ACL) attribute. This attribute provides for directory and file (ACL) attribute. This attribute provides for directory and file
access control beyond the model used in NFS Versions 2 and 3. The access control beyond the model used in NFS Versions 2 and 3. The
ACL definition allows for specification specific sets of permissions ACL definition allows for specification specific sets of permissions
for individual users and groups. In addition, ACL inheritance allows for individual users and groups. In addition, ACL inheritance allows
propagation of access permissions and restriction down a directory propagation of access permissions and restriction down a directory
tree as fileystsme objects are created. tree as filesystem objects are created.
One other type of attribute is the named attribute. A named One other type of attribute is the named attribute. A named
attribute is an opaque byte stream that is associated with a attribute is an opaque byte stream that is associated with a
directory or file and referred to by a string name. Named attributes directory or file and referred to by a string name. Named attributes
are meant to be used by client applications as a method to associate are meant to be used by client applications as a method to associate
application specific data with a regular file or directory. application specific data with a regular file or directory.
1.5.3.3. Multi-server Namespace 1.5.3.3. Multi-server Namespace
NFS Version 4.1 contains a number of features to allow implementation NFS Version 4.1 contains a number of features to allow implementation
of namespaces that cross server boundaries and that allow to and of namespaces that cross server boundaries and that allow to and
facilitate a non-disruptive transfer of support for individual file facilitate a non-disruptive transfer of support for individual file
systems between servers. They are all based upon attributes that systems between servers. They are all based upon attributes that
allow one file system to specify alternate or new locations for that allow one file system to specify alternate or new locations for that
file system. file system.
These attributes may be used together with the concept of absent file These attributes may be used together with the concept of absent file
system which provide specifications for additional locations but no system which provide specifications for additional locations but no
actual file system content. This allows a number of important actual file system content. This allows a number of important
facilties: facilities:
o Location attributes may be used with absent file systems to o Location attributes may be used with absent file systems to
implement referrals whereby one server may direct the client to a implement referrals whereby one server may direct the client to a
file system provided by another server. This allows extensive file system provided by another server. This allows extensive
mult-server namspaces to be constructed. mult-server namespaces to be constructed.
o Location attributes may be provided for present file systems to o Location attributes may be provided for present file systems to
provide the locations alternate file system instances or replicas provide the locations alternate file system instances or replicas
to be used in the event that the current file system instance to be used in the event that the current file system instance
becomes unavailable. becomes unavailable.
o Location attributes may be provided when a previously present file o Location attributes may be provided when a previously present file
system becomes absent. This allows non-disruptive migration of system becomes absent. This allows non-disruptive migration of
file systems to alternate servers. file systems to alternate servers.
skipping to change at page 16, line 8 skipping to change at page 14, line 12
that lock is held. When circumstances change, the lock is recalled that lock is held. When circumstances change, the lock is recalled
via a callback via a callback request. The assurances provided by via a callback via a callback request. The assurances provided by
delegations allow more extensive caching to be done safely when delegations allow more extensive caching to be done safely when
circumstances allow it. circumstances allow it.
o Share reservations as established by OPEN operations. o Share reservations as established by OPEN operations.
o Byte-range locks. o Byte-range locks.
o File delegations which are recallable locks that assure the holder o File delegations which are recallable locks that assure the holder
that inconsitent opens and file changes cannot occur so long as that inconsistent opens and file changes cannot occur so long as
the delegation is held. the delegation is held.
o Directory delegations which are recallable delegations that assure o Directory delegations which are recallable delegations that assure
the holder that inconsistent directory modifications cannot occur the holder that inconsistent directory modifications cannot occur
so long as the deleagtion is held. so long as the delegation is held.
o Layouts which are recallable objects that assure the holder that o Layouts which are recallable objects that assure the holder that
direct access to the file data may be performed directly by the direct access to the file data may be performed directly by the
client and that no change to the data's location inconsistent with client and that no change to the data's location inconsistent with
that access may be made so long as the layout is held. that access may be made so long as the layout is held.
All locks for a given client are tied together under a single client- All locks for a given client are tied together under a single client-
wide lease. All requests made on sessions associated with the client wide lease. All requests made on sessions associated with the client
renew that lease. When leases are not promptly renewed lock are renew that lease. When leases are not promptly renewed lock are
subject to revocation. In the event of server reinitialization, subject to revocation. In the event of server reinitialization,
skipping to change at page 18, line 18 skipping to change at page 16, line 23
1.7. Differences from NFSv4.0 1.7. Differences from NFSv4.0
The following summarizes the differences between minor version one The following summarizes the differences between minor version one
and the base protocol: and the base protocol:
o Implementation of the sessions model. o Implementation of the sessions model.
o Support for parallel access to data. o Support for parallel access to data.
o Addition of the RECLAIM_COMPLETE operation to better structiure o Addition of the RECLAIM_COMPLETE operation to better structure the
the lock reclamation process. lock reclamation process.
o < Support for directory delegation. o Support for delegations on directories and other file types in
addition to regular files.
o Operations to re-obtain a delegation. o Operations to re-obtain a delegation.
o Support for client and server implementation id's. o Support for client and server implementation id's.
2. Core Infrastructure 2. Core Infrastructure
2.1. Introduction 2.1. Introduction
2.2. RPC and XDR NFS version 4.1 (NFSv4.1) relies on core infrastructure common to
nearly every operation. This core infrastructure is described in the
2.2.1. RPC-based Security remainder of this section.
2.2.1.1. RPC Security Flavors
2.2.1.1.1. RPCSEC_GSS and Security Services
2.2.1.1.1.1. Authentication, Integrity, Privacy
2.2.1.1.1.2. GSS Server Principal
2.2.1.2. NFSv4 Security Tuples
2.2.1.2.1. Security Service Negotiation
2.2.1.2.1.1. SECINFO and SECINFO_NO_NAME
2.2.1.2.1.2. Security Error
2.2.1.2.1.3. PUTFH + LOOKUP
2.2.1.2.1.4. PUTFH + LOOKUPP
2.2.1.2.1.5. PUTFH + SECINFO
2.2.1.2.1.6. PUTFH + Anything Else
2.3. Non-RPC-based Security Services
2.3.1. Authorization
2.3.2. Auditing
2.3.3. Intrusion Detection
2.4. Transport Layers
2.4.1. Ports
2.4.2. Stream Transports
2.4.3. RDMA Transports
2.4.3.1. RDMA Requirements
2.4.3.2. RDMA Connection Resources
2.5. Session
2.5.1. Motivation and Overview
2.5.2. NFSv4 Integration
2.5.2.1. COMPOUND and CB_COMPOUND
2.5.2.2. SEQUENCE and CB_SEQUENCE
2.5.2.3. Clientid and Session Association
2.5.3. Channels
2.5.3.1. Operation Channel
2.5.3.2. Back Channel
2.5.3.2.1. Back Channel RPC Security
2.5.3.3. Session and Channel Association
2.5.3.4. Connection and Channel Association
2.5.3.4.1. Trunking
2.5.4. Exactly Once Semantics
2.5.4.1. Slot Identifiers and Server Duplicate Request Cache
2.5.4.2. Retry and Replay
2.5.4.3. Resolving server callback races with sessions
2.6. Channel Management
2.6.1. Buffer Management
2.6.2. Data Transfer
2.6.2.1. Inline Data Transfer (Stream and RDMA)
2.6.2.2. Direct Data Transfer (RDMA)
2.6.3. Flow Control
2.6.4. COMPOUND Sizing Issues
2.6.5. Data Alignment
2.7. Sessions Security
2.7.1. Denial of Service via Unauthorized State Changes
2.8. Session Mechanics - Steady State
2.8.1. Obligations of the Server
2.8.2. Obligations of the Client
2.8.3. Steps the Client Takes To Establish a Session
2.8.4. Session Mechanics - Recovery
2.8.4.1. Reconnection
2.8.4.2. Failover
2.8.4.3. Events Requiring Client Action
2.8.4.4. Events Requiring Server Action
3. RPC and Security Flavor
The NFS version 4.1 protocol is a Remote Procedure Call (RPC)
application that uses RPC version 2 and the corresponding eXternal
Data Representation (XDR) as defined in RFC1831 [4] and RFC4506 [3].
The RPCSEC_GSS security flavor as defined in RFC2203 [5] MUST be used
as the mechanism to deliver stronger security for the NFS version 4
protocol.
3.1. Ports and Transports
Historically, NFS version 2 and version 3 servers have resided on
port 2049. The registered port 2049 RFC3232 [19] for the NFS
protocol should be the default configuration. NFSv4 clients SHOULD
NOT use the RPC binding protocols as described in RFC1833 [20].
Where an NFS version 4 implementation supports operation over the IP
network protocol, the supported transports between NFS and IP MUST
have the following two attributes:
1. The transport must support reliable delivery of data in the order
it was sent.
2. The transport must be among the IETF-approved congestion control
transport protocols.
At the time this document was written, the only two transports that 2.2. RPC and XDR
had the above attributes were TCP and SCTP. To enhance the
possibilities for interoperability, an NFS version 4 implementation
MUST support operation over the TCP transport protocol.
If TCP is used as the transport, the client and server SHOULD use The NFS version 4.1 (NFSv4.1) protocol is a Remote Procedure Call
persistent connections for at least two reasons: (RPC) application that uses RPC version 2 and the corresponding
eXternal Data Representation (XDR) as defined in RFC1831 [4] and
RFC4506 [3].
1. This will prevent the weakening of TCP's congestion control via 2.2.1. RPC-based Security
short lived connections and will improve performance for the WAN
environment by eliminating the need for SYN handshakes.
2. The NFSv4.1 callback model has changed from NFSv4.0, and requires Previous NFS versions have been thought of as having a host-based
the client and server to maintain a client-created channel for authentication model, where the NFS server authenticates the the NFS
the server to use. client, and trust the client to authenticate all users. Actually,
NFS has always depended on RPC for authentication. The first form of
RPC authentication which required a host-based authentication
approach. NFSv4 also depends on RPC for basic security services, and
mandates RPC support for a user-based authentication model. The
user-based authentication model has user principals authenticated by
a server, and in turn the server authenticated by user principals.
RPC provides some basic security services which are used by NFSv4.
As noted in the Security Considerations section, the authentication 2.2.1.1. RPC Security Flavors
model for NFS version 4 has moved from machine-based to principal-
based. However, this modification of the authentication model does
not imply a technical requirement to move the transport connection
management model from whole machine-based to one based on a per user
model. In particular, NFS over TCP client implementations have
traditionally multiplexed traffic for multiple users over a common
TCP connection between an NFS client and server. This has been true,
regardless whether the NFS client is using AUTH_SYS, AUTH_DH,
RPCSEC_GSS or any other flavor. Similarly, NFS over TCP server
implementations have assumed such a model and thus scale the
implementation of TCP connection management in proportion to the
number of expected client machines. NFS version 4.1 will not modify
this connection management model. NFS version 4.1 clients that
violate this assumption can expect scaling issues on the server and
hence reduced service.
Note that for various timers, the client and server should avoid As described in section 7.2 "Authentication" of [4], RPC security is
inadvertent synchronization of those timers. For further discussion encapsulated in the RPC header, via a security or authentication
of the general issue refer to [Floyd]. flavor, and information specific to the specification of the security
flavor. Every RPC header conveys information used to identify and
authenticate a client and server. As discussed in Section 2.2.1.1.1,
some security flavors provide additional security services.
3.1.1. Client Retransmission Behavior NFSv4 clients and servers MUST implement RPCSEC_GSS. (This
requirement to implement is not a requirement to use.) Other
flavors, such as AUTH_NONE, and AUTH_SYS, MAY be implemented as well.
When processing a request received over a reliable transport such as 2.2.1.1.1. RPCSEC_GSS and Security Services
TCP, the NFS version 4.1 server MUST NOT silently drop the request,
except if the transport connection has been broken. Given such a
contract between NFS version 4.1 clients and servers, clients MUST
NOT retry a request unless one or both of the following are true:
o The transport connection has been broken RPCSEC_GSS ([5]) uses the functionality of GSS-API RFC2743 [8]. This
allows for the use of various security mechanisms by the RPC layer
without the additional implementation overhead of adding RPC security
flavors.
o The procedure being retried is the NULL procedure 2.2.1.1.1.1. Identification, Authentication, Integrity, Privacy
Since reliable transports, such as TCP, do not always synchronously Via the GSS-API, RPCSEC_GSS can be used to identify and authenticate
inform a peer when the other peer has broken the connection (for users on clients to servers, and servers to users. It can also
example, when an NFS server reboots), the NFS version 4.1 client may perform integrity checking on the entire RPC message, including the
want to actively "probe" the connection to see if has been broken. RPC header, and the arguments or results. Finally, privacy, usually
Use of the NULL procedure is one recommended way to do so. So, when via encryption, is a service available with RPCSEC_GSS. Privacy is
a client experiences a remote procedure call timeout (of some performed on the arguments and results. Note that if privacy is
arbitrary implementation specific amount), rather than retrying the selected, integrity, authentication, and identification are enabled.
remote procedure call, it could instead issue a NULL procedure call If privacy is not selected, but integrity is selected, authentication
to the server. If the server has died, the transport connection and identification are enabled. If integrity and privacy are not
break will eventually be indicated to the NFS version 4.1 client. selected, but authentication is enabled, identification is enabled.
The client can then reconnect, and then retry the original request. RPCSEC_GSS does not provide identification as a separate service.
If the NULL procedure call gets a response, the connection has not
broken. The client can decide to wait longer for the original
request's response, or it can break the transport connection and
reconnect before re-sending the original request.
For callbacks from the server to the client, the same rules apply, Although GSS-API has an authentication service distinct from its
but the server doing the callback becomes the client, and the client privacy and integrity services, use GSS-API's authentication service
receiving the callback becomes the server. is not used for RPCSEC_GSS's authentication service. Instead, each
RPC request and response header is integrity protected with the GSS-
API integrity service, and this allows RPCSEC_GSS to offer per-RPC
authentication and identity. See [5] for more information.
3.2. Security Flavors NFSv4 client and servers MUST support RPCSEC_GSS's integrity and
authentication service. NFSv4.1 servers MUST support RPCSEC_GSS's
privacy service.
Traditional RPC implementations have included AUTH_NONE, AUTH_SYS, 2.2.1.1.1.2. Security mechanisms for NFS version 4
AUTH_DH, and AUTH_KRB4 as security flavors. With RFC2203 [5] an
additional security flavor of RPCSEC_GSS has been introduced which
uses the functionality of GSS-API RFC2743 [8]. This allows for the
use of various security mechanisms by the RPC layer without the
additional implementation overhead of adding RPC security flavors.
For NFS version 4, the RPCSEC_GSS security flavor MUST be implemented
to enable the mandatory security mechanism. Other flavors, such as,
AUTH_NONE, AUTH_SYS, and AUTH_DH MAY be implemented as well.
3.2.1. Security mechanisms for NFS version 4 RPCSEC_GSS, via GSS-API, normalizes access to mechanisms that provide
security services. Therefore NFSv4 clients and servers MUST support
three security mechanisms: Kerberos V5, SPKM-3, and LIPKEY.
The use of RPCSEC_GSS requires selection of: mechanism, quality of The use of RPCSEC_GSS requires selection of: mechanism, quality of
protection, and service (authentication, integrity, privacy). The protection (QOP), and service (authentication, integrity, privacy).
remainder of this document will refer to these three parameters of For the mandated security mechanisms, NFSv4 specifies that a QOP of
the RPCSEC_GSS security as the security triple. zero (0) is used, leaving it up to the mechanism or the mechanism's
configuration to use an appropriate level of protection that QOP zero
maps to. Each mandated mechanism specifies minimum set of
cryptographic algorithms for implementing integrity and privacy.
NFSv4 clients and servers MUST be implemented on operating
environments that comply with the mandatory cryptographic algorithms
of each mandated mechanism.
3.2.1.1. Kerberos V5 2.2.1.1.1.2.1. Kerberos V5
The Kerberos V5 GSS-API mechanism as described in RFC1964 [6] MUST be The Kerberos V5 GSS-API mechanism as described in RFC1964 [6] (
implemented. [[Comment.1: need new Kerberos RFC]] ) MUST be implemented with the
RPCSEC_GSS services as specified in the following table:
column descriptions: column descriptions:
1 == number of pseudo flavor 1 == number of pseudo flavor
2 == name of pseudo flavor 2 == name of pseudo flavor
3 == mechanism's OID 3 == mechanism's OID
4 == RPCSEC_GSS service 4 == RPCSEC_GSS service
5 == NFSv4.1 clients MUST support
6 == NFSv4.1 servers MUST support
1 2 3 4 1 2 3 4 5 6
-------------------------------------------------------------------- ------------------------------------------------------------------
390003 krb5 1.2.840.113554.1.2.2 rpc_gss_svc_none 390003 krb5 1.2.840.113554.1.2.2 rpc_gss_svc_none yes yes
390004 krb5i 1.2.840.113554.1.2.2 rpc_gss_svc_integrity 390004 krb5i 1.2.840.113554.1.2.2 rpc_gss_svc_integrity yes yes
390005 krb5p 1.2.840.113554.1.2.2 rpc_gss_svc_privacy 390005 krb5p 1.2.840.113554.1.2.2 rpc_gss_svc_privacy no yes
Note that the pseudo flavor is presented here as a mapping aid to the
implementor. Because this NFS protocol includes a method to
negotiate security and it understands the GSS-API mechanism, the
pseudo flavor is not needed. The pseudo flavor is needed for NFS
version 3 since the security negotiation is done via the MOUNT
protocol.
For a discussion of NFS' use of RPCSEC_GSS and Kerberos V5, please
see RFC2623 [21].
3.2.1.2. LIPKEY as a security triple
The LIPKEY GSS-API mechanism as described in RFC2847 [7] MUST be
implemented and provide the following security triples. The
definition of the columns matches the previous subsection "Kerberos
V5 as security triple"
1 2 3 4
--------------------------------------------------------------------
390006 lipkey 1.3.6.1.5.5.9 rpc_gss_svc_none
390007 lipkey-i 1.3.6.1.5.5.9 rpc_gss_svc_integrity
390008 lipkey-p 1.3.6.1.5.5.9 rpc_gss_svc_privacy
3.2.1.3. SPKM-3 as a security triple
The SPKM-3 GSS-API mechanism as described in RFC2847 [7] MUST be
implemented and provide the following security triples. The
definition of the columns matches the previous subsection "Kerberos
V5 as security triple".
1 2 3 5
--------------------------------------------------------------------
390009 spkm3 1.3.6.1.5.5.1.3 rpc_gss_svc_none
390010 spkm3i 1.3.6.1.5.5.1.3 rpc_gss_svc_integrity
390011 spkm3p 1.3.6.1.5.5.1.3 rpc_gss_svc_privacy
3.3. Security Negotiation
With the NFS version 4 server potentially offering multiple security
mechanisms, the client needs a method to determine or negotiate which
mechanism is to be used for its communication with the server. The
NFS server may have multiple points within its file system name space
that are available for use by NFS clients. In turn the NFS server
may be configured such that each of these entry points may have
different or multiple security mechanisms in use.
The security negotiation between client and server must be done with
a secure channel to eliminate the possibility of a third party
intercepting the negotiation sequence and forcing the client and
server to choose a lower level of security than required or desired.
See the section "Security Considerations" for further discussion.
3.3.1. SECINFO and SECINFO_NO_NAME Note that the number and name of the pseudo flavor is presented here
as a mapping aid to the implementor. Because the NFSv4 protocol
includes a method to negotiate security and it understands the GSS-
API mechanism, the pseudo flavor is not needed. The pseudo flavor is
needed for the NFS version 3 since the security negotiation is done
via the MOUNT protocol as described in [19].
The SECINFO and SECINFO_NO_NAME operations allow the client to 2.2.1.1.1.2.2. LIPKEY
determine, on a per filehandle basis, what security triple is to be
used for server access. In general, the client will not have to use
either operation except during initial communication with the server
or when the client crosses policy boundaries at the server. It is
possible that the server's policies change during the client's
interaction therefore forcing the client to negotiate a new security
triple.
3.3.2. Security Error The LIPKEY V5 GSS-API mechanism as described in [7] MUST be
implemented with the RPCSEC_GSS services as specified in the
following table:
Based on the assumption that each NFS version 4 client and server 1 2 3 4 5 6
must support a minimum set of security (i.e., LIPKEY, SPKM-3, and ------------------------------------------------------------------
Kerberos-V5 all under RPCSEC_GSS), the NFS client will start its 390006 lipkey 1.3.6.1.5.5.9 rpc_gss_svc_none yes yes
communication with the server with one of the minimal security 390007 lipkey-i 1.3.6.1.5.5.9 rpc_gss_svc_integrity yes yes
triples. During communication with the server, the client may 390008 lipkey-p 1.3.6.1.5.5.9 rpc_gss_svc_privacy no yes
receive an NFS error of NFS4ERR_WRONGSEC. This error allows the
server to notify the client that the security triple currently being
used is not appropriate for access to the server's file system
resources. The client is then responsible for determining what
security triples are available at the server and choose one which is
appropriate for the client. See the section for the "SECINFO"
operation for further discussion of how the client will respond to
the NFS4ERR_WRONGSEC error and use SECINFO.
3.3.3. Callback RPC Authentication 2.2.1.1.1.2.3. SPKM-3 as a security triple
Callback authentication has changed in NFSv4.1 from NFSv4.0. The SPKM-3 GSS-API mechanism as described in [7] MUST be implemented
with the RPCSEC_GSS services as specified in the following table:
NFSv4.0 required the NFS server to create a security context for 1 2 3 4 5 6
RPCSEC_GSS, AUTH_DH, and AUTH_KERB4, and any other security flavor ------------------------------------------------------------------
that had a security context. It also required that principal issuing 390009 spkm3 1.3.6.1.5.5.1.3 rpc_gss_svc_none yes yes
the callback be the same as the principal that accepted the callback 390010 spkm3i 1.3.6.1.5.5.1.3 rpc_gss_svc_integrity yes yes
parameters (via SETCLIENTID), and that the client principal accepting 390011 spkm3p 1.3.6.1.5.5.1.3 rpc_gss_svc_privacy no yes
the callback be the same as that which issued the SETCLIENTID. This
required the NFS client to have an assigned machine credential.
NFSv4.1 does not require a machine credential. Instead, NFSv4.1
allows an RPCSEC_GSS security context initiated by the client and
eswtablished on both the client and server to be used on callback
RPCs sent by the server to the client. The BIND_BACKCHANNEL
operation is used establish RPCSEC_GSS contexts (if the client so
desires) on the server. No support for AUTH_DH, or AUTH_KERB4 is
specified.
3.3.4. GSS Server Principal 2.2.1.1.1.3. GSS Server Principal
Regardless of what security mechanism under RPCSEC_GSS is being used, Regardless of what security mechanism under RPCSEC_GSS is being used,
the NFS server, MUST identify itself in GSS-API via a the NFS server, MUST identify itself in GSS-API via a
GSS_C_NT_HOSTBASED_SERVICE name type. GSS_C_NT_HOSTBASED_SERVICE GSS_C_NT_HOSTBASED_SERVICE name type. GSS_C_NT_HOSTBASED_SERVICE
names are of the form: names are of the form:
service@hostname service@hostname
For NFS, the "service" element is For NFS, the "service" element is
nfs nfs
Implementations of security mechanisms will convert nfs@hostname to Implementations of security mechanisms will convert nfs@hostname to
various different forms. For Kerberos V5, LIPKEY, and SPKM-3, the various different forms. For Kerberos V5, LIPKEY, and SPKM-3, the
following form is RECOMMENDED: following form is RECOMMENDED:
nfs/hostname nfs/hostname
4. Security Negotiation 2.3. COMPOUND and CB_COMPOUND
The NFSv4.0 specification contains three oversights and ambiguities
with respect to the SECINFO operation.
First, it is impossible for the client to use the SECINFO operation
to determine the correct security triple for accessing a parent
directory. This is because SECINFO takes as arguments the current
file handle and a component name. However, NFSv4.0 uses the LOOKUPP
operation to get the parent directory of the current filehandle. If
the client uses the wrong security when issuing the LOOKUPP, and gets
back an NFS4ERR_WRONGSEC error, SECINFO is useless to the client.
The client is left with guessing which security the server will
accept. This defeats the purpose of SECINFO, which was to provide an
efficient method of negotiating security.
Second, there is ambiguity as to what the server should do when it is
passed a LOOKUP operation such that the server restricts access to
the current file handle with one security triple, and access to the
component with a different triple, and remote procedure call uses one
of the two security triples. Should the server allow the LOOKUP?
Third, there is a problem as to what the client must do (or can do),
whenever the server returns NFS4ERR_WRONGSEC in response to a PUTFH
operation. The NFSv4.0 specification says that client should issue a
SECINFO using the parent filehandle and the component name of the
filehandle that PUTFH was issued with. This may not be convenient
for the client.
This document resolves the above three issues in the context of
NFSv4.1.
5. Clarification of Security Negotiation in NFSv4.1
This section attempts to clarify NFSv4.1 security negotiation issues.
Unless noted otherwise, for any mention of PUTFH in this section, the
reader should interpret it as applying to PUTROOTFH and PUTPUBFH in
addition to PUTFH.
5.1. PUTFH + LOOKUP A significant departure from the versions of the NFS protocol before
version 4 is the introduction of the COMPOUND procedure. For the
NFSv4 protocol, in all minor versions, there are exactly two RPC
procedures, NULL and COMPOUND. The COMPOUND procedure is defined as
a series of individual operations and these operations perform the
sorts of functions performed by traditional NFS procedures.
The server implementation may decide whether to impose any The operations combined within a COMPOUND request are evaluated in
restrictions on export security administration. There are at least order by the server, without any atomicity guarantees. A limited set
three approaches (Sc is the flavor set of the child export, Sp that of facilities exist to pass results from one operation to another.
of the parent), Once an operation returns a failing result, the evaluation ends and
the results of all evaluated operations are returned to the client.
a) Sc <= Sp (<= for subset) With the use of the COMPOUND procedure, the client is able to build
simple or complex requests. These COMPOUND requests allow for a
reduction in the number of RPCs needed for logical file system
operations. For example, multi-component lookup requests can be
constructed by combining multiple LOOKUP operations. Those can be
further combined with operations such as GETATTR, READDIR, or OPEN
plus READ to do more complicated sets of operation without incurring
additional latency.
b) Sc ^ Sp != {} (^ for intersection, {} for the empty set) NFSv4 also contains a considerable set of callback operations in
which the server makes an RPC directed at the client. Callback RPC's
have a similar structure to that of the normal server requests. For
the NFS version 4 protocol callbacks in all minor versions, there are
two RPC procedures, NULL and CB_COMPOUND. The CB_COMPOUND procedure
is defined in analogous fashion to that of COMPOUND with its own set
of callback operations.
c) free form Addition of new server and callback operation within the COMPOUND and
CB_COMPOUND request framework provide means of extending the protocol
in subsequent minor versions.
To support b (when client chooses a flavor that is not a member of Except for a small number of operations needed for session creation,
Sp) and c, PUTFH must NOT return NFS4ERR_WRONGSEC in case of security server requests and callback requests are performed within the
mismatch. Instead, it should be returned from the LOOKUP that context of a session. Sessions provide a client context for every
follows. request and support robust replay protection for non-idempotent
requests.
Since the above guideline does not contradict a, it should be 2.4. Client Identifiers
followed in general.
5.2. PUTFH + LOOKUPP For each operation that obtains or depends on locking state, the
specific client must be determinable by the server. In NFSv4, each
distinct client instance is represented by a clientid, which is a 64-
bit identifier that identifies a specific client at a given time and
which is changed whenever the client or the server re-initializes.
Clientid's are used to support lock identification and crash
recovery.
Since SECINFO only works its way down, there is no way LOOKUPP can In NFSv4.1, the clientid associated with each operation is derived
return NFS4ERR_WRONGSEC without the server implementing from the session (see Section 2.9) on which the operation is issued.
SECINFO_NO_NAME. SECINFO_NO_NAME solves this issue because via style Each session is associated with a specific clientid at session
"parent", it works in the opposite direction as SECINFO (component creation and that clientid then becomes the clientid associated with
name is implicit in this case). all requests issued using it. Therefore, unlike NFSv4.0, no NFSv4.1
operation is possible until a clientid is established.
5.3. PUTFH + SECINFO A sequence of a CREATE_CLIENTID operation followed by a
CREATE_SESSION operation using that clientid is required to establish
the identification on the server. Establishment of identification by
a new incarnation of the client also has the effect of immediately
releasing any locking state that a previous incarnation of that same
client might have had on the server. Such released state would
include all lock, share reservation, and, where the server is not
supporting the CLAIM_DELEGATE_PREV claim type, all delegation state
associated with same client with the same identity. For discussion
of delegation state recovery, see Section 9.2.1.
This case should be treated specially. Releasing such state requires that the server be able to determine
that one client instance is the successor of another. Where this
cannot be done, for any of a number of reasons, the locking state
will remain for a time subject to lease expiration (see Section 8.5)
and the new client will need to wait for such state to be removed, if
it makes conflicting lock requests.
A security sensitive client should be allowed to choose a strong Client identification is encapsulated in the following structure:
flavor when querying a server to determine a file object's permitted
security flavors. The security flavor chosen by the client does not
have to be included in the flavor list of the export. Of course the
server has to be configured for whatever flavor the client selects,
otherwise the request will fail at RPC authentication.
In theory, there is no connection between the security flavor used by struct nfs_client_id4 {
SECINFO and those supported by the export. But in practice, the verifier4 verifier;
client may start looking for strong flavors from those supported by opaque id<NFS4_OPAQUE_LIMIT>;
the export, followed by those in the mandatory set. };
5.4. PUTFH + Anything Else The first field, verifier, is a client incarnation verifier that is
used to detect client reboots. Only if the verifier is different
from that the server had previously recorded for the client (as
identified by the second field of the structure, id) does the server
start the process of canceling the client's leased state.
PUTFH must return NFS4ERR_WRONGSEC in case of security mismatch. The second field, id is a variable length string that uniquely
This is the most straightforward approach without having to add defines the client so that subsequent instances of the same client
NFS4ERR_WRONGSEC to every other operations. bear the same id with a different verifier.
PUTFH + SECINFO_NO_NAME (style "current_fh") is needed for the client There are several considerations for how the client generates the id
to recover from NFS4ERR_WRONGSEC. string:
6. NFSv4.1 Sessions o The string should be unique so that multiple clients do not
present the same string. The consequences of two clients
presenting the same string range from one client getting an error
to one client having its leased state abruptly and unexpectedly
canceled.
6.1. Sessions Background o The string should be selected so the subsequent incarnations (e.g.
reboots) of the same client cause the client to present the same
string. The implementor is cautioned from an approach that
requires the string to be recorded in a local file because this
precludes the use of the implementation in an environment where
there is no local disk and all file access is from an NFS version
4 server.
6.1.1. Introduction to Sessions o The string should be different for each server network address
that the client accesses, rather than common to all server network
addresses. The reason is that it may not be possible for the
client to tell if same server is listening on multiple network
addresses. If the client issues CREATE_CLIENTID with the same id
string to each network address of such a server, the server will
think it is the same client, and each successive CREATE_CLIENTID
will cause the server remove the client's previous leased state.
Regardless, as described in Section 2.9.3.4.1, NFSv4.1 does allow
clients to trunk traffic for a single clientid to one or more of a
server's networking addresses.
[[Comment.1: Noveck: Anyway, I think that trying to hack at the o The algorithm for generating the string should not assume that the
existing text is basically hopeless. I think you have to figure out client's network address will not change. This includes changes
what a new chapter (on sessions or basic protocol structure) should between client incarnations and even changes while the client is
say and then write it, pulling in text from the existing chapter when still running in its current incarnation. This means that if the
appropriate. Apart from the issues you have found, that document was client includes just the client's and server's network address in
written with a whole different purpose in mind. It discusses the the id string, there is a real risk, after the client gives up the
seesions "feature" and justifies it and talks about intergating it network address, that another client, using a similar algorithm
into v4.0, etc. Instead, it is not a feature but is a basic for generating the id string, would generate a conflicting id
underpinning of v4.1 and we just explain what client and server need string.
to do, and some why but it is why this works not why we have made
these design choices vs. others we might have made. It's a totally
different story and I don't think you can get there incrementally.]]
NFSv4.1 adds extensions which allow NFSv4 to support sessions and
endpoint management, and to support operation atop RDMA-capable RPC
over transports such as iWARP. [RDMAP, DDP] These extensions enable
support for exactly-once semantics by NFSv4 servers, multipathing and
trunking of transport connections, and enhanced security. The
ability to operate over RDMA enables greatly enhanced performance.
Operation over existing TCP is enhanced as well.
While discussed here with respect to IETF-chartered transports, the Given the above considerations, an example of a well generated id
intent is NFSv4.1 will function over other standards, such as string is one that includes:
Infiniband. [IB]
The following are the major aspects of the session feature:
o An explicit session is introduced to NFSv4, and new operations are o The server's network address.
added to support it. The session allows for enhanced trunking,
failover and recovery, and support for RDMA. The session is
implemented as operations within NFSv4 COMPOUND and does not
impact layering or interoperability with existing NFSv4
implementations. The NFSv4 callback channel is dynamically
associated and is connected by the client and not the server,
enhancing security and operation through firewalls. [[Comment.2:
XXX is the following true:]]In fact, the callback channel will be
enabled to share the same connection as the operations channel.
o An enhanced RPC layer enables NFSv4 operation atop RDMA. The o The client's network address.
session assists RDMA-mode connection, and additional facilities
are provided for managing RDMA resources at both NFSv4 server and
client. Existing NFSv4 operations continue to function as before,
though certain size limits are negotiated. A companion draft to
this specification, "RDMA Transport for ONC RPC" [RPCRDMA] is to
be referenced for details of RPC RDMA support.
o Support for exactly-once semantics ("EOS") is enabled by the new o For a user level NFS version 4 client, it should contain
session facilities, by providing to the server a way to bound the additional information to distinguish the client from other user
size of the duplicate request cache for a single client, and to level clients running on the same host, such as a process id or
manage its persistent storage. other unique sequence.
Block Diagram o Additional information that tends to be unique, such as one or
more of:
+-----------------+-------------------------------------+ * The client machine's serial number (for privacy reasons, it is
| NFSv4 | NFSv4 + session extensions | best to perform some one way function on the serial number).
+-----------------+------+----------------+-------------+
| Operations | Session | |
+------------------------+----------------+ |
| RPC/XDR | |
+-------------------------------+---------+ |
| Stream Transport | RDMA Transport |
+-------------------------------+-----------------------+
6.1.2. Session Model * A MAC address (again, a one way function should be performed).
A session is a dynamically created, long-lived server object created * The timestamp of when the NFS version 4 software was first
by a client, used over time from one or more transport connections. installed on the client (though this is subject to the
Its function is to maintain the server's state relative to the previously mentioned caution about using information that is
connection(s) belonging to a client instance. This state is entirely stored in a file, because the file might only be accessible
independent of the connection itself. The session in effect becomes over NFS version 4).
the object representing an active client on a connection or set of
connections.
Clients may create multiple sessions for a single clientid, and may * A true random number. However since this number ought to be
wish to do so for optimization of transport resources, buffers, or the same between client incarnations, this shares the same
server behavior. A session could be created by the client to problem as that of the using the timestamp of the software
represent a single mount point, for separate read and write installation.
"channels", or for any number of other client-selected parameters.
The session enables several things immediately. Clients may As a security measure, the server MUST NOT cancel a client's leased
disconnect and reconnect (voluntarily or not) without loss of context state if the principal established the state for a given id string is
at the server. (Of course, locks, delegations and related not the same as the principal issuing the CREATE_CLIENTID.
associations require special handling, and generally expire in the
extended absence of an open connection.) Clients may connect
multiple transport endpoints to this common state. The endpoints may
have all the same attributes, for instance when trunked on multiple
physical network links for bandwidth aggregation or path failover.
Or, the endpoints can have specific, special purpose attributes such
as callback channels.
The NFSv4.0 specification does not provide for any form of flow A server may compare an nfs_client_id4 in a CREATE_CLIENTID with an
control; instead it relies on the windowing provided by TCP to nfs_client_id4 established using SETCLIENTID using NFSv4 minor
throttle requests. This unfortunately does not work with RDMA, which version 0, so that an NFSv4.1 client is not forced to delay until
in general provides no operation flow control and will terminate a lease expiration for locking state established by the earlier client
connection in error when limits are exceeded. Limits are therefore using minor version 0.
exchanged when a session is created; These limits then provide maxima
within which each session's connections must operate, they are
managed within these limits as described in [RPCRDMA]. The limits
may also be modified dynamically at the server's choosing by
manipulating certain parameters present in each NFSv4.1 request.
The presence of a maximum request limit on the session bounds the Once a CREATE_CLIENTID has been done, and the resulting clientid
requirements of the duplicate request cache. This can be used a established as associated with a session, all requests made on that
server accurately determine any storage needs, enable it to maintain session implicitly identify that clientid, which in turn designates
duplicate request cache persistence, and to provide reliable exactly- the client specified using the long-form nfs_client_id4 structure.
once semantics. The shorthand client identifier (a clientid) is assigned by the
server and should be chosen so that it will not conflict with a
clientid previously assigned by the server. This applies across
server restarts or reboots.
6.1.3. Connection State In the event of a server restart, a client will find out that its
current clientid is no longer valid when receives a
NFS4ERR_STALE_CLIENTID error. The precise circumstances depend of
the characteristics of the sessions involved, specifically whether
the session is persistent (see Section 2.9.4.5).
In NFSv4.0, the combination of a connected transport endpoint and a When a session is not persistent, the client will need to create a
clientid forms the basis of connection state. While this has been new session. When the existing clientid is presented to a server as
made to be workable with certain limitations, there are difficulties part of creating a session and that clientid is not recognized, as
in correct and robust implementation. The NFSv4.0 protocol must would happen after a server reboot, the server will reject the
provide a server-initiated connection for the callback channel, and request with the error NFS4ERR_STALE_CLIENTID. When this happens,
must carefully specify the persistence of client state at the server the client must obtain a new clientid by use of the CREATE_CLIENTID
in the face of transport interruptions. The server has only the operation and then use that clientid as the basis of the basis of a
client's transport address binding (the IP 4-tuple) to identify the new session and then proceed to any other necessary recovery for the
client RPC transaction stream and to use as a lookup tag on the server reboot case (See Section 8.6.2).
duplicate request cache. (A useful overview of this is in [RW96].)
If the server listens on multiple addresses, and the client connects
to more than one, it must employ different clientid's on each,
negating its ability to aggregate bandwidth and redundancy. In
effect, each transport connection is used as the server's
representation of client state. But, transport connections are
potentially fragile and transitory.
In this specification, a session identifier is assigned by the server In the case of the session being persistent, the client will re-
upon initial session negotiation on each connection. This identifier establish communication using the existing session after the reboot.
is used to associate additional connections, to renegotiate after a This session will be associated with a stale clientid and the client
reconnect, to provide an abstraction for the various session will receive an indication of that fact in the sr_status field
properties, and to address the duplicate request cache. No returned by the SEQUENCE operation (see Section 2.9.2.1). The client
transport-specific information is used in the duplicate request cache can then use the existing session to do whatever operations are
implementation of an NFSv4.1 server, nor in fact the RPC XID itself. necessary to determine the status of requests outstanding at the time
The session identifier is unique within the server's scope and may be of reboot, while avoiding issuing new requests, particularly any
subject to certain server policies such as being bounded in time. involving locking on that session. Such requests would fail with
NFS4ERR_STALE_CLIENTID error or an NFS4ERR_STALE_STATEID error, if
attempted. In any case, the client would create a new clientid using
CREATE_CLIENTID, create a new session based on that clientid, and
proceed to other necessary recovery for the server reboot case.
6.1.4. NFSv4 Channels, Sessions and Connections See the detailed descriptions of CREATE_CLIENTID (Section 16.35 and
CREATE_SESSION (Section 16.36) for a complete specification of these
operations.
There are two types of NFSv4 channels: the "operations" or "fore" 2.4.1. Server Release of Clientid
channel used for ordinary requests from client to server, and the
"back" channel, used for callback requests from server to client.
Different NFSv4 operations on these channels can lead to different If the server determines that the client holds no associated state
resource needs. For example, server callback operations (CB_RECALL) for its clientid, the server may choose to release the clientid. The
are specific, small messages which flow from server to client at server may make this choice for an inactive client so that resources
arbitrary times, while data transfers such as read and write have are not consumed by those intermittently active clients. If the
very different sizes and asymmetric behaviors. It is sometimes client contacts the server after this release, the server must ensure
impractical for the RDMA peers (NFSv4 client and NFSv4 server) to the client receives the appropriate error so that it will use the
post buffers for these various operations on a single connection. CREATE_CLIENTID/CREATE_SESSION sequence to establish a new identity.
Commingling of requests with responses at the client receive queue is It should be clear that the server must be very hesitant to release a
particularly troublesome, due both to the need to manage both clientid since the resulting work on the client to recover from such
solicited and unsolicited completions, and to provision buffers for an event will be the same burden as if the server had failed and
both purposes. Due to the lack of any ordering of callback requests restarted. Typically a server would not release a clientid unless
versus response arrivals, without any other mechanisms, the client there had been no activity from that client for many minutes. Note
would be forced to allocate all buffers sized to the worst case. that "associated state" includes sessions. As long as there are
sessions, the server MUST not release the clientid. See
Section 2.9.8.1.4 for discussion on releasing inactive sessions.
The callback requests are likely to be handled by a different task Note that if the id string in a CREATE_CLIENTID request is properly
context from that handling the responses. Significant demultiplexing constructed, and if the client takes care to use the same principal
and thread management may be required if both are received on the for each successive use of CREATE_CLIENTID, then, barring an active
same connection. The client and server have full control as to denial of service attack, NFS4ERR_CLID_INUSE should never be
whether a connection will service one channel or both channels. returned.
[[Comment.3: I think trunking remains an open issue has there is no However, client bugs, server bugs, or perhaps a deliberate change of
way yet for clients to determine whether two different server network the principal owner of the id string (such as the case of a client
addresses refer to the same server]]. Also, the client may wish to that changes security flavors, and under the new flavor, there is no
perform trunking of operations channel requests for performance mapping to the previous owner) will in rare cases result in
reasons, or multipathing for availability. This specification NFS4ERR_CLID_INUSE.
permits both, as well as many other session and connection
possibilities, by permitting each operation to carry session
membership information and to share session (and clientid) state in
order to draw upon the appropriate resources. For example, reads and
writes may be assigned to specific, optimized connections, or sorted
and separated by any or all of size, idempotency, etc.
To address the problems described above, this specification allows In that event, when the server gets a CREATE_CLIENTID for a client id
multiple sessions to share a clientid, as well as for multiple that currently has no state, or it has state, but the lease has
connections to share a session. expired, rather than returning NFS4ERR_CLID_INUSE, the server MUST
allow the CREATE_CLIENTID, and confirm the new clientid if followed
by the appropriate CREATE_SESSION.
Single Connection model: 2.5. Security Service Negotiation
NFSv4.1 Session With the NFS version 4 server potentially offering multiple security
/ \ mechanisms, the client needs a method to determine or negotiate which
Operations_Channel [Back_Channel] mechanism is to be used for its communication with the server. The
\ / NFS server may have multiple points within its file system namespace
Connection that are available for use by NFS clients. These points can be
| considered security policy boundaries, and in some NFS
implementations are tied to NFS export points. In turn the NFS
server may be configured such that each of these security policy
boundaries may have different or multiple security mechanisms in use.
Multi-connection trunked model (2 operations channels shown): The security negotiation between client and server must be done with
a secure channel to eliminate the possibility of a third party
intercepting the negotiation sequence and forcing the client and
server to choose a lower level of security than required or desired.
See section Section 19 for further discussion.
NFSv4.1 Session 2.5.1. NFSv4 Security Tuples
/ \
Operations_Channels [Back_Channel]
| | |
Connection Connection [Connection]
| | |
Multi-connection split-use model (2 mounts shown): An NFS server can assign one or more "security tuples" to each
security policy boundary in its namespace. Each security tuple
consists of a security flavor (see Section 2.2.1.1), and if the
flavor is RPCSEC_GSS, a GSS-API mechanism OID, a GSS-API quality of
protection, and an RPCSEC_GSS service.
NFSv4.1 Session 2.5.2. SECINFO and SECINFO_NO_NAME
/ \
(/home) (/usr/local - readonly)
/ \ |
Operations_Channel [Back_Channel] |
| | Operations_Channel
Connection [Connection] |
| | Connection
|
In this way, implementation as well as resource management may be The SECINFO and SECINFO_NO_NAME operations allow the client to
optimized. Each session will have its own response caching and determine, on a per filehandle basis, what security tuple is to be
buffering, and each connection or channel will have its own transport used for server access. In general, the client will not have to use
resources, as appropriate. Clients which do not require certain either operation except during initial communication with the server
behaviors may optimize such resources away completely, by using or when the client crosses security policy boundaries at the server.
specific sessions and not even creating the additional channels and It is possible that the server's policies change during the client's
connections. interaction therefore forcing the client to negotiate a new security
tuple.
6.1.5. Reconnection, Trunking and Failover 2.5.3. Security Error
Reconnection after failure references stored state on the server Based on the assumption that each NFS version 4 client and server
associated with lease recovery during the grace period. The session must support a minimum set of security (i.e., LIPKEY, SPKM-3, and
provides a convenient handle for storing and managing information Kerberos-V5 all under RPCSEC_GSS), the NFS client will initiate file
regarding the client's previous state on a per- connection basis, access to the server with one of the minimal security tuples. During
e.g. to be used upon reconnection. Reconnection to a previously communication with the server, the client may receive an NFS error of
existing session, and its stored resources, are covered in NFS4ERR_WRONGSEC. This error allows the server to notify the client
Section 6.3. that the security tuple currently being used is contravenes the
server's security policy. The client is then responsible for
determining (see Section 2.5.3.1) what security tuples are available
at the server and choose one which is appropriate for the client.
One important aspect of reconnection is that of RPC library support. 2.5.3.1. Using NFS4ERR_WRONGSEC, SECINFO, and SECINFO_NO_NAME
Traditionally, an Upper Layer RPC-based Protocol such as NFS leaves
all transport knowledge to the RPC layer implementation below it.
This allows NFS to operate over a wide variety of transports and has
proven to be a highly successful approach. The session, however,
introduces an abstraction which is, in a way, "between" RPC and
NFSv4.1. It is important that the session abstraction not have
ramifications within the RPC layer.
One such issue arises within the reconnection logic of RPC. This section explains of the mechanics of NFSv4.1 security
Previously, an explicit session binding operation, which established negotiation. Unless noted otherwise, for any mention of PUTFH in
session context for each new connection, was explored. This however this section, the reader should interpret it as applying to PUTROOTFH
required that the session binding also be performed during reconnect, and PUTPUBFH in addition to PUTFH.
which in turn required an RPC request. This additional request
requires new RPC semantics, both in implementation and the fact that
a new request is inserted into the RPC stream. Also, the binding of
a connection to a session required the upper layer to become "aware"
of connections, something the RPC layer abstraction architecturally
abstracts away. Therefore the session binding is not handled in
connection scope but instead explicitly carried in each request.
For Reliability Availability and Serviceability (RAS) issues such as 2.5.3.1.1. PUTFH + LOOKUP (or OPEN by Name)
bandwidth aggregation and multipathing, clients frequently seek to
make multiple connections through multiple logical or physical
channels. The session is a convenient point to aggregate and manage
these resources.
6.1.6. Server Duplicate Request Cache This situation also applies to a put filehandle operation followed by
an OPEN operation that specifies a component name.
RPC-based server duplicate request caches, while not a part of an NFS In this situation, the client is potentially crossing a security
protocol, have become a de-facto requirement of any NFS policy boundary, and the set of security tuples the parent directory
implementation. First described in [CJ89], the duplicate request supports differ from those of the child. The server implementation
cache was initially found to reduce work at the server by avoiding may decide whether to impose any restrictions on security policy
duplicate processing for retransmitted requests. A second, and in administration. There are at least three approaches
the long run more important benefit, was improved correctness, as the (sec_policy_child is the tuple set of the child export,
cache avoided certain destructive non-idempotent requests from being sec_policy_parent is that of the parent).
reinvoked.
However, RPC-based caches do not provide correctness guarantees; they a) sec_policy_child <= sec_policy_parent (<= for subset). This
cannot be managed in a reliable, persistent fashion. The reason is means that the set of security tuples specified on the security
understandable - their storage requirement is unbounded due to the policy of a child directory is always a subset of that of its
lack of any such bound in the NFS protocol, and they are dependent on parent directory.
transport addresses for request matching.
The session model, the presence of maximum request count limits and b) sec_policy_child ^ sec_policy_parent != {} (^ for intersection,
negotiated maximum sizes allows the size and duration of the cache to {} for the empty set). This means that the security tuples
be bounded, and coupled with a long-lived session identifier, enables specified on the security policy of a child directory always has a
its persistent storage on a per-session basis. non empty intersection with that of the parent.
This provides a single unified mechanism which provides the following c) sec_policy_child ^ sec_policy_parent == {}. This means that
guarantees required in the NFSv4 specification, while extending them the set of tuples specified on the security policy of a child
to all requests, rather than limiting them only to a subset of state- directory may not intersect with that of the parent. In other
related requests: words, there are no restrictions on how the system administrator
may set.
"It is critical the server maintain the last response sent to the For a server to support approach (b) (when client chooses a flavor
client to provide a more reliable cache of duplicate non- idempotent that is not a member of sec_policy_parent) and (c), PUTFH must NOT
requests than that of the traditional cache described in [CJ89]..." return NFS4ERR_WRONGSEC in case of security mismatch. Instead, it
RFC3530 [2] should be returned from the LOOKUP (or OPEN by component name) that
follows.
The maximum request count limit is the count of active operations, Since the above guideline does not contradict approach (a), it should
which bounds the number of entries in the cache. Constraining the be followed in general. Even if approach (a) is implemented, it is
size of operations additionally serves to limit the required storage possible for the security tuple used to be acceptable for the target
to the product of the current maximum request count and the maximum of LOOKUP but not for the filehandles used in PUTFH. The PUTFH could
response size. This storage requirement enables server- side really be a PUTROOTFH or PUTPUBFH, where the client does not know the
efficiencies. security tuples for the root or public filehandle. Or the security
policy for the filehandle used by PUTFH could have changed since the
time the filehandle was obtained.
Session negotiation allows the server to maintain other state. An Therefore, an NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC in
NFSv4.1 client invoking the session destroy operation will cause the response to PUTFH, PUTROOTFH, or PUTPUBFH if the operation is
server to close the session, allowing the server to deallocate cache immediately followed by a LOOKUP or an OPEN by component name.
entries. Clients can potentially specify that such caches not be
kept for appropriate types of sessions (for example, read-only
sessions). This can enable more efficient server operation resulting
in improved response times, and more efficient sizing of buffers and
response caches.
Similarly, it is important for the client to explicitly learn whether 2.5.3.1.2. PUTFH + LOOKUPP
the server is able to implement reliable semantics. Knowledge of
whether these semantics are in force is critical for a highly
reliable client, one which must provide transactional integrity
guarantees. When clients request that the semantics be enabled for a
given session, the session reply must inform the client if the mode
is in fact enabled. In this way the client can confidently proceed
with operations without having to implement consistency facilities of
its own.
6.2. Session Initialization and Transfer Models Since SECINFO only works its way down, there is no way LOOKUPP can
return NFS4ERR_WRONGSEC without SECINFO_NO_NAME. SECINFO_NO_NAME
solves this issue because via style "parent", it works in the
opposite direction as SECINFO. As with Section 2.5.3.1.1, PUTFH must
not return NFS4ERR_WRONGSEC whenever it is followed by LOOKUPP. If
the server does not support SECINFO_NO_NAME, the client's only
recourse is to issue the PUTFH, LOOKUPP, GETFH sequence of operations
with every security tuple it supports.
Session initialization issues, and data transfer models relevant to Regardless whether SECINFO_NO_NAME is supported, an NFSv4.1 server
both TCP and RDMA are discussed in this section. MUST NOT return NFS4ERR_WRONGSEC in response to PUTFH, PUTROOTFH, or
PUTPUBFH if the operation is immediately followed by a LOOKUPP.
6.2.1. Session Negotiation 2.5.3.1.3. PUTFH + SECINFO or PUTFH + SECINFO_NO_NAME
The following parameters are exchanged between client and server at A security sensitive client is allowed to choose a strong security
session creation time. Their values allow the server to properly tuple when querying a server to determine a file object's permitted
size resources allocated in order to service the client's requests, security tuples. The security tuple chosen by the client does not
and to provide the server with a way to communicate limits to the have to be included in the tuple list of the security policy of the
client for proper and optimal operation. They are exchanged prior to either parent directory indicated in PUTFH, or the child file object
all session-related activity, over any transport type. Discussion of indicated in SECINFO (or any parent directory indicated in
their use is found in their descriptions as well as throughout this SECINFO_NO_NAME). Of course the server has to be configured for
section. whatever security tuple the client selects, otherwise the request
will fail at RPC layer with an appropriate authentication error.
Maximum Requests In theory, there is no connection between the security flavor used by
SECINFO or SECINFO_NO_NAME and those supported by the security
policy. But in practice, the client may start looking for strong
flavors from those supported by the security policy, followed by
those in the mandatory set.
The client's desired maximum number of concurrent requests is The NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC to PUTFH whenever
passed, in order to allow the server to size its reply cache it is immediately followed by SECINFO or SECINFO_NO_NAME. The
storage. The server may modify the client's requested limit NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC from SECINFO or
downward (or upward) to match its local policy and/or resources. SECINFO_NO_NAME.
Over RDMA-capable RPC transports, the per-request management of
low-level transport message credits is handled within the RPC
layer. [RPCRDMA]
Maximum Request/Response Sizes 2.5.3.1.4. PUTFH + PUTFH
The maximum request and response sizes are exchanged in order to This is a nonsensical situation, because the first put filehandle
permit allocation of appropriately sized buffers and request cache operation is wasted. The NFSv4.1 server MAY return NFS4ERR_WRONGSEC
entries. The size must allow for certain protocol minima, to the first PUTFH, or it MAY NOT. If it does not, it then processes
allowing the receipt of maximally sized operations (e.g. RENAME the subsequent PUTFH and any operation that follows it according to
requests which contains two name strings). Note the maximum the rules listed in Section 2.5.3.1.
request/response sizes cover the entire request/response message
and not simply the data payload as traditional NFS maximum read or
write size. Also note the server implementation may not, in fact
probably does not, require the reply cache entries to be sized as
large as the maximum response. The server may reduce the client's
requested sizes.
Inline Padding/Alignment 2.5.3.1.5. PUTFH + Nothing
The server can inform the client of any padding which can be used This too is nonsensical because the PUTFH is wasted. The NFSv4.1
to deliver NFSv4 inline WRITE payloads into aligned buffers. Such server MAY or MAY NOT return NFS4ERR_WRONGSEC.
alignment can be used to avoid data copy operations at the server
for both TCP and inline RDMA transfers. For RDMA, the client
informs the server in each operation when padding has been
applied. [RPCRDMA]
Transport Attributes
A placeholder for transport-specific attributes is provided, with 2.5.3.1.6. PUTFH + Anything Else
a format to be determined. Possible examples of information to be
passed in this parameter include transport security attributes to
be used on the connection, RDMA- specific attributes, legacy
"private data" as used on existing RDMA fabrics, transport Quality
of Service attributes, etc. This information is to be passed to
the peer's transport layer by local means which is currently
outside the scope of this draft, however one attribute is provided
in the RDMA case:
RDMA Read Resources "Anything Else" includes OPEN by filehandle.
RDMA implementations must explicitly provision resources to The security policy enforcement applies to the filehandle specified
support RDMA Read requests from connected peers. These values in PUTFH. Therefore PUTFH must return NFS4ERR_WRONGSEC in case of
must be explicitly specified, to provide adequate resources for security tuple on the part of the mismatch. This avoids the
matching the peer's expected needs and the connection's delay- complexity adding NFS4ERR_WRONGSEC as an allowable error to every
bandwidth parameters. The client provides its chosen value to the other operation.
server in the initial session creation, the value must be provided
in each client RDMA endpoint. The values are asymmetric and
should be set to zero at the server in order to conserve RDMA
resources, since clients do not issue RDMA Read operations in this
specification. The result is communicated in the session
response, to permit matching of values across the connection. The
value may not be changed in the duration of the session, although
a new value may be requested as part of a new session.
6.2.2. RDMA Requirements PUTFH + SECINFO_NO_NAME (style "current_fh") is an efficient way for
the client to recover from NFS4ERR_WRONGSEC.
A complete discussion of the operation of RPC-based protocols atop The NFSv4.1 server, MUST not return NFS4ERR_WRONGSEC to any operation
RDMA transports is in [RPCRDMA]. Where RDMA is considered, this other than LOOKUP, LOOKUPP, and OPEN (by component name).
specification assumes the use of such a layering; it addresses only
the upper layer issues relevant to making best use of RPC/RDMA.
A connection oriented (reliable sequenced) RDMA transport will be 2.6. Minor Versioning
required. There are several reasons for this. First, this model
most closely reflects the general NFSv4 requirement of long-lived and
congestion-controlled transports. Second, to operate correctly over
either an unreliable or unsequenced RDMA transport, or both, would
require significant complexity in the implementation and protocol not
appropriate for a strict minor version. For example, retransmission
on connected endpoints is explicitly disallowed in the current NFSv4
draft; it would again be required with these alternate transport
characteristics. Third, this specification assumes a specific RDMA
ordering semantic, which presents the same set of ordering and
reliability issues to the RDMA layer over such transports.
The RDMA implementation provides for making connections to other To address the requirement of an NFS protocol that can evolve as the
RDMA-capable peers. In the case of the current proposals before the need arises, the NFS version 4 protocol contains the rules and
RDDP working group, these RDMA connections are preceded by a framework to allow for future minor changes or versioning.
"streaming" phase, where ordinary TCP (or NFS) traffic might flow.
However, this is not assumed here and sizes and other parameters are
explicitly exchanged upon a session entering RDMA mode.
6.2.3. RDMA Connection Resources The base assumption with respect to minor versioning is that any
future accepted minor version must follow the IETF process and be
documented in a standards track RFC. Therefore, each minor version
number will correspond to an RFC. Minor version zero of the NFS
version 4 protocol is represented by [2], and minor version one is
represented by this document [[Comment.2: change "document" to "RFC"
when we publish]] . The COMPOUND and CB_COMPOUND procedures support
the encoding of the minor version being requested by the client.
On transport endpoints which support automatic RDMA mode, that is, The following items represent the basic rules for the development of
endpoints which are created in the RDMA-enabled state, a single, minor versions. Note that a future minor version may decide to
preposted buffer must initially be provided by both peers, and the modify or add to the following rules as part of the minor version
client session negotiation must be the first exchange. definition.
On transport endpoints supporting dynamic negotiation, a more 1. Procedures are not added or deleted
sophisticated negotiation is possible, but is not discussed in the
current draft.
RDMA imposes several requirements on upper layer consumers. To maintain the general RPC model, NFS version 4 minor versions
Registration of memory and the need to post buffers of a specific will not add to or delete procedures from the NFS program.
size and number for receive operations are a primary consideration.
Registration of memory can be a relatively high-overhead operation, 2. Minor versions may add operations to the COMPOUND and
since it requires pinning of buffers, assignment of attributes (e.g. CB_COMPOUND procedures.
readable/writable), and initialization of hardware translation.
Preregistration is desirable to reduce overhead. These registrations
are specific to hardware interfaces and even to RDMA connection
endpoints, therefore negotiation of their limits is desirable to
manage resources effectively.
Following the basic registration, these buffers must be posted by the The addition of operations to the COMPOUND and CB_COMPOUND
RPC layer to handle receives. These buffers remain in use by the procedures does not affect the RPC model.
RPC/NFSv4 implementation; the size and number of them must be known
to the remote peer in order to avoid RDMA errors which would cause a
fatal error on the RDMA connection.
The session provides a natural way for the server to manage resource * Minor versions may append attributes to GETATTR4args,
allocation to each client rather than to each transport connection bitmap4, and GETATTR4res.
itself. This enables considerable flexibility in the administration
of transport endpoints.
6.2.4. TCP and RDMA Inline Transfer Model This allows for the expansion of the attribute model to allow
for future growth or adaptation.
The basic transfer model for both TCP and RDMA is referred to as * Minor version X must append any new attributes after the last
"inline". For TCP, this is the only transfer model supported, since documented attribute.
TCP carries both the RPC header and data together in the data stream.
For RDMA, the RDMA Send transfer model is used for all NFS requests Since attribute results are specified as an opaque array of
and replies, but data is optionally carried by RDMA Writes or RDMA per-attribute XDR encoded results, the complexity of adding
Reads. Use of Sends is required to ensure consistency of data and to new attributes in the midst of the current definitions will
deliver completion notifications. The pure-Send method is typically be too burdensome.
used where the data payload is small, or where for whatever reason
target memory for RDMA is not available.
Inline message exchange 3. Minor versions must not modify the structure of an existing
operation's arguments or results.
Client Server Again the complexity of handling multiple structure definitions
: Request : for a single operation is too burdensome. New operations should
Send : ------------------------------> : untagged be added instead of modifying existing structures for a minor
: : buffer version.
: Response :
untagged : <------------------------------ : Send
buffer : :
Client Server This rule does not preclude the following adaptations in a minor
: Read request : version.
Send : ------------------------------> : untagged
: : buffer
: Read response with data :
untagged : <------------------------------ : Send
buffer : :
Client Server * adding bits to flag fields such as new attributes to
: Write request with data : GETATTR's bitmap4 data type
Send : ------------------------------> : untagged
: : buffer
: Write response :
untagged : <------------------------------ : Send
buffer : :
Responses must be sent to the client on the same connection that the * adding bits to existing attributes like ACLs that have flag
request was sent. It is important that the server does not assume words
any specific client implementation, in particular whether connections
within a session share any state at the client. This is also
important to preserve ordering of RDMA operations, and especially
RMDA consistency. Additionally, it ensures that the RPC RDMA layer
makes no requirement of the RDMA provider to open its memory
registration handles (Steering Tags) beyond the scope of a single
RDMA connection. This is an important security consideration.
Two values must be known to each peer prior to issuing Sends: the * extending enumerated types (including NFS4ERR_*) with new
maximum number of sends which may be posted, and their maximum size. values
These values are referred to, respectively, as the message credits
and the maximum message size. While the message credits might vary
dynamically over the duration of the session, the maximum message
size does not. The server must commit to preserving this number of
duplicate request cache entires, and preparing a number of receive
buffers equal to or greater than its currently advertised credit
value, each of the advertised size. These ensure that transport
resources are allocated sufficient to receive the full advertised
limits.
Note that the server must post the maximum number of session requests 4. Minor versions may not modify the structure of existing
to each client operations channel. The client is not required to attributes.
spread its requests in any particular fashion across connections
within a session. If the client wishes, it may create multiple
sessions, each with a single or small number of operations channels
to provide the server with this resource advantage. Or, over RDMA
the server may employ a "shared receive queue". The server can in
any case protect its resources by restricting the client's request
credits.
While tempting to consider, it is not possible to use the TCP window 5. Minor versions may not delete operations.
as an RDMA operation flow control mechanism. First, to do so would
violate layering, requiring both senders to be aware of the existing
TCP outbound window at all times. Second, since requests are of
variable size, the TCP window can hold a widely variable number of
them, and since it cannot be reduced without actually receiving data,
the receiver cannot limit the sender. Third, any middlebox
interposing on the connection would wreck any possible scheme.
[MIDTAX] In this specification, maximum request count limits are
exchanged at the session level to allow correct provisioning of
receive buffers by transports.
When operating over TCP or other similar transport, request limits This prevents the potential reuse of a particular operation
and sizes are still employed in NFSv4.1, but instead of being "slot" in a future minor version.
required for correctness, they provide the basis for efficient server
implementation of the duplicate request cache. The limits are chosen
based upon the expected needs and capabilities of the client and
server, and are in fact arbitrary. Sizes may be specified by the
client as zero (requesting the server's preferred or optimal value),
and request limits may be chosen in proportion to the client's
capabilities. For example, a limit of 1000 allows 1000 requests to
be in progress, which may generally be far more than adequate to keep
local networks and servers fully utilized.
Both client and server have independent sizes and buffering, but over 6. Minor versions may not delete attributes.
RDMA fabrics client credits are easily managed by posting a receive
buffer prior to sending each request. Each such buffer may not be
completed with the corresponding reply, since responses from NFSv4
servers arrive in arbitrary order. When an operations channel is
also used for callbacks, the client must account for callback
requests by posting additional buffers. Note that implementation-
specific facilities such as a shared receive queue may also allow
optimization of these allocations.
When a session is created, the client requests a preferred buffer 7. Minor versions may not delete flag bits or enumeration values.
size, and the server provides its answer. The server posts all
buffers of at least this size. The client must comply by not sending
requests greater than this size. It is recommended that server
implementations do all they can to accommodate a useful range of
possible client requests. There is a provision in [RPCRDMA] to allow
the sending of client requests which exceed the server's receive
buffer size, but it requires the server to "pull" the client's
request as a "read chunk" via RDMA Read. This introduces at least
one additional network roundtrip, plus other overhead such as
registering memory for RDMA Read at the client and additional RDMA
operations at the server, and is to be avoided.
An issue therefore arises when considering the NFSv4 COMPOUND 8. Minor versions may declare an operation as mandatory to NOT
procedures. Since an arbitrary number (total size) of operations can implement.
be specified in a single COMPOUND procedure, its size is effectively
unbounded. This cannot be supported by RDMA Sends, and therefore
this size negotiation places a restriction on the construction and
maximum size of both COMPOUND requests and responses. If a COMPOUND
results in a reply at the server that is larger than can be sent in
an RDMA Send to the client, then the COMPOUND must terminate and the
operation which causes the overflow will provide a TOOSMALL error
status result.
6.2.5. RDMA Direct Transfer Model Specifying an operation as "mandatory to not implement" is
equivalent to obsoleting an operation. For the client, it means
that the operation should not be sent to the server. For the
server, an NFS error can be returned as opposed to "dropping"
the request as an XDR decode error. This approach allows for
the obsolescence of an operation while maintaining its structure
so that a future minor version can reintroduce the operation.
Placement of data by explicitly tagged RDMA operations is referred to 1. Minor versions may declare attributes mandatory to NOT
as "direct" transfer. This method is typically used where the data implement.
payload is relatively large, that is, when RDMA setup has been
performed prior to the operation, or when any overhead for setting up
and performing the transfer is regained by avoiding the overhead of
processing an ordinary receive.
The client advertises RDMA buffers and not the server. This means 2. Minor versions may declare flag bits or enumeration values
the "XDR Decoding with Read Chunks" described in [RPCRDMA] is not as mandatory to NOT implement.
employed by NFSv4.1 replies, and instead all results transferred via
RDMA to the client employ "XDR Decoding with Write Chunks". There
are several reasons for this.
First, it allows for a correct and secure mode of transfer. The 9. Minor versions may downgrade features from mandatory to
client may advertise specific memory buffers only during specific recommended, or recommended to optional.
times, and may revoke access when it pleases. The server is not
required to expose copies of local file buffers for individual
clients, or to lock or copy them for each client access.
Second, client credits based on fixed-size request buffers are easily 10. Minor versions may upgrade features from optional to recommended
managed on the server, but for the server additional management of or recommended to mandatory.
buffers for client RDMA Reads is not well-bounded. For example, the
client may not perform these RDMA Read operations in a timely
fashion, therefore the server would have to protect itself against
denial-of-service on these resources.
Third, it reduces network traffic, since buffer exposure outside the 11. A client and server that supports minor version X should support
scope and duration of a single request/response exchange necessitates minor versions 0 (zero) through X-1 as well.
additional memory management exchanges.
There are costs associated with this decision. Primary among them is 12. Except for infrastructural changes, no new features may be
the need for the server to employ RDMA Read for operations such as introduced as mandatory in a minor version.
large WRITE. The RDMA Read operation is a two-way exchange at the
RDMA layer, which incurs additional overhead relative to RDMA Write.
Additionally, RDMA Read requires resources at the data source (the
client in this specification) to maintain state and to generate
replies. These costs are overcome through use of pipelining with
credits, with sufficient RDMA Read resources negotiated at session
initiation, and appropriate use of RDMA for writes by the client -
for example only for transfers above a certain size.
A description of which NFSv4 operation results are eligible for data This rule allows for the introduction of new functionality and
transfer via RDMA Write is in [NFSDDP]. There are only two such forces the use of implementation experience before designating a
operations: READ and READLINK. When XDR encoding these requests on feature as mandatory. On the other hand, some classes of
an RDMA transport, the NFSv4.1 client must insert the appropriate features are infrastructural and have broad effects. Allowing
xdr_write_list entries to indicate to the server whether the results such features to not be mandatory complicates implementation of
should be transferred via RDMA or inline with a Send. As described the minor version.
in [NFSDDP], a zero-length write chunk is used to indicate an inline
result. In this way, it is unnecessary to create new operations for
RDMA-mode versions of READ and READLINK.
Another tool to avoid creation of new, RDMA-mode operations is the 13. A client MUST NOT attempt to use a stateid, filehandle, or
Reply Chunk [RPCRDMA], which is used by RPC in RDMA mode to return similar returned object from the COMPOUND procedure with minor
large replies via RDMA as if they were inline. Reply chunks are used version X for another COMPOUND procedure with minor version Y,
for operations such as READDIR, which returns large amounts of where X != Y.
information, but in many small XDR segments. Reply chunks are
offered by the client and the server can use them in preference to
inline. Reply chunks are transparent to upper layers such as NFSv4.
In any very rare cases where another NFSv4.1 operation requires 2.7. Non-RPC-based Security Services
larger buffers than were negotiated when the session was created (for
example extraordinarily large RENAMEs), the underlying RPC layer may
support the use of "Message as an RDMA Read Chunk" and "RDMA Write of
Long Replies" as described in [RPCRDMA]. No additional support is
required in the NFSv4.1 client for this. The client should be
certain that its requested buffer sizes are not so small as to make
this a frequent occurrence, however.
All operations are initiated by a Send, and are completed with a As described in Section 2.2.1.1.1.1, NFSv4 relies on RPC for
Send. This is exactly as in conventional NFSv4, but under RDMA has a identification, authentication, integrity, and privacy. NFSv4 itself
significant purpose: RDMA operations are not complete, that is, provides additional security services as described in the next
guaranteed consistent, at the data sink until followed by a several subsections.
successful Send completion (i.e. a receive). These events provide a
natural opportunity for the initiator (client) to enable and later
disable RDMA access to the memory which is the target of each
operation, in order to provide for consistent and secure operation.
The RDMAP Send with Invalidate operation may be worth employing in
this respect, as it relieves the client of certain overhead in this
case.
A "onetime" boolean advisory to each RDMA region might become a hint 2.7.1. Authorization
to the server that the client will use the three-tuple for only one
NFSv4 operation. For a transport such as iWARP, the server can
assist the client in invalidating the three-tuple by performing a
Send with Solicited Event and Invalidate. The server may ignore this
hint, in which case the client must perform a local invalidate after
receiving the indication from the server that the NFSv4 operation is
complete. This may be considered in a future version of this draft
and [NFSDDP].
In a trusted environment, it may be desirable for the client to Authorization to access a file object via an NFSv4 operation is
persistently enable RDMA access by the server. Such a model is ultimately determined by the NFSv4 server. A client can predetermine
desirable for the highest level of efficiency and lowest overhead. its access to a file object via the OPEN (Section 16.16) and the
ACCESS (Section 16.1) operations.
RDMA message exchanges Principals with appropriate access rights can modify the
authorization on a file object via the SETATTR (Section 16.30)
operation. Four attributes that affect access rights are: mode,
owner, owner_group, and acl. See Section 5.
Client Server 2.7.2. Auditing
: Direct Read Request :
Send : ------------------------------> : untagged
: : buffer
: Segment :
tagged : <------------------------------ : RDMA Write
buffer : : :
: [Segment] :
tagged : <------------------------------ : [RDMA Write]
buffer : :
: Direct Read Response :
untagged : <------------------------------ : Send (w/Inv.)
buffer : :
Client Server NFSv4 provides auditing on a per file object basis, via the ACL
: Direct Write Request : attribute as described in Section 6. It is outside the scope of this
Send : ------------------------------> : untagged specification to specify audit log formats or management policies.
: : buffer
: Segment :
tagged : v------------------------------ : RDMA Read
buffer : +-----------------------------> :
: : :
: [Segment] :
tagged : v------------------------------ : [RDMA Read]
buffer : +-----------------------------> :
: :
: Direct Write Response :
untagged : <------------------------------ : Send (w/Inv.)
buffer : :
6.3. Connection Models 2.7.3. Intrusion Detection
There are three scenarios in which to discuss the connection model. NFSv4 provides alarm control on a per file object basis, via the ACL
Each will be discussed individually, after describing the common case attribute as described in Section 6. Alarms may serve as the basis
encountered at initial connection establishment. for instrusion detection. It is outside the scope of this
specification to specify heuristics for detecting intrusion via
alarms.
After a successful connection, the first request proceeds, in the 2.8. Transport Layers
case of a new client association, to initial session creation, and
then optionally to session callback channel binding, prior to regular
operation.
Commonly, each new client "mount" will be the action which drives 2.8.1. Required and Recommended Properties of Transports
creation of a new session. However there are any number of other
approaches. Clients may choose to share a single connection and
session among all their mount points. Or, clients may support
trunking, where additional connections are created but all within a
single session. Alternatively, the client may choose to create
multiple sessions, each tuned to the buffering and reliability needs
of the mount point. For example, a readonly mount can sharply reduce
its write buffering and also makes no requirement for the server to
support reliable duplicate request caching.
Similarly, the client can choose among several strategies for NFSv4 works over RDMA and non-RDMA_based transports with the
clientid usage. Sessions can share a single clientid, or create new following attributes:
clientids as the client deems appropriate. For kernel-based clients
which service multiple authenticated users, a single clientid shared
across all mount points is generally the most appropriate and
flexible approach. For example, all the client's file operations may
wish to share locking state and the local client kernel takes the
responsibility for arbitrating access locally. For clients choosing
to support other authentication models, perhaps example userspace
implementations, a new clientid is indicated. Through use of session
create options, both models are supported at the client's choice.
Since the session is explicitly created and destroyed by the client, o The transport supports reliable delivery of data, which NFSv4
and each client is uniquely identified, the server may be requires but neither NFSv4 nor RPC has facilities for ensuring.
specifically instructed to discard unneeded persistent state. For [20]
this reason, it is possible that a server will retain any previous
state indefinitely, and place its destruction under administrative
control. Or, a server may choose to retain state for some
configurable period, provided that the period meets other NFSv4
requirements such as lease reclamation time, etc. However, since
discarding this state at the server may affect the correctness of the
server as seen by the client across network partitioning, such
discarding of state should be done only in a conservative manner.
Each client request to the server carries a new SEQUENCE operation o The transport delivers data in the order it was sent. Ordered
within each COMPOUND, which provides the session context. This delivery simplifies detection of transmit errors, and simplifies
session context then governs the request control, duplicate request the sending of arbitrary sized requests and responses, via the
caching, and other persistent parameters managed by the server for a record marking protocol [4].
session.
6.3.1. TCP Connection Model Where an NFS version 4 implementation supports operation over the IP
network protocol, any transport used between NFS and IP MUST be among
the IETF-approved congestion control transport protocols. At the
time this document was written, the only two transports that had the
above attributes were TCP and SCTP. To enhance the possibilities for
interoperability, an NFS version 4 implementation MUST support
operation over the TCP transport protocol.
The following is a schematic diagram of the NFSv4.1 protocol Even if NFS version 4 is used over a non-IP network protocol, it is
exchanges leading up to normal operation on a TCP stream. RECOMMENDED that the transport support congestion control.
Client Server Note that it is permissible for connectionless transports to be used
TCPmode : Create Clientid(nfs_client_id4) : TCPmode under NFSv4.1, however reliable and in-order delivery of data is
: ------------------------------> : still required. NFSv4.1 assumes that a client transport address and
: : server transport address used to send data over a transport together
: Clientid reply(clientid, ...) : constitute a connection, even if the underlying transport eschews the
: <------------------------------ : concept of a connection.
: :
: Create Session(clientid, size S, :
: maxreq N, STREAM, ...) :
: ------------------------------> :
: :
: Session reply(sessionid, size S', :
: maxreq N') :
: <------------------------------ :
: :
: <normal operation> :
: ------------------------------> :
: <------------------------------ :
: : :
No net additional exchange is added to the initial negotiation. In 2.8.2. Client and Server Transport Behavior
the NFSv4.1 exchange, the CREATE_CLIENTID replaces SETCLIENTID
(eliding the callback "clientaddr4" addressing) and CREATE_SESSION
subsumes the function of SETCLIENTID_CONFIRM, as described elsewhere
in this specification. Callback channel binding is optional, as in
NFSv4.0. Note that the STREAM transport type is shown above, but
since the transport mode remains unchanged and transport attributes
are not necessarily exchanged, DEFAULT could also be passed.
6.3.2. Negotiated RDMA Connection Model If a connection-oriented transport (e.g. TCP) is used the client and
server SHOULD use long lived connections for at least three reasons:
One possible design which has been considered is to have a 1. This will prevent the weakening of the transport's congestion
"negotiated" RDMA connection model, supported via use of a session control mechanisms via short lived connections.
bind operation as a required first step. However due to issues
mentioned earlier, this proved problematic. This section remains as
a reminder of that fact, and it is possible such a mode can be
supported.
It is not considered critical that this be supported for two reasons. 2. This will improve performance for the WAN environment by
One, the session persistence provides a way for the server to eliminating the need for connection setup handshakes.
remember important session parameters, such as sizes and maximum
request counts. These values can be used to restore the endpoint
prior to making the first reply. Two, there are currently no
critical RDMA parameters to set in the endpoint at the server side of
the connection. RDMA Read resources, which are in general not
settable after entering RDMA mode, are set only at the client - the
originator of the connection. Therefore as long as the RDMA provider
supports an automatic RDMA connection mode, no further support is
required from the NFSv4.1 protocol for reconnection.
Note, the client must provide at least as many RDMA Read resources to 3. The NFSv4.1 callback model differs from NFSv4.0, and requires the
its local queue for the benefit of the server when reconnecting, as client and server to maintain a client-created channel (see
it used when negotiating the session. If this value is no longer Section 2.9.3.4for the server to use.
appropriate, the client should resynchronize its session state,
destroy the existing session, and start over with the more
appropriate values.
6.3.3. Automatic RDMA Connection Model In order to reduce congestion, if a connection-oriented transport is
used, and the request is not the NULL procedure,
The following is a schematic diagram of the NFSv4.1 protocol o A client (or the server, if issuing a callback), MUST NOT retry a
exchanges performed on an RDMA connection. request unless the connection the request was issued over was
disconnected before the reply was received.
Client Server o A server (or the client, if receiving a callback), MUST NOT
RDMAmode : : : RDMAmode silently drop a request, even if the request is a retry. (The
: : : silent drop behavior of RPCSEC_GSS [5] does not apply because this
Prepost : : : Prepost behavior happens at the RPCSEC_GSS layer, a lower layer in the
receive : : : receive request processing). Instead, the server SHOULD return an
: : appropriate error (see Section 2.9.4.1) or it MAY disconnect the
: Create Clientid(nfs_client_id4) : connection.
: ------------------------------> :
: : Prepost
: Clientid reply(clientid, ...) : receive
: <------------------------------ :
Prepost : :
receive : Create Session(clientid, size S, :
: maxreq N, RDMA ...) :
: ------------------------------> :
: : Prepost <=N'
: Session reply(sessionid, size S', : receives of
: maxreq N') : size S'
: <------------------------------ :
: :
: <normal operation> :
: ------------------------------> :
: <------------------------------ :
: : :
6.4. Buffer Management, Transfer, Flow Control When using RDMA transports there are other reasons not tolerating
retries over the same connection:
Inline operations in NFSv4.1 behave effectively the same as TCP o RDMA transports use "credits" to enforce flow control, where a
sends. Procedure results are passed in a single message, and its credit is a right to a peer to transmit a message. If one peer
completion at the client signal the receiving process to inspect the were to retransmit a request (or reply), it would consume an
message. additional credit. If the server retransmitted a reply, it would
certainly result in an RDMA connection loss, since the client
would typically only post a single receive buffer for each
request. If the client retransmitted a request, the additional
credit consumed on the server might lead to RDMA connection
failure unless the client accounted for it and decreased its
available credit, leading to wasted resources.
RDMA operations are performed solely by the server in NFSv4.1, as o RDMA credits present a new issue to the reply cache in NFSv4.1.
described in Section 6.2.5 RDMA Direct Transfer Model. Since server The reply cache may be used when a connection within a session is
RDMA operations do not result in a completion at the client, and due lost, such as after the client reconnects. Credit information is
to ordering rules in RDMA transports, after all required RDMA a dynamic property of the RDMA connection, and stale values must
operations are complete, a Send (Send with Solicited Event for iWARP) not be replayed from the cache. This implies that the reply cache
containing the procedure results is performed from server to client. contents must not be blindly used when replies are issued from it,
This Send operation will result in a completion which will signal the and credit information appropriate to the channel must be
client to inspect the message. refreshed by the RPC layer.
In the case of client read-type NFSv4 operations, the server will In addition, the sender of an NFSv4.1 request is not allowed to stop
have issued RDMA Writes to transfer the resulting data into client- waiting for a reply, as described in Section 2.9.4.2.
advertised buffers. The subsequent Send operation performs two
necessary functions: finalizing any active or pending DMA at the
client, and signaling the client to inspect the message.
In the case of client write-type NFSv4 operations, the server will 2.8.3. Ports
have issued RDMA Reads to fetch the data from the client-advertised
buffers. No data consistency issues arise at the client, but the
completion of the transfer must be acknowledged, again by a Send from
server to client.
In either case, the client advertises buffers for direct (RDMA style) Historically, NFS version 2 and version 3 servers have resided on
operations. The client may desire certain advertisement limits, and port 2049. The registered port 2049 RFC3232 [21] for the NFS
may wish the server to perform remote invalidation on its behalf when protocol should be the default configuration. NFSv4 clients SHOULD
the server has completed its RDMA. This may be considered in a NOT use the RPC binding protocols as described in RFC1833 [22].
future version of this draft.
In the absence of remote invalidation, the client may perform its 2.9. Session
own, local invalidation after the operation completes. This
invalidation should occur prior to any RPCSEC GSS integrity checking,
since a validly remotely accessible buffer can possibly be modified
by the peer. However, after invalidation and the contents integrity
checked, the contents are locally secure.
Credit updates over RDMA transports are supported at the RPC layer as 2.9.1. Motivation and Overview
described in [RPCRDMA]. In each request, the client requests a
desired number of credits to be made available to the connection on
which it sends the request. The client must not send more requests
than the number which the server has previously advertised, or in the
case of the first request, only one. If the client exceeds its
credit limit, the connection may close with a fatal RDMA error.
The server then executes the request, and replies with an updated Previous versions and minor versions of NFS have suffered from the
credit count accompanying its results. Since replies are sequenced following:
by their RDMA Send order, the most recent results always reflect the
server's limit. In this way the client will always know the maximum
number of requests it may safely post.
Because the client requests an arbitrary credit count in each o Lack of support for exactly once semantics (EOS). This includes
request, it is relatively easy for the client to request more, or lack of support for EOS through server failure and recovery.
fewer, credits to match its expected need. A client that discovered
itself frequently queuing outgoing requests due to lack of server
credits might increase its requested credits proportionately in
response. Or, a client might have a simple, configurable number.
The protocol also provides a per-operation "maxslot" exchange to
assist in dynamic adjustment at the session level, described in a
later section.
Occasionally, a server may wish to reduce the total number of credits o Limited callback support, including no support for sending
it offers a certain client on a connection. This could be callbacks through firewalls, and races between responses from
encountered if a client were found to be consuming its credits normal requests, and callbacks.
slowly, or not at all. A client might notice this itself, and reduce
its requested credits in advance, for instance requesting only the
count of operations it currently has queued, plus a few as a base for
starting up again. Such mechanisms can, however, be potentially
complicated and are implementation-defined. The protocol does not
require them.
Because of the way in which RDMA fabrics function, it is not possible o Limited trunking over multiple network paths.
for the server (or client back channel) to cancel outstanding receive
operations. Therefore, effectively only one credit can be withdrawn
per receive completion. The server (or client back channel) would
simply not replenish a receive operation when replying. The server
can still reduce the available credit advertisement in its replies to
the target value it desires, as a hint to the client that its credit
target is lower and it should expect it to be reduced accordingly.
Of course, even if the server could cancel outstanding receives, it
cannot do so, since the client may have already sent requests in
expectation of the previous limit.
This brings out an interesting scenario similar to that of client o Requiring machine credentials for fully secure operation.
reconnect discussed in Section 6.3. How does the server reduce the
credits of an inactive client?
One approach is for the server to simply close such a connection and Through the introduction of a session, NFSv4.1 addresses the above
require the client to reconnect at a new credit limit. This is shortfalls with practical solutions:
acceptable, if inefficient, when the connection setup time is short
and where the server supports persistent session semantics.
A better approach is to provide a back channel request to return the o EOS is enabled by a reply cache with a bounded size, making it
operations channel credits. The server may request the client to feasible to keep on persistent storage and enable EOS through
return some number of credits, the client must comply by performing server failure and recovery. One reason that previous revisions
operations on the operations channel, provided of course that the of NFS did not support EOS was because some EOS approaches often
request does not drop the client's credit count to zero (in which limited parallelism. As will be explained in Section 2.9.4),
case the connection would deadlock). If the client finds that it has NFSv4.1 supports both EOS and unlimited parallelism.
no requests with which to consume the credits it was previously
granted, it must send zero-length Send RDMA operations, or NULL NFSv4
operations in order to return the resources to the server. If the
client fails to comply in a timely fashion, the server can recover
the resources by breaking the connection.
While in principle, the back channel credits could be subject to a o The NFSv4.1 client provides creates transport connections and
similar resource adjustment, in practice this is not an issue, since gives them to the server for sending callbacks, thus solving the
the back channel is used purely for control and is expected to be firewall issue (Section 16.34). Races between responses from
statically provisioned. client requests, and callbacks caused by the requests are detected
via the session's sequencing properties which are a byproduct of
EOS (Section 2.9.4.3).
It is important to note that in addition to maximum request counts, o The NFSv4.1 client can add an arbitrary number of connections to
the sizes of buffers are negotiated per-session. This permits the the session, and thus provide trunking (Section 2.9.3.4.1).
most efficient allocation of resources on both peers. There is an
important requirement on reconnection: the sizes posted by the server
at reconnect must be at least as large as previously used, to allow
recovery. Any replies that are replayed from the server's duplicate
request cache must be able to be received into client buffers. In
the case where a client has received replies to all its retried
requests (and therefore received all its expected responses), then
the client may disconnect and reconnect with different buffers at
will, since no cache replay will be required.
6.5. Retry and Replay o The NFSv4.1 session produces a session key independent of client
and server machine credentials which can be used to compute a
digest for protecting key session management operations
Section 2.9.6.3).
NFSv4.0 forbids retransmission on active connections over reliable o The NFSv4.1 client can also create secure RPCSEC_GSS contexts for
transports; this includes connected-mode RDMA. This restriction must use by the session's callback channel that do not require the
be maintained in NFSv4.1. server to authenticate to a client machine principal
(Section 2.9.6.2).
If one peer were to retransmit a request (or reply), it would consume A session is a dynamically created, long-lived server object created
an additional credit on the other. If the server retransmitted a by a client, used over time from one or more transport connections.
reply, it would certainly result in an RDMA connection loss, since Its function is to maintain the server's state relative to the
the client would typically only post a single receive buffer for each connection(s) belonging to a client instance. This state is entirely
request. If the client retransmitted a request, the additional independent of the connection itself, and indeed the state exists
credit consumed on the server might lead to RDMA connection failure whether the connection exists or not (though locks, delegations, etc.
unless the client accounted for it and decreased its available and generally expire in the extended absence of an open connection).
credit, leading to wasted resources. The session in effect becomes the object representing an active
client on a set of zero or more connections.
RDMA credits present a new issue to the duplicate request cache in 2.9.2. NFSv4 Integration
NFSv4.1. The request cache may be used when a connection within a
session is lost, such as after the client reconnects. Credit
information is a dynamic property of the connection, and stale values
must not be replayed from the cache. This implies that the request
cache contents must not be blindly used when replies are issued from
it, and credit information appropriate to the channel must be
refreshed by the RPC layer.
Finally, RDMA fabrics do not guarantee that the memory handles Sessions are part of NFSv4.1 and not NFSv4.0. Normally, a major
(Steering Tags) within each rdma three-tuple are valid on a scope infrastructure change like sessions would require a new major version
outside that of a single connection. Therefore, handles used by the number to an RPC program like NFS. However, because NFSv4
direct operations become invalid after connection loss. The server encapsulates its functionality in a single procedure, COMPOUND, and
must ensure that any RDMA operations which must be replayed from the because COMPOUND can support an arbitrary number of operations,
request cache use the newly provided handle(s) from the most recent sessions are almost trivially added. COMPOUND includes a minor
request. version number field, and for NFSv4.1 this minor version is set to 1.
When the NFSv4 server processes a COMPOUND with the minor version set
to 1, it expects a different set of operations than it does for
NFSv4.0. One operation it expects is the SEQUENCE operation, which
is required for every COMPOUND that operates over an established
session.
6.6. The Back Channel 2.9.2.1. SEQUENCE and CB_SEQUENCE
The NFSv4 callback operations present a significant resource problem In NFSv4.1, when the SEQUENCE operation is present, it is always the
for the RDMA enabled client. Clearly, callbacks must be negotiated first operation in the COMPOUND procedure. The primary purpose of
in the way credits are for the ordinary operations channel for SEQUENCE is to carry the session identifier. The session identifier
requests flowing from client to server. But, for callbacks to arrive associates all other operations in the COMPOUND procedure with a
on the same RDMA endpoint as operation replies would require particular session. SEQUENCE also contains required information for
dedicating additional resources, and specialized demultiplexing and maintaining EOS (see Section 2.9.4). Session-enabled NFSv4.1
event handling. Or, callbacks may not require RDMA sevice at all COMPOUND requests thus have the form:
(they do not normally carry substantial data payloads). It is highly
desirable to streamline this critical path via a second
communications channel.
The session callback channel binding facility is designed for exactly +-----+--------------+-----------+------------+-----------+----
such a situation, by dynamically associating a new connected endpoint | tag | minorversion | numops |SEQUENCE op | op + args | ...
with the session, and separately negotiating sizes and counts for | | (== 1) | (limited) | + args | |
active callback channel operations. The binding operation is +-----+--------------+-----------+------------+-----------+----
firewall-friendly since it does not require the server to initiate
the connection.
This same method serves as well for ordinary TCP connection mode. It and the reply's structure is:
is expected that all NFSv4.1 clients may make use of the session
facility to streamline their design.
The back channel functions exactly the same as the operations channel +------------+-----+--------+-------------------------------+--//
except that no RDMA operations are required to perform transfers, |last status | tag | numres |status + SEQUENCE op + results | //
instead the sizes are required to be sufficiently large to carry all +------------+-----+--------+-------------------------------+--//
data inline, and of course the client and server reverse their roles //-----------------------+----
with respect to which is in control of credit management. The same // status + op + results | ...
rules apply for all transfers, with the server being required to flow //-----------------------+----
control its callback requests.
The back channel is optional. If not bound on a given session, the A CB_COMPOUND procedure request and reply has a similar form, but
server must not issue callback operations to the client. This in instead of a SEQUENCE operation, there is a CB_SEQUENCE operation,
turn implies that such a client must never put itself in the and there is an additional field called "callback_ident", which is
situation where the server will need to do so, lest the client lose superfluous in NFSv4.1. CB_SEQUENCE has the same information as
its connection by force, or its operation be incorrect. For the same SEQUENCE, but includes other information needed to solve callback
reason, if a back channel is bound, the client is subject to races (Section 2.9.4.3).
revocation of its delegations if the back channel is lost. Any
connection loss should be corrected by the client as soon as
possible.
This can be convenient for the NFSv4.1 client; if the client expects 2.9.2.2. Clientid and Session Association
to make no use of back channel facilities such as delegations, then
there is no need to create it. This may save significant resources
and complexity at the client.
For these reasons, if the client wishes to use the back channel, that Sessions are subordinate to the clientid (Section 2.4). Each
channel must be bound first, before using the operations channel. In clientid can have zero or more active sessions. A clientid, and a
this way, the server will not find itself in a position where it will session bound to it are required to do anything useful in NFSv4.1.
send callbacks on the operations channel when the client is not Each time a session is used, the state leased to it associated
prepared for them. clientid is automatically renewed.
[[Comment.4: [XXX - do we want to support this?]]] There is one State such as share reservations, locks, delegations, and layouts
special case, that where the back channel is bound in fact to the (Section 1.5.4) is tied to the clientid, not the sessions of the
operations channel's connection. This configuration would be used clientid. Successive state changing operations from a given state
normally over a TCP stream connection to exactly implement the owner can go over different sessions, as long each session is
NFSv4.0 behavior, but over RDMA would require complex resource and associated with the same clientid. Callbacks can arrive over a
event management at both sides of the connection. The server is not different session than the session that sent the operation the
required to accept such a bind request on an RDMA connection for this acquired the state that the callback is for. For example, if session
reason, though it is recommended. A is used to acquire a delegation, a request to recall the delegation
can arrive over session B.
6.7. COMPOUND Sizing Issues 2.9.3. Channels
Very large responses may pose duplicate request cache issues. Since Each session has one or two channels: the "operation" or "fore"
servers will want to bound the storage required for such a cache, the channel used for ordinary requests from client to server, and the
unlimited size of response data in COMPOUND may be troublesome. If "back" channel, used for callback requests from server to client.
COMPOUND is used in all its generality, then the inclusion of certain The session allocates resources for each channel, including separate
non-idempotent operations within a single COMPOUND request may render reply caches (see Section 2.9.4.1 These resources are for the most
the entire request non-idempotent. (For example, a single COMPOUND part specified at time the session is created.
request which read a file or symbolic link, then removed it, would be
obliged to cache the data in order to allow identical replay).
Therefore, many requests might include operations that return any
amount of data.
It is not satisfactory for the server to reject COMPOUNDs at will 2.9.3.1. Operation Channel
with NFS4ERR_RESOURCE when they pose such difficulties for the
server, as this results in serious interoperability problems.
Instead, any such limits must be explicitly exposed as attributes of
the session, ensuring that the server can explicitly support any
duplicate request cache needs at all times.
6.8. Data Alignment The operation channel carries COMPOUND requests and responses. A
session always has an operation channel.
A negotiated data alignment enables certain scatter/gather 2.9.3.2. Backchannel
optimizations. A facility for this is supported by [RPCRDMA]. Where
NFS file data is the payload, specific optimizations become highly
attractive.
Header padding is requested by each peer at session initiation, and The backchannel carries CB_COMPOUND requests and responses. Whether
may be zero (no padding). Padding leverages the useful property that there is a backchannel or not is a decision of the client; NFSv4.1
RDMA receives preserve alignment of data, even when they are placed servers MUST support backchannels.
into anonymous (untagged) buffers. If requested, client inline
writes will insert appropriate pad bytes within the request header to
align the data payload on the specified boundary. The client is
encouraged to be optimistic and simply pad all WRITEs within the RPC
layer to the negotiated size, in the expectation that the server can
use them efficiently.
It is highly recommended that clients offer to pad headers to an 2.9.3.3. Session and Channel Association
appropriate size. Most servers can make good use of such padding,
which allows them to chain receive buffers in such a way that any
data carried by client requests will be placed into appropriate
buffers at the server, ready for file system processing. The
receiver's RPC layer encounters no overhead from skipping over pad
bytes, and the RDMA layer's high performance makes the insertion and
transmission of padding on the sender a significant optimization. In
this way, the need for servers to perform RDMA Read to satisfy all
but the largest client writes is obviated. An added benefit is the
reduction of message roundtrips on the network - a potentially good
trade, where latency is present.
The value to choose for padding is subject to a number of criteria. Because there are at most two channels per session, and because each
A primary source of variable-length data in the RPC header is the channel has a distinct purpose, channels are not assigned
authentication information, the form of which is client-determined, identifiers. The operation and backchannel are implicitly created
possibly in response to server specification. The contents of and associated when the session is created.
COMPOUNDs, sizes of strings such as those passed to RENAME, etc. all
go into the determination of a maximal NFSv4 request size and
therefore minimal buffer size. The client must select its offered
value carefully, so as not to overburden the server, and vice- versa.
The payoff of an appropriate padding value is higher performance.
Sender gather: 2.9.3.4. Connection and Channel Association
|RPC Request|Pad bytes|Length| -> |User data...|
\------+---------------------/ \
\ \
\ Receiver scatter: \-----------+- ...
/-----+----------------\ \ \
|RPC Request|Pad|Length| -> |FS buffer|->|FS buffer|->...
In the above case, the server may recycle unused buffers to the next Each channel is associated with zero or more transport connections.
posted receive if unused by the actual received request, or may pass A connection can be bound to one channel or both channels of a
the now-complete buffers by reference for normal write processing. session; the client and server negotiate whether a connection will
For a server which can make use of it, this removes any need for data carry traffic for one channel or both channel via the the
copies of incoming data, without resorting to complicated end-to-end CREATE_SESSION (Section 16.36) and the BIND_CONN_TO_SESSION
buffer advertisement and management. This includes most kernel-based (Section 16.34) operations. When a session is created via
and integrated server designs, among many others. The client may CREATE_SESSION, it is automatically bound to the operation channel,
perform similar optimizations, if desired. and optionally the backchannel. If the client does not specify
connecting binding enforcement when the session is created, then
additional connections are automatically bound to the operation
channel when the are used with a SEQUENCE operation that has the
session's sessionid.
Padding is negotiated by the session creation operation, and A connection MAY be bound to the channels of other sessions. The
subsequently used by the RPC RDMA layer, as described in [RPCRDMA]. client decides, and the NFSv4.1 server MUST allow it. A connection
MAY be bound to the the channels' of other sessions of other
clientids. Again, the client decides, and the server MUST allow it.
6.9. NFSv4 Integration It is permissible for connections of multiple types to be bound to
the same channel. For example a TCP and RDMA connection can be bound
to the operation channel. In the event an RDAM and non-RDMA
connection are bound to the same channel, the maximum number of slots
must be at least one more than the total number of credits. This way
if all RDMA credits are use, the non-RDMA connection can have at
least one outstanding request.
The following section discusses the integration of the session It is permissible for a connection of one type to be bound to the
infrastructure into NFSv4.1 operation channel, and another type bound to the backchannel.
6.9.1. Minor Versioning 2.9.3.4.1. Trunking
Minor versioning of NFSv4 is relatively restrictive, and allows for Since multiple connections can be bound to a session's channel, these
tightly limited changes only. In particular, it does not permit means that traffic between an NFSv4.1 client and server channel goes
adding new "procedures" (it permits adding only new "operations"). over all connections. If the connections are over different network
Interoperability concerns make it impossible to consider additional paths, this is trunking. NFSv4.1 allows trunking, thus allows the
layering to be a minor revision. This somewhat limits the changes bandwidth capacity to scale with the number of connections.
that can be introduced when considering extensions.
To support the duplicate request cache integrated with sessions and At issue is how do NFSv4.1 clients and servers discover and verify
request control, it is desirable to tag each request with an multiple paths? On the client side, each client should be aware of
identifier to be called a Slotid. This identifier must be passed by the network interfaces it has available from which to create
NFSv4.1 when running atop any transport, including traditional TCP. connections. However, the client cannot always be certain whether a
Therefore it is not desirable to add the Slotid to a new RPC server's multitide of network interfaces in fact belong to the same
transport, even though such a transport is indicated for support of server, or even if they do, whether the server is prepared to share a
RDMA. This specification and [RPCRDMA] do not specify such an clientid or sessionid across all its interfaces. NFSv4.1 provides no
approach. discovery protocol for allowing servers to advertise multiple network
interfaces; such a protocol is problematic because network address
translation (NAT) may be occurring between the client and server, and
so, unless the NAT devices are inspecting NFSv4.1 traffic, the
network addresses the server offers to the client would be
meaningless. At best, short of manual configuration, an NFSv4.1
client could use a host name to network address directory (e.g. DNS)
to enumerate a server's network interfaces. This then leaves the
problem of verification.
Instead, this specification conforms to the requirements of NFSv4 NFSv4.1 provides a way for clients and servers to reliably verify if
minor versioning, through the use of a new operation within NFSv4 connections between different network paths are in fact bound to the
COMPOUND procedures as detailed below. same NFSv4.1 server. The SET_SSV (Section 16.47) operation allows a
client and server to establish a unique, shared key value (the SSV).
When a new connection is bound to the session (via the
BIND_CONN_TO_SESSION operation, see Section 16.34), the client must
offer a digest that based on the SSV. If the client mistakenly tries
to bind a connection to a session of a wrong server, the server will
either reject the attempt because it is not aware of the session
identifier of the BIND_CONN_TO_SESSION arguments, or it will reject
the attempt because the digest for the SSV does not match what the
server expects. Even if the server mistakenly or maliciously accept
the connection bind attempt, the digest it computes in the response
will not be verified by the client, the client will know it cannot
use the connection for trunking the specified channel.
If sessions are in use for a given clientid, this same clientid 2.9.4. Exactly Once Semantics
cannot be used for non-session NFSv4 operation, including NFSv4.0.
Because the server will have allocated session-specific state to the
active clientid, it would be an unnecessary burden on the server
implementor to support and account for additional, non- session
traffic, in addition to being of no benefit. Therefore this
specification prohibits a single clientid from doing this.
Nevertheless, employing a new clientid for such traffic is supported.
6.9.2. Slot Identifiers and Server Duplicate Request Cache Via the session, NFSv4.1 offers exactly once semantics (EOS) for
requests sent over a channel. EOS is supported on both the operation
and back channels.
The presence of deterministic maximum request limits on a session 2.9.4.1. Slot Identifiers and Reply Cache
enables in-progress requests to be assigned unique values with useful
properties.
The RPC layer provides a transaction ID (xid), which, while required The RPC layer provides a transaction ID (xid), which, while required
to be unique, is not especially convenient for tracking requests. to be unique, is not especially convenient for tracking requests.
The transaction ID is only meaningful to the issuer (client), it The xid is only meaningful to the requester it cannot be interpreted
cannot be interpreted at the server except to test for equality with at the replier except to test for equality with previously issued
previously issued requests. Because RPC operations may be completed requests. Because RPC operations may be completed by the replier in
by the server in any order, many transaction IDs may be outstanding any order, many transaction IDs may be outstanding at any time. The
at any time. The client may therefore perform a computationally requester may therefore perform a computationally expensive lookup
expensive lookup operation in the process of demultiplexing each operation in the process of demultiplexing each reply.
reply.
In the specification, there is a limit to the number of active In the NFSv4.1, there is a limit to the number of active requests.
requests. This immediately enables a convenient, computationally This immediately enables a computationally efficient index for each
efficient index for each request which is designated as a Slot request which is designated as a Slot Identifier, or slotid.
Identifier, or slotid.
When the client issues a new request, it selects a slotid in the When the requester issues a new request, it selects a slotid in the
range 0..N-1, where N is the server's current "totalrequests" limit range 0..N-1, where N is the replier's current "totalrequests" limit
granted the client on the session over which the request is to be granted to the requester on the session over which the request is to
issued. The slotid must be unused by any of the requests which the be issued. The slotid must be unused by any of the requests which
client has already active on the session. "Unused" here means the the requester has already active on the session. "Unused" here means
client has no outstanding request for that slotid. Because the slot the requester has no outstanding request for that slotid. Because
id is always an integer in the range 0..N-1, client implementations the slot id is always an integer in the range 0..N-1, requester
can use the slotid from a server response to efficiently match implementations can use the slotid from a replier response to
responses with outstanding requests, such as, for example, by using efficiently match responses with outstanding requests, such as, for
the slotid to index into a outstanding request array. This can be example, by using the slotid to index into a outstanding request
used to avoid expensive hashing and lookup functions in the array. This can be used to avoid expensive hashing and lookup
performance-critical receive path. functions in the performance-critical receive path.
The sequenceid, which accompanies the slotid in each request, is The sequenceid, which accompanies the slotid in each request, is
important for a second, important check at the server: it must be important for an important check at the server: it must be able to be
able to be determined efficiently whether a request using a certain determined efficiently whether a request using a certain slotid is a
slotid is a retransmit or a new, never-before-seen request. It is retransmit or a new, never-before-seen request. It is not feasible
not feasible for the client to assert that it is retransmitting to for the client to assert that it is retransmitting to implement this,
implement this, because for any given request the client cannot know because for any given request the client cannot know the server has
the server has seen it unless the server actually replies. Of seen it unless the server actually replies. Of course, if the client
course, if the client has seen the server's reply, the client would has seen the server's reply, the client would not retransmit.
not retransmit!
The sequenceid must increase monotonically for each new transmit of a The sequenceid MUST increase monotonically for each new transmit of a
given slotid, and must remain unchanged for any retransmission. The given slotid, and MUST remain unchanged for any retransmission. The
server must in turn compare each newly received request's sequenceid server must in turn compare each newly received request's sequenceid
with the last one previously received for that slotid, to see if the with the last one previously received for that slotid, to see if the
new request is: new request is:
o A new request, in which the sequenceid is one greater than that o A new request, in which the sequenceid is one greater than that
previously seen in the slot (accounting for sequence wraparound). previously seen in the slot (accounting for sequence wraparound).
The server proceeds to execute the new request. The replier proceeds to execute the new request.
o A retransmitted request, in which the sequenceid is equal to that o A retransmitted request, in which the sequenceid is equal to that
last seen in the slot. Note that this request may be either last seen in the slot. Note that this request may be either
complete, or in progress. The server performs replay processing complete, or in progress. The replier performs replay processing
in these cases. in these cases.
o A misordered duplicate, in which the sequenceid is less than o A misordered replay, in which the sequenceid is less than
(acounting for sequence wraparound) than that previously seen in (accounting for sequence wraparound) than that previously seen in
the slot. The server MUST return NFS4ERR_SEQ_MISORDERED. the slot. The replier MUST return NFS4ERR_SEQ_MISORDERED (as the
result from SEQUENCE or CB_SEQUENCE).
o A misordered new request, in which the sequenceid is two or more o A misordered new request, in which the sequenceid is two or more
than (acounting for sequence wraparound) than that previously seen than (accounting for sequence wraparound) than that previously
in the slot. Note that because the sequenceid must wraparound one seen in the slot. Note that because the sequenceid must
it reaches 0xFFFFFFFF, a misordered new request and a misordered wraparound one it reaches 0xFFFFFFFF, a misordered new request and
duplicate cannot be distinguished. Thus, the server MUST return a misordered replay cannot be distinguished. Thus, the replier
NFS4ERR_SEQ_MISORDERED. MUST return NFS4ERR_SEQ_MISORDERED (as the result from SEQUENCE or
CB_SEQUENCE).
Unlike the XID, the slotid is always within a specific range; this Unlike the XID, the slotid is always within a specific range; this
has two implications. The first implication is that for a given has two implications. The first implication is that for a given
session, the server need only cache the results of a limited number session, the replier need only cache the results of a limited number
of COMPOUND requests. The second implication derives from the first, of COMPOUND requests. The second implication derives from the first,
which is unlike XID-indexed DRCs, the slotid DRC by its nature cannot which is unlike XID-indexed reply caches (also know as duplicate
be overflowed. Through use of the sequenceid to identify request caches - DRCs), the slotid-based reply cache cannot be
retransmitted requests, it is notable that the server does not need overflowed. Through use of the sequenceid to identify retransmitted
to actually cache the request itself, reducing the storage requests, the replier does not need to actually cache the request
requirements of the DRC further. These new facilities makes it itself, reducing the storage requirements of the reply cache further.
practical to maintain all the required entries for an effective DRC. These new facilities makes it practical to maintain all the required
entries for an effective reply cache.
The slotid and sequenceid therefore take over the traditional role of The slotid and sequenceid therefore take over the traditional role of
the XID and port number in the server DRC implementation, and the the XID and port number in the replier reply cache implementation,
session replaces the IP address. This approach is considerably more and the session replaces the IP address. This approach is
portable and completely robust - it is not subject to the frequent considerably more portable and completely robust - it is not subject
reassignment of ports as clients reconnect over IP networks. In to the frequent reassignment of ports as clients reconnect over IP
addition, the RPC XID is not used in the reply cache, enhancing networks. In addition, the RPC XID is not used in the reply cache,
robustness of the cache in the face of any rapid reuse of XIDs by the enhancing robustness of the cache in the face of any rapid reuse of
client. [[Comment.5: We need to discuss the requirements of the XIDs by the client. [[Comment.3: We need to discuss the requirements
client for changing the XID.]]. of the client for changing the XID.]] .
It is required to encode the slotid information into each request in It is required to encode the slotid information into each request in
a way that does not violate the minor versioning rules of the NFSv4.0 a way that does not violate the minor versioning rules of the NFSv4.0
specification. This is accomplished here by encoding it in a control specification. This is accomplished here by encoding it in the
operation (SEQUENCE) within each NFSv4.1 COMPOUND and CB_COMPOUND SEQUENCE operation within each NFSv4.1 COMPOUND and CB_COMPOUND
procedure. The operation easily piggybacks within existing messages. procedure. The operation easily piggybacks within existing messages.
[[Comment.4: Need a better term than piggyback]]
In general, the receipt of a new sequenced request arriving on any In general, the receipt of a new sequenced request arriving on any
valid slot is an indication that the previous DRC contents of that valid slot is an indication that the previous reply cache contents of
slot may be discarded. In order to further assist the server in slot that slot may be discarded. In order to further assist the replier
management, the client is required to use the lowest available slot in slot management, the requester is required to use the lowest
when issuing a new request. In this way, the server may be able to available slot when issuing a new request. In this way, the replier
retire additional entries. may be able to retire additional entries.
However, in the case where the server is actively adjusting its However, in the case where the replier is actively adjusting its
granted maximum request count to the client, it may not be able to granted maximum request count to the requester, it may not be able to
use receipt of the slotid to retire cache entries. The slotid used use receipt of the slotid to retire cache entries. The slotid used
in an incoming request may not reflect the server's current idea of in an incoming request may not reflect the server's current idea of
the client's session limit, because the request may have been sent the requester's session limit, because the request may have been sent
from the client before the update was received. Therefore, in the from the requester before the update was received. Therefore, in the
downward adjustment case, the server may have to retain a number of downward adjustment case, the replier may have to retain a number of
duplicate request cache entries at least as large as the old value, reply cache entries at least as large as the old value, until
until operation sequencing rules allow it to infer that the client operation sequencing rules allow it to infer that the requester has
has seen its reply. seen its reply.
The SEQUENCE (and CB_SEQUENCE) operation also carries a "maxslot" The SEQUENCE (and CB_SEQUENCE) operation also carries a "maxslot"
value which carries additional client slot usage information. The value which carries additional client slot usage information. The
client must always provide its highest-numbered outstanding slot requester must always provide its highest-numbered outstanding slot
value in the maxslot argument, and the server may reply with a new value in the maxslot argument, and the replier may reply with a new
recognized value. The client should in all cases provide the most recognized value. The requester should in all cases provide the most
conservative value possible, although it can be increased somewhat conservative value possible, although it can be increased somewhat
above the actual instantaneous usage to maintain some minimum or above the actual instantaneous usage to maintain some minimum or
optimal level. This provides a way for the client to yield unused optimal level. This provides a way for the requester to yield unused
request slots back to the server, which in turn can use the request slots back to the replier, which in turn can use the
information to reallocate resources. Obviously, maxslot can never be information to reallocate resources. Obviously, maxslot can never be
zero, or the session would deadlock. zero, or the session would deadlock.
The server also provides a target maxslot value to the client, which The replier also provides a target maxslot value to the requester,
is an indication to the client of the maxslot the server wishes the which is an indication to the requester of the maxslot the replier
client to be using. This permits the server to withdraw (or add) wishes the requester to be using. This permits the server to
resources from a client that has been found to not be using them, in withdraw (or add) resources from a requester that has been found to
order to more fairly share resources among a varying level of demand not be using them, in order to more fairly share resources among a
from other clients. The client must always comply with the server's varying level of demand from other requesters. The requester must
value updates, since they indicate newly established hard limits on always comply with the replier's value updates, since they indicate
the client's access to session resources. However, because of newly established hard limits on the requester's access to session
request pipelining, the client may have active requests in flight resources. However, because of request pipelining, the requester may
reflecting prior values, therefore the server must not immediately have active requests in flight reflecting prior values, therefore the
require the client to comply. replier must not immediately require the requester to comply.
It is worthwhile to note that Sprite RPC [BW87] defined a "channel" 2.9.4.1.1. Errors from SEQUENCE and CB_SEQUENCE
which in some ways is similar to the slotid defined here. Sprite RPC
used channels to implement parallel request processing and request/
response cache retirement.
6.9.3. Resolving server callback races with sessions Any time SEQUENCE or CB_SEQUENCE return an error, the sequenceid of
the slot MUST NOT change. The replier MUST NOT modify the reply
cache entry for the slot whenever an error is returned from SEQUENCE
or CB_SEQUENCE.
2.9.4.1.2. Optional Reply Caching
On a per-request basis the requester can choose to direct the replier
to cache the reply to all operations after the first operation
(SEQUENCE or CB_SEQUENCE) via the sa_cachethis or csa_cachethis
fields of the arguments to SEQUENCE or CB_SEQUENCE. The reason it
would not direct the replier to cache the entire reply is that the
request is composed of all idempotent operations [20]. Caching the
reply may offer little benefit, and if the reply is too large (see
Section 2.9.4.4, it may not be cacheable anyway.
Whether the requester requests the reply to be cached or not has no
effect on the slot processing. If the results of SEQUENCE or
CB_SEQUENCE are NFS4_OK, then the slot's sequenceid MUST be
incremented by one. If a requester does not direct the replier to
cache, the reply, the replier MUST do one of following:
o The replier can cache the entire original reply. Even though
sa_cachethis or csa_cachethis are FALSE, the replier is always
free to cache. It may choose this approach in order to simplify
implementation.
o The replier enters into its reply cache a reply consisting of the
original results to the SEQUENCE or CB_SEQUENCE operation,
followed by the error NFS4ERR_RETRY_UNCACHED_REP. Thus when the
requester later retries the request, it will get
NFS4ERR_RETRY_UNCACHE_REP.
2.9.4.1.3. Multiple Connections and Sharing the Reply Cache
Multiple connections can be bound to a session's channel, hence the
connections share the same table of slotids. For connections over
non-RDMA transports like TCP, there are no particular considerations.
Considerations for multiple RDMA connections sharing a slot table are
discussed in Section 2.9.5.1. [[Comment.5: Also need to discuss when
RDMA and non-RDMA share a slot table.]]
2.9.4.2. Retry and Replay
A client MUST NOT retry a request, unless the connection it used to
send the request disconnects. The client can then reconnect and
resend the request, or it can resend the request over a different
connection. In the case of the server resending over the
backchannel, it cannot reconnect, and either resends the request over
another connection that the client has bound to the backchannel, or
if there is no other backchannel connection, waits for the client to
bind a connection to the backchannel.
A client MUST wait for a reply to a request before using the slot for
another request. If it does not wait for a reply, then the client
does not know what sequenceid to use for the slot on its next
request. For example, suppose a client sends a request with
sequenceid 1, and does not wait for the response. The next time it
uses the slot, it sends the new request with sequenceid 2. If the
server has not seen the request with sequenceid 1, then the server is
expecting sequenceid 2, and rejects the client's new request with
NFS4ERR_SEQ_MISORDERED (as the result from SEQUENCE or CB_SEQUENCE).
RDMA fabrics do not guarantee that the memory handles (Steering Tags)
within each RDMA three-tuple are valid on a scope [[Comment.6: What
is a three-tuple?]] outside that of a single connection. Therefore,
handles used by the direct operations become invalid after connection
loss. The server must ensure that any RDMA operations which must be
replayed from the reply cache use the newly provided handle(s) from
the most recent request.
2.9.4.3. Resolving server callback races with sessions
It is possible for server callbacks to arrive at the client before It is possible for server callbacks to arrive at the client before
the reply from related forward channel operations. For example, a the reply from related forward channel operations. For example, a
client may have been granted a delegation to a file it has opened, client may have been granted a delegation to a file it has opened,
but the reply to the OPEN (informing the client of the granting of but the reply to the OPEN (informing the client of the granting of
the delegation) may be delayed in the network. If a conflicting the delegation) may be delayed in the network. If a conflicting
operation arrives at the server, it will recall the delegation using operation arrives at the server, it will recall the delegation using
the callback channel, which may be on a different transport the callback channel, which may be on a different transport
connection, perhaps even a different network. In NFSv4.0, if the connection, perhaps even a different network. In NFSv4.0, if the
callback request arrives before the related reply, the client may callback request arrives before the related reply, the client may
reply to the server with an error. reply to the server with an error.
The presence of a session between client and server alleviates this The presence of a session between client and server alleviates this
issue. When a session is in place, each client request is uniquely issue. When a session is in place, each client request is uniquely
identified by its { slotid, sequenceid } pair. By the rules under identified by its { slotid, sequenceid } pair. By the rules under
which slot entries (duplicate request cache entries) are retired, the which slot entries (reply cache entries) are retired, the server has
server has knowledge whether the client has "seen" each of the knowledge whether the client has "seen" each of the server's replies.
server's replies. The server can therefore provide sufficient The server can therefore provide sufficient information to the client
information to the client to allow it to disambiguate between an to allow it to disambiguate between an erroneous or conflicting
erroneous or conflicting callback and a race condition. callback and a race condition.
For each client operation which might result in some sort of server For each client operation which might result in some sort of server
callback, the server should "remember" the { slotid, sequenceid } callback, the server should "remember" the { slotid, sequenceid }
pair of the client request until the slotid retirement rules allow pair of the client request until the slotid retirement rules allow
the server to determine that the client has, in fact, seen the the server to determine that the client has, in fact, seen the
server's reply. Until the time the { slotid, sequencedid } request server's reply. Until the time the { slotid, sequenceid } request
pair can be retired, any recalls of the associated object MUST carry pair can be retired, any recalls of the associated object MUST carry
an array of these referring identifiers (in the CB_SEQUENCE an array of these referring identifiers (in the CB_SEQUENCE
operation's arguments), for the benefit of the client. After this operation's arguments), for the benefit of the client. After this
time, it is not necessary for the server to provide this information time, it is not necessary for the server to provide this information
in related callbacks, since it is certain that a race condition can in related callbacks, since it is certain that a race condition can
no longer occur. no longer occur.
The CB_SEQUENCE operation which begins each server callback carries a The CB_SEQUENCE operation which begins each server callback carries a
list of "referring" { slotid, sequenceid } tuples. If the client list of "referring" { slotid, sequenceid } tuples. If the client
finds the request corresponding to the referring slotid and sequenced finds the request corresponding to the referring slotid and sequenced
id be currently outstanding (i.e. the server's reply has not been id be currently outstanding (i.e. the server's reply has not been
seen by the client), it can determine that the callback has raced the seen by the client), it can determine that the callback has raced the
reply, and act accordingly. reply, and act accordingly.
The client must not simply wait forever for the expected server reply The client must not simply wait forever for the expected server reply
to arrive on any of the session's operations channels, because it is to arrive on any of the session's operations channels, because it is
possible that they will be delayed indefinitely. However, it should possible that they will be delayed indefinitely. However, it should
wait for a period of time, and if the time expires it can provide a wait for a period of time, and if the time expires it can provide a
more meaningful error such as NFS4ERR_DELAY. more meaningful error such as NFS4ERR_DELAY.
[[Comment.6: XXX ... We need to consider the clients' options here, [[Comment.7: We need to consider the clients' options here, and
and describe them... NFS4ERR_DELAY has been discussed as a legal describe them... NFS4ERR_DELAY has been discussed as a legal reply
reply to CB_RECALL?]] to CB_RECALL?]]
There are other scenarios under which callbacks may race replies, There are other scenarios under which callbacks may race replies,
among them pnfs layout recalls, described in Section 17.3.5.3 among them pnfs layout recalls, described in Section 12.3.5.3
[[Comment.7: XXX fill in the blanks w/others, etc...]] [[Comment.8: fill in the blanks w/others, etc...]]
6.9.4. COMPOUND and CB_COMPOUND 2.9.4.4. COMPOUND and CB_COMPOUND Construction Issues
[[Comment.8: Noveck: This is about the twelfth time we say that this Very large requests and replies may pose both buffer management
is minor version. The diagram makes sense if you are explaining issues (especially with RDMA) and reply cache issues. When the
which should be done somewhere, but this is supposedly explaining session is created, (Section 16.36) the client and server negotiate
sessions.]] the maximum sized request they will send or process
(ca_maxrequestsize), the maximum sized reply they will return or
process (ca_maxresponsesize), and the the maximum sized reply they
will store in the reply cache (ca_maxresponsesize_cached).
Support for per-operation control is added to NFSv4 COMPOUNDs by If a request exceeds ca_maxrequestsize, the reply will have the
placing such facilities into their own, new operation, and placing status NFS4ERR_REQ_TOO_BIG. A replier may return NFS4ERR_REQ_TOO_BIG
this operation first in each COMPOUND under the new NFSv4 minor as the status for first operation (SEQUENCE or CB_SEQUENCE) in the
protocol revision. The contents of the operation would then apply to request, or it may chose to return it on a subsequent operation.
the entire COMPOUND.
Recall that the NFSv4 minor version number is contained within the If a reply exceeds ca_maxresponsesize, the reply will have the status
COMPOUND header, encoded prior to the COMPOUNDed operations. By NFS4ERR_REP_TOO_BIG. A replier may return NFS4ERR_REP_TOO_BIG as the
simply requiring that the new operation always be contained in NFSv4 status for first operation (SEQUENCE or CB_SEQUENCE) in the request,
minor COMPOUNDs, the control protocol can piggyback perfectly with or it may chose to return it on a subsequent operation.
each request and response.
In this way, the NFSv4 Session Extensions may stay in compliance with If sa_cachethis or csa_cachethis are TRUE, then the replier MUST
the minor versioning requirements specified in section 10 of RFC3530 cache a reply except if an error is returned by the SEQUENCE or
[2]. CB_SEQUENCE operation (see Section 2.9.4.1.1). If the reply exceeds
ca_maxresponsesize_cached, (and sa_cachethis or csa_cachethis are
TRUE) then the server MUST return NFS4ERR_REP_TOO_BIG_TO_CACHE. Even
if NFS4ERR_REP_TOO_BIG_TO_CACHE (or any other error for that matter)
is returned on a operation other than first operation (SEQUENCE or
CB_SEQUENCE), then the reply MUST be cached if sa_cachethis or
csa_cachethis are TRUE. For example, if a COMPOUND has eleven
operations, including SEQUENCE, the fifth operation is a RENAME, and
the tenth operation is a READ for one million bytes, server may
return NFS4ERR_REP_TOO_BIG_TO_CACHE on the tenth operation. Since
the server executed several operations, especially the non-idempotent
RENAME, the client's request to cache the reply needs to be honored
in order for correct operation of exactly once semantics. If the
client retries the request, the server will have cached a reply that
contains results for ten of the elevent requested operations, with
the tenth operation having a status of NFS4ERR_REP_TOO_BIG_TO_CACHE.
Referring to section 13.1 of RFC3530 [2], the specified session- A client needs to take care that when sending operations that change
enabled COMPOUND and CB_COMPOUND have the form: the current filehandle (except for PUTFH, PUTPUBFH, and PUTROOFFH)
that it not exceed the maximum reply buffer before the GETFH
operation. Otherwise the client will have to retry the operation
that changed the current filehandle, in order obtain the desired
filehandle. For the OPEN operation (see Section 16.16), retry is not
always available as an option. The following guidelines for the
handling of filehandle changing operations are advised:
+-----+--------------+-----------+------------+-----------+---- o A client SHOULD issue GETFH immediately after a current filehandle
| tag | minorversion | numops | control op | op + args | ... changing operation. This is especially important after any
| | (== 1) | (limited) | + args | | current filehandle changing non-idempotent operation. It is
+-----+--------------+-----------+------------+-----------+---- critical to issue GETFH immediately after OPEN.
and the reply's structure is: o A server MAY return NFS4ERR_REP_TOO_BIG or
NFS4ERR_REP_TOO_BIG_TO_CACHE (if sa_cachethis is TRUE) on a
filehandle changing operation if the reply would be too large on
the next operation.
+------------+-----+--------+-------------------------------+--// o A server SHOULD return NFS4ERR_REP_TOO_BIG or
|last status | tag | numres | status + control op + results | // NFS4ERR_REP_TOO_BIG_TO_CACHE (if sa_cachethis is TRUE) on a
+------------+-----+--------+-------------------------------+--// filehandle changing non-idempotent operation if the reply would be
//-----------------------+---- too large on the next operation, especially if the operation is
// status + op + results | ... OPEN.
//-----------------------+----
[[Comment.9: The artwork above doesn't mention callback_ident that is o A server MAY return NFS4ERR_UNSAFE_COMPOUND if it looks at the
used for CB_COMPOUND. We need to mention that for NFSv4.1, next operation after a non-idempotent current filehandle changing
callback_ident is superfluous]] The single control operation, operation, and finds it is not GETFH. The server would do this if
SEQUENCE, within each NFSv4.1 COMPOUND defines the context and it it unable to determine in advance whether the total response
operational session parameters which govern that COMPOUND request and size would exceed ca_maxresponsesize_cached or ca_maxresponsesize.
reply. Placing it first in the COMPOUND encoding is required in
order to allow its processing before other operations in the
COMPOUND.
6.10. Sessions Security Considerations 2.9.4.5. Persistence
The NFSv4 minor version 1 retains all of existing NFSv4 security; all Since the reply cache is bounded, it is practical for the server
security considerations present in NFSv4.0 apply to it equally. reply cache to persist across server reboots, and to be kept in
stable storage (a client's reply cache for callbacks need not persist
across client reboots unless the client intends for its session and
other state to persist across reboots).
Security considerations of any underlying RDMA transport are o The slot table including the sequenceid and cached reply for each
additionally important, all the more so due to the emerging nature of slot.
such transports. Examining these issues is outside the scope of this
specification.
When protecting a connection with RPCSEC_GSS, all data in each o The sessionid.
request and response (whether transferred inline or via RDMA)
continues to receive this protection over RDMA fabrics [RPCRDMA].
However when performing data transfers via RDMA, RPCSEC_GSS
protection of the data transfer portion works against the efficiency
which RDMA is typically employed to achieve. This is because such
data is normally managed solely by the RDMA fabric, and intentionally
is not touched by software. The means by which the local RPCSEC_GSS
implementation is integrated with the RDMA data protection facilities
are outside the scope of this specification.
If the NFS client wishes to maintain full control over RPCSEC_GSS o The clientid.
protection, it may still perform its transfer operations using either
the inline or RDMA transfer model, or of course employ traditional
TCP stream operation. In the RDMA inline case, header padding is
recommended to optimize behavior at the server. At the client, close
attention should be paid to the implementation of RPCSEC_GSS
processing to minimize memory referencing and especially copying.
The session callback channel binding improves security over that o The SSV (see section Section 2.9.6.3).
provided by NFSv4 for the callback channel. The connection is
client-initiated, and subject to the same firewall and routing checks
as the operations channel. The connection cannot be hijacked by an
attacker who connects to the client port prior to the intended
server. The connection is set up by the client with its desired
attributes, such as optionally securing with IPsec or similar. The
binding is fully authenticated before being activated.
6.10.1. Denial of Service via Unauthorized State Changes The CREATE_SESSION (see Section 16.36 operation determines the
persistence of the reply cache.
2.9.5. RDMA Considerations
A complete discussion of the operation of RPC-based protocols atop
RDMA transports is in [RPCRDMA]. A discussion of the operation of
NFSv4, including NFSv4.1 over RDMA is in [NFSDDP]. Where RDMA is
considered, this specification assumes the use of such a layering; it
addresses only the upper layer issues relevant to making best use of
RPC/RDMA.
2.9.5.1. RDMA Connection Resources
RDMA requires its consumers to register memory and post buffers of a
specific size and number for receive operations.
Registration of memory can be a relatively high-overhead operation,
since it requires pinning of buffers, assignment of attributes (e.g.
readable/writable), and initialization of hardware translation.
Preregistration is desirable to reduce overhead. These registrations
are specific to hardware interfaces and even to RDMA connection
endpoints, therefore negotiation of their limits is desirable to
manage resources effectively.
Following the basic registration, these buffers must be posted by the
RPC layer to handle receives. These buffers remain in use by the
RPC/NFSv4 implementation; the size and number of them must be known
to the remote peer in order to avoid RDMA errors which would cause a
fatal error on the RDMA connection.
NFSv4.1 manages slots as resources on a per session basis (see
Section 2.9), while RDMA connections manage credits on a per
connection basis. This means that in order for a peer to send data
over RDMA to a remote buffer, it has to have both an NFSv4.1 slot,
and an RDMA credit.
2.9.5.2. Flow Control
NFSv4.0 and all previous versions do not provide for any form of flow
control; instead they rely on the windowing provided by transports
like TCP to throttle requests. This does not work with RDMA, which
provides no operation flow control and will terminate a connection in
error when limits are exceeded. Limits such as maximum number of
requests outstanding are therefore negotiated when a session is
created (see the ca_maxrequests field in Section 16.36). These
limits then provide the maxima each session's channels' connections
must operate within. RDMA connections are managed within these
limits as described in section 3.3 of [RPCRDMA]; if there are
multiple RDMA connections, then the maximum requests for a channel
will be divided among the RDMA connections. The limits may also be
modified dynamically at the server's choosing by manipulating certain
parameters present in each NFSv4.1 request. In addition, the
CB_RECALL_SLOT callback operation (see Section 18.8 can be issued by
a server to a client to return RDMA credits to the server, thereby
lowering the maximum number of requests a client can have outstanding
to the server.
2.9.5.3. Padding
Header padding is requested by each peer at session initiation (see
the csa_headerpadsize argument to CREATE_SESSION in Section 16.36),
and subsequently used by the RPC RDMA layer, as described in
[RPCRDMA]. Zero padding is permitted.
Padding leverages the useful property that RDMA receives preserve
alignment of data, even when they are placed into anonymous
(untagged) buffers. If requested, client inline writes will insert
appropriate pad bytes within the request header to align the data
payload on the specified boundary. The client is encouraged to add
sufficient padding (up to the negotiated size) so that the "data"
field of the NFSv4.1 WRITE operation is aligned. Most servers can
make good use of such padding, which allows them to chain receive
buffers in such a way that any data carried by client requests will
be placed into appropriate buffers at the server, ready for file
system processing. The receiver's RPC layer encounters no overhead
from skipping over pad bytes, and the RDMA layer's high performance
makes the insertion and transmission of padding on the sender a
significant optimization. In this way, the need for servers to
perform RDMA Read to satisfy all but the largest client writes is
obviated. An added benefit is the reduction of message round trips
on the network - a potentially good trade, where latency is present.
The value to choose for padding is subject to a number of criteria.
A primary source of variable-length data in the RPC header is the
authentication information, the form of which is client-determined,
possibly in response to server specification. The contents of
COMPOUNDs, sizes of strings such as those passed to RENAME, etc. all
go into the determination of a maximal NFSv4 request size and
therefore minimal buffer size. The client must select its offered
value carefully, so as not to overburden the server, and vice- versa.
The payoff of an appropriate padding value is higher performance.
Sender gather:
|RPC Request|Pad bytes|Length| -> |User data...|
\------+---------------------/ \
\ \
\ Receiver scatter: \-----------+- ...
/-----+----------------\ \ \
|RPC Request|Pad|Length| -> |FS buffer|->|FS buffer|->...
In the above case, the server may recycle unused buffers to the next
posted receive if unused by the actual received request, or may pass
the now-complete buffers by reference for normal write processing.
For a server which can make use of it, this removes any need for data
copies of incoming data, without resorting to complicated end-to-end
buffer advertisement and management. This includes most kernel-based
and integrated server designs, among many others. The client may
perform similar optimizations, if desired.
2.9.5.4. Dual RDMA and Non-RDMA Transports
Some RDMA transports (for example see [RDDP]), [[Comment.9: need
xref]] require a "streaming" (non-RDMA) phase, where ordinary traffic
might flow before "stepping" up to RDMA mode, commencing RDMA
traffic. Some RDMA transports start connections always in RDMA mode.
NFSv4.1 allows, but does not assume, a streaming phase before RDMA
mode. When a connection is bound to a session, the client and server
negotiate whether the connection is used in RDMA or non-RDMA mode
(see Section 16.36 and Section 16.34).
2.9.6. Sessions Security
2.9.6.1. Session Callback Security
The session connection binding improves security over that provided
by NFSv4.0 for the callback channel. The connection is client-
initiated (see Section 16.34), and subject to the same firewall and
routing checks as the operations channel. The connection cannot be
hijacked by an attacker who connects to the client port prior to the
intended server. At the client's option (see Section 16.36 binding
is fully authenticated before being activated (see Section 16.34).
Traffic from the server over the callback channel is authenticated
exactly as the client specifies (see Section 2.9.6.2).
2.9.6.2. Backchannel RPC Security
When the NFSv4.1 client establishes the backchannel, it informs the
server what security flavors and principals it must use when sending
requests over the backchannel. If the security flavor is RPCSEC_GSS,
the client expresses the principal in the form of an established
RPCSEC_GSS context. The server is free to use any flavor/principal
combination the server offers, but MUST NOT use unoffered
combinations.
This way, the client does not have to provide a target GSS principal
as it did with NFSv4.0, and the server does not have to implement an
RPCSEC_GSS initiator as it did with NFSv4.0. [[Comment.10: xrefs]]
The CREATE_SESSION (Section 16.36) and BACKCHANNEL_CTL
(Section 16.33) operations allow the client to specify flavor/
principal combinations.
2.9.6.3. Protection from Unauthorized State Changes
Under some conditions, NFSv4.0 is vulnerable to a denial of service Under some conditions, NFSv4.0 is vulnerable to a denial of service
issue with respect to its state management. issue with respect to its state management.
The attack works via an unauthorized client faking an open_owner4, an The attack works via an unauthorized client faking an open_owner4, an
open_owner/lock_owner pair, or stateid, combined with a seqid. The open_owner/lock_owner pair, or stateid, combined with a seqid. The
operation is sent to the NFSv4 server. The NFSv4 server accepts the operation is sent to the NFSv4 server. The NFSv4 server accepts the
state information, and as long as any status code from the result of state information, and as long as any status code from the result of
this operation is not NFS4ERR_STALE_CLIENTID, NFS4ERR_STALE_STATEID, this operation is not NFS4ERR_STALE_CLIENTID, NFS4ERR_STALE_STATEID,
NFS4ERR_BAD_STATEID, NFS4ERR_BAD_SEQID, NFS4ERR_BADXDR, NFS4ERR_BAD_STATEID, NFS4ERR_BAD_SEQID, NFS4ERR_BADXDR,
skipping to change at page 60, line 21 skipping to change at page 50, line 9
denial of service attack. denial of service attack.
If the client uses RPCSEC_GSS authentication and integrity, and every If the client uses RPCSEC_GSS authentication and integrity, and every
client maps each open_owner and lock_owner one and only one client maps each open_owner and lock_owner one and only one
principal, and the server enforces this binding, then the conditions principal, and the server enforces this binding, then the conditions
leading to vulnerability to the denial of service do not exist. One leading to vulnerability to the denial of service do not exist. One
should keep in mind that if AUTH_SYS is being used, far simpler should keep in mind that if AUTH_SYS is being used, far simpler
easier denial of service and other attacks are possible. easier denial of service and other attacks are possible.
With NFSv4.1 sessions, the per-operation sequence number is ignored With NFSv4.1 sessions, the per-operation sequence number is ignored
(see Section 13.13) therefore the NFSv4.0 denial of service (see Section 8.13) therefore the NFSv4.0 denial of service
vulnerability described above does not apply. However as described vulnerability described above does not apply. However as described
to this point in the specification, an attacker could forge the to this point in the specification, an attacker could forge the
sessionid and issue a SEQUENCE with a slot id that he expects the sessionid and issue a SEQUENCE with a slot id that he expects the
legitimate client to use next. The legitimate client could then use legitimate client to use next. The legitimate client could then use
the slotid with the same sequence number, and the server returns the the slotid with the same sequence number, and the server returns the
attacker's result from the replay cache, thereby disrupting the attacker's result from the replay cache, thereby disrupting the
legitimate client. legitimate client.
If we give each NFSv4.1 user their own session, and each user uses If we give each NFSv4.1 user their own session, and each user uses
RPCSEC_GSS authentication and integrity, then the denial of service RPCSEC_GSS authentication and integrity, then the denial of service
issue is solved, at the cost of additional per session state. The issue is solved, at the cost of additional per session state. The
alternative NFSv4.1 specifies is described as follows. alternative NFSv4.1 specifies is described as follows.
Transport connections MUST be bound to to a session by the client. Transport connections MUST be bound to a session by the client. The
The server MUST return an error to an operation (other than the server MUST return an error to an operation (other than the operation
operation that binds the connection to the session) that uses an that binds the connection to the session) that uses an unbound
unbound connection. As a simplification, the transport connection connection. As a simplification, the transport connection used by
used by CREATE_SESSION is automatically bound to the session. CREATE_SESSION (see Section 16.36) is automatically bound to the
Additional connections are bound to a session via a new operation, session. Additional connections are bound to a session via
BIND_CONN_TO_SESSION. BIND_CONN_TO_SESSION (see Section 16.34).
To prevent attackers from issuing BIND_CONN_TO_SESSION operations, To prevent attackers from issuing BIND_CONN_TO_SESSION operations,
the arguments to BIND_CONN_TO_SESSION include a digest of a shared the arguments to BIND_CONN_TO_SESSION include a digest of a shared
secret called the secret session verifier (SSV) that only the client secret called the secret session verifier (SSV) that only the client
and server know. The digest is created via a one way, collision and server know. The digest is created via a one way, collision
resistance hash function, making it intractable for the attacker to resistant hash function, making it intractable for the attacker to
forge. forge.
The SSV is sent to the server via SET_SSV. To prevent eavesdropping, The SSV is sent to the server via SET_SSV (see Section 16.47). To
a SET_SSV for the SSV can be protected via RPCSEC_GSS with the prevent eavesdropping, a SET_SSV for the SSV SHOULD be protected via
privacy service. The SSV can be changed by the client at any time, RPCSEC_GSS with the privacy service. The SSV can be changed by the
by any principal. However several aspects of SSV changing prevent an client at any time, by any principal. However several aspects of SSV
attacker from engaging in a successful denial of service attack: changing prevent an attacker from engaging in a successful denial of
service attack:
1. A SET_SSV on the SSV does not replace the SSV with the argument o A SET_SSV on the SSV does not replace the SSV with the argument to
to SET_SVV. Instead, the current SSV on the server is logically SET_SSV. Instead, the current SSV on the server is logically
exclusive ORed (XORed) with the argument to SET_SSV. SET_SSV exclusive ORed (XORed) with the argument to SET_SSV. SET_SSV MUST
MUST NOT be called with an SSV value that is zero. NOT be called with an SSV value that is zero.
2. The arguments to and results of SET_SSV include digests of the o The arguments to and results of SET_SSV include digests of the old
old and new SSV, respectively. and new SSV, respectively.
3. Because the initial value of the SSV is zero, therefore known, o Because the initial value of the SSV is zero, therefore known, the
the client MUST issue at least one SET_SSV operation before the client that opts for connecting binding enforcement, MUST issue at
first BIND_CONN_TO_SESSION operation. A client SHOULD issue least one SET_SSV operation before the first BIND_CONN_TO_SESSION
SET_SSV as soon as a session is created. operation. A client SHOULD issue SET_SSV as soon as a session is
created.
If a connection is disconnected, BIND_CONN_TO_SESSION is required to If a connection is disconnected, BIND_CONN_TO_SESSION is required to
bind a connection to the session, even if the connection that was bind a connection to the session, even if the connection that was
disconnected was the one CREATE_SESSION was created with. disconnected was the one CREATE_SESSION was created with.
If a client is assigned a machine principal then the client SHOULD If a client is assigned a machine principal then the client SHOULD
use the machine principal's RPCSEC_GSS context to privacy protect the use the machine principal's RPCSEC_GSS context to privacy protect the
SSV from eavesdropping during the SET_SSV operation. If a machine SSV from eavesdropping during the SET_SSV operation. If a machine
principal is not being used, then the client MAY use the non-machine principal is not being used, then the client MAY use the non-machine
principal's RPCSEC_GSS context to privacy protect the SSV. The principal's RPCSEC_GSS context to privacy protect the SSV. The
server MUST accept either type of principal. A client SHOULD change server MUST accept either type of principal. A client SHOULD change
the SSV each time a new principal uses the session. the SSV each time a new principal uses the session.
Here are the types of attacks that can be attempted an attacker named Here are the types of attacks that can be attempted by an attacker
Eve, and how the connection to session binding approach addresses named Eve, and how the connection to session binding approach
each attack: addresses each attack:
o If the Eve creates a connection after the legitimate client o If the Eve creates a connection after the legitimate client
establishes an SSV via privacy protection from a machine establishes an SSV via privacy protection from a machine
principal's RPCSEC_GSS session, she does not know the SSV and so principal's RPCSEC_GSS session, she does not know the SSV and so
cannot compute a digest that BIND_CONN_TO_SESSION will accept. cannot compute a digest that BIND_CONN_TO_SESSION will accept.
Users on the legitimate client cannot be disrupted by Eve. Users on the legitimate client cannot be disrupted by Eve.
o If Eve first logs into the legitimate client, and the client does o If Eve is the first one log into the legitimate client, and the
not use machine principals, then Eve can cause an SSV to be client does not use machine principals, then Eve can cause an SSV
created via the legitimate client's NFSv4.1 implementation, to be created via the legitimate client's NFSv4.1 implementation,
protected by the RPCSEC_GSS context created by the legitimate protected by the RPCSEC_GSS context created by the legitimate
client (which uses Eve's GSS principal and credentials). Eve can client (which uses Eve's GSS principal and credentials). Eve can
eavesdrop on the network, and because she knows her credentials, then eavesdrop on the network, and because she knows her
she can decrypt the SSV. Eve can compute a digest credentials, she can decrypt the SSV. Eve can compute a digest
BIND_CONN_TO_SESSION will accept, and so bind a new connection to BIND_CONN_TO_SESSION will accept, and so bind a new connection to
the session. Eve can change the slotid, sequence state, and/or the session. Eve can change the slotid, sequence state, and/or
the SSV state in such a way that when Bob accesses the server via the SSV state in such a way that when Bob accesses the server via
the legitimate client, the legitimate client will be unable to use the legitimate client, the legitimate client will be unable to use
the session. The client's only recourse is to create a new the session.
session, which will cause any state Eve created on the legitimate
client over the old (but hijacked) session to be lost. This The client's only recourse is to create a new session, which will
disrupts Eve, but because she is the attacker, this is acceptable. cause any state Eve created on the legitimate client over the old
(but hijacked) session to be lost. This disrupts Eve, but because
she is the attacker, this is acceptable.
Once the legitimate client establishes an SSV over the new session Once the legitimate client establishes an SSV over the new session
using Bob's RPCSEC_GSS context, Eve can use the new session via using Bob's RPCSEC_GSS context, Eve can use the new session via
the legitimate client, but she cannot disrupt Bob. Moreover, the legitimate client, but she cannot disrupt Bob. Moreover,
because the client SHOULD have modified the SSV due to Eve using because the client SHOULD have modified the SSV due to Eve using
the new session, Bob cannot get revenge on Eve by binding a rogue the new session, Bob cannot get revenge on Eve by binding a rogue
connection to the session. The question is how does the connection to the session.
legitimate client detect that Eve has hijacked the old session?
When the client detects that a new principal, Bob, wants to use The question is how does the legitimate client detect that Eve has
the session, it SHOULD have issued a SET_SSV. hijacked the old session? When the client detects that a new
principal, Bob, wants to use the session, it SHOULD have issued a
SET_SSV.
* Let us suppose that from the rogue connection, Eve issued a * Let us suppose that from the rogue connection, Eve issued a
SET_SSV with the same slotid and sequence that the legitimate SET_SSV with the same slotid and sequence that the legitimate
client later uses. The server will assume this is a replay, client later uses. The server will assume this is a replay,
and return to the legitimate client the reply it sent Eve. and return to the legitimate client the reply it sent Eve.
However, unless Eve can correctly guess the SSV the legitimate However, unless Eve can correctly guess the SSV the legitimate
client will use, the digest verification checks in the SET_SSV client will use, the digest verification checks in the SET_SSV
response will fail. That is the clue to the client that the response will fail. That is the clue to the client that the
session has been hijacked. session has been hijacked.
skipping to change at page 63, line 14 skipping to change at page 53, line 10
and therefore known, Eve can issue a SET_SSV that will pass the and therefore known, Eve can issue a SET_SSV that will pass the
digest verification check. However because the new connection has digest verification check. However because the new connection has
not been bound to the session, the SET_SSV is rejected for that not been bound to the session, the SET_SSV is rejected for that
reason. reason.
o The connection to session binding model does not prevent o The connection to session binding model does not prevent
connection hijacking. However, if an attacker can perform connection hijacking. However, if an attacker can perform
connection hijacking, it can issue denial of service attacks that connection hijacking, it can issue denial of service attacks that
are less difficult than attacks based on forging sessions. are less difficult than attacks based on forging sessions.
6.11. Session Mechanics - Steady State 2.9.7. Session Mechanics - Steady State
6.11.1. Obligations of the Server 2.9.7.1. Obligations of the Server
[[Comment.10: XXX - TBD]] The server has the primary obligation to monitor the state of
backchannel resources that the client has created for the server
(RPCSEC_GSS contexts and back channel connections). When these
resources go away, the server takes action as specified in
Section 2.9.8.2.
6.11.2. Obligations of the Client 2.9.7.2. Obligations of the Client
The client has the following obligations in order to utilize the The client has the following obligations in order to utilize the
session: session:
o Keep a necessary session from going idle on the server. A client o Keep a necessary session from going idle on the server. A client
that requires a session, but nonetheless is not sending operations that requires a session, but nonetheless is not sending operations
risks having the session be destroyed by the server. This is risks having the session be destroyed by the server. This is
because sessions consume resources, and resource limitations may because sessions consume resources, and resource limitations may
force the server to cull the least recently used session. force the server to cull the least recently used session.
skipping to change at page 64, line 5 skipping to change at page 53, line 48
BACKCHANNEL_CTL are unexpired. A good practice is to keep at BACKCHANNEL_CTL are unexpired. A good practice is to keep at
least two contexts outstanding, where the expiration time of the least two contexts outstanding, where the expiration time of the
newest context at the time it was created, is N times that of the newest context at the time it was created, is N times that of the
oldest context, where N is the number of contexts available for oldest context, where N is the number of contexts available for
callbacks. callbacks.
o Maintain an active connection. The server requires a callback o Maintain an active connection. The server requires a callback
path in order to gracefully recall recallable state, or notify the path in order to gracefully recall recallable state, or notify the
client of certain events. client of certain events.
6.11.3. Steps the Client Takes To Establish a Session 2.9.7.3. Steps the Client Takes To Establish a Session
The client issues CREATE_CLIENTID to establish a clientid. The client issues CREATE_CLIENTID to establish a clientid.
The client uses the clientid to issue a CREATE_SESSION on a The client uses the clientid to issue a CREATE_SESSION on a
connection to the server. The results of CREATE_SESSION indicate connection to the server. The results of CREATE_SESSION indicate
whether the server will persist the session replay cache through a whether the server will persist the session replay cache through a
server reboot or not, and the client notes this for future reference. server reboot or not, and the client notes this for future reference.
The client SHOULD issue SET_SSV in first COMPOUND after the session The client SHOULD have specified connecting binding enforcement when
is created. If it is not using machine credentials, then each time a the session was created. If so, the client SHOULD issue SET_SSV in
new principal goes to use the session, it SHOULD issue a SET_SSV the first COMPOUND after the session is created. If it is not using
again. machine credentials, then each time a new principal goes to use the
session, it SHOULD issue a SET_SSV again.
If the client wants to use delegations, layouts, directory If the client wants to use delegations, layouts, directory
notifications, or any other state that requires a call back channel, notifications, or any other state that requires a call back channel,
then it must add connection to the backchannel if CREATE_SESSION did then it MUST add a connection to the backchannel if CREATE_SESSION
not already do so. The client creates a connection, and calls did not already do so. The client creates a connection, and calls
BIND_CONN_TO_SESSION to bind the connection to the session and the BIND_CONN_TO_SESSION to bind the connection to the session and the
session's backchannel. If CREATE_SESSION did not already do so, the session's backchannel. If CREATE_SESSION did not already do so, the
client MUST tell the server what security is required in order for client MUST tell the server what security is required in order for
the client to accept callbacks. The client does this via the client to accept callbacks. The client does this via
BACKCHANNEL_CTL. BACKCHANNEL_CTL.
If the client wants to use additional connections for the operations If the client wants to use additional connections for the
and back channels, then it MUST call BIND_CONN_TO_SESSION on each backchannel, then it MUST call BIND_CONN_TO_SESSION on each
connection it wants to use with the session. connection it wants to use with the session. If the client wants to
use additional connections for the operation channel, then it MUST
call BIND_CONN_TO_SESSION if it specified connection binding
enforcement before using the connection.
At this point the client has reached a steady state as far as session At this point the client has reached a steady state as far as session
use. use.
6.12. Session Mechanics - Recovery 2.9.8. Session Mechanics - Recovery
This section discussions session related events that require
recovery.
6.12.1. Events Requiring Client Action 2.9.8.1. Events Requiring Client Action
The following events require client action to recover. The following events require client action to recover.
6.12.1.1. RPCSEC_GSS Context Loss by Callback Path 2.9.8.1.1. RPCSEC_GSS Context Loss by Callback Path
If all RPCSEC_GSS contexts granted to by the client to the server for If all RPCSEC_GSS contexts granted to by the client to the server for
callback use have expired, the client MUST establish a new context callback use have expired, the client MUST establish a new context
via BIND_CONN_TO_SESSION. The sr_status field of SEQUENCE results via BACKCHANNEL_CTL. The sr_status field of SEQUENCE results
indicates when callback contexts are nearly expired, or fully expired indicates when callback contexts are nearly expired, or fully expired
(see Section 21.46.4). (see Section 16.46.4).
6.12.1.2. Connection Disconnect 2.9.8.1.2. Connection Disconnect
If the client loses the last connection of the session, then it MUST If the client loses the last connection of the session, then it MUST
create a new connection, and bind it to the session via create a new connection, and if connecting binding enforcement was
specified when the session was created, bind it to the session via
BIND_CONN_TO_SESSION. BIND_CONN_TO_SESSION.
6.12.1.3. Loss of Session If there were requests outstanding at the time the of connection
disconnect, then the client MUST retry the request, as described in
Section 2.9.4.2. Note that it is not necessary to retry requests
over a connection with the same source network address or the same
destination network address as the disconnected connection. As long
as the sessionid, slotid, and sequenceid in the retry match that of
the original request, the server will recognize the request as a
retry if it did see the request prior to disconnect.
If the connection that was bound to the backchannel is lost, the
client may need to reconnect, and use BIND_CONN_TO_SESSION, to give
the connection to the backchannel. If the connection that was lost
was the last one bound to the backchannel, the the client MUST
reconnect, and bind the connection to the session and backchannel.
The server should indicate when it has no callback connection via the
sr_status result from SEQUENCE.
2.9.8.1.3. Backchannel GSS Context Loss
Via the sr_status result of the SEQUENCE operation or other means,
the client will learn if some or all of the RPCSEC_GSS contexts it
assigned to the backchannel have been lost. The client may need to
use BACKCHANNEL_CTL to assign new contexts. It MUST assign new
contexts if there are no more contexts.
2.9.8.1.4. Loss of Session
The server may lose a record of the session. Causes include: The server may lose a record of the session. Causes include:
o Server crash and reboot o Server crash and reboot
o A catastrophe that causes the cache to be corrupted or lost on the o A catastrophe that causes the cache to be corrupted or lost on the
media it was stored on. This applies even if the server indicated media it was stored on. This applies even if the server indicated
in the CREATE_SESSION results that it would persist the cache. in the CREATE_SESSION results that it would persist the cache.
o The server purges the session of a client that has been inactive o The server purges the session of a client that has been inactive
for a very extended period of time. [[Comment.11: XXX - Should we for a very extended period of time. [[Comment.11: XXX - Should we
add a value to the CREATE_SESSION results that tells a client how add a value to the CREATE_SESSION results that tells a client how
long he can let a session stay idle before losing it?]]. long he can let a session stay idle before losing it?]]
Loss of replay cache is equivalent to loss of session. The server Loss of replay cache is equivalent to loss of session. The server
indicates loss of session to the client by returning indicates loss of session to the client by returning
NFS4ERR_BADSESSION on the next operation that uses the sessionid NFS4ERR_BADSESSION on the next operation that uses the sessionid
associated with the lost session. associated with the lost session.
After an event like a server reboot, the client may have lost its After an event like a server reboot, the client may have lost its
connections. The client assumes for the moment that the session has connections. The client assumes for the moment that the session has
not been lost. It reconnects, and invokes BIND_CONN_TO_SESSION using not been lost. It reconnects, and if it specified connecting binding
the sessionid. If BIND_CONN_TO_SESSION returns NFS4ERR_BADSESSION, enforcement when the session was created, it invokes
the client knows the session was lost. If the connection survives BIND_CONN_TO_SESSION using the sessionid. Otherwise, it invokes
session loss, then the next SEQUENCE operation the client issues over SEQUENCE. If BIND_CONN_TO_SESSION or SEQUENCE returns
the connection will get back NFS4ERR_BADSESSION. The client again NFS4ERR_BADSESSION, the client knows the session was lost. If the
knows the session was lost. connection survives session loss, then the next SEQUENCE operation
the client issues over the connection will get back
NFS4ERR_BADSESSION. The client again knows the session was lost.
When the client detects session loss, it must call CREATE_SESSION to When the client detects session loss, it must call CREATE_SESSION to
recover. Any non-idempotent operations that were in progress may recover. Any non-idempotent operations that were in progress may
have been performed on the server at the time of session loss. The have been performed on the server at the time of session loss. The
client has no general way to recover from this. client has no general way to recover from this.
Note that loss of session does not imply loss of lock, open, Note that loss of session does not imply loss of lock, open,
delegation, or layout state. Nor does loss of lock, open, delegation, or layout state. Nor does loss of lock, open,
delegation, or layout state imply loss of session state.[[Comment.12: delegation, or layout state imply loss of session state.
Add reference to lock recovery section]]. A session can survive a [[Comment.12: Add reference to lock recovery section]] . A session
server reboot, but lock recovery may still be needed. The converse can survive a server reboot, but lock recovery may still be needed.
is also true. The converse is also true.
It is possible CREATE_SESSION will fail with NFS4ERR_STALE_CLIENTID It is possible CREATE_SESSION will fail with NFS4ERR_STALE_CLIENTID
(for example the server reboots and does not preserve clientid (for example the server reboots and does not preserve clientid
state). If so, the client needs to call CREATE_CLIENTID, followed by state). If so, the client needs to call CREATE_CLIENTID, followed by
CREATE_SESSION. CREATE_SESSION.
6.12.2. Events Requiring Server Action 2.9.8.1.5. Failover
[[Comment.13: Dave Noveck requested this section; not sure what is
needed here if this refers to failover to a replica. What are the
session ramifications?]]
2.9.8.2. Events Requiring Server Action
The following events require server action to recover. The following events require server action to recover.
6.12.2.1. Client Crash and Reboot 2.9.8.2.1. Client Crash and Reboot
As described in Section 21.35, a rebooted client causes the server to As described in Section 16.35, a rebooted client causes the server to
delete any sessions it had. delete any sessions it had.
6.12.2.2. Client Crash with No Reboot 2.9.8.2.2. Client Crash with No Reboot
If a client crashes and never comes back, it will never issue If a client crashes and never comes back, it will never issue
CREATE_CLIENTID with its old clientid. Thus the server has session CREATE_CLIENTID with its old clientid. Thus the server has session
state that will never be used again. After an extended period of state that will never be used again. After an extended period of
time and if the server has resource constraints, it MAY destroy the time and if the server has resource constraints, it MAY destroy the
old session. old session.
6.12.2.2.1. Extended Network Parition 2.9.8.2.3. Extended Network Partition
To the server, the extended network partition may be no different To the server, the extended network partition may be no different
than a client crash with no reboot (see Section 6.12.2.2 Client Crash than a client crash with no reboot (see Section 2.9.8.2.2). Unless
with No Reboot). Unless the server can discern that there is a the server can discern that there is a network partition, it is free
network partition, it is free to treat the situation as if the client to treat the situation as if the client has crashed for good.
has crashed for good.
7. Minor Versioning
To address the requirement of an NFS protocol that can evolve as the
need arises, the NFS version 4 protocol contains the rules and
framework to allow for future minor changes or versioning.
The base assumption with respect to minor versioning is that any
future accepted minor version must follow the IETF process and be
documented in a standards track RFC. Therefore, each minor version
number will correspond to an RFC. Minor version zero of the NFS
version 4 protocol is represented by this RFC. The COMPOUND
procedure will support the encoding of the minor version being
requested by the client.
The following items represent the basic rules for the development of
minor versions. Note that a future minor version may decide to
modify or add to the following rules as part of the minor version
definition.
1. Procedures are not added or deleted
To maintain the general RPC model, NFS version 4 minor versions
will not add to or delete procedures from the NFS program.
2. Minor versions may add operations to the COMPOUND and
CB_COMPOUND procedures.
The addition of operations to the COMPOUND and CB_COMPOUND
procedures does not affect the RPC model.
* Minor versions may append attributes to GETATTR4args,
bitmap4, and GETATTR4res.
This allows for the expansion of the attribute model to allow
for future growth or adaptation.
* Minor version X must append any new attributes after the last
documented attribute.
Since attribute results are specified as an opaque array of
per-attribute XDR encoded results, the complexity of adding
new attributes in the midst of the current definitions will
be too burdensome.
3. Minor versions must not modify the structure of an existing
operation's arguments or results.
Again the complexity of handling multiple structure definitions
for a single operation is too burdensome. New operations should
be added instead of modifying existing structures for a minor
version.
This rule does not preclude the following adaptations in a minor
version.
* adding bits to flag fields such as new attributes to
GETATTR's bitmap4 data type
* adding bits to existing attributes like ACLs that have flag
words
* extending enumerated types (including NFS4ERR_*) with new
values
4. Minor versions may not modify the structure of existing
attributes.
5. Minor versions may not delete operations.
This prevents the potential reuse of a particular operation
"slot" in a future minor version.
6. Minor versions may not delete attributes.
7. Minor versions may not delete flag bits or enumeration values.
8. Minor versions may declare an operation as mandatory to NOT
implement.
Specifying an operation as "mandatory to not implement" is
equivalent to obsoleting an operation. For the client, it means
that the operation should not be sent to the server. For the
server, an NFS error can be returned as opposed to "dropping"
the request as an XDR decode error. This approach allows for
the obsolescence of an operation while maintaining its structure
so that a future minor version can reintroduce the operation.
1. Minor versions may declare attributes mandatory to NOT
implement.
2. Minor versions may declare flag bits or enumeration values
as mandatory to NOT implement.
9. Minor versions may downgrade features from mandatory to
recommended, or recommended to optional.
10. Minor versions may upgrade features from optional to recommended 2.9.8.2.4. Backchannel Connection Loss
or recommended to mandatory.
11. A client and server that support minor version X must support If there were callback requests outstanding at the time the of a
minor versions 0 (zero) through X-1 as well. connection disconnect, then the server MUST retry the request, as
described in Section 2.9.4.2. Note that it is not necessary to retry
requests over a connection with the same source network address or
the same destination network address as the disconnected connection.
As long as the sessionid, slotid, and sequenceid in the retry match
that of the original request, the callback target will recognize the
request as a retry if it did see the request prior to disconnect.
12. No new features may be introduced as mandatory in a minor If the connection lost is the last one bound to the backchannel, then
version. the server MUST indicate that in the sr_status field of the next
SEQUENCE reply.
This rule allows for the introduction of new functionality and 2.9.8.2.5. GSS Context Loss
forces the use of implementation experience before designating a
feature as mandatory.
13. A client MUST NOT attempt to use a stateid, filehandle, or The server SHOULD monitor when the last RPCSEC_GSS context assigned
similar returned object from the COMPOUND procedure with minor to the backchannel is near expiry (i.e between one and two periods of
version X for another COMPOUND procedure with minor version Y, lease time), and indicate so in the sr_status field of the next
where X != Y. SEQUENCE reply. The server MUST indicate when the backchannel's last
RPCSEC_GSS context has expired in the sr_status field of the next
SEQUENCE reply.
8. Protocol Data Types 3. Protocol Data Types
The syntax and semantics to describe the data types of the NFS The syntax and semantics to describe the data types of the NFS
version 4 protocol are defined in the XDR RFC4506 [3] and RPC RFC1831 version 4 protocol are defined in the XDR RFC4506 [3] and RPC RFC1831
[4] documents. The next sections build upon the XDR data types to [4] documents. The next sections build upon the XDR data types to
define types and structures specific to this protocol. define types and structures specific to this protocol.
8.1. Basic Data Types 3.1. Basic Data Types
These are the base NFSv4 data types. These are the base NFSv4 data types.
+---------------+---------------------------------------------------+ +---------------+---------------------------------------------------+
| Data Type | Definition | | Data Type | Definition |
+---------------+---------------------------------------------------+ +---------------+---------------------------------------------------+
| int32_t | typedef int int32_t; | | int32_t | typedef int int32_t; |
| uint32_t | typedef unsigned int uint32_t; | | uint32_t | typedef unsigned int uint32_t; |
| int64_t | typedef hyper int64_t; | | int64_t | typedef hyper int64_t; |
| uint64_t | typedef unsigned hyper uint64_t; | | uint64_t | typedef unsigned hyper uint64_t; |
skipping to change at page 70, line 36 skipping to change at page 59, line 25
| | Verifier used for various operations (COMMIT, | | | Verifier used for various operations (COMMIT, |
| | CREATE, OPEN, READDIR, SETCLIENTID, | | | CREATE, OPEN, READDIR, SETCLIENTID, |
| | SETCLIENTID_CONFIRM, WRITE) NFS4_VERIFIER_SIZE is | | | SETCLIENTID_CONFIRM, WRITE) NFS4_VERIFIER_SIZE is |
| | defined as 8. | | | defined as 8. |
+---------------+---------------------------------------------------+ +---------------+---------------------------------------------------+
End of Base Data Types End of Base Data Types
Table 1 Table 1
8.2. Structured Data Types 3.2. Structured Data Types
8.2.1. nfstime4 3.2.1. nfstime4
struct nfstime4 { struct nfstime4 {
int64_t seconds; int64_t seconds;
uint32_t nseconds; uint32_t nseconds;
} }
The nfstime4 structure gives the number of seconds and nanoseconds The nfstime4 structure gives the number of seconds and nanoseconds
since midnight or 0 hour January 1, 1970 Coordinated Universal Time since midnight or 0 hour January 1, 1970 Coordinated Universal Time
(UTC). Values greater than zero for the seconds field denote dates (UTC). Values greater than zero for the seconds field denote dates
after the 0 hour January 1, 1970. Values less than zero for the after the 0 hour January 1, 1970. Values less than zero for the
skipping to change at page 71, line 16 skipping to change at page 60, line 5
nseconds fields would have a value of one-half second (500000000). nseconds fields would have a value of one-half second (500000000).
Values greater than 999,999,999 for nseconds are considered invalid. Values greater than 999,999,999 for nseconds are considered invalid.
This data type is used to pass time and date information. A server This data type is used to pass time and date information. A server
converts to and from its local representation of time when processing converts to and from its local representation of time when processing
time values, preserving as much accuracy as possible. If the time values, preserving as much accuracy as possible. If the
precision of timestamps stored for a file system object is less than precision of timestamps stored for a file system object is less than
defined, loss of precision can occur. An adjunct time maintenance defined, loss of precision can occur. An adjunct time maintenance
protocol is recommended to reduce client and server time skew. protocol is recommended to reduce client and server time skew.
8.2.2. time_how4 3.2.2. time_how4
enum time_how4 { enum time_how4 {
SET_TO_SERVER_TIME4 = 0, SET_TO_SERVER_TIME4 = 0,
SET_TO_CLIENT_TIME4 = 1 SET_TO_CLIENT_TIME4 = 1
}; };
8.2.3. settime4 3.2.3. settime4
union settime4 switch (time_how4 set_it) { union settime4 switch (time_how4 set_it) {
case SET_TO_CLIENT_TIME4: case SET_TO_CLIENT_TIME4:
nfstime4 time; nfstime4 time;
default: default:
void; void;
}; };
The above definitions are used as the attribute definitions to set The above definitions are used as the attribute definitions to set
time values. If set_it is SET_TO_SERVER_TIME4, then the server uses time values. If set_it is SET_TO_SERVER_TIME4, then the server uses
its local representation of time for the time value. its local representation of time for the time value.
8.2.4. specdata4 3.2.4. specdata4
struct specdata4 { struct specdata4 {
uint32_t specdata1; /* major device number */ uint32_t specdata1; /* major device number */
uint32_t specdata2; /* minor device number */ uint32_t specdata2; /* minor device number */
}; };
This data type represents additional information for the device file This data type represents additional information for the device file
types NF4CHR and NF4BLK. types NF4CHR and NF4BLK.
8.2.5. fsid4 3.2.5. fsid4
struct fsid4 { struct fsid4 {
uint64_t major; uint64_t major;
uint64_t minor; uint64_t minor;
}; };
8.2.6. fs_location4 3.2.6. fs_location4
struct fs_location4 { struct fs_location4 {
utf8str_cis server<>; utf8str_cis server<>;
pathname4 rootpath; pathname4 rootpath;
}; };
8.2.7. fs_locations4 3.2.7. fs_locations4
struct fs_locations4 { struct fs_locations4 {
pathname4 fs_root; pathname4 fs_root;
fs_location4 locations<>; fs_location4 locations<>;
}; };
The fs_location4 and fs_locations4 data types are used for the The fs_location4 and fs_locations4 data types are used for the
fs_locations recommended attribute which is used for migration and fs_locations recommended attribute which is used for migration and
replication support. replication support.
8.2.8. fattr4 3.2.8. fattr4
struct fattr4 { struct fattr4 {
bitmap4 attrmask; bitmap4 attrmask;
attrlist4 attr_vals; attrlist4 attr_vals;
}; };
The fattr4 structure is used to represent file and directory The fattr4 structure is used to represent file and directory
attributes. attributes.
The bitmap is a counted array of 32 bit integers used to contain bit The bitmap is a counted array of 32 bit integers used to contain bit
values. The position of the integer in the array that contains bit n values. The position of the integer in the array that contains bit n
can be computed from the expression (n / 32) and its bit within that can be computed from the expression (n / 32) and its bit within that
integer is (n mod 32). integer is (n mod 32).
0 1 0 1
+-----------+-----------+-----------+-- +-----------+-----------+-----------+--
| count | 31 .. 0 | 63 .. 32 | | count | 31 .. 0 | 63 .. 32 |
+-----------+-----------+-----------+-- +-----------+-----------+-----------+--
8.2.9. change_info4 3.2.9. change_info4
struct change_info4 { struct change_info4 {
bool atomic; bool atomic;
changeid4 before; changeid4 before;
changeid4 after; changeid4 after;
}; };
This structure is used with the CREATE, LINK, REMOVE, RENAME This structure is used with the CREATE, LINK, REMOVE, RENAME
operations to let the client know the value of the change attribute operations to let the client know the value of the change attribute
for the directory in which the target file system object resides. for the directory in which the target file system object resides.
8.2.10. netaddr4 3.2.10. netaddr4
struct netaddr4 { struct netaddr4 {
/* see struct rpcb in RFC1833 */ /* see struct rpcb in RFC1833 */
string r_netid<>; /* network id */ string r_netid<>; /* network id */
string r_addr<>; /* universal address */ string r_addr<>; /* universal address */
}; };
The netaddr4 structure is used to identify TCP/IP based endpoints. The netaddr4 structure is used to identify TCP/IP based endpoints.
The r_netid and r_addr fields are specified in RFC1833 [20], but they The r_netid and r_addr fields are specified in RFC1833 [22], but they
are underspecified in RFC1833 [20] as far as what they should look are underspecified in RFC1833 [22] as far as what they should look
like for specific protocols. like for specific protocols.
For TCP over IPv4 and for UDP over IPv4, the format of r_addr is the For TCP over IPv4 and for UDP over IPv4, the format of r_addr is the
US-ASCII string: US-ASCII string:
h1.h2.h3.h4.p1.p2 h1.h2.h3.h4.p1.p2
The prefix, "h1.h2.h3.h4", is the standard textual form for The prefix, "h1.h2.h3.h4", is the standard textual form for
representing an IPv4 address, which is always four octets long. representing an IPv4 address, which is always four octets long.
Assuming big-endian ordering, h1, h2, h3, and h4, are respectively, Assuming big-endian ordering, h1, h2, h3, and h4, are respectively,
skipping to change at page 74, line 5 skipping to change at page 63, line 5
The suffix "p1.p2" is the service port, and is computed the same way The suffix "p1.p2" is the service port, and is computed the same way
as with universal addresses for TCP and UDP over IPv4. The prefix, as with universal addresses for TCP and UDP over IPv4. The prefix,
"x1:x2:x3:x4:x5:x6:x7:x8", is the standard textual form for "x1:x2:x3:x4:x5:x6:x7:x8", is the standard textual form for
representing an IPv6 address as defined in Section 2.2 of RFC1884 representing an IPv6 address as defined in Section 2.2 of RFC1884
[9]. Additionally, the two alternative forms specified in Section [9]. Additionally, the two alternative forms specified in Section
2.2 of RFC1884 [9] are also acceptable. 2.2 of RFC1884 [9] are also acceptable.
For TCP over IPv6 the value of r_netid is the string "tcp6". For UDP For TCP over IPv6 the value of r_netid is the string "tcp6". For UDP
over IPv6 the value of r_netid is the string "udp6". over IPv6 the value of r_netid is the string "udp6".
8.2.11. clientaddr4 3.2.11. clientaddr4
typedef netaddr4 clientaddr4; typedef netaddr4 clientaddr4;
The clientaddr4 structure is used as part of the SETCLIENTID The clientaddr4 structure is used as part of the SETCLIENTID
operation to either specify the address of the client that is using a operation to either specify the address of the client that is using a
clientid or as part of the callback registration. clientid or as part of the callback registration.
8.2.12. cb_client4 3.2.12. cb_client4
struct cb_client4 { struct cb_client4 {
unsigned int cb_program; unsigned int cb_program;
netaddr4 cb_location; netaddr4 cb_location;
}; };
This structure is used by the client to inform the server of its call This structure is used by the client to inform the server of its call
back address; includes the program number and client address. back address; includes the program number and client address.
8.2.13. nfs_client_id4 3.2.13. nfs_client_id4
struct nfs_client_id4 { struct nfs_client_id4 {
verifier4 verifier; verifier4 verifier;
opaque id<NFS4_OPAQUE_LIMIT> opaque id<NFS4_OPAQUE_LIMIT>
}; };
This structure is part of the arguments to the SETCLIENTID operation. This structure is part of the arguments to the SETCLIENTID operation.
NFS4_OPAQUE_LIMIT is defined as 1024. NFS4_OPAQUE_LIMIT is defined as 1024.
8.2.14. open_owner4 3.2.14. open_owner4
struct open_owner4 { struct open_owner4 {
clientid4 clientid; clientid4 clientid;
opaque owner<NFS4_OPAQUE_LIMIT> opaque owner<NFS4_OPAQUE_LIMIT>
}; };
This structure is used to identify the owner of open state. This structure is used to identify the owner of open state.
NFS4_OPAQUE_LIMIT is defined as 1024. NFS4_OPAQUE_LIMIT is defined as 1024.
8.2.15. lock_owner4 3.2.15. lock_owner4
struct lock_owner4 { struct lock_owner4 {
clientid4 clientid; clientid4 clientid;
opaque owner<NFS4_OPAQUE_LIMIT> opaque owner<NFS4_OPAQUE_LIMIT>
}; };
This structure is used to identify the owner of file locking state. This structure is used to identify the owner of file locking state.
NFS4_OPAQUE_LIMIT is defined as 1024. NFS4_OPAQUE_LIMIT is defined as 1024.
8.2.16. open_to_lock_owner4 3.2.16. open_to_lock_owner4
struct open_to_lock_owner4 { struct open_to_lock_owner4 {
seqid4 open_seqid; seqid4 open_seqid;
stateid4 open_stateid; stateid4 open_stateid;
seqid4 lock_seqid; seqid4 lock_seqid;
lock_owner4 lock_owner; lock_owner4 lock_owner;
}; };
This structure is used for the first LOCK operation done for an This structure is used for the first LOCK operation done for an
open_owner4. It provides both the open_stateid and lock_owner such open_owner4. It provides both the open_stateid and lock_owner such
that the transition is made from a valid open_stateid sequence to that the transition is made from a valid open_stateid sequence to
that of the new lock_stateid sequence. Using this mechanism avoids that of the new lock_stateid sequence. Using this mechanism avoids
the confirmation of the lock_owner/lock_seqid pair since it is tied the confirmation of the lock_owner/lock_seqid pair since it is tied
to established state in the form of the open_stateid/open_seqid. to established state in the form of the open_stateid/open_seqid.
8.2.17. stateid4 3.2.17. stateid4
struct stateid4 { struct stateid4 {
uint32_t seqid; uint32_t seqid;
opaque other[12]; opaque other[12];
}; };
This structure is used for the various state sharing mechanisms This structure is used for the various state sharing mechanisms
between the client and server. For the client, this data structure between the client and server. For the client, this data structure
is read-only. The starting value of the seqid field is undefined. is read-only. The starting value of the seqid field is undefined.
The server is required to increment the seqid field monotonically at The server is required to increment the seqid field monotonically at
each transition of the stateid. This is important since the client each transition of the stateid. This is important since the client
will inspect the seqid in OPEN stateids to determine the order of will inspect the seqid in OPEN stateids to determine the order of
OPEN processing done by the server. OPEN processing done by the server.
8.2.18. layouttype4 3.2.18. layouttype4
enum layouttype4 { enum layouttype4 {
LAYOUT_NFSV4_FILES = 1, LAYOUT_NFSV4_FILES = 1,
LAYOUT_OSD2_OBJECTS = 2, LAYOUT_OSD2_OBJECTS = 2,
LAYOUT_BLOCK_VOLUME = 3 LAYOUT_BLOCK_VOLUME = 3
}; };
A layout type specifies the layout being used. The implication is A layout type specifies the layout being used. The implication is
that clients have "layout drivers" that support one or more layout that clients have "layout drivers" that support one or more layout
types. The file server advertises the layout types it supports types. The file server advertises the layout types it supports
through the LAYOUT_TYPES file system attribute. A client asks for through the LAYOUT_TYPES file system attribute. A client asks for
layouts of a particular type in LAYOUTGET, and passes those layouts layouts of a particular type in LAYOUTGET, and passes those layouts
to its layout driver. to its layout driver.
The layouttype4 structure is 32 bits in length. The range The layouttype4 structure is 32 bits in length. The range
represented by the layout type is split into two parts. Types within represented by the layout type is split into two parts. Types within
the range 0x00000000-0x7FFFFFFF are globally unique and are assigned the range 0x00000000-0x7FFFFFFF are globally unique and are assigned
according to the description in Section 25.1; they are maintained by according to the description in Section 20.1; they are maintained by
IANA. Types within the range 0x80000000-0xFFFFFFFF are site specific IANA. Types within the range 0x80000000-0xFFFFFFFF are site specific
and for "private use" only. and for "private use" only.
The LAYOUT_NFSV4_FILES enumeration specifies that the NFSv4 file The LAYOUT_NFSV4_FILES enumeration specifies that the NFSv4 file
layout type is to be used. The LAYOUT_OSD2_OBJECTS enumeration layout type is to be used. The LAYOUT_OSD2_OBJECTS enumeration
specifies that the object layout, as defined in [22], is to be used. specifies that the object layout, as defined in [23], is to be used.
Similarly, the LAYOUT_BLOCK_VOLUME enumeration that the block/volume Similarly, the LAYOUT_BLOCK_VOLUME enumeration that the block/volume
layout, as defined in [23], is to be used. layout, as defined in [24], is to be used.
8.2.19. deviceid4 3.2.19. deviceid4
typedef uint32_t deviceid4; /* 32-bit device ID */ typedef uint32_t deviceid4; /* 32-bit device ID */
Layout information includes device IDs that specify a storage device Layout information includes device IDs that specify a storage device
through a compact handle. Addressing and type information is through a compact handle. Addressing and type information is
obtained with the GETDEVICEINFO operation. A client must not assume obtained with the GETDEVICEINFO operation. A client must not assume
that device IDs are valid across metadata server reboots. The device that device IDs are valid across metadata server reboots. The device
ID is qualified by the layout type and are unique per file system ID is qualified by the layout type and are unique per file system
(FSID). This allows different layout drivers to generate device IDs (FSID). This allows different layout drivers to generate device IDs
without the need for co-ordination. See Section 17.3.1.4 for more without the need for co-ordination. See Section 12.3.1.4 for more
details. details.
8.2.20. devlist_item4 3.2.20. devlist_item4
struct devlist_item4 { struct devlist_item4 {
deviceid4 dli_id; deviceid4 dli_id;
opaque dli_device_addr<>; opaque dli_device_addr<>;
}; };
An array of these values is returned by the GETDEVICELIST operation. An array of these values is returned by the GETDEVICELIST operation.
They define the set of devices associated with a file system for the They define the set of devices associated with a file system for the
layout type specified in the GETDEVICELIST4args. layout type specified in the GETDEVICELIST4args.
The device address is used to set up a communication channel with the The device address is used to set up a communication channel with the
storage device. Different layout types will require different types storage device. Different layout types will require different types
of structures to define how they communicate with storage devices. of structures to define how they communicate with storage devices.
The opaque device_addr field must be interpreted based on the The opaque device_addr field must be interpreted based on the
specified layout type. specified layout type.
This document defines the device address for the NFSv4 file layout This document defines the device address for the NFSv4 file layout
(struct netaddr4 (Section 8.2.10)), which identifies a storage device (struct netaddr4 (Section 3.2.10)), which identifies a storage device
by network IP address and port number. This is sufficient for the by network IP address and port number. This is sufficient for the
clients to communicate with the NFSv4 storage devices, and may be clients to communicate with the NFSv4 storage devices, and may be
sufficient for other layout types as well. Device types for object sufficient for other layout types as well. Device types for object
storage devices and block storage devices (e.g., SCSI volume labels) storage devices and block storage devices (e.g., SCSI volume labels)
will be defined by their respective layout specifications. will be defined by their respective layout specifications.
8.2.21. layout4 3.2.21. layout4
struct layout4 { struct layout4 {
offset4 lo_offset; offset4 lo_offset;
length4