draft-ietf-nfsv4-minorversion1-20.txt   draft-ietf-nfsv4-minorversion1-21.txt 
NFSv4 S. Shepler NFSv4 S. Shepler
Internet-Draft M. Eisler Internet-Draft M. Eisler
Intended status: Standards Track D. Noveck Intended status: Standards Track D. Noveck
Expires: August 28, 2008 Editors Expires: August 28, 2008 Editors
February 25, 2008 February 25, 2008
NFS Version 4 Minor Version 1 NFS Version 4 Minor Version 1
draft-ietf-nfsv4-minorversion1-20.txt draft-ietf-nfsv4-minorversion1-21.txt
Status of this Memo Status of this Memo
By submitting this Internet-Draft, each author represents that any By submitting this Internet-Draft, each author represents that any
applicable patent or other IPR claims of which he or she is aware applicable patent or other IPR claims of which he or she is aware
have been or will be disclosed, and any of which he or she becomes have been or will be disclosed, and any of which he or she becomes
aware will be disclosed, in accordance with Section 6 of BCP 79. aware will be disclosed, in accordance with Section 6 of BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that Task Force (IETF), its areas, and its working groups. Note that
skipping to change at page 3, line 6 skipping to change at page 3, line 6
2.9. Transport Layers . . . . . . . . . . . . . . . . . . . . 37 2.9. Transport Layers . . . . . . . . . . . . . . . . . . . . 37
2.9.1. REQUIRED and RECOMMENDED Properties of Transports . 37 2.9.1. REQUIRED and RECOMMENDED Properties of Transports . 37
2.9.2. Client and Server Transport Behavior . . . . . . . . 37 2.9.2. Client and Server Transport Behavior . . . . . . . . 37
2.9.3. Ports . . . . . . . . . . . . . . . . . . . . . . . 39 2.9.3. Ports . . . . . . . . . . . . . . . . . . . . . . . 39
2.10. Session . . . . . . . . . . . . . . . . . . . . . . . . 39 2.10. Session . . . . . . . . . . . . . . . . . . . . . . . . 39
2.10.1. Motivation and Overview . . . . . . . . . . . . . . 39 2.10.1. Motivation and Overview . . . . . . . . . . . . . . 39
2.10.2. NFSv4 Integration . . . . . . . . . . . . . . . . . 40 2.10.2. NFSv4 Integration . . . . . . . . . . . . . . . . . 40
2.10.3. Channels . . . . . . . . . . . . . . . . . . . . . . 42 2.10.3. Channels . . . . . . . . . . . . . . . . . . . . . . 42
2.10.4. Trunking . . . . . . . . . . . . . . . . . . . . . . 43 2.10.4. Trunking . . . . . . . . . . . . . . . . . . . . . . 43
2.10.5. Exactly Once Semantics . . . . . . . . . . . . . . . 46 2.10.5. Exactly Once Semantics . . . . . . . . . . . . . . . 46
2.10.6. RDMA Considerations . . . . . . . . . . . . . . . . 58 2.10.6. RDMA Considerations . . . . . . . . . . . . . . . . 59
2.10.7. Sessions Security . . . . . . . . . . . . . . . . . 61 2.10.7. Sessions Security . . . . . . . . . . . . . . . . . 62
2.10.8. The SSV GSS Mechanism . . . . . . . . . . . . . . . 66 2.10.8. The SSV GSS Mechanism . . . . . . . . . . . . . . . 67
2.10.9. Session Mechanics - Steady State . . . . . . . . . . 71 2.10.9. Session Mechanics - Steady State . . . . . . . . . . 71
2.10.10. Session Mechanics - Recovery . . . . . . . . . . . . 72 2.10.10. Session Inactivity Timer . . . . . . . . . . . . . . 73
2.10.11. Parallel NFS and Sessions . . . . . . . . . . . . . 76 2.10.11. Session Mechanics - Recovery . . . . . . . . . . . . 73
3. Protocol Constants and Data Types . . . . . . . . . . . . . . 76 2.10.12. Parallel NFS and Sessions . . . . . . . . . . . . . 77
3.1. Basic Constants . . . . . . . . . . . . . . . . . . . . 76 3. Protocol Constants and Data Types . . . . . . . . . . . . . . 77
3.2. Basic Data Types . . . . . . . . . . . . . . . . . . . . 77 3.1. Basic Constants . . . . . . . . . . . . . . . . . . . . 77
3.3. Structured Data Types . . . . . . . . . . . . . . . . . 79 3.2. Basic Data Types . . . . . . . . . . . . . . . . . . . . 78
4. Filehandles . . . . . . . . . . . . . . . . . . . . . . . . . 88 3.3. Structured Data Types . . . . . . . . . . . . . . . . . 80
4.1. Obtaining the First Filehandle . . . . . . . . . . . . . 88 4. Filehandles . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.1.1. Root Filehandle . . . . . . . . . . . . . . . . . . 89 4.1. Obtaining the First Filehandle . . . . . . . . . . . . . 89
4.1.2. Public Filehandle . . . . . . . . . . . . . . . . . 89 4.1.1. Root Filehandle . . . . . . . . . . . . . . . . . . 90
4.2. Filehandle Types . . . . . . . . . . . . . . . . . . . . 89 4.1.2. Public Filehandle . . . . . . . . . . . . . . . . . 90
4.2.1. General Properties of a Filehandle . . . . . . . . . 90 4.2. Filehandle Types . . . . . . . . . . . . . . . . . . . . 90
4.2.2. Persistent Filehandle . . . . . . . . . . . . . . . 91 4.2.1. General Properties of a Filehandle . . . . . . . . . 91
4.2.3. Volatile Filehandle . . . . . . . . . . . . . . . . 91 4.2.2. Persistent Filehandle . . . . . . . . . . . . . . . 92
4.3. One Method of Constructing a Volatile Filehandle . . . . 92 4.2.3. Volatile Filehandle . . . . . . . . . . . . . . . . 92
4.4. Client Recovery from Filehandle Expiration . . . . . . . 93 4.3. One Method of Constructing a Volatile Filehandle . . . . 93
5. File Attributes . . . . . . . . . . . . . . . . . . . . . . . 94 4.4. Client Recovery from Filehandle Expiration . . . . . . . 94
5.1. REQUIRED Attributes . . . . . . . . . . . . . . . . . . 95 5. File Attributes . . . . . . . . . . . . . . . . . . . . . . . 95
5.2. RECOMMENDED Attributes . . . . . . . . . . . . . . . . . 95 5.1. REQUIRED Attributes . . . . . . . . . . . . . . . . . . 96
5.3. Named Attributes . . . . . . . . . . . . . . . . . . . . 96 5.2. RECOMMENDED Attributes . . . . . . . . . . . . . . . . . 96
5.4. Classification of Attributes . . . . . . . . . . . . . . 97 5.3. Named Attributes . . . . . . . . . . . . . . . . . . . . 97
5.5. REQUIRED Attributes - List and Definition References . . 99 5.4. Classification of Attributes . . . . . . . . . . . . . . 98
5.5. REQUIRED Attributes - List and Definition References . . 100
5.6. RECOMMENDED Attributes - List and Definition 5.6. RECOMMENDED Attributes - List and Definition
References . . . . . . . . . . . . . . . . . . . . . . . 99 References . . . . . . . . . . . . . . . . . . . . . . . 100
5.7. Attribute Definitions . . . . . . . . . . . . . . . . . 101 5.7. Attribute Definitions . . . . . . . . . . . . . . . . . 102
5.7.1. Definitions of REQUIRED Attributes . . . . . . . . . 101 5.7.1. Definitions of REQUIRED Attributes . . . . . . . . . 102
5.7.2. Definitions of Uncategorized RECOMMENDED 5.7.2. Definitions of Uncategorized RECOMMENDED
Attributes . . . . . . . . . . . . . . . . . . . . . 103 Attributes . . . . . . . . . . . . . . . . . . . . . 104
5.8. Interpreting owner and owner_group . . . . . . . . . . . 109 5.8. Interpreting owner and owner_group . . . . . . . . . . . 110
5.9. Character Case Attributes . . . . . . . . . . . . . . . 111 5.9. Character Case Attributes . . . . . . . . . . . . . . . 112
5.10. Directory Notification Attributes . . . . . . . . . . . 111 5.10. Directory Notification Attributes . . . . . . . . . . . 112
5.11. pNFS Attribute Definitions . . . . . . . . . . . . . . . 112 5.11. pNFS Attribute Definitions . . . . . . . . . . . . . . . 113
5.12. Retention Attributes . . . . . . . . . . . . . . . . . . 114 5.12. Retention Attributes . . . . . . . . . . . . . . . . . . 115
6. Security Related Attributes . . . . . . . . . . . . . . . . . 116 6. Access Control Attributes . . . . . . . . . . . . . . . . . . 117
6.1. Goals . . . . . . . . . . . . . . . . . . . . . . . . . 116 6.1. Goals . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.2. File Attributes Discussion . . . . . . . . . . . . . . . 117 6.2. File Attributes Discussion . . . . . . . . . . . . . . . 118
6.2.1. Attribute 12: acl . . . . . . . . . . . . . . . . . 117 6.2.1. Attribute 12: acl . . . . . . . . . . . . . . . . . 118
6.2.2. Attribute 58: dacl . . . . . . . . . . . . . . . . . 132 6.2.2. Attribute 58: dacl . . . . . . . . . . . . . . . . . 133
6.2.3. Attribute 59: sacl . . . . . . . . . . . . . . . . . 132 6.2.3. Attribute 59: sacl . . . . . . . . . . . . . . . . . 133
6.2.4. Attribute 33: mode . . . . . . . . . . . . . . . . . 132 6.2.4. Attribute 33: mode . . . . . . . . . . . . . . . . . 133
6.2.5. Attribute 74: mode_set_masked . . . . . . . . . . . 133 6.2.5. Attribute 74: mode_set_masked . . . . . . . . . . . 134
6.3. Common Methods . . . . . . . . . . . . . . . . . . . . . 134 6.3. Common Methods . . . . . . . . . . . . . . . . . . . . . 135
6.3.1. Interpreting an ACL . . . . . . . . . . . . . . . . 134 6.3.1. Interpreting an ACL . . . . . . . . . . . . . . . . 135
6.3.2. Computing a Mode Attribute from an ACL . . . . . . . 135 6.3.2. Computing a Mode Attribute from an ACL . . . . . . . 136
6.4. Requirements . . . . . . . . . . . . . . . . . . . . . . 136 6.4. Requirements . . . . . . . . . . . . . . . . . . . . . . 137
6.4.1. Setting the mode and/or ACL Attributes . . . . . . . 136 6.4.1. Setting the mode and/or ACL Attributes . . . . . . . 137
6.4.2. Retrieving the mode and/or ACL Attributes . . . . . 138 6.4.2. Retrieving the mode and/or ACL Attributes . . . . . 139
6.4.3. Creating New Objects . . . . . . . . . . . . . . . . 138 6.4.3. Creating New Objects . . . . . . . . . . . . . . . . 139
7. Single-server Namespace . . . . . . . . . . . . . . . . . . . 142 7. Single-server Namespace . . . . . . . . . . . . . . . . . . . 143
7.1. Server Exports . . . . . . . . . . . . . . . . . . . . . 142 7.1. Server Exports . . . . . . . . . . . . . . . . . . . . . 143
7.2. Browsing Exports . . . . . . . . . . . . . . . . . . . . 143 7.2. Browsing Exports . . . . . . . . . . . . . . . . . . . . 144
7.3. Server Pseudo File System . . . . . . . . . . . . . . . 143 7.3. Server Pseudo File System . . . . . . . . . . . . . . . 144
7.4. Multiple Roots . . . . . . . . . . . . . . . . . . . . . 144 7.4. Multiple Roots . . . . . . . . . . . . . . . . . . . . . 145
7.5. Filehandle Volatility . . . . . . . . . . . . . . . . . 144 7.5. Filehandle Volatility . . . . . . . . . . . . . . . . . 145
7.6. Exported Root . . . . . . . . . . . . . . . . . . . . . 144 7.6. Exported Root . . . . . . . . . . . . . . . . . . . . . 145
7.7. Mount Point Crossing . . . . . . . . . . . . . . . . . . 145 7.7. Mount Point Crossing . . . . . . . . . . . . . . . . . . 146
7.8. Security Policy and Namespace Presentation . . . . . . . 145 7.8. Security Policy and Namespace Presentation . . . . . . . 146
8. State Management . . . . . . . . . . . . . . . . . . . . . . 146 8. State Management . . . . . . . . . . . . . . . . . . . . . . 147
8.1. Client and Session ID . . . . . . . . . . . . . . . . . 147 8.1. Client and Session ID . . . . . . . . . . . . . . . . . 148
8.2. Stateid Definition . . . . . . . . . . . . . . . . . . . 147 8.2. Stateid Definition . . . . . . . . . . . . . . . . . . . 148
8.2.1. Stateid Types . . . . . . . . . . . . . . . . . . . 148 8.2.1. Stateid Types . . . . . . . . . . . . . . . . . . . 149
8.2.2. Stateid Structure . . . . . . . . . . . . . . . . . 149 8.2.2. Stateid Structure . . . . . . . . . . . . . . . . . 150
8.2.3. Special Stateids . . . . . . . . . . . . . . . . . . 150 8.2.3. Special Stateids . . . . . . . . . . . . . . . . . . 151
8.2.4. Stateid Lifetime and Validation . . . . . . . . . . 152 8.2.4. Stateid Lifetime and Validation . . . . . . . . . . 152
8.2.5. Stateid Use for I/O Operations . . . . . . . . . . . 155 8.2.5. Stateid Use for I/O Operations . . . . . . . . . . . 155
8.3. Lease Renewal . . . . . . . . . . . . . . . . . . . . . 155 8.3. Lease Renewal . . . . . . . . . . . . . . . . . . . . . 156
8.4. Crash Recovery . . . . . . . . . . . . . . . . . . . . . 157 8.4. Crash Recovery . . . . . . . . . . . . . . . . . . . . . 158
8.4.1. Client Failure and Recovery . . . . . . . . . . . . 158 8.4.1. Client Failure and Recovery . . . . . . . . . . . . 158
8.4.2. Server Failure and Recovery . . . . . . . . . . . . 158 8.4.2. Server Failure and Recovery . . . . . . . . . . . . 159
8.4.3. Network Partitions and Recovery . . . . . . . . . . 162 8.4.3. Network Partitions and Recovery . . . . . . . . . . 163
8.5. Server Revocation of Locks . . . . . . . . . . . . . . . 166 8.5. Server Revocation of Locks . . . . . . . . . . . . . . . 167
8.6. Short and Long Leases . . . . . . . . . . . . . . . . . 167 8.6. Short and Long Leases . . . . . . . . . . . . . . . . . 168
8.7. Clocks, Propagation Delay, and Calculating Lease 8.7. Clocks, Propagation Delay, and Calculating Lease
Expiration . . . . . . . . . . . . . . . . . . . . . . . 168 Expiration . . . . . . . . . . . . . . . . . . . . . . . 169
8.8. Vestigial Locking Infrastructure From V4.0 . . . . . . . 168 8.8. Obsolete Locking Infrastructure From NFSv4.0 . . . . . . 169
9. File Locking and Share Reservations . . . . . . . . . . . . . 169 9. File Locking and Share Reservations . . . . . . . . . . . . . 170
9.1. Opens and Byte-range Locks . . . . . . . . . . . . . . . 170 9.1. Opens and Byte-range Locks . . . . . . . . . . . . . . . 171
9.1.1. State-owner Definition . . . . . . . . . . . . . . . 170 9.1.1. State-owner Definition . . . . . . . . . . . . . . . 171
9.1.2. Use of the Stateid and Locking . . . . . . . . . . . 170 9.1.2. Use of the Stateid and Locking . . . . . . . . . . . 171
9.2. Lock Ranges . . . . . . . . . . . . . . . . . . . . . . 173 9.2. Lock Ranges . . . . . . . . . . . . . . . . . . . . . . 174
9.3. Upgrading and Downgrading Locks . . . . . . . . . . . . 174 9.3. Upgrading and Downgrading Locks . . . . . . . . . . . . 175
9.4. Stateid Seqid Values and Byte-range Locks . . . . . . . 174 9.4. Stateid Seqid Values and Byte-range Locks . . . . . . . 175
9.5. Issues with Multiple Open-owners . . . . . . . . . . . . 175 9.5. Issues with Multiple Open-owners . . . . . . . . . . . . 175
9.6. Blocking Locks . . . . . . . . . . . . . . . . . . . . . 175 9.6. Blocking Locks . . . . . . . . . . . . . . . . . . . . . 176
9.7. Share Reservations . . . . . . . . . . . . . . . . . . . 176 9.7. Share Reservations . . . . . . . . . . . . . . . . . . . 177
9.8. OPEN/CLOSE Operations . . . . . . . . . . . . . . . . . 177 9.8. OPEN/CLOSE Operations . . . . . . . . . . . . . . . . . 178
9.9. Open Upgrade and Downgrade . . . . . . . . . . . . . . . 178 9.9. Open Upgrade and Downgrade . . . . . . . . . . . . . . . 179
9.10. Parallel OPENs . . . . . . . . . . . . . . . . . . . . . 179 9.10. Parallel OPENs . . . . . . . . . . . . . . . . . . . . . 179
9.11. Reclaim of Open and Byte-range Locks . . . . . . . . . . 179 9.11. Reclaim of Open and Byte-range Locks . . . . . . . . . . 180
10. Client-Side Caching . . . . . . . . . . . . . . . . . . . . . 180 10. Client-Side Caching . . . . . . . . . . . . . . . . . . . . . 180
10.1. Performance Challenges for Client-Side Caching . . . . . 180 10.1. Performance Challenges for Client-Side Caching . . . . . 181
10.2. Delegation and Callbacks . . . . . . . . . . . . . . . . 181 10.2. Delegation and Callbacks . . . . . . . . . . . . . . . . 182
10.2.1. Delegation Recovery . . . . . . . . . . . . . . . . 183 10.2.1. Delegation Recovery . . . . . . . . . . . . . . . . 184
10.3. Data Caching . . . . . . . . . . . . . . . . . . . . . . 186 10.3. Data Caching . . . . . . . . . . . . . . . . . . . . . . 186
10.3.1. Data Caching and OPENs . . . . . . . . . . . . . . . 186 10.3.1. Data Caching and OPENs . . . . . . . . . . . . . . . 187
10.3.2. Data Caching and File Locking . . . . . . . . . . . 187 10.3.2. Data Caching and File Locking . . . . . . . . . . . 188
10.3.3. Data Caching and Mandatory File Locking . . . . . . 189 10.3.3. Data Caching and Mandatory File Locking . . . . . . 189
10.3.4. Data Caching and File Identity . . . . . . . . . . . 189 10.3.4. Data Caching and File Identity . . . . . . . . . . . 190
10.4. Open Delegation . . . . . . . . . . . . . . . . . . . . 190 10.4. Open Delegation . . . . . . . . . . . . . . . . . . . . 191
10.4.1. Open Delegation and Data Caching . . . . . . . . . . 192 10.4.1. Open Delegation and Data Caching . . . . . . . . . . 193
10.4.2. Open Delegation and File Locks . . . . . . . . . . . 194 10.4.2. Open Delegation and File Locks . . . . . . . . . . . 195
10.4.3. Handling of CB_GETATTR . . . . . . . . . . . . . . . 194 10.4.3. Handling of CB_GETATTR . . . . . . . . . . . . . . . 195
10.4.4. Recall of Open Delegation . . . . . . . . . . . . . 197 10.4.4. Recall of Open Delegation . . . . . . . . . . . . . 198
10.4.5. Clients that Fail to Honor Delegation Recalls . . . 199 10.4.5. Clients that Fail to Honor Delegation Recalls . . . 200
10.4.6. Delegation Revocation . . . . . . . . . . . . . . . 200 10.4.6. Delegation Revocation . . . . . . . . . . . . . . . 200
10.4.7. Delegations via WANT_DELEGATION . . . . . . . . . . 200 10.4.7. Delegations via WANT_DELEGATION . . . . . . . . . . 201
10.5. Data Caching and Revocation . . . . . . . . . . . . . . 201 10.5. Data Caching and Revocation . . . . . . . . . . . . . . 202
10.5.1. Revocation Recovery for Write Open Delegation . . . 202 10.5.1. Revocation Recovery for Write Open Delegation . . . 202
10.6. Attribute Caching . . . . . . . . . . . . . . . . . . . 202 10.6. Attribute Caching . . . . . . . . . . . . . . . . . . . 203
10.7. Data and Metadata Caching and Memory Mapped Files . . . 204 10.7. Data and Metadata Caching and Memory Mapped Files . . . 205
10.8. Name and Directory Caching without Directory 10.8. Name and Directory Caching without Directory
Delegations . . . . . . . . . . . . . . . . . . . . . . 207 Delegations . . . . . . . . . . . . . . . . . . . . . . 207
10.8.1. Name Caching . . . . . . . . . . . . . . . . . . . . 207 10.8.1. Name Caching . . . . . . . . . . . . . . . . . . . . 207
10.8.2. Directory Caching . . . . . . . . . . . . . . . . . 208 10.8.2. Directory Caching . . . . . . . . . . . . . . . . . 209
10.9. Directory Delegations . . . . . . . . . . . . . . . . . 209 10.9. Directory Delegations . . . . . . . . . . . . . . . . . 210
10.9.1. Introduction to Directory Delegations . . . . . . . 209 10.9.1. Introduction to Directory Delegations . . . . . . . 210
10.9.2. Directory Delegation Design . . . . . . . . . . . . 210 10.9.2. Directory Delegation Design . . . . . . . . . . . . 211
10.9.3. Attributes in Support of Directory Notifications . . 211 10.9.3. Attributes in Support of Directory Notifications . . 212
10.9.4. Directory Delegation Recall . . . . . . . . . . . . 211 10.9.4. Directory Delegation Recall . . . . . . . . . . . . 212
10.9.5. Directory Delegation Recovery . . . . . . . . . . . 212 10.9.5. Directory Delegation Recovery . . . . . . . . . . . 213
11. Multi-Server Namespace . . . . . . . . . . . . . . . . . . . 212 11. Multi-Server Namespace . . . . . . . . . . . . . . . . . . . 213
11.1. Location Attributes . . . . . . . . . . . . . . . . . . 213 11.1. Location Attributes . . . . . . . . . . . . . . . . . . 213
11.2. File System Presence or Absence . . . . . . . . . . . . 213 11.2. File System Presence or Absence . . . . . . . . . . . . 214
11.3. Getting Attributes for an Absent File System . . . . . . 214 11.3. Getting Attributes for an Absent File System . . . . . . 215
11.3.1. GETATTR Within an Absent File System . . . . . . . . 214 11.3.1. GETATTR Within an Absent File System . . . . . . . . 215
11.3.2. READDIR and Absent File Systems . . . . . . . . . . 216 11.3.2. READDIR and Absent File Systems . . . . . . . . . . 216
11.4. Uses of Location Information . . . . . . . . . . . . . . 216 11.4. Uses of Location Information . . . . . . . . . . . . . . 217
11.4.1. File System Replication . . . . . . . . . . . . . . 217 11.4.1. File System Replication . . . . . . . . . . . . . . 218
11.4.2. File System Migration . . . . . . . . . . . . . . . 218 11.4.2. File System Migration . . . . . . . . . . . . . . . 219
11.4.3. Referrals . . . . . . . . . . . . . . . . . . . . . 219 11.4.3. Referrals . . . . . . . . . . . . . . . . . . . . . 220
11.5. Location Entries and Server Identity . . . . . . . . . . 220 11.5. Location Entries and Server Identity . . . . . . . . . . 221
11.6. Additional Client-side Considerations . . . . . . . . . 221 11.6. Additional Client-side Considerations . . . . . . . . . 222
11.7. Effecting File System Transitions . . . . . . . . . . . 222 11.7. Effecting File System Transitions . . . . . . . . . . . 223
11.7.1. File System Transitions and Simultaneous Access . . 223 11.7.1. File System Transitions and Simultaneous Access . . 224
11.7.2. Simultaneous Use and Transparent Transitions . . . . 224 11.7.2. Simultaneous Use and Transparent Transitions . . . . 224
11.7.3. Filehandles and File System Transitions . . . . . . 226 11.7.3. Filehandles and File System Transitions . . . . . . 227
11.7.4. Fileids and File System Transitions . . . . . . . . 227 11.7.4. Fileids and File System Transitions . . . . . . . . 227
11.7.5. Fsids and File System Transitions . . . . . . . . . 228 11.7.5. Fsids and File System Transitions . . . . . . . . . 229
11.7.6. The Change Attribute and File System Transitions . . 229 11.7.6. The Change Attribute and File System Transitions . . 229
11.7.7. Lock State and File System Transitions . . . . . . . 229 11.7.7. Lock State and File System Transitions . . . . . . . 230
11.7.8. Write Verifiers and File System Transitions . . . . 233 11.7.8. Write Verifiers and File System Transitions . . . . 234
11.7.9. Readdir Cookies and Verifiers and File System 11.7.9. Readdir Cookies and Verifiers and File System
Transitions . . . . . . . . . . . . . . . . . . . . 233 Transitions . . . . . . . . . . . . . . . . . . . . 234
11.7.10. File System Data and File System Transitions . . . . 234 11.7.10. File System Data and File System Transitions . . . . 234
11.8. Effecting File System Referrals . . . . . . . . . . . . 235 11.8. Effecting File System Referrals . . . . . . . . . . . . 236
11.8.1. Referral Example (LOOKUP) . . . . . . . . . . . . . 235 11.8.1. Referral Example (LOOKUP) . . . . . . . . . . . . . 236
11.8.2. Referral Example (READDIR) . . . . . . . . . . . . . 239 11.8.2. Referral Example (READDIR) . . . . . . . . . . . . . 240
11.9. The Attribute fs_locations . . . . . . . . . . . . . . . 242 11.9. The Attribute fs_locations . . . . . . . . . . . . . . . 242
11.10. The Attribute fs_locations_info . . . . . . . . . . . . 244 11.10. The Attribute fs_locations_info . . . . . . . . . . . . 244
11.10.1. The fs_locations_server4 Structure . . . . . . . . . 247 11.10.1. The fs_locations_server4 Structure . . . . . . . . . 248
11.10.2. The fs_locations_info4 Structure . . . . . . . . . . 253 11.10.2. The fs_locations_info4 Structure . . . . . . . . . . 253
11.10.3. The fs_locations_item4 Structure . . . . . . . . . . 254 11.10.3. The fs_locations_item4 Structure . . . . . . . . . . 254
11.11. The Attribute fs_status . . . . . . . . . . . . . . . . 256 11.11. The Attribute fs_status . . . . . . . . . . . . . . . . 256
12. Parallel NFS (pNFS) . . . . . . . . . . . . . . . . . . . . . 259 12. Parallel NFS (pNFS) . . . . . . . . . . . . . . . . . . . . . 260
12.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 259 12.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 260
12.2. pNFS Definitions . . . . . . . . . . . . . . . . . . . . 261 12.2. pNFS Definitions . . . . . . . . . . . . . . . . . . . . 262
12.2.1. Metadata . . . . . . . . . . . . . . . . . . . . . . 261 12.2.1. Metadata . . . . . . . . . . . . . . . . . . . . . . 262
12.2.2. Metadata Server . . . . . . . . . . . . . . . . . . 261 12.2.2. Metadata Server . . . . . . . . . . . . . . . . . . 262
12.2.3. pNFS Client . . . . . . . . . . . . . . . . . . . . 262 12.2.3. pNFS Client . . . . . . . . . . . . . . . . . . . . 263
12.2.4. Storage Device . . . . . . . . . . . . . . . . . . . 262 12.2.4. Storage Device . . . . . . . . . . . . . . . . . . . 263
12.2.5. Storage Protocol . . . . . . . . . . . . . . . . . . 262 12.2.5. Storage Protocol . . . . . . . . . . . . . . . . . . 263
12.2.6. Control Protocol . . . . . . . . . . . . . . . . . . 262 12.2.6. Control Protocol . . . . . . . . . . . . . . . . . . 263
12.2.7. Layout Types . . . . . . . . . . . . . . . . . . . . 262 12.2.7. Layout Types . . . . . . . . . . . . . . . . . . . . 263
12.2.8. Layout . . . . . . . . . . . . . . . . . . . . . . . 263 12.2.8. Layout . . . . . . . . . . . . . . . . . . . . . . . 264
12.2.9. Layout Iomode . . . . . . . . . . . . . . . . . . . 263 12.2.9. Layout Iomode . . . . . . . . . . . . . . . . . . . 264
12.2.10. Device IDs . . . . . . . . . . . . . . . . . . . . . 264 12.2.10. Device IDs . . . . . . . . . . . . . . . . . . . . . 265
12.3. pNFS Operations . . . . . . . . . . . . . . . . . . . . 265 12.3. pNFS Operations . . . . . . . . . . . . . . . . . . . . 266
12.4. pNFS Attributes . . . . . . . . . . . . . . . . . . . . 266 12.4. pNFS Attributes . . . . . . . . . . . . . . . . . . . . 267
12.5. Layout Semantics . . . . . . . . . . . . . . . . . . . . 266 12.5. Layout Semantics . . . . . . . . . . . . . . . . . . . . 267
12.5.1. Guarantees Provided by Layouts . . . . . . . . . . . 266 12.5.1. Guarantees Provided by Layouts . . . . . . . . . . . 267
12.5.2. Getting a Layout . . . . . . . . . . . . . . . . . . 268 12.5.2. Getting a Layout . . . . . . . . . . . . . . . . . . 269
12.5.3. Layout Stateid . . . . . . . . . . . . . . . . . . . 269 12.5.3. Layout Stateid . . . . . . . . . . . . . . . . . . . 270
12.5.4. Committing a Layout . . . . . . . . . . . . . . . . 270 12.5.4. Committing a Layout . . . . . . . . . . . . . . . . 271
12.5.5. Recalling a Layout . . . . . . . . . . . . . . . . . 272 12.5.5. Recalling a Layout . . . . . . . . . . . . . . . . . 274
12.5.6. Revoking Layouts . . . . . . . . . . . . . . . . . . 279 12.5.6. Revoking Layouts . . . . . . . . . . . . . . . . . . 281
12.5.7. Metadata Server Write Propagation . . . . . . . . . 279 12.5.7. Metadata Server Write Propagation . . . . . . . . . 281
12.6. pNFS Mechanics . . . . . . . . . . . . . . . . . . . . . 280 12.6. pNFS Mechanics . . . . . . . . . . . . . . . . . . . . . 281
12.7. Recovery . . . . . . . . . . . . . . . . . . . . . . . . 281 12.7. Recovery . . . . . . . . . . . . . . . . . . . . . . . . 283
12.7.1. Recovery from Client Restart . . . . . . . . . . . . 282 12.7.1. Recovery from Client Restart . . . . . . . . . . . . 283
12.7.2. Dealing with Lease Expiration on the Client . . . . 282 12.7.2. Dealing with Lease Expiration on the Client . . . . 283
12.7.3. Dealing with Loss of Layout State on the Metadata 12.7.3. Dealing with Loss of Layout State on the Metadata
Server . . . . . . . . . . . . . . . . . . . . . . . 283 Server . . . . . . . . . . . . . . . . . . . . . . . 284
12.7.4. Recovery from Metadata Server Restart . . . . . . . 284 12.7.4. Recovery from Metadata Server Restart . . . . . . . 285
12.7.5. Operations During Metadata Server Grace Period . . . 286 12.7.5. Operations During Metadata Server Grace Period . . . 287
12.7.6. Storage Device Recovery . . . . . . . . . . . . . . 286 12.7.6. Storage Device Recovery . . . . . . . . . . . . . . 287
12.8. Metadata and Storage Device Roles . . . . . . . . . . . 286 12.8. Metadata and Storage Device Roles . . . . . . . . . . . 288
12.9. Security Considerations for pNFS . . . . . . . . . . . . 287 12.9. Security Considerations for pNFS . . . . . . . . . . . . 288
13. PNFS: NFSv4.1 File Layout Type . . . . . . . . . . . . . . . 288 13. PNFS: NFSv4.1 File Layout Type . . . . . . . . . . . . . . . 289
13.1. Client ID and Session Considerations . . . . . . . . . . 288 13.1. Client ID and Session Considerations . . . . . . . . . . 289
13.2. File Layout Definitions . . . . . . . . . . . . . . . . 290 13.1.1. Sessions Considerations for Data Servers . . . . . . 292
13.3. File Layout Data Types . . . . . . . . . . . . . . . . . 291 13.2. File Layout Definitions . . . . . . . . . . . . . . . . 292
13.4. Interpreting the File Layout . . . . . . . . . . . . . . 295 13.3. File Layout Data Types . . . . . . . . . . . . . . . . . 293
13.4.1. Determining the Stripe Unit Number . . . . . . . . . 295 13.4. Interpreting the File Layout . . . . . . . . . . . . . . 297
13.4.2. Interpreting the File Layout Using Sparse Packing . 295 13.4.1. Determining the Stripe Unit Number . . . . . . . . . 297
13.4.3. Interpreting the File Layout Using Dense Packing . . 298 13.4.2. Interpreting the File Layout Using Sparse Packing . 297
13.4.4. Sparse and Dense Stripe Unit Packing . . . . . . . . 300 13.4.3. Interpreting the File Layout Using Dense Packing . . 300
13.5. Data Server Multipathing . . . . . . . . . . . . . . . . 302 13.4.4. Sparse and Dense Stripe Unit Packing . . . . . . . . 302
13.6. Operations Sent to NFSv4.1 Data Servers . . . . . . . . 303 13.5. Data Server Multipathing . . . . . . . . . . . . . . . . 304
13.7. COMMIT Through Metadata Server . . . . . . . . . . . . . 305 13.6. Operations Sent to NFSv4.1 Data Servers . . . . . . . . 305
13.8. The Layout Iomode . . . . . . . . . . . . . . . . . . . 307 13.7. COMMIT Through Metadata Server . . . . . . . . . . . . . 307
13.9. Metadata and Data Server State Coordination . . . . . . 307 13.8. The Layout Iomode . . . . . . . . . . . . . . . . . . . 309
13.9.1. Global Stateid Requirements . . . . . . . . . . . . 307 13.9. Metadata and Data Server State Coordination . . . . . . 309
13.9.2. Data Server State Propagation . . . . . . . . . . . 308 13.9.1. Global Stateid Requirements . . . . . . . . . . . . 309
13.10. Data Server Component File Size . . . . . . . . . . . . 310 13.9.2. Data Server State Propagation . . . . . . . . . . . 310
13.11. Layout Revocation and Fencing . . . . . . . . . . . . . 311 13.10. Data Server Component File Size . . . . . . . . . . . . 312
13.12. Security Considerations for the File Layout Type . . . . 311 13.11. Layout Revocation and Fencing . . . . . . . . . . . . . 313
14. Internationalization . . . . . . . . . . . . . . . . . . . . 312 13.12. Security Considerations for the File Layout Type . . . . 313
14.1. Stringprep profile for the utf8str_cs type . . . . . . . 313 14. Internationalization . . . . . . . . . . . . . . . . . . . . 314
14.2. Stringprep profile for the utf8str_cis type . . . . . . 315 14.1. Stringprep profile for the utf8str_cs type . . . . . . . 315
14.3. Stringprep profile for the utf8str_mixed type . . . . . 316 14.2. Stringprep profile for the utf8str_cis type . . . . . . 317
14.4. UTF-8 Capabilities . . . . . . . . . . . . . . . . . . . 318 14.3. Stringprep profile for the utf8str_mixed type . . . . . 318
14.5. UTF-8 Related Errors . . . . . . . . . . . . . . . . . . 318 14.4. UTF-8 Capabilities . . . . . . . . . . . . . . . . . . . 320
15. Error Values . . . . . . . . . . . . . . . . . . . . . . . . 319 14.5. UTF-8 Related Errors . . . . . . . . . . . . . . . . . . 320
15.1. Error Definitions . . . . . . . . . . . . . . . . . . . 319 15. Error Values . . . . . . . . . . . . . . . . . . . . . . . . 321
15.1.1. General Errors . . . . . . . . . . . . . . . . . . . 321 15.1. Error Definitions . . . . . . . . . . . . . . . . . . . 321
15.1.2. Filehandle Errors . . . . . . . . . . . . . . . . . 323 15.1.1. General Errors . . . . . . . . . . . . . . . . . . . 323
15.1.3. Compound Structure Errors . . . . . . . . . . . . . 324 15.1.2. Filehandle Errors . . . . . . . . . . . . . . . . . 325
15.1.4. File System Errors . . . . . . . . . . . . . . . . . 326 15.1.3. Compound Structure Errors . . . . . . . . . . . . . 326
15.1.5. State Management Errors . . . . . . . . . . . . . . 328 15.1.4. File System Errors . . . . . . . . . . . . . . . . . 328
15.1.6. Security Errors . . . . . . . . . . . . . . . . . . 329 15.1.5. State Management Errors . . . . . . . . . . . . . . 330
15.1.7. Name Errors . . . . . . . . . . . . . . . . . . . . 329 15.1.6. Security Errors . . . . . . . . . . . . . . . . . . 331
15.1.8. Locking Errors . . . . . . . . . . . . . . . . . . . 330 15.1.7. Name Errors . . . . . . . . . . . . . . . . . . . . 331
15.1.9. Reclaim Errors . . . . . . . . . . . . . . . . . . . 331 15.1.8. Locking Errors . . . . . . . . . . . . . . . . . . . 332
15.1.10. pNFS Errors . . . . . . . . . . . . . . . . . . . . 332 15.1.9. Reclaim Errors . . . . . . . . . . . . . . . . . . . 333
15.1.11. Session Use Errors . . . . . . . . . . . . . . . . . 333 15.1.10. pNFS Errors . . . . . . . . . . . . . . . . . . . . 334
15.1.12. Session Management Errors . . . . . . . . . . . . . 334 15.1.11. Session Use Errors . . . . . . . . . . . . . . . . . 335
15.1.13. Client Management Errors . . . . . . . . . . . . . . 335 15.1.12. Session Management Errors . . . . . . . . . . . . . 336
15.1.14. Delegation Errors . . . . . . . . . . . . . . . . . 336 15.1.13. Client Management Errors . . . . . . . . . . . . . . 337
15.1.15. Attribute Handling Errors . . . . . . . . . . . . . 336 15.1.14. Delegation Errors . . . . . . . . . . . . . . . . . 338
15.1.16. Obsoleted Errors . . . . . . . . . . . . . . . . . . 337 15.1.15. Attribute Handling Errors . . . . . . . . . . . . . 338
15.2. Operations and their valid errors . . . . . . . . . . . 338 15.1.16. Obsoleted Errors . . . . . . . . . . . . . . . . . . 339
15.3. Callback operations and their valid errors . . . . . . . 354 15.2. Operations and their valid errors . . . . . . . . . . . 340
15.4. Errors and the operations that use them . . . . . . . . 356 15.3. Callback operations and their valid errors . . . . . . . 356
16. NFSv4.1 Procedures . . . . . . . . . . . . . . . . . . . . . 370 15.4. Errors and the operations that use them . . . . . . . . 358
16.1. Procedure 0: NULL - No Operation . . . . . . . . . . . . 370 16. NFSv4.1 Procedures . . . . . . . . . . . . . . . . . . . . . 372
16.2. Procedure 1: COMPOUND - Compound Operations . . . . . . 371 16.1. Procedure 0: NULL - No Operation . . . . . . . . . . . . 372
17. Operations: REQUIRED, RECOMMENDED, or OPTIONAL . . . . . . . 381 16.2. Procedure 1: COMPOUND - Compound Operations . . . . . . 373
18. NFSv4.1 Operations . . . . . . . . . . . . . . . . . . . . . 384 17. Operations: REQUIRED, RECOMMENDED, or OPTIONAL . . . . . . . 383
18.1. Operation 3: ACCESS - Check Access Rights . . . . . . . 384 18. NFSv4.1 Operations . . . . . . . . . . . . . . . . . . . . . 386
18.2. Operation 4: CLOSE - Close File . . . . . . . . . . . . 387 18.1. Operation 3: ACCESS - Check Access Rights . . . . . . . 386
18.3. Operation 5: COMMIT - Commit Cached Data . . . . . . . . 388 18.2. Operation 4: CLOSE - Close File . . . . . . . . . . . . 389
18.4. Operation 6: CREATE - Create a Non-Regular File Object . 391 18.3. Operation 5: COMMIT - Commit Cached Data . . . . . . . . 390
18.4. Operation 6: CREATE - Create a Non-Regular File Object . 393
18.5. Operation 7: DELEGPURGE - Purge Delegations Awaiting 18.5. Operation 7: DELEGPURGE - Purge Delegations Awaiting
Recovery . . . . . . . . . . . . . . . . . . . . . . . . 394 Recovery . . . . . . . . . . . . . . . . . . . . . . . . 396
18.6. Operation 8: DELEGRETURN - Return Delegation . . . . . . 395 18.6. Operation 8: DELEGRETURN - Return Delegation . . . . . . 397
18.7. Operation 9: GETATTR - Get Attributes . . . . . . . . . 395 18.7. Operation 9: GETATTR - Get Attributes . . . . . . . . . 397
18.8. Operation 10: GETFH - Get Current Filehandle . . . . . . 397 18.8. Operation 10: GETFH - Get Current Filehandle . . . . . . 399
18.9. Operation 11: LINK - Create Link to a File . . . . . . . 398 18.9. Operation 11: LINK - Create Link to a File . . . . . . . 400
18.10. Operation 12: LOCK - Create Lock . . . . . . . . . . . . 400 18.10. Operation 12: LOCK - Create Lock . . . . . . . . . . . . 402
18.11. Operation 13: LOCKT - Test For Lock . . . . . . . . . . 404 18.11. Operation 13: LOCKT - Test For Lock . . . . . . . . . . 406
18.12. Operation 14: LOCKU - Unlock File . . . . . . . . . . . 406 18.12. Operation 14: LOCKU - Unlock File . . . . . . . . . . . 408
18.13. Operation 15: LOOKUP - Lookup Filename . . . . . . . . . 407 18.13. Operation 15: LOOKUP - Lookup Filename . . . . . . . . . 409
18.14. Operation 16: LOOKUPP - Lookup Parent Directory . . . . 409 18.14. Operation 16: LOOKUPP - Lookup Parent Directory . . . . 411
18.15. Operation 17: NVERIFY - Verify Difference in 18.15. Operation 17: NVERIFY - Verify Difference in
Attributes . . . . . . . . . . . . . . . . . . . . . . . 410 Attributes . . . . . . . . . . . . . . . . . . . . . . . 412
18.16. Operation 18: OPEN - Open a Regular File . . . . . . . . 411 18.16. Operation 18: OPEN - Open a Regular File . . . . . . . . 413
18.17. Operation 19: OPENATTR - Open Named Attribute 18.17. Operation 19: OPENATTR - Open Named Attribute
Directory . . . . . . . . . . . . . . . . . . . . . . . 430 Directory . . . . . . . . . . . . . . . . . . . . . . . 432
18.18. Operation 21: OPEN_DOWNGRADE - Reduce Open File Access . 431 18.18. Operation 21: OPEN_DOWNGRADE - Reduce Open File Access . 433
18.19. Operation 22: PUTFH - Set Current Filehandle . . . . . . 432 18.19. Operation 22: PUTFH - Set Current Filehandle . . . . . . 434
18.20. Operation 23: PUTPUBFH - Set Public Filehandle . . . . . 433 18.20. Operation 23: PUTPUBFH - Set Public Filehandle . . . . . 435
18.21. Operation 24: PUTROOTFH - Set Root Filehandle . . . . . 435 18.21. Operation 24: PUTROOTFH - Set Root Filehandle . . . . . 437
18.22. Operation 25: READ - Read from File . . . . . . . . . . 435 18.22. Operation 25: READ - Read from File . . . . . . . . . . 437
18.23. Operation 26: READDIR - Read Directory . . . . . . . . . 438 18.23. Operation 26: READDIR - Read Directory . . . . . . . . . 440
18.24. Operation 27: READLINK - Read Symbolic Link . . . . . . 441 18.24. Operation 27: READLINK - Read Symbolic Link . . . . . . 443
18.25. Operation 28: REMOVE - Remove File System Object . . . . 442 18.25. Operation 28: REMOVE - Remove File System Object . . . . 444
18.26. Operation 29: RENAME - Rename Directory Entry . . . . . 445 18.26. Operation 29: RENAME - Rename Directory Entry . . . . . 447
18.27. Operation 31: RESTOREFH - Restore Saved Filehandle . . . 448 18.27. Operation 31: RESTOREFH - Restore Saved Filehandle . . . 450
18.28. Operation 32: SAVEFH - Save Current Filehandle . . . . . 449 18.28. Operation 32: SAVEFH - Save Current Filehandle . . . . . 451
18.29. Operation 33: SECINFO - Obtain Available Security . . . 450 18.29. Operation 33: SECINFO - Obtain Available Security . . . 452
18.30. Operation 34: SETATTR - Set Attributes . . . . . . . . . 453 18.30. Operation 34: SETATTR - Set Attributes . . . . . . . . . 455
18.31. Operation 37: VERIFY - Verify Same Attributes . . . . . 456 18.31. Operation 37: VERIFY - Verify Same Attributes . . . . . 458
18.32. Operation 38: WRITE - Write to File . . . . . . . . . . 457 18.32. Operation 38: WRITE - Write to File . . . . . . . . . . 459
18.33. Operation 40: BACKCHANNEL_CTL - Backchannel control . . 462 18.33. Operation 40: BACKCHANNEL_CTL - Backchannel control . . 464
18.34. Operation 41: BIND_CONN_TO_SESSION . . . . . . . . . . . 463 18.34. Operation 41: BIND_CONN_TO_SESSION . . . . . . . . . . . 465
18.35. Operation 42: EXCHANGE_ID - Instantiate Client ID . . . 466 18.35. Operation 42: EXCHANGE_ID - Instantiate Client ID . . . 468
18.36. Operation 43: CREATE_SESSION - Create New Session and 18.36. Operation 43: CREATE_SESSION - Create New Session and
Confirm Client ID . . . . . . . . . . . . . . . . . . . 482 Confirm Client ID . . . . . . . . . . . . . . . . . . . 484
18.37. Operation 44: DESTROY_SESSION - Destroy existing 18.37. Operation 44: DESTROY_SESSION - Destroy existing
session . . . . . . . . . . . . . . . . . . . . . . . . 492 session . . . . . . . . . . . . . . . . . . . . . . . . 494
18.38. Operation 45: FREE_STATEID - Free stateid with no 18.38. Operation 45: FREE_STATEID - Free stateid with no
locks . . . . . . . . . . . . . . . . . . . . . . . . . 494 locks . . . . . . . . . . . . . . . . . . . . . . . . . 496
18.39. Operation 46: GET_DIR_DELEGATION - Get a directory 18.39. Operation 46: GET_DIR_DELEGATION - Get a directory
delegation . . . . . . . . . . . . . . . . . . . . . . . 495 delegation . . . . . . . . . . . . . . . . . . . . . . . 497
18.40. Operation 47: GETDEVICEINFO - Get Device Information . . 499 18.40. Operation 47: GETDEVICEINFO - Get Device Information . . 501
18.41. Operation 48: GETDEVICELIST - Get All Device Mappings 18.41. Operation 48: GETDEVICELIST - Get All Device Mappings
for a File System . . . . . . . . . . . . . . . . . . . 501 for a File System . . . . . . . . . . . . . . . . . . . 503
18.42. Operation 49: LAYOUTCOMMIT - Commit writes made using 18.42. Operation 49: LAYOUTCOMMIT - Commit writes made using
a layout . . . . . . . . . . . . . . . . . . . . . . . . 503 a layout . . . . . . . . . . . . . . . . . . . . . . . . 505
18.43. Operation 50: LAYOUTGET - Get Layout Information . . . . 506 18.43. Operation 50: LAYOUTGET - Get Layout Information . . . . 508
18.44. Operation 51: LAYOUTRETURN - Release Layout 18.44. Operation 51: LAYOUTRETURN - Release Layout
Information . . . . . . . . . . . . . . . . . . . . . . 510 Information . . . . . . . . . . . . . . . . . . . . . . 512
18.45. Operation 52: SECINFO_NO_NAME - Get Security on 18.45. Operation 52: SECINFO_NO_NAME - Get Security on
Unnamed Object . . . . . . . . . . . . . . . . . . . . . 515 Unnamed Object . . . . . . . . . . . . . . . . . . . . . 517
18.46. Operation 53: SEQUENCE - Supply per-procedure 18.46. Operation 53: SEQUENCE - Supply per-procedure
sequencing and control . . . . . . . . . . . . . . . . . 516 sequencing and control . . . . . . . . . . . . . . . . . 518
18.47. Operation 54: SET_SSV - Update SSV for a Client ID . . . 522 18.47. Operation 54: SET_SSV - Update SSV for a Client ID . . . 524
18.48. Operation 55: TEST_STATEID - Test stateids for 18.48. Operation 55: TEST_STATEID - Test stateids for
validity . . . . . . . . . . . . . . . . . . . . . . . . 524 validity . . . . . . . . . . . . . . . . . . . . . . . . 526
18.49. Operation 56: WANT_DELEGATION - Request Delegation . . . 526 18.49. Operation 56: WANT_DELEGATION - Request Delegation . . . 528
18.50. Operation 57: DESTROY_CLIENTID - Destroy existing 18.50. Operation 57: DESTROY_CLIENTID - Destroy existing
client ID . . . . . . . . . . . . . . . . . . . . . . . 529 client ID . . . . . . . . . . . . . . . . . . . . . . . 532
18.51. Operation 58: RECLAIM_COMPLETE - Indicates Reclaims 18.51. Operation 58: RECLAIM_COMPLETE - Indicates Reclaims
Finished . . . . . . . . . . . . . . . . . . . . . . . . 530 Finished . . . . . . . . . . . . . . . . . . . . . . . . 532
18.52. Operation 10044: ILLEGAL - Illegal operation . . . . . . 532 18.52. Operation 10044: ILLEGAL - Illegal operation . . . . . . 535
19. NFSv4.1 Callback Procedures . . . . . . . . . . . . . . . . . 533 19. NFSv4.1 Callback Procedures . . . . . . . . . . . . . . . . . 535
19.1. Procedure 0: CB_NULL - No Operation . . . . . . . . . . 533 19.1. Procedure 0: CB_NULL - No Operation . . . . . . . . . . 536
19.2. Procedure 1: CB_COMPOUND - Compound Operations . . . . . 533 19.2. Procedure 1: CB_COMPOUND - Compound Operations . . . . . 536
20. NFSv4.1 Callback Operations . . . . . . . . . . . . . . . . . 538 20. NFSv4.1 Callback Operations . . . . . . . . . . . . . . . . . 540
20.1. Operation 3: CB_GETATTR - Get Attributes . . . . . . . . 538 20.1. Operation 3: CB_GETATTR - Get Attributes . . . . . . . . 540
20.2. Operation 4: CB_RECALL - Recall an Open Delegation . . . 539 20.2. Operation 4: CB_RECALL - Recall an Open Delegation . . . 541
20.3. Operation 5: CB_LAYOUTRECALL - Recall Layout from 20.3. Operation 5: CB_LAYOUTRECALL - Recall Layout from
Client . . . . . . . . . . . . . . . . . . . . . . . . . 540 Client . . . . . . . . . . . . . . . . . . . . . . . . . 542
20.4. Operation 6: CB_NOTIFY - Notify directory changes . . . 544 20.4. Operation 6: CB_NOTIFY - Notify directory changes . . . 546
20.5. Operation 7: CB_PUSH_DELEG - Offer Delegation to 20.5. Operation 7: CB_PUSH_DELEG - Offer Delegation to
Client . . . . . . . . . . . . . . . . . . . . . . . . . 548 Client . . . . . . . . . . . . . . . . . . . . . . . . . 550
20.6. Operation 8: CB_RECALL_ANY - Keep any N delegations . . 549 20.6. Operation 8: CB_RECALL_ANY - Keep any N delegations . . 551
20.7. Operation 9: CB_RECALLABLE_OBJ_AVAIL - Signal 20.7. Operation 9: CB_RECALLABLE_OBJ_AVAIL - Signal
Resources for Recallable Objects . . . . . . . . . . . . 551 Resources for Recallable Objects . . . . . . . . . . . . 553
20.8. Operation 10: CB_RECALL_SLOT - change flow control 20.8. Operation 10: CB_RECALL_SLOT - change flow control
limits . . . . . . . . . . . . . . . . . . . . . . . . . 552 limits . . . . . . . . . . . . . . . . . . . . . . . . . 554
20.9. Operation 11: CB_SEQUENCE - Supply backchannel 20.9. Operation 11: CB_SEQUENCE - Supply backchannel
sequencing and control . . . . . . . . . . . . . . . . . 553 sequencing and control . . . . . . . . . . . . . . . . . 555
20.10. Operation 12: CB_WANTS_CANCELLED - Cancel Pending 20.10. Operation 12: CB_WANTS_CANCELLED - Cancel Pending
Delegation Wants . . . . . . . . . . . . . . . . . . . . 555 Delegation Wants . . . . . . . . . . . . . . . . . . . . 557
20.11. Operation 13: CB_NOTIFY_LOCK - Notify of possible 20.11. Operation 13: CB_NOTIFY_LOCK - Notify of possible
lock availability . . . . . . . . . . . . . . . . . . . 556 lock availability . . . . . . . . . . . . . . . . . . . 558
20.12. Operation 14: CB_NOTIFY_DEVICEID - Notify device ID 20.12. Operation 14: CB_NOTIFY_DEVICEID - Notify device ID
changes . . . . . . . . . . . . . . . . . . . . . . . . 558 changes . . . . . . . . . . . . . . . . . . . . . . . . 560
20.13. Operation 10044: CB_ILLEGAL - Illegal Callback 20.13. Operation 10044: CB_ILLEGAL - Illegal Callback
Operation . . . . . . . . . . . . . . . . . . . . . . . 560 Operation . . . . . . . . . . . . . . . . . . . . . . . 562
21. Security Considerations . . . . . . . . . . . . . . . . . . . 560 21. Security Considerations . . . . . . . . . . . . . . . . . . . 562
22. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 562 22. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 564
22.1. Named Attribute Definitions . . . . . . . . . . . . . . 562 22.1. Named Attribute Definitions . . . . . . . . . . . . . . 564
22.2. ONC RPC Network Identifiers (netids) . . . . . . . . . . 562 22.2. ONC RPC Network Identifiers (netids) . . . . . . . . . . 564
22.3. Defining New Notifications . . . . . . . . . . . . . . . 563 22.3. Defining New Notifications . . . . . . . . . . . . . . . 565
22.4. Defining New Layout Types . . . . . . . . . . . . . . . 563 22.4. Defining New Layout Types . . . . . . . . . . . . . . . 565
22.5. Path Variable Definitions . . . . . . . . . . . . . . . 565 22.5. Path Variable Definitions . . . . . . . . . . . . . . . 567
22.5.1. Path Variable Values . . . . . . . . . . . . . . . . 565 22.5.1. Path Variable Values . . . . . . . . . . . . . . . . 567
22.5.2. Path Variable Names . . . . . . . . . . . . . . . . 565 22.5.2. Path Variable Names . . . . . . . . . . . . . . . . 567
23. References . . . . . . . . . . . . . . . . . . . . . . . . . 565 23. References . . . . . . . . . . . . . . . . . . . . . . . . . 567
23.1. Normative References . . . . . . . . . . . . . . . . . . 565 23.1. Normative References . . . . . . . . . . . . . . . . . . 567
23.2. Informative References . . . . . . . . . . . . . . . . . 567 23.2. Informative References . . . . . . . . . . . . . . . . . 569
Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 568 Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 570
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 570 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 572
Intellectual Property and Copyright Statements . . . . . . . . . 572 Intellectual Property and Copyright Statements . . . . . . . . . 574
1. Introduction 1. Introduction
1.1. The NFS Version 4 Minor Version 1 Protocol 1.1. The NFS Version 4 Minor Version 1 Protocol
The NFS version 4 minor version 1 (NFSv4.1) protocol is the second The NFS version 4 minor version 1 (NFSv4.1) protocol is the second
minor version of the NFS version 4 (NFSv4) protocol. The first minor minor version of the NFS version 4 (NFSv4) protocol. The first minor
version, NFSv4.0 is described in [21]. It generally follows the version, NFSv4.0 is described in [21]. It generally follows the
guidelines for minor versioning model listed in Section 10 of RFC guidelines for minor versioning model listed in Section 10 of RFC
3530. However, it diverges from guidelines 11 ("a client and server 3530. However, it diverges from guidelines 11 ("a client and server
skipping to change at page 18, line 42 skipping to change at page 18, line 42
long as the delegation is held. long as the delegation is held.
o Layouts, which are recallable objects that assure the holder that o Layouts, which are recallable objects that assure the holder that
direct access to the file data may be performed directly by the direct access to the file data may be performed directly by the
client and that no change to the data's location inconsistent with client and that no change to the data's location inconsistent with
that access may be made so long as the layout is held. that access may be made so long as the layout is held.
All locks for a given client are tied together under a single client- All locks for a given client are tied together under a single client-
wide lease. All requests made on sessions associated with the client wide lease. All requests made on sessions associated with the client
renew that lease. When leases are not promptly renewed locks are renew that lease. When leases are not promptly renewed locks are
subject to revocation. In the event of server reboot, clients have subject to revocation. In the event of server restart, clients have
the opportunity to safely reclaim their locks within a special grace the opportunity to safely reclaim their locks within a special grace
period. period.
1.7. Differences from NFSv4.0 1.7. Differences from NFSv4.0
The following summarizes the major differences between minor version The following summarizes the major differences between minor version
one and the base protocol: one and the base protocol:
o Implementation of the sessions model. o Implementation of the sessions model.
skipping to change at page 26, line 25 skipping to change at page 26, line 25
the session is persistent (see Section 2.10.5.5), but in each case the session is persistent (see Section 2.10.5.5), but in each case
the client will receive this error when it attempts to establish a the client will receive this error when it attempts to establish a
new session with the existing client ID and receives the error new session with the existing client ID and receives the error
NFS4ERR_STALE_CLIENTID, indicating that a new client ID must be NFS4ERR_STALE_CLIENTID, indicating that a new client ID must be
obtained via EXCHANGE_ID and the new session established with that obtained via EXCHANGE_ID and the new session established with that
client ID. client ID.
When a session is not persistent, the client will find out that it When a session is not persistent, the client will find out that it
needs to create a new session as a result of getting an needs to create a new session as a result of getting an
NFS4ERR_BADSESSION, since the session in question was lost as part of NFS4ERR_BADSESSION, since the session in question was lost as part of
a server reboot. When the existing client ID is presented to a a server restart. When the existing client ID is presented to a
server as part of creating a session and that client ID is not server as part of creating a session and that client ID is not
recognized, as would happen after a server restart, the server will recognized, as would happen after a server restart, the server will
reject the request with the error NFS4ERR_STALE_CLIENTID. reject the request with the error NFS4ERR_STALE_CLIENTID.
In the case of the session being persistent, the client will re- In the case of the session being persistent, the client will re-
establish communication using the existing session after the restart. establish communication using the existing session after the restart.
This session will be associated with the existing client ID but may This session will be associated with the existing client ID but may
only be used to retransmit operations that the client previously only be used to retransmit operations that the client previously
transmitted and did not see replies to. Replies to operations that transmitted and did not see replies to. Replies to operations that
the server previously performed will come from the reply cache, the server previously performed will come from the reply cache,
skipping to change at page 27, line 45 skipping to change at page 27, line 45
the client ID in order to conserve resources. If the client contacts the client ID in order to conserve resources. If the client contacts
the server after this release, the server must ensure the client the server after this release, the server must ensure the client
receives the appropriate error so that it will use the EXCHANGE_ID/ receives the appropriate error so that it will use the EXCHANGE_ID/
CREATE_SESSION sequence to establish a new client ID. The server CREATE_SESSION sequence to establish a new client ID. The server
ought to be very hesitant to release a client ID since the resulting ought to be very hesitant to release a client ID since the resulting
work on the client to recover from such an event will be the same work on the client to recover from such an event will be the same
burden as if the server had failed and restarted. Typically a server burden as if the server had failed and restarted. Typically a server
would not release a client ID unless there had been no activity from would not release a client ID unless there had been no activity from
that client for many minutes. As long as there are sessions, opens, that client for many minutes. As long as there are sessions, opens,
locks, delegations, layouts, or wants, the server MUST NOT release locks, delegations, layouts, or wants, the server MUST NOT release
the client ID. See Section 2.10.10.1.4 for discussion on releasing the client ID. See Section 2.10.11.1.4 for discussion on releasing
inactive sessions. inactive sessions.
2.4.3. Resolving Client Owner Conflicts 2.4.3. Resolving Client Owner Conflicts
When the server gets an EXCHANGE_ID for a client owner that currently When the server gets an EXCHANGE_ID for a client owner that currently
has no state, or that has state, but the lease has expired, the has no state, or that has state, but the lease has expired, the
server MUST allow the EXCHANGE_ID, and confirm the new client ID if server MUST allow the EXCHANGE_ID, and confirm the new client ID if
followed by the appropriate CREATE_SESSION. followed by the appropriate CREATE_SESSION.
When the server gets an EXCHANGE_ID for a new incarnation of a client When the server gets an EXCHANGE_ID for a new incarnation of a client
skipping to change at page 46, line 43 skipping to change at page 46, line 43
2.10.5. Exactly Once Semantics 2.10.5. Exactly Once Semantics
Via the session, NFSv4.1 offers Exactly Once Semantics (EOS) for Via the session, NFSv4.1 offers Exactly Once Semantics (EOS) for
requests sent over a channel. EOS is supported on both the fore and requests sent over a channel. EOS is supported on both the fore and
back channels. back channels.
Each COMPOUND or CB_COMPOUND request that is sent with a leading Each COMPOUND or CB_COMPOUND request that is sent with a leading
SEQUENCE or CB_SEQUENCE operation MUST be executed by the receiver SEQUENCE or CB_SEQUENCE operation MUST be executed by the receiver
exactly once. This requirement holds regardless of whether the exactly once. This requirement holds regardless of whether the
request is sent with reply caching specified (see request is sent with reply caching specified (see
Section 2.10.5.1.2). The requirement holds even if the requester is Section 2.10.5.1.3). The requirement holds even if the requester is
issuing the request over a session created between a pNFS data client issuing the request over a session created between a pNFS data client
and pNFS data server. To understand the rationale for this and pNFS data server. To understand the rationale for this
requirement, divide the requests into three classifications: requirement, divide the requests into three classifications:
o Nonidempotent requests. o Nonidempotent requests.
o Idempotent modifying requests. o Idempotent modifying requests.
o Idempotent non-modifying requests. o Idempotent non-modifying requests.
skipping to change at page 49, line 40 skipping to change at page 49, line 40
seen in the slot. Note that because the sequence id must seen in the slot. Note that because the sequence id must
wraparound to zero (0) once it reaches 0xFFFFFFFF, a misordered wraparound to zero (0) once it reaches 0xFFFFFFFF, a misordered
new request and a misordered retry cannot be distinguished. Thus, new request and a misordered retry cannot be distinguished. Thus,
the replier MUST return NFS4ERR_SEQ_MISORDERED (as the result from the replier MUST return NFS4ERR_SEQ_MISORDERED (as the result from
SEQUENCE or CB_SEQUENCE). SEQUENCE or CB_SEQUENCE).
Unlike the XID, the slot id is always within a specific range; this Unlike the XID, the slot id is always within a specific range; this
has two implications. The first implication is that for a given has two implications. The first implication is that for a given
session, the replier need only cache the results of a limited number session, the replier need only cache the results of a limited number
of COMPOUND requests . The second implication derives from the of COMPOUND requests . The second implication derives from the
first, which is unlike XID-indexed reply caches (also known as first, which is that unlike XID-indexed reply caches (also known as
duplicate request caches - DRCs), the slot id-based reply cache duplicate request caches - DRCs), the slot id-based reply cache
cannot be overflowed. Through use of the sequence id to identify cannot be overflowed. Through use of the sequence id to identify
retransmitted requests, the replier does not need to actually cache retransmitted requests, the replier does not need to actually cache
the request itself, reducing the storage requirements of the reply the request itself, reducing the storage requirements of the reply
cache further. These facilities make it practical to maintain all cache further. These facilities make it practical to maintain all
the required entries for an effective reply cache. the required entries for an effective reply cache.
The slot id, sequence id, and sessionid therefore take over the The slot id, sequence id, and sessionid therefore take over the
traditional role of the XID and source network address in the traditional role of the XID and source network address in the
replier's reply cache implementation. This approach is considerably replier's reply cache implementation. This approach is considerably
skipping to change at page 52, line 23 skipping to change at page 52, line 23
because the request may have been sent from the requester before because the request may have been sent from the requester before
the update was received. Therefore, in the downward adjustment the update was received. Therefore, in the downward adjustment
case, the replier may have to retain a number of reply cache case, the replier may have to retain a number of reply cache
entries at least as large as the old value of maximum requests entries at least as large as the old value of maximum requests
outstanding, until it can infer that the requester has seen a outstanding, until it can infer that the requester has seen a
reply containing the new granted highest_slotid. The replier can reply containing the new granted highest_slotid. The replier can
infer that requester as seen such a reply when it receives a new infer that requester as seen such a reply when it receives a new
request with the same slotid as the request replied to and the request with the same slotid as the request replied to and the
next higher sequenceid. next higher sequenceid.
2.10.5.1.1. Errors from SEQUENCE and CB_SEQUENCE 2.10.5.1.1. Caching of SEQUENCE and CB_SEQUENCE Replies
When a SEQUENCE or CB_SEQUENCE operation is successfully executed,
its reply MUST always be cached. Specifically, sessionid,
sequenceid, and slotid MUST be cached in the reply cache. The reply
from SEQUENCE also includes the highest slotid, target highest
slotid, and status flags. Instead of caching these values, the
server MAY re-compute the values from the current state of the fore
channel, session and/or client ID as appropriate. Similarly, the
reply from CB_SEQUENCE includes a highest slotid and target highest
slotid. The client MAY re-compute the values from the current state
of the session as appropriate.
Regardless of whether a replier is re-computing highest slotid,
target slotid, and status on replies to retries or not, the requester
MUST NOT assume the values are being re-computed whenever it receives
a reply after a retry is sent, since it has no way of knowing whether
the reply it has received was sent by the server in response to the
retry, or is a delayed response to the original request. Therefore,
it may be the case that highest slotid, target slotid, or status bits
may reflect the state of affairs when the request was first executed.
Although acting based on such delayed information is valid, it may
cause the receiver to do unneeded work. Requesters MAY choose to
send additional requests to get the current state of affairs or use
the state of affairs reported by subsequent requests, in preference
to acting immediately on data which may be out of date.
2.10.5.1.2. Errors from SEQUENCE and CB_SEQUENCE
Any time SEQUENCE or CB_SEQUENCE return an error, the sequence id of Any time SEQUENCE or CB_SEQUENCE return an error, the sequence id of
the slot MUST NOT change. The replier MUST NOT modify the reply the slot MUST NOT change. The replier MUST NOT modify the reply
cache entry for the slot whenever an error is returned from SEQUENCE cache entry for the slot whenever an error is returned from SEQUENCE
or CB_SEQUENCE. or CB_SEQUENCE.
2.10.5.1.2. Optional Reply Caching 2.10.5.1.3. Optional Reply Caching
On a per-request basis the requester can choose to direct the replier On a per-request basis the requester can choose to direct the replier
to cache the reply to all operations after the first operation to cache the reply to all operations after the first operation
(SEQUENCE or CB_SEQUENCE) via the sa_cachethis or csa_cachethis (SEQUENCE or CB_SEQUENCE) via the sa_cachethis or csa_cachethis
fields of the arguments to SEQUENCE or CB_SEQUENCE. The reason it fields of the arguments to SEQUENCE or CB_SEQUENCE. The reason it
would not direct the replier to cache the entire reply is that the would not direct the replier to cache the entire reply is that the
request is composed of all idempotent operations [24]. Caching the request is composed of all idempotent operations [24]. Caching the
reply may offer little benefit. If the reply is too large (see reply may offer little benefit. If the reply is too large (see
Section 2.10.5.4), it may not be cacheable anyway. Even if the reply Section 2.10.5.4), it may not be cacheable anyway. Even if the reply
to idempotent request is small enough to cache, unnecessarily caching to idempotent request is small enough to cache, unnecessarily caching
skipping to change at page 53, line 9 skipping to change at page 53, line 38
incremented by one. If a requester does not direct the replier to incremented by one. If a requester does not direct the replier to
cache the reply, the replier MUST do one of following: cache the reply, the replier MUST do one of following:
o The replier can cache the entire original reply. Even though o The replier can cache the entire original reply. Even though
sa_cachethis or csa_cachethis are FALSE, the replier is always sa_cachethis or csa_cachethis are FALSE, the replier is always
free to cache. It may choose this approach in order to simplify free to cache. It may choose this approach in order to simplify
implementation. implementation.
o The replier enters into its reply cache a reply consisting of the o The replier enters into its reply cache a reply consisting of the
original results to the SEQUENCE or CB_SEQUENCE operation, and original results to the SEQUENCE or CB_SEQUENCE operation, and
with the next operation in COMPOUND or CB)COMPOUND having the with the next operation in COMPOUND or CB_COMPOUND having the
error NFS4ERR_RETRY_UNCACHED_REP. Thus if the requester later error NFS4ERR_RETRY_UNCACHED_REP. Thus if the requester later
retries the request, it will get NFS4ERR_RETRY_UNCACHED_REP. retries the request, it will get NFS4ERR_RETRY_UNCACHED_REP.
2.10.5.2. Retry and Replay of Reply 2.10.5.2. Retry and Replay of Reply
A requester MUST NOT retry a request, unless the connection it used A requester MUST NOT retry a request, unless the connection it used
to send the request disconnects. The requester can then reconnect to send the request disconnects. The requester can then reconnect
and re-send the request, or it can re-send the request over a and re-send the request, or it can re-send the request over a
different connection that is associated with the same session. different connection that is associated with the same session.
skipping to change at page 56, line 11 skipping to change at page 56, line 40
If a reply exceeds ca_maxresponsesize, the reply will have the status If a reply exceeds ca_maxresponsesize, the reply will have the status
NFS4ERR_REP_TOO_BIG. A replier MAY return NFS4ERR_REP_TOO_BIG as the NFS4ERR_REP_TOO_BIG. A replier MAY return NFS4ERR_REP_TOO_BIG as the
status for first operation (SEQUENCE or CB_SEQUENCE) in the request, status for first operation (SEQUENCE or CB_SEQUENCE) in the request,
or it MAY chose to return it on a subsequent operation (in the same or it MAY chose to return it on a subsequent operation (in the same
COMPOUND or CB_COMPOUND reply). A replier MAY return COMPOUND or CB_COMPOUND reply). A replier MAY return
NFS4ERR_REP_TOO_BIG in the reply to SEQUENCE or CB_SEQUENCE, even if NFS4ERR_REP_TOO_BIG in the reply to SEQUENCE or CB_SEQUENCE, even if
the response would still exceed ca_maxresponsesize. the response would still exceed ca_maxresponsesize.
If sa_cachethis or csa_cachethis are TRUE, then the replier MUST If sa_cachethis or csa_cachethis are TRUE, then the replier MUST
cache a reply except if an error is returned by the SEQUENCE or cache a reply except if an error is returned by the SEQUENCE or
CB_SEQUENCE operation (see Section 2.10.5.1.1). If the reply exceeds CB_SEQUENCE operation (see Section 2.10.5.1.2). If the reply exceeds
ca_maxresponsesize_cached, (and sa_cachethis or csa_cachethis are ca_maxresponsesize_cached, (and sa_cachethis or csa_cachethis are
TRUE) then the server MUST return NFS4ERR_REP_TOO_BIG_TO_CACHE. Even TRUE) then the server MUST return NFS4ERR_REP_TOO_BIG_TO_CACHE. Even
if NFS4ERR_REP_TOO_BIG_TO_CACHE (or any other error for that matter) if NFS4ERR_REP_TOO_BIG_TO_CACHE (or any other error for that matter)
is returned on a operation other than first operation (SEQUENCE or is returned on a operation other than first operation (SEQUENCE or
CB_SEQUENCE), then the reply MUST be cached if sa_cachethis or CB_SEQUENCE), then the reply MUST be cached if sa_cachethis or
csa_cachethis are TRUE. For example, if a COMPOUND has eleven csa_cachethis are TRUE. For example, if a COMPOUND has eleven
operations, including SEQUENCE, the fifth operation is a RENAME, and operations, including SEQUENCE, the fifth operation is a RENAME, and
the tenth operation is a READ for one million bytes, the server may the tenth operation is a READ for one million bytes, the server may
return NFS4ERR_REP_TOO_BIG_TO_CACHE on the tenth operation. Since return NFS4ERR_REP_TOO_BIG_TO_CACHE on the tenth operation. Since
the server executed several operations, especially the non-idempotent the server executed several operations, especially the non-idempotent
skipping to change at page 71, line 18 skipping to change at page 71, line 44
Section 5.2.2 "Context Creation Requests" in [4]). Section 5.2.2 "Context Creation Requests" in [4]).
2.10.9. Session Mechanics - Steady State 2.10.9. Session Mechanics - Steady State
2.10.9.1. Obligations of the Server 2.10.9.1. Obligations of the Server
The server has the primary obligation to monitor the state of The server has the primary obligation to monitor the state of
backchannel resources that the client has created for the server backchannel resources that the client has created for the server
(RPCSEC_GSS contexts and backchannel connections). If these (RPCSEC_GSS contexts and backchannel connections). If these
resources vanish, the server takes action as specified in resources vanish, the server takes action as specified in
Section 2.10.10.2. Section 2.10.11.2.
2.10.9.2. Obligations of the Client 2.10.9.2. Obligations of the Client
The client SHOULD honor the following obligations in order to utilize The client SHOULD honor the following obligations in order to utilize
the session: the session:
o Keep a necessary session from going idle on the server. A client o Keep a necessary session from going idle on the server. A client
that requires a session, but nonetheless is not sending operations that requires a session, but nonetheless is not sending operations
risks having the session be destroyed by the server. This is risks having the session be destroyed by the server. This is
because sessions consume resources, and resource limitations may because sessions consume resources, and resource limitations may
force the server to cull an inactive session. force the server to cull an inactive session. A server MAY
consider a session to be inactive if the client has not used the
session before the session inactivity timer (Section 2.10.10) has
expired.
o Destroy the session when not needed. If a client has multiple o Destroy the session when not needed. If a client has multiple
sessions, one of which has no requests waiting for replies, and sessions, one of which has no requests waiting for replies, and
has been idle for some period of time, it SHOULD destroy the has been idle for some period of time, it SHOULD destroy the
session. session.
o Maintain GSS contexts for the backchannel. If the client requires o Maintain GSS contexts for the backchannel. If the client requires
the server to use the RPCSEC_GSS security flavor for callbacks, the server to use the RPCSEC_GSS security flavor for callbacks,
then it needs to be sure the contexts handed to the server via then it needs to be sure the contexts handed to the server via
BACKCHANNEL_CTL are unexpired. BACKCHANNEL_CTL are unexpired.
skipping to change at page 72, line 47 skipping to change at page 73, line 28
If the client wants to use additional connections for the If the client wants to use additional connections for the
backchannel, then it must call BIND_CONN_TO_SESSION on each backchannel, then it must call BIND_CONN_TO_SESSION on each
connection it wants to use with the session. If the client wants to connection it wants to use with the session. If the client wants to
use additional connections for the fore channel, then it must call use additional connections for the fore channel, then it must call
BIND_CONN_TO_SESSION if it specified SP4_SSV or SP4_MACH_CRED state BIND_CONN_TO_SESSION if it specified SP4_SSV or SP4_MACH_CRED state
protection when the client ID was created. protection when the client ID was created.
At this point the session has reached steady state. At this point the session has reached steady state.
2.10.10. Session Mechanics - Recovery 2.10.10. Session Inactivity Timer
2.10.10.1. Events Requiring Client Action The server MAY maintain a session inactivity timer for each session.
If the session inactivity timer expires, then the server MAY destroy
the session. To avoid losing a session due to inactivity, the client
MUST renew the session inactivity timer. The length of session
inactivity timer MUST NOT be less than the lease_time attribute
(Section 5.7.1.11). As with lease renewal (Section 8.3), when the
server receives a SEQUENCE operation, it resets the session
inactivity timer, and MUST NOT allow the timer to expire while the
rest of the operations in the COMPOUND procedure's request are still
executing. Once the last operation has finished, the server MUST set
the session inactivity timer to expire no sooner that the sum of the
current time and the value of the lease_time attribute.
2.10.11. Session Mechanics - Recovery
2.10.11.1. Events Requiring Client Action
The following events require client action to recover. The following events require client action to recover.
2.10.10.1.1. RPCSEC_GSS Context Loss by Callback Path 2.10.11.1.1. RPCSEC_GSS Context Loss by Callback Path
If all RPCSEC_GSS contexts granted by the client to the server for If all RPCSEC_GSS contexts granted by the client to the server for
callback use have expired, the client MUST establish a new context callback use have expired, the client MUST establish a new context
via BACKCHANNEL_CTL. The sr_status_flags field of the SEQUENCE via BACKCHANNEL_CTL. The sr_status_flags field of the SEQUENCE
results indicates when callback contexts are nearly expired, or fully results indicates when callback contexts are nearly expired, or fully
expired (see Section 18.46.3). expired (see Section 18.46.3).
2.10.10.1.2. Connection Loss 2.10.11.1.2. Connection Loss
If the client loses the last connection of the session, and if wants If the client loses the last connection of the session, and if wants
to retain the session, then it must create a new connection, and if, to retain the session, then it must create a new connection, and if,
when the client ID was created, BIND_CONN_TO_SESSION was specified in when the client ID was created, BIND_CONN_TO_SESSION was specified in
the spo_must_enforce list, the client MUST use BIND_CONN_TO_SESSION the spo_must_enforce list, the client MUST use BIND_CONN_TO_SESSION
to associate the connection with the session. to associate the connection with the session.
If there was a request outstanding at the time the of connection If there was a request outstanding at the time the of connection
loss, then if client wants to continue to use the session it MUST loss, then if client wants to continue to use the session it MUST
retry the request, as described in Section 2.10.5.2. Note that it is retry the request, as described in Section 2.10.5.2. Note that it is
skipping to change at page 73, line 39 skipping to change at page 74, line 39
disconnect. disconnect.
If the connection that was lost was the last one associated with the If the connection that was lost was the last one associated with the
backchannel, and the client wants to retain the backchannel and/or backchannel, and the client wants to retain the backchannel and/or
not put recallable state subject to revocation, the client must not put recallable state subject to revocation, the client must
reconnect, and if it does, it MUST associate the connection to the reconnect, and if it does, it MUST associate the connection to the
session and backchannel via BIND_CONN_TO_SESSION. The server SHOULD session and backchannel via BIND_CONN_TO_SESSION. The server SHOULD
indicate when it has no callback connection via the sr_status_flags indicate when it has no callback connection via the sr_status_flags
result from SEQUENCE. result from SEQUENCE.
2.10.10.1.3. Backchannel GSS Context Loss 2.10.11.1.3. Backchannel GSS Context Loss
Via the sr_status_flags result of the SEQUENCE operation or other Via the sr_status_flags result of the SEQUENCE operation or other
means, the client will learn if some or all of the RPCSEC_GSS means, the client will learn if some or all of the RPCSEC_GSS
contexts it assigned to the backchannel have been lost. If the contexts it assigned to the backchannel have been lost. If the
client wants to the retain the backchannel and/or not put recallable client wants to the retain the backchannel and/or not put recallable
state subjection to revocation, the client must use BACKCHANNEL_CTL state subjection to revocation, the client must use BACKCHANNEL_CTL
to assign new contexts. to assign new contexts.
2.10.10.1.4. Loss of Session 2.10.11.1.4. Loss of Session
The replier might lose a record of the session. Causes include: The replier might lose a record of the session. Causes include:
o Replier failure and restart o Replier failure and restart
o A catastrophe that causes the reply cache to be corrupted or lost o A catastrophe that causes the reply cache to be corrupted or lost
on the media it was stored on. This applies even if the replier on the media it was stored on. This applies even if the replier
indicated in the CREATE_SESSION results that it would persist the indicated in the CREATE_SESSION results that it would persist the
cache. cache.
skipping to change at page 75, line 5 skipping to change at page 76, line 5
client ID; loss of client ID however does imply loss of session, client ID; loss of client ID however does imply loss of session,
lock, open, delegation, and layout state. See Section 8.4.2. A lock, open, delegation, and layout state. See Section 8.4.2. A
session can survive a server restart, but lock recovery may still be session can survive a server restart, but lock recovery may still be
needed. needed.
It is possible CREATE_SESSION will fail with NFS4ERR_STALE_CLIENTID It is possible CREATE_SESSION will fail with NFS4ERR_STALE_CLIENTID
(for example the server restarts and does not preserve client ID (for example the server restarts and does not preserve client ID
state). If so, the client needs to call EXCHANGE_ID, followed by state). If so, the client needs to call EXCHANGE_ID, followed by
CREATE_SESSION. CREATE_SESSION.
2.10.10.2. Events Requiring Server Action 2.10.11.2. Events Requiring Server Action
The following events require server action to recover. The following events require server action to recover.
2.10.10.2.1. Client Crash and Restart 2.10.11.2.1. Client Crash and Restart
As described in Section 18.35, a restarted client sends EXCHANGE_ID As described in Section 18.35, a restarted client sends EXCHANGE_ID
in such a way it causes the server to delete any sessions it had. in such a way it causes the server to delete any sessions it had.
2.10.10.2.2. Client Crash with No Restart 2.10.11.2.2. Client Crash with No Restart
If a client crashes and never comes back, it will never send If a client crashes and never comes back, it will never send
EXCHANGE_ID with its old client owner. Thus the server has session EXCHANGE_ID with its old client owner. Thus the server has session
state that will never be used again. After an extended period of state that will never be used again. After an extended period of
time and if the server has resource constraints, it MAY destroy the time and if the server has resource constraints, it MAY destroy the
old session as well as locking state. old session as well as locking state.
2.10.10.2.3. Extended Network Partition 2.10.11.2.3. Extended Network Partition
To the server, the extended network partition may be no different To the server, the extended network partition may be no different
from a client crash with no restart (see Section 2.10.10.2.2). from a client crash with no restart (see Section 2.10.11.2.2).
Unless the server can discern that there is a network partition, it Unless the server can discern that there is a network partition, it
is free to treat the situation as if the client has crashed is free to treat the situation as if the client has crashed
permanently. permanently.
2.10.10.2.4. Backchannel Connection Loss 2.10.11.2.4. Backchannel Connection Loss
If there were callback requests outstanding at the time of a If there were callback requests outstanding at the time of a
connection loss, then the server MUST retry the request, as described connection loss, then the server MUST retry the request, as described
in Section 2.10.5.2. Note that it is not necessary to retry requests in Section 2.10.5.2. Note that it is not necessary to retry requests
over a connection with the same source network address or the same over a connection with the same source network address or the same
destination network address as the lost connection. As long as the destination network address as the lost connection. As long as the
sessionid, slot id, and sequence id in the retry match that of the sessionid, slot id, and sequence id in the retry match that of the
original request, the callback target will recognize the request as a original request, the callback target will recognize the request as a
retry even if it did see the request prior to disconnect. retry even if it did see the request prior to disconnect.
If the connection lost is the last one associated with the If the connection lost is the last one associated with the
backchannel, then the server MUST indicate that in the backchannel, then the server MUST indicate that in the
sr_status_flags field of every SEQUENCE reply until the backchannel sr_status_flags field of every SEQUENCE reply until the backchannel
is reestablished. There are two situations each of which use is reestablished. There are two situations each of which use
different status flags: no connectivity for the session's different status flags: no connectivity for the session's
backchannel, and no connectivity for any session backchannel of the backchannel, and no connectivity for any session backchannel of the
client. See Section 18.46 for a description of the appropriate flags client. See Section 18.46 for a description of the appropriate flags
in sr_status_flags. in sr_status_flags.
2.10.10.2.5. GSS Context Loss 2.10.11.2.5. GSS Context Loss
The server SHOULD monitor when the number RPCSEC_GSS contexts The server SHOULD monitor when the number RPCSEC_GSS contexts
assigned to the backchannel reaches one, and when that one context is assigned to the backchannel reaches one, and when that one context is
near expiry (i.e. between one and two periods of lease time), near expiry (i.e. between one and two periods of lease time),
indicate so in the sr_status_flags field of all SEQUENCE replies. indicate so in the sr_status_flags field of all SEQUENCE replies.
The server MUST indicate when the all of the backchannel's assigned The server MUST indicate when the all of the backchannel's assigned
RPCSEC_GSS contexts have expired in the sr_status_flags field of all RPCSEC_GSS contexts have expired in the sr_status_flags field of all
SEQUENCE replies. SEQUENCE replies.
2.10.11. Parallel NFS and Sessions 2.10.12. Parallel NFS and Sessions
A client and server can potentially be a non-pNFS implementation, a A client and server can potentially be a non-pNFS implementation, a
metadata server implementation, a data server implementation, or two metadata server implementation, a data server implementation, or two
or three types of implementations. The EXCHGID4_FLAG_USE_NON_PNFS, or three types of implementations. The EXCHGID4_FLAG_USE_NON_PNFS,
EXCHGID4_FLAG_USE_PNFS_MDS, and EXCHGID4_FLAG_USE_PNFS_DS flags (not EXCHGID4_FLAG_USE_PNFS_MDS, and EXCHGID4_FLAG_USE_PNFS_DS flags (not
mutually exclusive) are passed in the EXCHANGE_ID arguments and mutually exclusive) are passed in the EXCHANGE_ID arguments and
results to allow the client to indicate how it wants to use sessions results to allow the client to indicate how it wants to use sessions
created under the client ID, and to allow the server to indicate how created under the client ID, and to allow the server to indicate how
it will allow the sessions to be used. See Section 13.1 for pNFS it will allow the sessions to be used. See Section 13.1 for pNFS
sessions considerations. sessions considerations.
skipping to change at page 78, line 19 skipping to change at page 79, line 19
| | Various defined file types. | | | Various defined file types. |
| nfsstat4 | enum nfsstat4; | | nfsstat4 | enum nfsstat4; |
| | Return value for operations. | | | Return value for operations. |
| offset4 | typedef uint64_t offset4; | | offset4 | typedef uint64_t offset4; |
| | Various offset designations (READ, WRITE, LOCK, | | | Various offset designations (READ, WRITE, LOCK, |
| | COMMIT). | | | COMMIT). |
| qop4 | typedef uint32_t qop4; | | qop4 | typedef uint32_t qop4; |
| | Quality of protection designation in SECINFO. | | | Quality of protection designation in SECINFO. |
| sec_oid4 | typedef opaque sec_oid4<>; | | sec_oid4 | typedef opaque sec_oid4<>; |
| | Security Object Identifier. The sec_oid4 data | | | Security Object Identifier. The sec_oid4 data |
| | type is not really opaque. Instead it contains | | | type is not really opaque. Instead it contains an |
| | an ASN.1 OBJECT IDENTIFIER as used by GSS-API in | | | ASN.1 OBJECT IDENTIFIER as used by GSS-API in the |
| | the mech_type argument to GSS_Init_sec_context. | | | mech_type argument to GSS_Init_sec_context. See |
| | See [7] for details. | | | [7] for details. |
| sequenceid4 | typedef uint32_t sequenceid4; | | sequenceid4 | typedef uint32_t sequenceid4; |
| | Sequence number used for various session | | | Sequence number used for various session |
| | operations (EXCHANGE_ID, CREATE_SESSION, | | | operations (EXCHANGE_ID, CREATE_SESSION, |
| | SEQUENCE, CB_SEQUENCE). | | | SEQUENCE, CB_SEQUENCE). |
| seqid4 | typedef uint32_t seqid4; | | seqid4 | typedef uint32_t seqid4; |
| | Sequence identifier used for file locking. | | | Sequence identifier used for file locking. |
| sessionid4 | typedef opaque sessionid4[NFS4_SESSIONID_SIZE]; | | sessionid4 | typedef opaque sessionid4[NFS4_SESSIONID_SIZE]; |
| | Session identifier. | | | Session identifier. |
| slotid4 | typedef uint32_t slotid4; | | slotid4 | typedef uint32_t slotid4; |
| | Sequencing artifact for various session | | | Sequencing artifact for various session |
skipping to change at page 84, line 46 skipping to change at page 85, line 46
3.3.14. deviceid4 3.3.14. deviceid4
const NFS4_DEVICEID4_SIZE = 16; const NFS4_DEVICEID4_SIZE = 16;
typedef opaque deviceid4[NFS4_DEVICEID4_SIZE]; typedef opaque deviceid4[NFS4_DEVICEID4_SIZE];
Layout information includes device IDs that specify a storage device Layout information includes device IDs that specify a storage device
through a compact handle. Addressing and type information is through a compact handle. Addressing and type information is
obtained with the GETDEVICEINFO operation. Device IDs are not obtained with the GETDEVICEINFO operation. Device IDs are not
guaranteed to be valid across metadata server reboots. A device ID guaranteed to be valid across metadata server restarts. A device ID
is unique per client ID and layout type. See Section 12.2.10 for is unique per client ID and layout type. See Section 12.2.10 for
more details. more details.
3.3.15. device_addr4 3.3.15. device_addr4
struct device_addr4 { struct device_addr4 {
layouttype4 da_layout_type; layouttype4 da_layout_type;
opaque da_addr_body<>; opaque da_addr_body<>;
}; };
skipping to change at page 91, line 11 skipping to change at page 92, line 11
the same underlying file object and associated data. For example, if the same underlying file object and associated data. For example, if
paths /a/b/c and /a/d/c refer to the same file, the server SHOULD paths /a/b/c and /a/d/c refer to the same file, the server SHOULD
return the same filehandle for both path names traversals. return the same filehandle for both path names traversals.
4.2.2. Persistent Filehandle 4.2.2. Persistent Filehandle
A persistent filehandle is defined as having a fixed value for the A persistent filehandle is defined as having a fixed value for the
lifetime of the file system object to which it refers. Once the lifetime of the file system object to which it refers. Once the
server creates the filehandle for a file system object, the server server creates the filehandle for a file system object, the server
MUST accept the same filehandle for the object for the lifetime of MUST accept the same filehandle for the object for the lifetime of
the object. If the server restarts or reboots the NFS server must the object. If the server restarts, the NFS server must honor the
honor the same filehandle value as it did in the server's previous same filehandle value as it did in the server's previous
instantiation. Similarly, if the file system is migrated, the new instantiation. Similarly, if the file system is migrated, the new
NFS server must honor the same filehandle as the old NFS server. NFS server must honor the same filehandle as the old NFS server.
The persistent filehandle will be become stale or invalid when the The persistent filehandle will be become stale or invalid when the
file system object is removed. When the server is presented with a file system object is removed. When the server is presented with a
persistent filehandle that refers to a deleted object, it MUST return persistent filehandle that refers to a deleted object, it MUST return
an error of NFS4ERR_STALE. A filehandle may become stale when the an error of NFS4ERR_STALE. A filehandle may become stale when the
file system containing the object is no longer available. The file file system containing the object is no longer available. The file
system may become unavailable if it exists on removable media and the system may become unavailable if it exists on removable media and the
media is no longer available at the server or the file system in media is no longer available at the server or the file system in
skipping to change at page 93, line 16 skipping to change at page 94, line 16
o generation number is the generation number for the table entry/ o generation number is the generation number for the table entry/
slot slot
When the client presents a volatile filehandle, the server makes the When the client presents a volatile filehandle, the server makes the
following checks, which assume that the check for the volatile bit following checks, which assume that the check for the volatile bit
has passed. If the server boot time is less than the current server has passed. If the server boot time is less than the current server
boot time, return NFS4ERR_FHEXPIRED. If slot is out of range, return boot time, return NFS4ERR_FHEXPIRED. If slot is out of range, return
NFS4ERR_BADHANDLE. If the generation number does not match, return NFS4ERR_BADHANDLE. If the generation number does not match, return
NFS4ERR_FHEXPIRED. NFS4ERR_FHEXPIRED.
When the server reboots, the table is gone (it is volatile). When the server restarts, the table is gone (it is volatile).
If volatile bit is 0, then it is a persistent filehandle with a If volatile bit is 0, then it is a persistent filehandle with a
different structure following it. different structure following it.
4.4. Client Recovery from Filehandle Expiration 4.4. Client Recovery from Filehandle Expiration
If possible, the client SHOULD recover from the receipt of an If possible, the client SHOULD recover from the receipt of an
NFS4ERR_FHEXPIRED error. The client must take on additional NFS4ERR_FHEXPIRED error. The client must take on additional
responsibility so that it may prepare itself to recover from the responsibility so that it may prepare itself to recover from the
expiration of a volatile filehandle. If the server returns expiration of a volatile filehandle. If the server returns
skipping to change at page 94, line 32 skipping to change at page 95, line 32
server supports and construct requests with only those supported server supports and construct requests with only those supported
attributes (or a subset thereof). attributes (or a subset thereof).
To this end, attributes are divided into three groups: REQUIRED, To this end, attributes are divided into three groups: REQUIRED,
RECOMMENDED, and named. Both REQUIRED and RECOMMENDED attributes are RECOMMENDED, and named. Both REQUIRED and RECOMMENDED attributes are
supported in the NFSv4.1 protocol by a specific and well-defined supported in the NFSv4.1 protocol by a specific and well-defined
encoding and are identified by number. They are requested by setting encoding and are identified by number. They are requested by setting
a bit in the bit vector sent in the GETATTR request; the server a bit in the bit vector sent in the GETATTR request; the server
response includes a bit vector to list what attributes were returned response includes a bit vector to list what attributes were returned
in the response. New REQUIRED or RECOMMENDED attributes may be added in the response. New REQUIRED or RECOMMENDED attributes may be added
to the NFS protocol between major revisions by publishing a to the NFSv4 protocol as part of a new minor version by publishing a
standards-track RFC which allocates a new attribute number value and standards-track RFC which allocates a new attribute number value and
defines the encoding for the attribute. See Section 2.7 for further defines the encoding for the attribute. See Section 2.7 for further
discussion. discussion.
Named attributes are accessed by the new OPENATTR operation, which Named attributes are accessed by the new OPENATTR operation, which
accesses a hidden directory of attributes associated with a file accesses a hidden directory of attributes associated with a file
system object. OPENATTR takes a filehandle for the object and system object. OPENATTR takes a filehandle for the object and
returns the filehandle for the attribute hierarchy. The filehandle returns the filehandle for the attribute hierarchy. The filehandle
for the named attributes is a directory object accessible by LOOKUP for the named attributes is a directory object accessible by LOOKUP
or READDIR and contains files whose names represent the named or READDIR and contains files whose names represent the named
skipping to change at page 95, line 37 skipping to change at page 96, line 37
Note that the hidden directory returned by OPENATTR is a convenience Note that the hidden directory returned by OPENATTR is a convenience
for protocol processing. The client should not make any assumptions for protocol processing. The client should not make any assumptions
about the server's implementation of named attributes and whether the about the server's implementation of named attributes and whether the
underlying file system at the server has a named attribute directory underlying file system at the server has a named attribute directory
or not. Therefore, operations such as SETATTR and GETATTR on the or not. Therefore, operations such as SETATTR and GETATTR on the
named attribute directory are undefined. named attribute directory are undefined.
5.1. REQUIRED Attributes 5.1. REQUIRED Attributes
These MUST be supported by every NFSv4.1 client and server in order These MUST be supported by every NFSv4.1 client and server in order
to ensure a minimum level of interoperability. The server must store to ensure a minimum level of interoperability. The server MUST store
and return these attributes and the client must be able to function and return these attributes and the client MUST be able to function
with an attribute set limited to these attributes. With just the with an attribute set limited to these attributes. With just the
REQUIRED attributes some client functionality may be impaired or REQUIRED attributes some client functionality may be impaired or
limited in some ways. A client may ask for any of these attributes limited in some ways. A client may ask for any of these attributes
to be returned by setting a bit in the GETATTR request and the server to be returned by setting a bit in the GETATTR request and the server
must return their value. must return their value.
5.2. RECOMMENDED Attributes 5.2. RECOMMENDED Attributes
These attributes are understood well enough to warrant support in the These attributes are understood well enough to warrant support in the
NFSv4.1 protocol. However, they may not be supported on all clients NFSv4.1 protocol. However, they may not be supported on all clients
and servers. A client may ask for any of these attributes to be and servers. A client may ask for any of these attributes to be
returned by setting a bit in the GETATTR request but must handle the returned by setting a bit in the GETATTR request but must handle the
case where the server does not return them. A client may ask for the case where the server does not return them. A client may ask for the
set of attributes the server supports and should not request set of attributes the server supports and SHOULD NOT request
attributes the server does not support. A server should be tolerant attributes the server does not support. A server should be tolerant
of requests for unsupported attributes and simply not return them of requests for unsupported attributes and simply not return them
rather than considering the request an error. It is expected that rather than considering the request an error. It is expected that
servers will support all attributes they comfortably can and only servers will support all attributes they comfortably can and only
fail to support attributes which are difficult to support in their fail to support attributes which are difficult to support in their
operating environments. A server should provide attributes whenever operating environments. A server should provide attributes whenever
they don't have to "tell lies" to the client. For example, a file they don't have to "tell lies" to the client. For example, a file
modification time should be either an accurate time or should not be modification time should be either an accurate time or should not be
supported by the server. This will not always be comfortable to supported by the server. This will not always be comfortable to
clients but the client is better positioned decide whether and how to clients but the client is better positioned decide whether and how to
skipping to change at page 97, line 5 skipping to change at page 98, line 5
of delegations (in the case of the named attribute directory these of delegations (in the case of the named attribute directory these
will be directory delegations). However, since granting of will be directory delegations). However, since granting of
delegations or not is within the server's discretion, a server need delegations or not is within the server's discretion, a server need
not support delegations on named attributes or the named attribute not support delegations on named attributes or the named attribute
directory. directory.
It is RECOMMENDED that servers support arbitrary named attributes. A It is RECOMMENDED that servers support arbitrary named attributes. A
client should not depend on the ability to store any named attributes client should not depend on the ability to store any named attributes
in the server's file system. If a server does support named in the server's file system. If a server does support named
attributes, a client which is also able to handle them should be able attributes, a client which is also able to handle them should be able
to copy a file's data and meta-data with complete transparency from to copy a file's data and metadata with complete transparency from
one location to another; this would imply that names allowed for one location to another; this would imply that names allowed for
regular directory entries are valid for named attribute names as regular directory entries are valid for named attribute names as
well. well.
In NFSv4.1, the structure of named attribute directories is In NFSv4.1, the structure of named attribute directories is
restricted in a number of ways, in order to prevent the development restricted in a number of ways, in order to prevent the development
of non-interoperable implementations in which some servers support a of non-interoperable implementations in which some servers support a
fully general hierarchical directory structure for named attributes fully general hierarchical directory structure for named attributes
while others support a limited set, but fully adequate to the while others support a limited set, but fully adequate to the
feature's goals. In such an environment, clients or applications feature's goals. In such an environment, clients or applications
might come to depend on non-portable extensions. The restrictions might come to depend on non-portable extensions. The restrictions
are: are:
o CREATE is not allowed in a named attribute directory. Thus, such o CREATE is not allowed in a named attribute directory. Thus, such
objects as symbolic links and special files are not allowed to be objects as symbolic links and special files are not allowed to be
named attributes. Further, directories may not be created in a named attributes. Further, directories may not be created in a
named attribute directory so no hierarchical structure of named named attribute directory so no hierarchical structure of named
attributes for a single object is allowed. attributes for a single object is allowed.
o OPENATTR many not be done on a named attribute directory or on a o If OPENATTR is done on a named attribute directory or on a named
named attribute. Thus, although these object have attributes, attribute, the server MUST return NFS4ERR_WRONG_TYPE.
they may not may named attributes.
o Doing a RENAME of a named attribute to a different named attribute o Doing a RENAME of a named attribute to a different named attribute
directory or to an ordinary (i.e. non-named-attribute) directory directory or to an ordinary (i.e. non-named-attribute) directory
is not allowed. is not allowed.
o Creating hard links between names attribute directories or between o Creating hard links between named attribute directories or between
named attribute directories and ordinary directories is not named attribute directories and ordinary directories is not
allowed. allowed.
Names of attributes will not be controlled by this document or other Names of attributes will not be controlled by this document or other
IETF standards track documents. See Section 22.1 for further IETF standards track documents. See Section 22.1 for further
discussion. discussion.
5.4. Classification of Attributes 5.4. Classification of Attributes
Each of the REQUIRED and RECOMMENDED attributes can be classified in Each of the REQUIRED and RECOMMENDED attributes can be classified in
skipping to change at page 103, line 43 skipping to change at page 104, line 43
True, if the server able to change the times for a file system object True, if the server able to change the times for a file system object
as specified in a SETATTR operation. as specified in a SETATTR operation.
5.7.2.3. Attribute 16: case_insensitive 5.7.2.3. Attribute 16: case_insensitive
True, if filename comparisons on this file system are case True, if filename comparisons on this file system are case
insensitive. insensitive.
5.7.2.4. Attribute 17: case_preserving 5.7.2.4. Attribute 17: case_preserving
True, if filename case on this file system are preserved. True, if file name case on this file system is preserved.
5.7.2.5. Attribute 60: change_policy 5.7.2.5. Attribute 60: change_policy
A value created by the server that the client can use to determine if A value created by the server that the client can use to determine if
some server policy related to the current file system has been some server policy related to the current file system has been
subject to change. If the value remains the same then the client can subject to change. If the value remains the same then the client can
be sure that the values of the attributes related to fs location and be sure that the values of the attributes related to fs location and
the fss_type field of the fs_status attribute have not changed. On the fss_type field of the fs_status attribute have not changed. On
the other hand, a change in this value does necessarily imply a the other hand, a change in this value does necessarily imply a
change in policy. It is up to the client to interrogate the server change in policy. It is up to the client to interrogate the server
skipping to change at page 105, line 49 skipping to change at page 106, line 49
lead to the client either wasting bandwidth or not receiving the best lead to the client either wasting bandwidth or not receiving the best
performance. performance.
5.7.2.22. Attribute 32: mimetype 5.7.2.22. Attribute 32: mimetype
MIME body type/subtype of this object. MIME body type/subtype of this object.
5.7.2.23. Attribute 55: mounted_on_fileid 5.7.2.23. Attribute 55: mounted_on_fileid
Like fileid, but if the target filehandle is the root of a file Like fileid, but if the target filehandle is the root of a file
system return the fileid of the underlying directory. system, this attribute represents the fileid of the underlying
directory.
UNIX-based operating environments connect a file system into the UNIX-based operating environments connect a file system into the
namespace by connecting (mounting) the file system onto the existing namespace by connecting (mounting) the file system onto the existing
file object (the mount point, usually a directory) of an existing file object (the mount point, usually a directory) of an existing
file system. When the mount point's parent directory is read via an file system. When the mount point's parent directory is read via an
API like readdir(), the return results are directory entries, each API like readdir(), the return results are directory entries, each
with a component name and a fileid. The fileid of the mount point's with a component name and a fileid. The fileid of the mount point's
directory entry will be different from the fileid that the stat() directory entry will be different from the fileid that the stat()
system call returns. The stat() system call is returning the fileid system call returns. The stat() system call is returning the fileid
of the root of the mounted file system, whereas readdir() is of the root of the mounted file system, whereas readdir() is
skipping to change at page 107, line 7 skipping to change at page 108, line 7
should obey an invariant that has it returning a value that is equal should obey an invariant that has it returning a value that is equal
to the file object's entry in the object's parent directory, i.e. to the file object's entry in the object's parent directory, i.e.
what readdir() would have returned. Some operating environments what readdir() would have returned. Some operating environments
allow a series of two or more file systems to be mounted onto a allow a series of two or more file systems to be mounted onto a
single mount point. In this case, for the server to obey the single mount point. In this case, for the server to obey the
aforementioned invariant, it will need to find the base mount point, aforementioned invariant, it will need to find the base mount point,
and not the intermediate mount points. and not the intermediate mount points.
5.7.2.24. Attribute 34: no_trunc 5.7.2.24. Attribute 34: no_trunc
True, if a name longer than name_max is used, an error be returned If this attribute is TRUE, then if the client uses a file name longer
and name is not truncated. than name_max, an error will be returned instead of the name being
truncated.
5.7.2.25. Attribute 35: numlinks 5.7.2.25. Attribute 35: numlinks
Number of hard links to this object. Number of hard links to this object.
5.7.2.26. Attribute 36: owner 5.7.2.26. Attribute 36: owner
The string name of the owner of this object. The string name of the owner of this object.
5.7.2.27. Attribute 37: owner_group 5.7.2.27. Attribute 37: owner_group
The string name of the group ownership of this object. The string name of the group ownership of this object.
5.7.2.28. Attribute 38: quota_avail_hard 5.7.2.28. Attribute 38: quota_avail_hard
The value in bytes which represent the amount of additional disk The value in bytes which represents the amount of additional disk
space beyond the current allocation that can be allocated to this space beyond the current allocation that can be allocated to this
file or directory before further allocations will be refused. It is file or directory before further allocations will be refused. It is
understood that this space may be consumed by allocations to other understood that this space may be consumed by allocations to other
files or directories. files or directories.
5.7.2.29. Attribute 39: quota_avail_soft 5.7.2.29. Attribute 39: quota_avail_soft
The value in bytes which represents the amount of additional disk The value in bytes which represents the amount of additional disk
space that can be allocated to this file or directory before the user space that can be allocated to this file or directory before the user
may reasonably be warned. It is understood that this space may be may reasonably be warned. It is understood that this space may be
skipping to change at page 108, line 9 skipping to change at page 109, line 11
files or directories for which a quota_used value is maintained. files or directories for which a quota_used value is maintained.
E.g. "all files with a given owner", "all files with a given group E.g. "all files with a given owner", "all files with a given group
owner". etc. owner". etc.
The server is at liberty to choose any of those sets but should do so The server is at liberty to choose any of those sets but should do so
in a repeatable way. The rule may be configured per file system or in a repeatable way. The rule may be configured per file system or
may be "choose the set with the smallest quota". may be "choose the set with the smallest quota".
5.7.2.31. Attribute 41: rawdev 5.7.2.31. Attribute 41: rawdev
Raw device identifier. UNIX device major/minor node information. If Raw device identifier; the UNIX device major/minor node information.
the value of type is not NF4BLK or NF4CHR, the value return SHOULD If the value of type is not NF4BLK or NF4CHR, the value returned
NOT be considered useful. SHOULD NOT be considered useful.
5.7.2.32. Attribute 42: space_avail 5.7.2.32. Attribute 42: space_avail
Disk space in bytes available to this user on the file system Disk space in bytes available to this user on the file system
containing this object - this should be the smallest relevant limit. containing this object - this should be the smallest relevant limit.
5.7.2.33. Attribute 43: space_free 5.7.2.33. Attribute 43: space_free
Free disk space in bytes on the file system containing this object - Free disk space in bytes on the file system containing this object -
this should be the smallest relevant limit. this should be the smallest relevant limit.
skipping to change at page 108, line 33 skipping to change at page 109, line 35
5.7.2.34. Attribute 44: space_total 5.7.2.34. Attribute 44: space_total
Total disk space in bytes on the file system containing this object. Total disk space in bytes on the file system containing this object.
5.7.2.35. Attribute 45: space_used 5.7.2.35. Attribute 45: space_used
Number of file system bytes allocated to this object. Number of file system bytes allocated to this object.
5.7.2.36. Attribute 46: system 5.7.2.36. Attribute 46: system
True, if this file is a "system" file with respect to the Windows This attribute is TRUE if this file is a "system" file with respect
API. to the Windows operating environment.
5.7.2.37. Attribute 47: time_access 5.7.2.37. Attribute 47: time_access
The time_access attribute represents the time of last access to the The time_access attribute represents the time of last access to the
object by a read that was satisfied by the server. The notion of object by a read that was satisfied by the server. The notion of
what is an "access" depends on server's operating environment and/or what is an "access" depends on server's operating environment and/or
the server's file system semantics. For example, for servers obeying the server's file system semantics. For example, for servers obeying
POSIX semantics, time_access would be updated only by the READLINK, POSIX semantics, time_access would be updated only by the READLINK,
READ, and READDIR operations and not any of the operations that READ, and READDIR operations and not any of the operations that
modify the content of the object. Of course, setting the modify the content of the object. Of course, setting the
skipping to change at page 109, line 29 skipping to change at page 110, line 30
The time of creation of the object. This attribute does not have any The time of creation of the object. This attribute does not have any
relation to the traditional UNIX file attribute "ctime" or "change relation to the traditional UNIX file attribute "ctime" or "change
time". time".
5.7.2.41. Attribute 51: time_delta 5.7.2.41. Attribute 51: time_delta
Smallest useful server time granularity. Smallest useful server time granularity.
5.7.2.42. Attribute 52: time_metadata 5.7.2.42. Attribute 52: time_metadata
The time of last meta-data modification of the object. The time of last metadata modification of the object.
5.7.2.43. Attribute 53: time_modify 5.7.2.43. Attribute 53: time_modify
The time of last modification to the object. The time of last modification to the object.
5.7.2.44. Attribute 54: time_modify_set 5.7.2.44. Attribute 54: time_modify_set
Set the time of last modification to the object. SETATTR use only. Set the time of last modification to the object. SETATTR use only.
5.8. Interpreting owner and owner_group 5.8. Interpreting owner and owner_group
skipping to change at page 110, line 31 skipping to change at page 111, line 32
service may also be used to accomplish the translation. A server may service may also be used to accomplish the translation. A server may
provide a more general service, not limited by any particular provide a more general service, not limited by any particular
translation (which would only translate a limited set of possible translation (which would only translate a limited set of possible
strings) by storing the owner and owner_group attributes in local strings) by storing the owner and owner_group attributes in local
storage without any translation or it may augment a translation storage without any translation or it may augment a translation
method by storing the entire string for attributes for which no method by storing the entire string for attributes for which no
translation is available while using the local representation for translation is available while using the local representation for
those cases in which a translation is available. those cases in which a translation is available.
Servers that do not provide support for all possible values of the Servers that do not provide support for all possible values of the
owner and owner_group attributes, should return an error owner and owner_group attributes, SHOULD return an error
(NFS4ERR_BADOWNER) when a string is presented that has no (NFS4ERR_BADOWNER) when a string is presented that has no
translation, as the value to be set for a SETATTR of the owner, translation, as the value to be set for a SETATTR of the owner,
owner_group, or acl attributes. When a server does accept an owner owner_group, or acl attributes. When a server does accept an owner
or owner_group value as valid on a SETATTR (and similarly for the or owner_group value as valid on a SETATTR (and similarly for the
owner and group strings in an acl), it is promising to return that owner and group strings in an acl), it is promising to return that
same string when a corresponding GETATTR is done. Configuration same string when a corresponding GETATTR is done. Configuration
changes and ill-constructed name translations (those that contain changes (including changes from the mapping of the string to the
aliasing) may make that promise impossible to honor. Servers should local representation) and ill-constructed name translations (those
make appropriate efforts to avoid a situation in which these that contain aliasing) may make that promise impossible to honor.
attributes have their values changed when no real change to ownership Servers should make appropriate efforts to avoid a situation in which
has occurred. these attributes have their values changed when no real change to
ownership has occurred.
The "dns_domain" portion of the owner string is meant to be a DNS The "dns_domain" portion of the owner string is meant to be a DNS
domain name. For example, user@ietf.org. Servers should accept as domain name. For example, user@ietf.org. Servers should accept as
valid a set of users for at least one domain. A server may treat valid a set of users for at least one domain. A server may treat
other domains as having no valid translations. A more general other domains as having no valid translations. A more general
service is provided when a server is capable of accepting users for service is provided when a server is capable of accepting users for
multiple domains, or for all domains, subject to security multiple domains, or for all domains, subject to security
constraints. constraints.
In the case where there is no translation available to the client or In the case where there is no translation available to the client or
server, the attribute value must be constructed without the "@". server, the attribute value must be constructed without the "@".
Therefore, the absence of the @ from the owner or owner_group Therefore, the absence of the @ from the owner or owner_group
attribute signifies that no translation was available at the sender attribute signifies that no translation was available at the sender
and that the receiver of the attribute should not use that string as and that the receiver of the attribute should not use that string as
a basis for translation into its own internal format. Even though a basis for translation into its own internal format. Even though
the attribute value can not be translated, it may still be useful. the attribute value can not be translated, it may still be useful.
In the case of a client, the attribute string may be used for local In the case of a client, the attribute string may be used for local
display of ownership. display of ownership.
To provide a greater degree of compatibility with NFSv3, which To provide a greater degree of compatibility with NFSv3, which
identified users and groups by 32-bit unsigned uid's and gid's, owner identified users and groups by 32-bit unsigned user identifiers and
and group strings that consist of decimal numeric values with no group identifiers, owner and group strings that consist of decimal
leading zeros can be given a special interpretation by clients and numeric values with no leading zeros can be given a special
servers which choose to provide such support. The receiver may treat interpretation by clients and servers which choose to provide such
such a user or group string as representing the same user as would be support. The receiver may treat such a user or group string as
represented by an NFSv3 uid or gid having the corresponding numeric representing the same user as would be represented by an NFSv3 uid or
value. A server is not obligated to accept such a string, but may gid having the corresponding numeric value. A server is not
return an NFS4ERR_BADOWNER instead. To avoid this mechanism being obligated to accept such a string, but may return an NFS4ERR_BADOWNER
used to subvert user and group translation, so that a client might instead. To avoid this mechanism being used to subvert user and
pass all of the owners and groups in numeric form, a server SHOULD group translation, so that a client might pass all of the owners and
return an NFS4ERR_BADOWNER error when there is a valid translation groups in numeric form, a server SHOULD return an NFS4ERR_BADOWNER
for the user or owner designated in this way. In that case, the error when there is a valid translation for the user or owner
client must use the appropriate name@domain string and not the designated in this way. In that case, the client must use the
special form for compatibility. appropriate name@domain string and not the special form for
compatibility.
The owner string "nobody" may be used to designate an anonymous user, The owner string "nobody" may be used to designate an anonymous user,
which will be associated with a file created by a security principal which will be associated with a file created by a security principal
that cannot be mapped through normal means to the owner attribute. that cannot be mapped through normal means to the owner attribute.
5.9. Character Case Attributes 5.9. Character Case Attributes
With respect to the case_insensitive and case_preserving attributes, With respect to the case_insensitive and case_preserving attributes,
each UCS-4 character (which UTF-8 encodes) has a "long descriptive each UCS-4 character (which UTF-8 encodes) has a "long descriptive
name" RFC1345 [35] which may or may not included the word "CAPITAL" name" RFC1345 [35] which may or may not include the word "CAPITAL" or
or "SMALL". The presence of SMALL or CAPITAL allows an NFS server to "SMALL". The presence of SMALL or CAPITAL allows an NFS server to
implement unambiguous and efficient table driven mappings for case implement unambiguous and efficient table driven mappings for case
insensitive comparisons, and non-case-preserving storage. For insensitive comparisons, and non-case-preserving storage. For
general character handling and internationalization issues, see general character handling and internationalization issues, see
Section 14. Section 14.
5.10. Directory Notification Attributes 5.10. Directory Notification Attributes
As described in Section 18.39, the client can request a minimum delay As described in Section 18.39, the client can request a minimum delay
for notifications of changes to attributes, but the server is free to for notifications of changes to attributes, but the server is free to
ignore what the client requests. The client can determine in advance ignore what the client requests. The client can determine in advance
skipping to change at page 112, line 24 skipping to change at page 113, line 27
5.10.2. Attribute 57: dirent_notif_delay 5.10.2. Attribute 57: dirent_notif_delay
The dirent_notif_delay attribute is the minimum number of seconds the The dirent_notif_delay attribute is the minimum number of seconds the
server will delay before notifying the client of a change to a file server will delay before notifying the client of a change to a file
object that has an entry in the directory. object that has an entry in the directory.
5.11. pNFS Attribute Definitions 5.11. pNFS Attribute Definitions
5.11.1. Attribute 62: fs_layout_type 5.11.1. Attribute 62: fs_layout_type
The fs_layout_type attribute (data type layouttype4 (Section 3.3.13)) The fs_layout_type attribute (see Section 3.3.13) applies to a file
applies to a file system and indicates what layout types are system and indicates what layout types are supported by the file
supported by the file system. When the client encounters a new fsid, system. When the client encounters a new fsid, the client SHOULD
the client should obtain the value for the fs_layout_type attribute obtain the value for the fs_layout_type attribute associated with the
associated with the new file system. This attribute is used by the new file system. This attribute is used by the client to determine
client to determine if the layout types supported by the server match if the layout types supported by the server match any of the client's
any of the client's supported layout types. supported layout types.
5.11.2. Attribute 66: layout_alignment 5.11.2. Attribute 66: layout_alignment
When a client has layouts for a file system, the layout_alignment When a client has layouts for a file system, the layout_alignment
attribute indicates the preferred alignment for I/O to files on that attribute indicates the preferred alignment for I/O to files on that
file system. Where possible, the client should send READ and WRITE file system. Where possible, the client should send READ and WRITE
operations with offsets that are whole multiples of the operations with offsets that are whole multiples of the
layout_alignment attribute. layout_alignment attribute.
5.11.3. Attribute 65: layout_blksize 5.11.3. Attribute 65: layout_blksize
When a client has layouts for a file system, the layout_blksize When a client has layouts for a file system, the layout_blksize
attribute indicates the preferred block size for I/O to files on that attribute indicates the preferred block size for I/O to files on that
file system. Where possible, the client should send READ operations file system. Where possible, the client should send READ operations
with a count argument that is a whole multiple of layout_blksize, and with a count argument that is a whole multiple of layout_blksize, and
WRITE operations with a data argument of size that is a whole WRITE operations with a data argument of size that is a whole
multiple of layout_blksize. multiple of layout_blksize.
5.11.4. Attribute 63: layout_hint 5.11.4. Attribute 63: layout_hint
The layout_hint attribute (data type layouthint4 (Section 3.3.19)) The layout_hint attribute (see Section 3.3.19) may be set on newly
may be set on newly created files to influence the metadata server's created files to influence the metadata server's choice for the
choice for the file's layout. If possible, this attribute is one of file's layout. If possible, this attribute is one of those set in
those set in the initial attributes within the OPEN operation. The the initial attributes within the OPEN operation. The metadata
metadata server may choose to ignore this attribute. The layout_hint server may choose to ignore this attribute. The layout_hint
attribute is a sub-set of the layout structure returned by LAYOUTGET. attribute is a sub-set of the layout structure returned by LAYOUTGET.
For example, instead of specifying particular devices, this would be For example, instead of specifying particular devices, this would be
used to suggest the stripe width of a file. The server used to suggest the stripe width of a file. The server
implementation determines which fields within the layout will be implementation determines which fields within the layout will be
used. used.
5.11.5. Attribute 64: layout_type 5.11.5. Attribute 64: layout_type
This attribute lists the layout type(s) available for a file. The This attribute lists the layout type(s) available for a file. The
value returned by the server is for informational purposes only. The value returned by the server is for informational purposes only. The
skipping to change at page 113, line 33 skipping to change at page 114, line 33
needed in order to perform I/O. For example, the specific device needed in order to perform I/O. For example, the specific device
information for the file and its layout. information for the file and its layout.
5.11.6. Attribute 68: mdsthreshold 5.11.6. Attribute 68: mdsthreshold
This attribute is a server provided hint used to communicate to the This attribute is a server provided hint used to communicate to the
client when it is more efficient to send READ and WRITE operations to client when it is more efficient to send READ and WRITE operations to
the metadata server or the data server. The two types of thresholds the metadata server or the data server. The two types of thresholds
described are file size thresholds and I/O size thresholds. If a described are file size thresholds and I/O size thresholds. If a
file's size is smaller than the file size threshold, data accesses file's size is smaller than the file size threshold, data accesses
should be sent to the metadata server. If an I/O is below the I/O SHOULD be sent to the metadata server. If an I/O request has a
size threshold, the I/O should be sent to the metadata server. As length that is below the I/O size threshold, the I/O SHOULD be sent
defined, each threshold type is specified separately for READ and to the metadata server. Each threshold type is specified separately
WRITE. for READ and WRITE.
The server may provide both types of thresholds for a file. If both The server MAY provide both types of thresholds for a file. If both
file size and I/O size are provided, the client should exceed both file size and I/O size are provided, the client SHOULD reach or
thresholds before issuing its READ or WRITE requests to the data exceed both thresholds before issuing its READ or WRITE requests to
server. Alternatively, if only one of the specified thresholds is the data server. Alternatively, if only one of the specified
exceeded, the I/O requests are sent to the metadata server. thresholds are reached or exceeded, the I/O requests are sent to the
metadata server.
For each threshold type, a value of 0 indicates no READ or WRITE For each threshold type, a value of 0 indicates no READ or WRITE
should be sent to the metadata server, while a value of all 1s should be sent to the metadata server, while a value of all 1s
indicates all READS or WRITES should be sent to the metadata server. indicates all READS or WRITES should be sent to the metadata server.
The attribute is available on a per filehandle basis. If the current The attribute is available on a per filehandle basis. If the current
filehandle refers to a non-pNFS file or directory, the metadata filehandle refers to a non-pNFS file or directory, the metadata
server should return an attribute that is representative of the server should return an attribute that is representative of the
filehandle's file system. It is suggested that this attribute is filehandle's file system. It is suggested that this attribute is
queried as part of the OPEN operation. Due to dynamic system queried as part of the OPEN operation. Due to dynamic system
skipping to change at page 114, line 24 skipping to change at page 115, line 25
reached. reached.
When retention is enabled, retention MUST extend to the data of the When retention is enabled, retention MUST extend to the data of the
file, and the name of file. The server MAY extend retention any file, and the name of file. The server MAY extend retention any
other property of the file, including any subset of REQUIRED, other property of the file, including any subset of REQUIRED,
RECOMMENDED, and named attributes, with the exceptions noted in this RECOMMENDED, and named attributes, with the exceptions noted in this
section. section.
Servers MAY support or not support retention on any file object type. Servers MAY support or not support retention on any file object type.
The five retention attributes are as follows: The five retention attributes are explained in the next subsections.
5.12.1. Attribute 69: retention_get 5.12.1. Attribute 69: retention_get
If retention is enabled for the associated file, this attribute's If retention is enabled for the associated file, this attribute's
value represents the retention begin time of the file object. This value represents the retention begin time of the file object. This
attribute's value is only readable with the GETATTR operation and may attribute's value is only readable with the GETATTR operation and may
not be modified by the SETATTR operation. The value of the attribute not be modified by the SETATTR operation. The value of the attribute
consists of: consists of:
const RET4_DURATION_INFINITE = 0xffffffffffffffff; const RET4_DURATION_INFINITE = 0xffffffffffffffff;
skipping to change at page 115, line 43 skipping to change at page 116, line 48
5.12.4. Attribute 72: retentevt_set 5.12.4. Attribute 72: retentevt_set
Set the event-based retention duration, and optionally enable event- Set the event-based retention duration, and optionally enable event-
based retention on the file object. This attribute corresponds to based retention on the file object. This attribute corresponds to
retentevt_get, is like retention_set, but refers to event-based retentevt_get, is like retention_set, but refers to event-based
retention. When event based retention is set, the file MUST be retention. When event based retention is set, the file MUST be
retained even if non-event-based retention has been set, and the retained even if non-event-based retention has been set, and the
duration of non-event-based retention has been reached. Conversely, duration of non-event-based retention has been reached. Conversely,
when non-event-based retention has been set, the file MUST be when non-event-based retention has been set, the file MUST be
retained even the event-based retention has been set, and the retained even if event-based retention has been set, and the duration
duration of event-based retention has been reached. The server MAY of event-based retention has been reached. The server MAY restrict
restrict the enabling of event-based retention or the duration of the enabling of event-based retention or the duration of event-based
event-based retention on the basis of the ACE4_WRITE_RETENTION ACL retention on the basis of the ACE4_WRITE_RETENTION ACL permission.
permission. The enabling of event-based retention does not prevent The enabling of event-based retention does not prevent the enabling
the enabling of non-event-based retention nor the modification of the of non-event-based retention nor the modification of the
retention_hold attribute. retention_hold attribute.
5.12.5. Attribute 73: retention_hold 5.12.5. Attribute 73: retention_hold
Get or set administrative retention holds, one hold per bit position. Get or set administrative retention holds, one hold per bit position.
This attribute allows one to 64 administrative holds, one hold per This attribute allows one to 64 administrative holds, one hold per
bit on the attribute. If retention_hold is not zero, then the file bit on the attribute. If retention_hold is not zero, then the file
MUST NOT be deleted, renamed, or modified, even if the duration on MUST NOT be deleted, renamed, or modified, even if the duration on
enabled event or non-event-based retention has been reached. The enabled event or non-event-based retention has been reached. The
server MAY restrict the modification of retention_hold on the basis server MAY restrict the modification of retention_hold on the basis
of the ACE4_WRITE_RETENTION_HOLD ACL permission. The enabling of of the ACE4_WRITE_RETENTION_HOLD ACL permission. The enabling of
administration retention holds does not prevent the enabling of administration retention holds does not prevent the enabling of
event-based or non-event-based retention. event-based or non-event-based retention.
6. Security Related Attributes 6. Access Control Attributes
Access Control Lists (ACLs) are file attributes that specify fine Access Control Lists (ACLs) are file attributes that specify fine
grained access control. This chapter covers the "acl", "dacl", grained access control. This chapter covers the "acl", "dacl",
"sacl", "aclsupport", "mode", "mode_set_masked" file attributes, and "sacl", "aclsupport", "mode", "mode_set_masked" file attributes, and
their interactions. Note that file attributes may apply to any file their interactions. Note that file attributes may apply to any file
system objects. system objects.
6.1. Goals 6.1. Goals
ACLs and modes represent two well established models for specifying ACLs and modes represent two well established models for specifying
skipping to change at page 119, line 10 skipping to change at page 120, line 10
The situation is complicated by the fact that a server may have The situation is complicated by the fact that a server may have
multiple modules that enforce ACLs. For example, the enforcement for multiple modules that enforce ACLs. For example, the enforcement for
NFSv4.1 access may be different from, but not weaker than, the NFSv4.1 access may be different from, but not weaker than, the
enforcement for local access, and both may be different from the enforcement for local access, and both may be different from the
enforcement for access through other protocols such as SMB. So it enforcement for access through other protocols such as SMB. So it
may be useful for a server to accept an ACL even if not all of its may be useful for a server to accept an ACL even if not all of its
modules are able to support it. modules are able to support it.
The guiding principle with regard to NFSv4 access is that the server The guiding principle with regard to NFSv4 access is that the server
must not accept ACLs that appear to make the file more secure than it must not accept ACLs that appear to make access to the file more
really is. restrictive than it really is.
6.2.1.1. ACE Type 6.2.1.1. ACE Type
The constants used for the type field (acetype4) are as follows: The constants used for the type field (acetype4) are as follows:
const ACE4_ACCESS_ALLOWED_ACE_TYPE = 0x00000000; const ACE4_ACCESS_ALLOWED_ACE_TYPE = 0x00000000;
const ACE4_ACCESS_DENIED_ACE_TYPE = 0x00000001; const ACE4_ACCESS_DENIED_ACE_TYPE = 0x00000001;
const ACE4_SYSTEM_AUDIT_ACE_TYPE = 0x00000002; const ACE4_SYSTEM_AUDIT_ACE_TYPE = 0x00000002;
const ACE4_SYSTEM_ALARM_ACE_TYPE = 0x00000003; const ACE4_SYSTEM_ALARM_ACE_TYPE = 0x00000003;
skipping to change at page 142, line 39 skipping to change at page 143, line 39
This chapter describes the NFSv4 single-server namespace. Single- This chapter describes the NFSv4 single-server namespace. Single-
server namespaces may be presented directly to clients, or they may server namespaces may be presented directly to clients, or they may
be used as a basis to form larger multi-server namespaces (e.g. site- be used as a basis to form larger multi-server namespaces (e.g. site-
wide or organization-wide) to be presented to clients, as described wide or organization-wide) to be presented to clients, as described
in Section 11. in Section 11.
7.1. Server Exports 7.1. Server Exports
On a UNIX server, the namespace describes all the files reachable by On a UNIX server, the namespace describes all the files reachable by
pathnames under the root directory or "/". On a Windows NT server pathnames under the root directory or "/". On a Windows server the
the namespace constitutes all the files on disks named by mapped disk namespace constitutes all the files on disks named by mapped disk
letters. NFS server administrators rarely make the entire server's letters. NFS server administrators rarely make the entire server's
file system namespace available to NFS clients. More often portions file system namespace available to NFS clients. More often portions
of the namespace are made available via an "export" feature. In of the namespace are made available via an "export" feature. In
previous versions of the NFS protocol, the root filehandle for each previous versions of the NFS protocol, the root filehandle for each
export is obtained through the MOUNT protocol; the client sent a export is obtained through the MOUNT protocol; the client sent a
string that identified the export name within the namespace and the string that identified the export name within the namespace and the
server returned the root filehandle for that export. The MOUNT server returned the root filehandle for that export. The MOUNT
protocol also provided an EXPORTS procedure that enumerated server's protocol also provided an EXPORTS procedure that enumerated server's
exports. exports.
skipping to change at page 146, line 6 skipping to change at page 147, line 6
a particular file system, as opposed to all of the data within it, a particular file system, as opposed to all of the data within it,
the server can apply the security policy of a shared resource in the the server can apply the security policy of a shared resource in the
server's namespace to components of the resource's ancestors. For server's namespace to components of the resource's ancestors. For
example: example:
/ (place holder/not exported) / (place holder/not exported)
/a/b (file system 1) /a/b (file system 1)
/a/b/MySecretProject (file system 2) /a/b/MySecretProject (file system 2)
The /a/b/MySecretProject directory is a real file system and is the The /a/b/MySecretProject directory is a real file system and is the
shared resource. Suppose the security policy for /a/b/ shared resource. Suppose the security policy for /a/b/
MySecretProject is Kerberos with integrity and it desired to prevent MySecretProject is Kerberos with integrity and it desired that
knowledge of the existence of this file system to be very limited. knowledge of the existence of this file system to be very limited.
In this case the server should apply the same security policy to In this case the server should apply the same security policy to
/a/b. This allows for knowledge the existence of a file system to be /a/b. This allows for knowledge the existence of a file system to be
secured in cases where this is desirable. secured in cases where this is desirable.
For the case of the use of multiple, disjoint security mechanisms in For the case of the use of multiple, disjoint security mechanisms in
the server's resources, applying that sort of policy would result in the server's resources, applying that sort of policy would result in
the higher-level file system not being accessible using any security the higher-level file system not being accessible using any security
flavor, which would make the that higher-level file system flavor, which would make the that higher-level file system
inaccessible. Therefore, that sort of configuration is not inaccessible. Therefore, that sort of configuration is not
skipping to change at page 148, line 19 skipping to change at page 149, line 19
associated state-owner or state-owners (in the case of an open-owner/ associated state-owner or state-owners (in the case of an open-owner/
lock-owner pair) and the associated filehandle. When stateids are lock-owner pair) and the associated filehandle. When stateids are
used, the current filehandle must be the one associated with that used, the current filehandle must be the one associated with that
stateid. stateid.
All stateids associated with a given clientid are associated with a All stateids associated with a given clientid are associated with a
common lease which represents the claim of those stateids and the common lease which represents the claim of those stateids and the
objects they represent to be maintained by the server. See objects they represent to be maintained by the server. See
Section 8.3 for a discussion of leases. Section 8.3 for a discussion of leases.
The server may assign stateids independently for different clients The server may assign stateids independently for different clients.
and a stateid with the same bit pattern for one client may designate A stateid with the same bit pattern for one client may designate an
an entirely different set of locks for a different client. The entirely different set of locks for a different client. The stateid
stateid is always interpreted with respect to the client ID is always interpreted with respect to the client ID associated with
associated with the current session. Stateids apply to all sessions the current session. Stateids apply to all sessions associated with
associated with the given client ID and the client may use a stateid the given client ID and the client may use a stateid obtained from
obtained from one session on another session associated with the same one session on another session associated with the same client ID.
client ID.
8.2.1. Stateid Types 8.2.1. Stateid Types
With the exception of special stateids, to be discussed later, each With the exception of special stateids, to be discussed later, each
stateid represents locking objects of one of a set of types defined stateid represents locking objects of one of a set of types defined
by the NFSv4.1 protocol. Note that in all these cases, where we by the NFSv4.1 protocol. Note that in all these cases, where we
speak of guarantee, there is always an implied codicil that any speak of guarantee, it is understood there are situations such as a
situation such as a client reboot, or lock revocation, allows the client restart, or lock revocation, that allow the guarantee to be
guarantee to be voided. voided.
o Stateids may represent opens of files. o Stateids may represent opens of files.
Each stateid in this case represents the open for a given Each stateid in this case represents the open for a given
clientid/open-owner/filehandle triple. Such stateids are subject clientid/open-owner/filehandle triple. Such tateids are subject
to change (with consequent bumping of the seqid) in response to to change (with consequent bumping of the seqid) in response to
OPENs that result in upgrade and OPEN_DOWNGRADE operations. OPENs that result in upgrade and OPEN_DOWNGRADE operations.
o Stateids may represent sets of byte-range locks. o Stateids may represent sets of byte-range locks.
All locks held on a particular file by a particular owner and all All locks held on a particular file by a particular owner and all
gotten under the aegis of a particular open file are associated gotten under the aegis of a particular open file are associated
with a single stateid with the seqid being bumped as LOCK and with a single stateid with the seqid being bumped as LOCK and
LOCKU operation affect that set of locks. LOCKU operation affect that set of locks.
skipping to change at page 152, line 7 skipping to change at page 152, line 49
client IDs and filehandles. In the case of a special stateid client IDs and filehandles. In the case of a special stateid
designating the current stateid, the current stateid value designating the current stateid, the current stateid value
substituted for the special stateid is associated with a particular substituted for the special stateid is associated with a particular
client ID and filehandle, and so, if it is used where current client ID and filehandle, and so, if it is used where current
filehandle does not match that associated with the current stateid, filehandle does not match that associated with the current stateid,
the operation to which the stateid is passed will return the operation to which the stateid is passed will return
NFS4ERR_BAD_STATEID. NFS4ERR_BAD_STATEID.
8.2.4. Stateid Lifetime and Validation 8.2.4. Stateid Lifetime and Validation
Stateids must remain valid until either a client reboot or a server Stateids must remain valid until either a client restart or a server
reboot or until the client returns all of the locks associated with restart or until the client returns all of the locks associated with
the stateid by means of an operation such as CLOSE or DELEGRETURN. the stateid by means of an operation such as CLOSE or DELEGRETURN.
If the locks are lost due to revocation the stateid remains a valid If the locks are lost due to revocation the stateid remains a valid
designation of that revoked state until the client frees it by using designation of that revoked state until the client frees it by using
FREE_STATEID. Stateids associated with record locks are an FREE_STATEID. Stateids associated with record locks are an
exception. They remain valid even if a LOCKU frees all remaining exception. They remain valid even if a LOCKU frees all remaining
locks, so long as the open file with which they are associated locks, so long as the open file with which they are associated
remains open, unless the client does a FREE_STATEID to cause the remains open, unless the client does a FREE_STATEID to cause the
stateid to be freed. stateid to be freed.
It should be noted that there are situations in which the client's It should be noted that there are situations in which the client's
locks become invalid, without the client requesting they be returned. locks become invalid, without the client requesting they be returned.
skipping to change at page 152, line 20 skipping to change at page 153, line 15
If the locks are lost due to revocation the stateid remains a valid If the locks are lost due to revocation the stateid remains a valid
designation of that revoked state until the client frees it by using designation of that revoked state until the client frees it by using
FREE_STATEID. Stateids associated with record locks are an FREE_STATEID. Stateids associated with record locks are an
exception. They remain valid even if a LOCKU frees all remaining exception. They remain valid even if a LOCKU frees all remaining
locks, so long as the open file with which they are associated locks, so long as the open file with which they are associated
remains open, unless the client does a FREE_STATEID to cause the remains open, unless the client does a FREE_STATEID to cause the
stateid to be freed. stateid to be freed.
It should be noted that there are situations in which the client's It should be noted that there are situations in which the client's
locks become invalid, without the client requesting they be returned. locks become invalid, without the client requesting they be returned.
These include lease expiration and a number if forms lock revocation These include lease expiration and a number of forms of lock
within the lease period. It is important to note that in these revocation within the lease period. It is important to note that in
situations, the stateid remains valid and the client can use it to these situations, the stateid remains valid and the client can use it
determine the disposition of the associated lost locks. to determine the disposition of the associated lost locks.
An "other" value must never be reused for a different purpose (i.e. An "other" value must never be reused for a different purpose (i.e.
different filehandle, owner, or type of locks) within the context of different filehandle, owner, or type of locks) within the context of
a single client ID. A server may retain the "other" value for the a single client ID. A server may retain the "other" value for the
same purpose beyond the point where it may otherwise be freed but if same purpose beyond the point where it may otherwise be freed but if
it does so, it must maintain "seqid" continuity with previous values. it does so, it must maintain "seqid" continuity with previous values.
One mechanism that may be used to satisfy the requirement that the One mechanism that may be used to satisfy the requirement that the
server recognize invalid and out-of-date stateids is for the server server recognize invalid and out-of-date stateids is for the server
to divide the "other" field of the stateid into two fields. to divide the "other" field of the stateid into two fields.
skipping to change at page 156, line 25 skipping to change at page 157, line 18
steps to ensure that the renewal messages actually reach the server steps to ensure that the renewal messages actually reach the server
in good time. For example: in good time. For example:
o When trunking is in effect, the client should consider issuing o When trunking is in effect, the client should consider issuing
multiple requests on different connections, in order to ensure multiple requests on different connections, in order to ensure
that renewal occurs, even in the event of blockage in the path that renewal occurs, even in the event of blockage in the path
used for one of those connections. used for one of those connections.
o TCP retransmission delays might become so large as to approach or o TCP retransmission delays might become so large as to approach or
exceed the length of the lease period. This may be particularly exceed the length of the lease period. This may be particularly
likely when the server is unresponsive due to a reboot; see likely when the server is unresponsive due to a restart; see
Section 8.4.2.1 Section 8.4.2.1
If the server renews the lease upon receiving a SEQUENCE operation, If the server renews the lease upon receiving a SEQUENCE operation,
the server MUST NOT allow the lease to expire while the rest of the the server MUST NOT allow the lease to expire while the rest of the
operations in the COMPOUND procedure's request are still executing. operations in the COMPOUND procedure's request are still executing.
Once the last operation has finished, and the response to COMPOUND Once the last operation has finished, and the response to COMPOUND
has been sent, the server MUST set the lease to expire no sooner than has been sent, the server MUST set the lease to expire no sooner than
the sum of current time and the value of the lease_time attribute. the sum of current time and the value of the lease_time attribute.
A client ID's lease can expire when it has been at least the lease A client ID's lease can expire when it has been at least the lease
skipping to change at page 157, line 23 skipping to change at page 158, line 15
SEQ4_STATUS_RECALLABLE_STATE_REVOKED notify the client of lock SEQ4_STATUS_RECALLABLE_STATE_REVOKED notify the client of lock
revocation events. When these bits are set, the client should use revocation events. When these bits are set, the client should use
TEST_STATEID to find what stateids have been revoked and use TEST_STATEID to find what stateids have been revoked and use
FREE_STATEID to acknowledge loss of the associated state. FREE_STATEID to acknowledge loss of the associated state.
o The status bit SEQ4_STATUS_LEASE_MOVE indicates that o The status bit SEQ4_STATUS_LEASE_MOVE indicates that
responsibility for lease renewal has been transferred to one or responsibility for lease renewal has been transferred to one or
more new servers. more new servers.
o The status bit SEQ4_STATUS_RESTART_RECLAIM_NEEDED indicates that o The status bit SEQ4_STATUS_RESTART_RECLAIM_NEEDED indicates that
due to server restart or reboot the client must reclaim locking due to server restart or restart the client must reclaim locking
state. state.
o The status bit SEQ4_STATUS_BACKCHANNEL_FAULT indicates server has o The status bit SEQ4_STATUS_BACKCHANNEL_FAULT indicates server has
encountered an unrecoverable fault with the backchannel (e.g. it encountered an unrecoverable fault with the backchannel (e.g. it
has lost track of a sequence id for a slot in the backchannel). has lost track of a sequence id for a slot in the backchannel).
8.4. Crash Recovery 8.4. Crash Recovery
A critical requirement in crash recovery is that both the client and A critical requirement in crash recovery is that both the client and
the server know when the other has failed. Additionally, it is the server know when the other has failed. Additionally, it is
required that a client sees a consistent view of data across server required that a client sees a consistent view of data across server
restarts or reboots. All READ and WRITE operations that may have restarts. All READ and WRITE operations that may have been queued
been queued within the client or network buffers must wait until the within the client or network buffers must wait until the client has
client has successfully recovered the locks protecting the READ and successfully recovered the locks protecting the READ and WRITE
WRITE operations. Any that reach the server before the server can operations. Any that reach the server before the server can safely
safely determine that the client has recovered enough locking state determine that the client has recovered enough locking state to be
to be sure that such operations can be safely processed must be sure that such operations can be safely processed must be rejected.
rejected. This will happen because either: This will happen because either:
o The state presented is no longer valid since it is associated with o The state presented is no longer valid since it is associated with
a now invalid clientid. In this case the client will receive a now invalid clientid. In this case the client will receive
either an NFS4ERR_BADSESSION or NFS4ERR_DEADSESSION error, and any either an NFS4ERR_BADSESSION or NFS4ERR_DEADSESSION error, and any
attempt to attach a new session to the existing clientid will attempt to attach a new session to the existing clientid will
encounter an NFS4ERR_STALE_CLIENTID error. encounter an NFS4ERR_STALE_CLIENTID error.
o Subsequent recovery of locks may make execution of the operation o Subsequent recovery of locks may make execution of the operation
inappropriate (NFS4ERR_GRACE). inappropriate (NFS4ERR_GRACE).
8.4.1. Client Failure and Recovery 8.4.1. Client Failure and Recovery
In the event that a client fails, the server may release the client's In the event that a client fails, the server may release the client's
locks when the associated lease has expired. Conflicting locks from locks when the associated lease has expired. Conflicting locks from
another client may only be granted after this lease expiration. As another client may only be granted after this lease expiration. As
discussed in Section 8.3, when a client has not failed and re- discussed in Section 8.3, when a client has not failed and re-
establishes his lease before expiration occurs, requests for establishes its lease before expiration occurs, requests for
conflicting locks will not be granted. conflicting locks will not be granted.
To minimize client delay upon restart, lock requests are associated To minimize client delay upon restart, lock requests are associated
with an instance of the client by a client-supplied verifier. This with an instance of the client by a client-supplied verifier. This
verifier is part of the client_owner4 sent in the initial EXCHANGE_ID verifier is part of the client_owner4 sent in the initial EXCHANGE_ID
call made by the client. The server returns a client ID as a result call made by the client. The server returns a client ID as a result
of the EXCHANGE_ID operation. The client then confirms the use of of the EXCHANGE_ID operation. The client then confirms the use of
the client ID by establishing a session associated with that client the client ID by establishing a session associated with that client
ID. See Section 18.36.3 for a description how this is done. All ID. See Section 18.36.3 for a description how this is done. All
locks, including opens, record locks, delegations, and layouts locks, including opens, record locks, delegations, and layouts
skipping to change at page 158, line 41 skipping to change at page 159, line 32
derived from the old verifier. At this point conflicting locks from derived from the old verifier. At this point conflicting locks from
other clients, kept waiting while the lease had not yet expired, can other clients, kept waiting while the lease had not yet expired, can
be granted. In addition, all stateids associated with the old be granted. In addition, all stateids associated with the old
clientid can also be freed, as they are no longer reference-able. clientid can also be freed, as they are no longer reference-able.
Note that the verifier must have the same uniqueness properties as Note that the verifier must have the same uniqueness properties as
the verifier for the COMMIT operation. the verifier for the COMMIT operation.
8.4.2. Server Failure and Recovery 8.4.2. Server Failure and Recovery
If the server loses locking state (usually as a result of a restart If the server loses locking state (usually as a result of a restart),
or reboot), it must allow clients time to discover this fact and re- it must allow clients time to discover this fact and re-establish the
establish the lost locking state. The client must be able to re- lost locking state. The client must be able to re-establish the
establish the locking state without having the server deny valid locking state without having the server deny valid requests because
requests because the server has granted conflicting access to another the server has granted conflicting access to another client.
client. Likewise, if there is a possibility that clients have not Likewise, if there is a possibility that clients have not yet re-
yet re-established their locking state for a file, and that such established their locking state for a file, and that such locking
locking state might make it invalid to perform READ or WRITE state might make it invalid to perform READ or WRITE operations, for
operations, for example through the establishment of mandatory locks, example through the establishment of mandatory locks, the server must
the server must disallow READ and WRITE operations for that file. disallow READ and WRITE operations for that file.
A client can determine that loss of locking state has occurred via A client can determine that loss of locking state has occurred via
several methods. several methods.
1. When a SEQUENCE (most common) or other operation returns 1. When a SEQUENCE (most common) or other operation returns
NFS4ERR_BADSESSION, this may mean the session has been destroyed, NFS4ERR_BADSESSION, this may mean the session has been destroyed,
but the client ID is still valid. The client sends a but the client ID is still valid. The client sends a
CREATE_SESSION request with the client ID to re-establish the CREATE_SESSION request with the client ID to re-establish the
session. If CREATE_SESSION fails with NFS4ERR_STALE_CLIENTID, session. If CREATE_SESSION fails with NFS4ERR_STALE_CLIENTID,
the client must establish a new client ID (see Section 8.1) and the client must establish a new client ID (see Section 8.1) and
skipping to change at page 159, line 36 skipping to change at page 160, line 27
3. When a operation, neither SEQUENCE nor preceded by SEQUENCE (for 3. When a operation, neither SEQUENCE nor preceded by SEQUENCE (for
example, CREATE_SESSION, DESTROY_SESSION) returns example, CREATE_SESSION, DESTROY_SESSION) returns
NFS4ERR_STALE_CLIENTID. The client MUST establish a new client NFS4ERR_STALE_CLIENTID. The client MUST establish a new client
ID (Section 8.1) and re-establish its lock state ID (Section 8.1) and re-establish its lock state
(Section 8.4.2.1). (Section 8.4.2.1).
8.4.2.1. State Reclaim 8.4.2.1. State Reclaim
When state information and the associated locks are lost as a result When state information and the associated locks are lost as a result
of a server reboot, the protocol must provide a way to cause that of a server restart, the protocol must provide a way to cause that
state to be re-established. The approach used is to define, for most state to be re-established. The approach used is to define, for most
types of locking state (layouts are an exception), a request whose types of locking state (layouts are an exception), a request whose
function is to allow the client to re-establish on the server a lock function is to allow the client to re-establish on the server a lock
first obtained from a previous instance. Generally these requests first obtained from a previous instance. Generally these requests
are variants of the requests normally used to create locks of that are variants of the requests normally used to create locks of that
type and are referred to as "reclaim-type" requests and the process type and are referred to as "reclaim-type" requests and the process
of re-establishing such locks is referred to as "reclaiming" them. of re-establishing such locks is referred to as "reclaiming" them.
Because each client must have an opportunity to reclaim all of the Because each client must have an opportunity to reclaim all of the
locks that it has without the possibility that some other client will locks that it has without the possibility that some other client will
be granted a conflicting lock, a special period called the "grace be granted a conflicting lock, a special period called the "grace
period" is devoted to the reclaim process. During this period, period" is devoted to the reclaim process. During this period,
requests creating client IDs and sessions are handled normally, but requests creating client IDs and sessions are handled normally, but
locking requests are subject to special restrictions. Only reclaim- locking requests are subject to special restrictions. Only reclaim-
type locking requests are allowed, unless the server is able to type locking requests are allowed, unless the server is able to
reliably determine (through state persistently maintained across reliably determine (through state persistently maintained across
reboot instances), that granting any such lock cannot possibly restart instances), that granting any such lock cannot possibly
conflict with a subsequent reclaim. When a request is made to obtain conflict with a subsequent reclaim. When a request is made to obtain
a new lock (i.e. not a reclaim-type request) during the grace period a new lock (i.e. not a reclaim-type request) during the grace period
and such a determination cannot be made, the server must return the and such a determination cannot be made, the server must return the
error NFS4ERR_GRACE. error NFS4ERR_GRACE.
Once a session is established using the new client ID, the client Once a session is established using the new client ID, the client
will use reclaim-type locking requests (e.g. LOCK requests with will use reclaim-type locking requests (e.g. LOCK requests with
reclaim set to true and OPEN operations with a claim type of reclaim set to TRUE and OPEN operations with a claim type of
CLAIM_PREVIOUS. See Section 9.11) to re-establish its locking state. CLAIM_PREVIOUS. See Section 9.11) to re-establish its locking state.
Once this is done, or if there is no such locking state to reclaim, Once this is done, or if there is no such locking state to reclaim,
the client sends a global RECLAIM_COMPLETE operation, i.e. one with the client sends a global RECLAIM_COMPLETE operation, i.e. one with
the one_fs argument set to false, to indicate that it has reclaimed the rca_one_fs argument set to FALSE, to indicate that it has
all of the locking state that it will reclaim. Once a client sends reclaimed all of the locking state that it will reclaim. Once a
such a RECLAIM_COMPLETE operation, it may attempt non-reclaim locking client sends such a RECLAIM_COMPLETE operation, it may attempt non-
operations, although it may get NFS4ERR_GRACE errors the operations reclaim locking operations, although it may get NFS4ERR_GRACE errors
until the period of special handling is over. See Section 11.7.7 for the operations until the period of special handling is over. See
a discussion of the analogous handling lock reclamation in the case Section 11.7.7 for a discussion of the analogous handling lock
of file systems transitioning from server to server. reclamation in the case of file systems transitioning from server to
server.
During the grace period, the server must reject READ and WRITE During the grace period, the server must reject READ and WRITE
operations and non-reclaim locking requests (i.e. other LOCK and OPEN operations and non-reclaim locking requests (i.e. other LOCK and OPEN
operations) with an error of NFS4ERR_GRACE, unless it is able to operations) with an error of NFS4ERR_GRACE, unless it is able to
guarantee that these may be done safely, as described below. guarantee that these may be done safely, as described below.
The grace period may last until all clients who are known to possibly The grace period may last until all clients which are known to
have had locks have done a global RECLAIM_COMPLETE operation, possibly have had locks have done a global RECLAIM_COMPLETE
indicating that they have finished reclaiming the locks they held operation, indicating that they have finished reclaiming the locks
before the server reboot. This means that a client which has done a they held before the server restart. This means that a client which
RECLAIM_COMPLETE must be prepared to receive an NFS4ERR_GRACE when has done a RECLAIM_COMPLETE must be prepared to receive an
attempting to acquire new locks. The server is assumed to maintain NFS4ERR_GRACE when attempting to acquire new locks. The server is
in stable storage a list of clients who may have such locks. The assumed to maintain in stable storage a list of clients which may
server may also terminate the grace period before all clients have have such locks. The server may also terminate the grace period
done a global RECLAIM_COMPLETE. The server SHOULD NOT terminate the before all clients have done a global RECLAIM_COMPLETE. The server
grace period before a time equal to the lease period in order to give SHOULD NOT terminate the grace period before a time equal to the
clients an opportunity to find out about the server reboot, as a lease period in order to give clients an opportunity to find out
result of issuing requests on associated sessions with a frequency about the server restart, as a result of issuing requests on
governed by the lease time. Note that when a client does not issue associated sessions with a frequency governed by the lease time.
such requests (or they are issued by the client but not received by Note that when a client does not issue such requests (or they are
the server), it is possible for the grace period to expire before the issued by the client but not received by the server), it is possible
client finds out that the server reboot has occurred. for the grace period to expire before the client finds out that the
server restart has occurred.
Some additional time in order to allow time to establish a new client Some additional time in order to allow a client to establish a new
ID and session and to effect lock reclaims may be added to the lease client ID and session and to effect lock reclaims may be added to the
time. Note that analogous rules apply to file system-specific grace lease time. Note that analogous rules apply to file system-specific
periods discussed in Section 11.7.7. grace periods discussed in Section 11.7.7.
If the server can reliably determine that granting a non-reclaim If the server can reliably determine that granting a non-reclaim
request will not conflict with reclamation of locks by other clients, request will not conflict with reclamation of locks by other clients,
the NFS4ERR_GRACE error does not have to be returned even within the the NFS4ERR_GRACE error does not have to be returned even within the
grace period, although NFS4ERR_GRACE must always be returned to grace period, although NFS4ERR_GRACE must always be returned to
clients attempting a non-reclaim lock request before doing their own clients attempting a non-reclaim lock request before doing their own
global RECLAIM_COMPLETE. For the server to be able to service READ global RECLAIM_COMPLETE. For the server to be able to service READ
and WRITE operations during the grace period, it must again be able and WRITE operations during the grace period, it must again be able
to guarantee that no possible conflict could arise between a to guarantee that no possible conflict could arise between a
potential reclaim locking request and the READ or WRITE operation. potential reclaim locking request and the READ or WRITE operation.
skipping to change at page 162, line 11 skipping to change at page 163, line 4
non-reclaim lock and I/O requests. In this case the client should non-reclaim lock and I/O requests. In this case the client should
employ a retry mechanism for the request. A delay (on the order of employ a retry mechanism for the request. A delay (on the order of
several seconds) between retries should be used to avoid overwhelming several seconds) between retries should be used to avoid overwhelming
the server. Further discussion of the general issue is included in the server. Further discussion of the general issue is included in
[37]. The client must account for the server that is able to perform [37]. The client must account for the server that is able to perform
I/O and non-reclaim locking requests within the grace period as well I/O and non-reclaim locking requests within the grace period as well
as those that can not do so. as those that can not do so.
A reclaim-type locking request outside the server's grace period can A reclaim-type locking request outside the server's grace period can
only succeed if the server can guarantee that no conflicting lock or only succeed if the server can guarantee that no conflicting lock or
I/O request has been granted since reboot or restart. I/O request has been granted since restart.
A server may, upon restart, establish a new value for the lease A server may, upon restart, establish a new value for the lease
period. Therefore, clients should, once a new client ID is period. Therefore, clients should, once a new client ID is
established, refetch the lease_time attribute and use it as the basis established, refetch the lease_time attribute and use it as the basis
for lease renewal for the lease associated with that server. for lease renewal for the lease associated with that server.
However, the server must establish, for this restart event, a grace However, the server must establish, for this restart event, a grace
period at least as long as the lease period for the previous server period at least as long as the lease period for the previous server
instantiation. This allows the client state obtained during the instantiation. This allows the client state obtained during the
previous server instance to be reliably re-established. previous server instance to be reliably re-established.
skipping to change at page 163, line 11 skipping to change at page 164, line 4
to create a new session, it would get an NFS4ERR_STALE_CLIENTID. to create a new session, it would get an NFS4ERR_STALE_CLIENTID.
Upon creating the new clientid and new session it would attempt to Upon creating the new clientid and new session it would attempt to
reclaim locks not be allowed to do so by the server. reclaim locks not be allowed to do so by the server.
Another possibility is for the server to maintain the session and Another possibility is for the server to maintain the session and
clientid but for all stateids held by the client to become invalid or clientid but for all stateids held by the client to become invalid or
stale. Once the client is able to reach the server after such a stale. Once the client is able to reach the server after such a
network partition, the status returned by the SEQUENCE operation will network partition, the status returned by the SEQUENCE operation will
indicate a loss of locking state. (The flag indicate a loss of locking state. (The flag
SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED will be set in SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED will be set in
sr_status_flags). In addition all I/O submitted by the client with sr_status_flags.) In addition, all I/O submitted by the client with
the now invalid stateids will fail with the server returning the the now invalid stateids will fail with the server returning the
error NFS4ERR_EXPIRED. Once the client learns of the loss of locking error NFS4ERR_EXPIRED. Once the client learns of the loss of locking
state, it will suitably notify the applications that held the state, it will suitably notify the applications that held the
invalidated locks. The client should then take action to free invalidated locks. The client should then take action to free
invalidated stateids, either by establishing a new client ID using a invalidated stateids, either by establishing a new client ID using a
new verifier or by doing a FREE_STATEID operation to release each of new verifier or by doing a FREE_STATEID operation to release each of
the invalidated stateids. the invalidated stateids.
When the server adopts a finer-grained approach to revocation of When the server adopts a finer-grained approach to revocation of
locks when lease have expired, only a subset of stateids will locks when lease have expired, only a subset of stateids will
skipping to change at page 163, line 36 skipping to change at page 164, line 29
including I/O submitted by the client with the now invalid stateids including I/O submitted by the client with the now invalid stateids
will fail with the server returning the error NFS4ERR_EXPIRED. Once will fail with the server returning the error NFS4ERR_EXPIRED. Once
the client learns of the loss of locking state, it will use the the client learns of the loss of locking state, it will use the
TEST_STATEID operation on all of its stateids to determine which TEST_STATEID operation on all of its stateids to determine which
locks have been lost and then suitably notify the applications that locks have been lost and then suitably notify the applications that
held the invalidated locks. The client can then release the held the invalidated locks. The client can then release the
invalidated locking state and acknowledge the revocation of the invalidated locking state and acknowledge the revocation of the
associated locks by doing a FREE_STATEID operation on each of the associated locks by doing a FREE_STATEID operation on each of the
invalidated stateids. invalidated stateids.
When a network partition is combined with a server reboot, there are When a network partition is combined with a server restart, there are
edge conditions that place requirements on the server in order to edge conditions that place requirements on the server in order to
avoid silent data corruption following the server reboot. Two of avoid silent data corruption following the server restart. Two of
these edge conditions are known, and are discussed below. these edge conditions are known, and are discussed below.
The first edge condition arises as a result of the scenarios such as The first edge condition arises as a result of the scenarios such as
the following: the following:
1. Client A acquires a lock. 1. Client A acquires a lock.
2. Client A and server experience mutual network partition, such 2. Client A and server experience mutual network partition, such
that client A is unable to renew its lease. that client A is unable to renew its lease.
3. Client A's lease expires, and the server releases lock. 3. Client A's lease expires, and the server releases the lock.
4. Client B acquires a lock that would have conflicted with that of 4. Client B acquires a lock that would have conflicted with that of
Client A. Client A.
5. Client B releases its lock. 5. Client B releases its lock.
6. Server reboots. 6. Server restarts.
7. Network partition between client A and server heals. 7. Network partition between client A and server heals.
8. Client A connects to new server instance and finds out about 8. Client A connects to new server instance and finds out about
server reboot. server restart.
9. Client A reclaims its lock within the server's grace period. 9. Client A reclaims its lock within the server's grace period.
Thus, at the final step, the server has erroneously granted client Thus, at the final step, the server has erroneously granted client
A's lock reclaim. If client B modified the object the lock was A's lock reclaim. If client B modified the object the lock was
protecting, client A will experience object corruption. protecting, client A will experience object corruption.
The second known edge condition arises in situations such as the The second known edge condition arises in situations such as the
following: following:
1. Client A acquires one or more locks. 1. Client A acquires one or more locks.
2. Server reboots. 2. Server restarts.
3. Client A and server experience mutual network partition, such 3. Client A and server experience mutual network partition, such
that client A is unable to reclaim all of its locks within the that client A is unable to reclaim all of its locks within the
grace period. grace period.
4. Server's reclaim grace period ends. Client A has either no 4. Server's reclaim grace period ends. Client A has either no
locks or an incomplete set of locks known to the server. locks or an incomplete set of locks known to the server.
5. Client B acquires a lock that would have conflicted with a lock 5. Client B acquires a lock that would have conflicted with a lock
of client A that was not reclaimed. of client A that was not reclaimed.
6. Client B releases the lock. 6. Client B releases the lock.
7. Server reboots a second time. 7. Server restarts a second time.
8. Network partition between client A and server heals. 8. Network partition between client A and server heals.
9. Client A connects to new server instance and finds out about 9. Client A connects to new server instance and finds out about
server reboot. server restart.
10. Client A reclaims its lock within the server's grace period. 10. Client A reclaims its lock within the server's grace period.
As with the first edge condition, the final step of the scenario of As with the first edge condition, the final step of the scenario of
the second edge condition has the server erroneously granting client the second edge condition has the server erroneously granting client
A's lock reclaim. A's lock reclaim.
Solving the first and second edge conditions requires that the server Solving the first and second edge conditions requires that the server
either always assumes after it reboots that some edge condition either always assumes after it restarts that some edge condition
occurs, and thus return NFS4ERR_NO_GRACE for all reclaim attempts, or occurs, and thus return NFS4ERR_NO_GRACE for all reclaim attempts, or
that the server record some information in stable storage. The that the server record some information in stable storage. The
amount of information the server records in stable storage is in amount of information the server records in stable storage is in
inverse proportion to how harsh the server intends to be whenever inverse proportion to how harsh the server intends to be whenever
edge conditions arise. The server that is completely tolerant of all edge conditions arise. The server that is completely tolerant of all
edge conditions will record in stable storage every lock that is edge conditions will record in stable storage every lock that is
acquired, removing the lock record from stable storage only when the acquired, removing the lock record from stable storage only when the
lock is released. For the two edge conditions discussed above, the lock is released. For the two edge conditions discussed above, the
harshest a server can be, and still support a grace period for harshest a server can be, and still support a grace period for
reclaims, requires that the server record in stable storage reclaims, requires that the server record in stable storage
skipping to change at page 165, line 37 skipping to change at page 166, line 33
o a boolean that indicates whether the client may have locks that it o a boolean that indicates whether the client may have locks that it
believes to be reclaimable in situations which the grace period believes to be reclaimable in situations which the grace period
was terminated, making the server's view of lock reclaimability was terminated, making the server's view of lock reclaimability
suspect. The server will set this for any client record in stable suspect. The server will set this for any client record in stable
storage where the client has not done a suitable RECLAIM_COMPLETE storage where the client has not done a suitable RECLAIM_COMPLETE
(global or file system-specific depending on the target of the (global or file system-specific depending on the target of the
lock request) before it grants any new (i.e. not reclaimed) lock lock request) before it grants any new (i.e. not reclaimed) lock
to any client. to any client.
Assuming the above record keeping, for the first edge condition, Assuming the above record keeping, for the first edge condition,
after the server reboots, the record that client A's lease expired after the server restarts, the record that client A's lease expired
means that another client could have acquired a conflicting record means that another client could have acquired a conflicting record
lock, share reservation, or delegation. Hence the server must reject lock, share reservation, or delegation. Hence the server must reject
a reclaim from client A with the error NFS4ERR_NO_GRACE. a reclaim from client A with the error NFS4ERR_NO_GRACE.
For the second edge condition, after the server reboots for a second For the second edge condition, after the server restarts for a second
time, the indication that the client had not completed its reclaims time, the indication that the client had not completed its reclaims
at the time at which the grace period ended means that the server at the time at which the grace period ended means that the server
must reject a reclaim from client A with the error NFS4ERR_NO_GRACE. must reject a reclaim from client A with the error NFS4ERR_NO_GRACE.
When either edge condition occurs, the client's attempt to reclaim When either edge condition occurs, the client's attempt to reclaim
locks will result in the error NFS4ERR_NO_GRACE. When this is locks will result in the error NFS4ERR_NO_GRACE. When this is
received, or after the client reboots with no lock state, the client received, or after the client restarts with no lock state, the client
will send a global RECLAIM_COMPLETE. When the RECLAIM_COMPLETE is will send a global RECLAIM_COMPLETE. When the RECLAIM_COMPLETE is
received, the server and client are again in agreement regarding received, the server and client are again in agreement regarding
reclaimable locks and both booleans in persistent storage can be reclaimable locks and both booleans in persistent storage can be
reset, to be set again only when there is a subsequent event that reset, to be set again only when there is a subsequent event that
causes lock reclaim operations to be questionable. causes lock reclaim operations to be questionable.
Regardless of the level and approach to record keeping, the server Regardless of the level and approach to record keeping, the server
MUST implement one of the following strategies (which apply to MUST implement one of the following strategies (which apply to
reclaims of share reservations, record locks, and delegations): reclaims of share reservations, record locks, and delegations):
1. Reject all reclaims with NFS4ERR_NO_GRACE. This is extremely 1. Reject all reclaims with NFS4ERR_NO_GRACE. This is extremely
unforgiving, but necessary if the server does not record lock unforgiving, but necessary if the server does not record lock
state in stable storage. state in stable storage.
2. Record sufficient state in stable storage such that all known 2. Record sufficient state in stable storage such that all known
edge conditions involving server reboot, including the two noted edge conditions involving server restart, including the two noted
in this section, are detected. Erroneously recognizing a edge in this section, are detected. Erroneously recognizing a edge
condition and not allowing, when, with sufficient knowledge it condition and not allowing, when, with sufficient knowledge it
would be grantable, acceptable. Note that at this time, it is would be grantable, acceptable. Note that at this time, it is
not known if there are other edge conditions. not known if there are other edge conditions.
In the event that, after a server reboot, the server determines In the event that, after a server restart, the server determines
that there is unrecoverable damage or corruption to the that there is unrecoverable damage or corruption to the
information in stable storage, then for all clients and/or locks information in stable storage, then for all clients and/or locks
which may be affected, the server MUST return NFS4ERR_NO_GRACE. which may be affected, the server MUST return NFS4ERR_NO_GRACE.
A mandate for the client's handling of the NFS4ERR_NO_GRACE error is A mandate for the client's handling of the NFS4ERR_NO_GRACE error is
outside the scope of this specification, since the strategies for outside the scope of this specification, since the strategies for
such handling are very dependent on the client's operating such handling are very dependent on the client's operating
environment. However, one potential approach is described below. environment. However, one potential approach is described below.
When the client receives NFS4ERR_NO_GRACE, it could examine the When the client receives NFS4ERR_NO_GRACE, it could examine the
skipping to change at page 167, line 8 skipping to change at page 168, line 5
8.5. Server Revocation of Locks 8.5. Server Revocation of Locks
At any point, the server can revoke locks held by a client and the At any point, the server can revoke locks held by a client and the
client must be prepared for this event. When the client detects that client must be prepared for this event. When the client detects that
its locks have been or may have been revoked, the client is its locks have been or may have been revoked, the client is
responsible for validating the state information between itself and responsible for validating the state information between itself and
the server. Validating locking state for the client means that it the server. Validating locking state for the client means that it
must verify or reclaim state for each lock currently held. must verify or reclaim state for each lock currently held.
The first occasion of lock revocation is upon server reboot or The first occasion of lock revocation is upon server restart. Note
restart. Note that this includes situations in which sessions are that this includes situations in which sessions are persistent and
persistent and locking state is lost. In this class of instances, locking state is lost. In this class of instances, the client will
the client will receive an error (NFS4ERR_STALE_CLIENTID on an receive an error (NFS4ERR_STALE_CLIENTID on an operation that takes
operation that takes client ID, usually as part of recovery in client ID, usually as part of recovery in response to a problem with
response to a problem with the current session) and the client will the current session) and the client will proceed with normal crash
proceed with normal crash recovery as described in the recovery as described in the Section 8.4.2.1.
Section 8.4.2.1.
The second occasion of lock revocation is the inability to renew the The second occasion of lock revocation is the inability to renew the
lease before expiration, as discussed in Section 8.4.3. While this lease before expiration, as discussed in Section 8.4.3. While this
is considered a rare or unusual event, the client must be prepared to is considered a rare or unusual event, the client must be prepared to
recover. The server is responsible for determining the precise recover. The server is responsible for determining the precise
consequences of the lease expiration, informing the client of the consequences of the lease expiration, informing the client of the
scope of the lock revocation decided upon. The client then uses the scope of the lock revocation decided upon. The client then uses the
status information provided by the server in the SEQUENCE results status information provided by the server in the SEQUENCE results
(field sr_status_flags, see Section 18.46.3) to synchronize its (field sr_status_flags, see Section 18.46.3) to synchronize its
locking state with that of the server, in order to recover. locking state with that of the server, in order to recover.
skipping to change at page 168, line 48 skipping to change at page 169, line 43
before the lease would expire. before the lease would expire.
The server's lease period configuration should take into account the The server's lease period configuration should take into account the
network distance of the clients that will be accessing the server's network distance of the clients that will be accessing the server's
resources. It is expected that the lease period will take into resources. It is expected that the lease period will take into
account the network propagation delays and other network delay account the network propagation delays and other network delay
factors for the client population. Since the protocol does not allow factors for the client population. Since the protocol does not allow
for an automatic method to determine an appropriate lease period, the for an automatic method to determine an appropriate lease period, the
server's administrator may have to tune the lease period. server's administrator may have to tune the lease period.
8.8. Vestigial Locking Infrastructure From V4.0 8.8. Obsolete Locking Infrastructure From NFSv4.0
There are a number of operations and fields within existing There are a number of operations and fields within existing
operations that no longer have a function in minor version one. In operations that no longer have a function in NFSv4.1. In one way or
one way or another, these changes are all due to the implementation another, these changes are all due to the implementation of sessions
of sessions which provides client context and exactly once semantics which provides client context and exactly once semantics as a base
as a base feature of the protocol, separate from locking itself. feature of the protocol, separate from locking itself.
The following NFSv4.0 operations MUST NOT be implemented in NFSv4.1. The following NFSv4.0 operations MUST NOT be implemented in NFSv4.1.
The server MUST return NFS4ERR_NOTSUPP if these operations are found The server MUST return NFS4ERR_NOTSUPP if these operations are found
in an NFSv4.1 COMPOUND. in an NFSv4.1 COMPOUND.
o SETCLIENTID since its function has been replaced by EXCHANGE_ID. o SETCLIENTID since its function has been replaced by EXCHANGE_ID.
o SETCLIENTID_CONFIRM since client ID confirmation now happens by o SETCLIENTID_CONFIRM since client ID confirmation now happens by
means of CREATE_SESSION. means of CREATE_SESSION.
o OPEN_CONFIRM because OPENs no longer require confirmation to o OPEN_CONFIRM because OPENs no longer require confirmation to
establish an owner-based sequence value. establish an owner-based sequence value.
skipping to change at page 183, line 49 skipping to change at page 184, line 41
A client failure or a network partition can result in failure to A client failure or a network partition can result in failure to
respond to a recall callback. In this case, the server will revoke respond to a recall callback. In this case, the server will revoke
the delegation which in turn will render useless any modified state the delegation which in turn will render useless any modified state
still on the client. still on the client.
10.2.1. Delegation Recovery 10.2.1. Delegation Recovery
There are three situations that delegation recovery must deal with: There are three situations that delegation recovery must deal with:
o Client reboot or restart o Client restart
o Server restart
o Server reboot or restart
o Network partition (full or backchannel-only) o Network partition (full or backchannel-only)
In the event the client reboots or restarts, the failure to renew the In the event the client restarts, the failure to renew the lease will
lease will result in the revocation of record locks and share result in the revocation of record locks and share reservations.
reservations. Delegations, however, may be treated a bit Delegations, however, may be treated a bit differently.
differently.
There will be situations in which delegations will need to be There will be situations in which delegations will need to be
reestablished after a client reboots or restarts. The reason for reestablished after a client restarts. The reason for this is the
this is the client may have file data stored locally and this data client may have file data stored locally and this data was associated
was associated with the previously held delegations. The client will with the previously held delegations. The client will need to
need to reestablish the appropriate file state on the server. reestablish the appropriate file state on the server.
To allow for this type of client recovery, the server MAY extend the To allow for this type of client recovery, the server MAY extend the
period for delegation recovery beyond the typical lease expiration period for delegation recovery beyond the typical lease expiration
period. This implies that requests from other clients that conflict period. This implies that requests from other clients that conflict
with these delegations will need to wait. Because the normal recall with these delegations will need to wait. Because the normal recall
process may require significant time for the client to flush changed process may require significant time for the client to flush changed
state to the server, other clients need be prepared for delays that state to the server, other clients need be prepared for delays that
occur because of a conflicting delegation. This longer interval occur because of a conflicting delegation. This longer interval
would increase the window for clients to reboot and consult stable would increase the window for clients to restart and consult stable
storage so that the delegations can be reclaimed. For open storage so that the delegations can be reclaimed. For open
delegations, such delegations are reclaimed using OPEN with a claim delegations, such delegations are reclaimed using OPEN with a claim
type of CLAIM_DELEGATE_PREV. (See Section 10.5 and Section 18.16 for type of CLAIM_DELEGATE_PREV. (See Section 10.5 and Section 18.16 for
discussion of open delegation and the details of OPEN respectively). discussion of open delegation and the details of OPEN respectively).
A server MAY support a claim type of CLAIM_DELEGATE_PREV, and if it A server MAY support a claim type of CLAIM_DELEGATE_PREV, and if it
does, it MUST NOT remove delegations upon a CREATE_SESSION that does, it MUST NOT remove delegations upon a CREATE_SESSION that
confirms a client ID created by EXCHANGE_ID, and instead MUST, for a confirms a client ID created by EXCHANGE_ID, and instead MUST, for a
period of time no less than that of the value of the lease_time period of time no less than that of the value of the lease_time
attribute, maintain the client's delegations to allow time for the attribute, maintain the client's delegations to allow time for the
client to send CLAIM_DELEGATE_PREV requests. The server that client to send CLAIM_DELEGATE_PREV requests. The server that
supports CLAIM_DELEGATE_PREV MUST support the DELEGPURGE operation. supports CLAIM_DELEGATE_PREV MUST support the DELEGPURGE operation.
When the server reboots or restarts, delegations are reclaimed (using When the server restarts, delegations are reclaimed (using the OPEN
the OPEN operation with CLAIM_PREVIOUS) in a similar fashion to operation with CLAIM_PREVIOUS) in a similar fashion to record locks
record locks and share reservations. However, there is a slight and share reservations. However, there is a slight semantic
semantic difference. In the normal case if the server decides that a difference. In the normal case if the server decides that a
delegation should not be granted, it performs the requested action delegation should not be granted, it performs the requested action
(e.g. OPEN) without granting any delegation. For reclaim, the (e.g. OPEN) without granting any delegation. For reclaim, the
server grants the delegation but a special designation is applied so server grants the delegation but a special designation is applied so
that the client treats the delegation as having been granted but that the client treats the delegation as having been granted but
recalled by the server. Because of this, the client has the duty to recalled by the server. Because of this, the client has the duty to
write all modified state to the server and then return the write all modified state to the server and then return the
delegation. This process of handling delegation reclaim reconciles delegation. This process of handling delegation reclaim reconciles
three principles of the NFSv4.1 protocol: three principles of the NFSv4.1 protocol:
o Upon reclaim, a client reporting resources assigned to it by an o Upon reclaim, a client reporting resources assigned to it by an
skipping to change at page 185, line 44 skipping to change at page 186, line 34
requests are held off. Eventually the occurrence of a conflicting requests are held off. Eventually the occurrence of a conflicting
request from another client will cause revocation of the delegation. request from another client will cause revocation of the delegation.
A loss of the backchannel (e.g. by later network configuration A loss of the backchannel (e.g. by later network configuration
change) will have the same effect. A recall request will fail and change) will have the same effect. A recall request will fail and
revocation of the delegation will result. revocation of the delegation will result.
A client normally finds out about revocation of a delegation when it A client normally finds out about revocation of a delegation when it
uses a stateid associated with a delegation and receives one of the uses a stateid associated with a delegation and receives one of the
errors NFS4EER_EXPIRED, NFS4ERR_ADMIN_REVOKED, or errors NFS4EER_EXPIRED, NFS4ERR_ADMIN_REVOKED, or
MFS4ERR_DELEG_REVOKED. It also may find out about delegation MFS4ERR_DELEG_REVOKED. It also may find out about delegation
revocation after a client reboot when it attempts to reclaim a revocation after a client restart when it attempts to reclaim a
delegation and receives that same error. Note that in the case of a delegation and receives that same error. Note that in the case of a
revoked write open delegation, there are issues because data may have revoked write open delegation, there are issues because data may have
been modified by the client whose delegation is revoked and been modified by the client whose delegation is revoked and
separately by other clients. See Section 10.5.1 for a discussion of separately by other clients. See Section 10.5.1 for a discussion of
such issues. Note also that when delegations are revoked, such issues. Note also that when delegations are revoked,
information about the revoked delegation will be written by the information about the revoked delegation will be written by the
server to stable storage (as described in Section 8.4.3). This is server to stable storage (as described in Section 8.4.3). This is
done to deal with the case in which a server reboots after revoking a done to deal with the case in which a server restarts after revoking
delegation but before the client holding the revoked delegation is a delegation but before the client holding the revoked delegation is
notified about the revocation. notified about the revocation.
10.3. Data Caching 10.3. Data Caching
When applications share access to a set of files, they need to be When applications share access to a set of files, they need to be
implemented so as to take account of the possibility of conflicting implemented so as to take account of the possibility of conflicting
access by another application. This is true whether the applications access by another application. This is true whether the applications
in question execute on different clients or reside on the same in question execute on different clients or reside on the same
client. client.
skipping to change at page 187, line 17 skipping to change at page 188, line 6
cache validation logic of time_modify and not change runs the risk cache validation logic of time_modify and not change runs the risk
of the client incorrectly marking stale data as valid. of the client incorrectly marking stale data as valid.
o Second, modified data must be flushed to the server before closing o Second, modified data must be flushed to the server before closing
a file OPENed for write. This is complementary to the first rule. a file OPENed for write. This is complementary to the first rule.
If the data is not flushed at CLOSE, the revalidation done after If the data is not flushed at CLOSE, the revalidation done after
client OPENs as file is unable to achieve its purpose. The other client OPENs as file is unable to achieve its purpose. The other
aspect to flushing the data before close is that the data must be aspect to flushing the data before close is that the data must be
committed to stable storage, at the server, before the CLOSE committed to stable storage, at the server, before the CLOSE
operation is requested by the client. In the case of a server operation is requested by the client. In the case of a server
reboot or restart and a CLOSEd file, it may not be possible to restart and a CLOSEd file, it may not be possible to retransmit
retransmit the data to be written to the file. Hence, this the data to be written to the file. Hence, this requirement.
requirement.
10.3.2. Data Caching and File Locking 10.3.2. Data Caching and File Locking
For those applications that choose to use file locking instead of For those applications that choose to use file locking instead of
share reservations to exclude inconsistent file access, there is an share reservations to exclude inconsistent file access, there is an
analogous set of constraints that apply to client side data caching. analogous set of constraints that apply to client side data caching.
These rules are effective only if the file locking is used in a way These rules are effective only if the file locking is used in a way
that matches in an equivalent way the actual READ and WRITE that matches in an equivalent way the actual READ and WRITE
operations executed. This is as opposed to file locking that is operations executed. This is as opposed to file locking that is
based on pure convention. For example, it is possible to manipulate based on pure convention. For example, it is possible to manipulate
skipping to change at page 188, line 29 skipping to change at page 189, line 16
a locked area which is not an integral number of full buffer blocks a locked area which is not an integral number of full buffer blocks
would require the client to read one or two partial blocks from the would require the client to read one or two partial blocks from the
server if the revalidation procedure shows that the data which the server if the revalidation procedure shows that the data which the
client possesses may not be valid. client possesses may not be valid.
The data that is written to the server as a prerequisite to the The data that is written to the server as a prerequisite to the
unlocking of a region must be written, at the server, to stable unlocking of a region must be written, at the server, to stable
storage. The client may accomplish this either with synchronous storage. The client may accomplish this either with synchronous
writes or by following asynchronous writes with a COMMIT operation. writes or by following asynchronous writes with a COMMIT operation.
This is required because retransmission of the modified data after a This is required because retransmission of the modified data after a
server reboot might conflict with a lock held by another client. server restart might conflict with a lock held by another client.
A client implementation may choose to accommodate applications which A client implementation may choose to accommodate applications which
use record locking in non-standard ways (e.g. using a record lock as use record locking in non-standard ways (e.g. using a record lock as
a global semaphore) by flushing to the server more data upon an LOCKU a global semaphore) by flushing to the server more data upon an LOCKU
than is covered by the locked range. This may include modified data than is covered by the locked range. This may include modified data
within files other than the one for which the unlocks are being done. within files other than the one for which the unlocks are being done.
In such cases, the client must not interfere with applications whose In such cases, the client must not interfere with applications whose
READs and WRITEs are being done only within the bounds of record READs and WRITEs are being done only within the bounds of record
locks which the application holds. For example, an application locks locks which the application holds. For example, an application locks
a single byte of a file and proceeds to write that single byte. A a single byte of a file and proceeds to write that single byte. A
skipping to change at page 196, line 34 skipping to change at page 197, line 25
storage is OPTIONAL. storage is OPTIONAL.
As discussed earlier in this section, the client MAY return the same As discussed earlier in this section, the client MAY return the same
cc value on subsequent CB_GETATTR calls, even if the file was cc value on subsequent CB_GETATTR calls, even if the file was
modified in the client's cache yet again between successive modified in the client's cache yet again between successive
CB_GETATTR calls. Therefore, the server must assume that the file CB_GETATTR calls. Therefore, the server must assume that the file
has been modified yet again, and MUST take care to ensure that the has been modified yet again, and MUST take care to ensure that the
new nsc it constructs and returns is greater than the previous nsc it new nsc it constructs and returns is greater than the previous nsc it
returned. An example implementation's delegation record would returned. An example implementation's delegation record would
satisfy this mandate by including a boolean field (let us call it satisfy this mandate by including a boolean field (let us call it
"modified") that is set to false when the delegation is granted, and "modified") that is set to FALSE when the delegation is granted, and
an sc value set at the time of grant to the change attribute value. an sc value set at the time of grant to the change attribute value.
The modified field would be set to true the first time cc != sc, and The modified field would be set to true the first time cc != sc, and
would stay true until the delegation is returned or revoked. The would stay true until the delegation is returned or revoked. The
processing for constructing nsc, time_modify, and time_metadata would processing for constructing nsc, time_modify, and time_metadata would
use this pseudo code: use this pseudo code:
if (!modified) { if (!modified) {
do CB_GETATTR for change and size; do CB_GETATTR for change and size;
if (cc != sc) if (cc != sc)
skipping to change at page 217, line 45 skipping to change at page 218, line 30
The fs_locations and fs_locations_info attributes provide alternative The fs_locations and fs_locations_info attributes provide alternative
locations, to be used to access data in place of or in addition to locations, to be used to access data in place of or in addition to
the current file system instance. On first access to a file system, the current file system instance. On first access to a file system,
the client should obtain the value of the set of alternate locations the client should obtain the value of the set of alternate locations
by interrogating the fs_locations or fs_locations_info attribute, by interrogating the fs_locations or fs_locations_info attribute,
with the latter being preferred. with the latter being preferred.
In the event that server failures, communications problems, or other In the event that server failures, communications problems, or other
difficulties make continued access to the current file system difficulties make continued access to the current file system
impossible or otherwise impractical, the client can use the alternate impossible or otherwise impractical, the client can use the alternate
locations as a way to get continued access to his data. Depending on locations as a way to get continued access to its data. Depending on
specific attributes of these alternate locations, as indicated within specific attributes of these alternate locations, as indicated within
the fs_locations_info attribute, multiple locations may be used the fs_locations_info attribute, multiple locations may be used
simultaneously, to provide higher performance through the simultaneously, to provide higher performance through the
exploitation of multiple paths between client and target file system. exploitation of multiple paths between client and target file system.
The alternate locations may be physical replicas of the (typically The alternate locations may be physical replicas of the (typically
read-only) file system data, or they may reflect alternate paths to read-only) file system data, or they may reflect alternate paths to
the same server or provide for the use of various forms of server the same server or provide for the use of various forms of server
clustering in which multiple servers provide alternate ways of clustering in which multiple servers provide alternate ways of
accessing the same physical file system. How these different modes accessing the same physical file system. How these different modes
skipping to change at page 226, line 26 skipping to change at page 227, line 9
locations. locations.
o Change attribute values are consistent across the transition and o Change attribute values are consistent across the transition and
do not have to be refetched. When change attributes indicate that do not have to be refetched. When change attributes indicate that
a cached object is still valid, it can remain cached. a cached object is still valid, it can remain cached.
o Client and state identifiers retain their validity across the o Client and state identifiers retain their validity across the
transition, except where their staleness is recognized and transition, except where their staleness is recognized and
reported by the new server. Except where such staleness requires reported by the new server. Except where such staleness requires
it, no lock reclamation is needed. Any such staleness is an it, no lock reclamation is needed. Any such staleness is an
indication that the server should be considered to have rebooted indication that the server should be considered to have restarted
and is reported as discussed in Section 8.4.2. and is reported as discussed in Section 8.4.2.
o Write verifiers are presumed to retain their validity and can be o Write verifiers are presumed to retain their validity and can be
used to compare with verifiers returned by COMMIT on the new used to compare with verifiers returned by COMMIT on the new
server, with the expectation that if COMMIT on the new server server, with the expectation that if COMMIT on the new server
returns an identical verifier, then that server has all of the returns an identical verifier, then that server has all of the
data unstably written to the original server and has committed it data unstably written to the original server and has committed it
to stable storage as requested. to stable storage as requested.
o Readdir cookies are presumed to retain their validity and can be o Readdir cookies are presumed to retain their validity and can be
skipping to change at page 230, line 22 skipping to change at page 231, line 5
transferring all state, the client can attempt to establish sessions transferring all state, the client can attempt to establish sessions
associated with the client ID used for the source file system associated with the client ID used for the source file system
instance. If the server accepts that as a valid client ID, then the instance. If the server accepts that as a valid client ID, then the
client may use the existing stateids associated with that client ID client may use the existing stateids associated with that client ID
for the old file system instance in connection with that same client for the old file system instance in connection with that same client
ID in connection with the transitioned file system instance. ID in connection with the transitioned file system instance.
When the two servers belong to the same server scope, it does not When the two servers belong to the same server scope, it does not
mean that when dealing with the transition, the client will not have mean that when dealing with the transition, the client will not have
to reclaim state. However it does mean that the client may proceed to reclaim state. However it does mean that the client may proceed
using his current client ID when establishing communication with the using its current client ID when establishing communication with the
new server and the new server will either recognize the client ID as new server and the new server will either recognize the client ID as
valid, or reject it, in which case locks must be reclaimed by the valid, or reject it, in which case locks must be reclaimed by the
client. client.
File systems co-operating in state management may actually share File systems co-operating in state management may actually share
state or simply divide the id space so as to recognize (and reject as state or simply divide the id space so as to recognize (and reject as
stale) each other's stateids and client IDs. Servers which do share stale) each other's stateids and client IDs. Servers which do share
state may not do so under all conditions or at all times. The state may not do so under all conditions or at all times. The
requirement for the server is that if it cannot be sure in accepting requirement for the server is that if it cannot be sure in accepting
a client ID that it reflects the locks the client was given, it must a client ID that it reflects the locks the client was given, it must
skipping to change at page 231, line 5 skipping to change at page 231, line 36
In either case, when actual locks are not known to be maintained, the In either case, when actual locks are not known to be maintained, the
destination server may establish a grace period specific to the given destination server may establish a grace period specific to the given
file system, with non-reclaim locks being rejected for that file file system, with non-reclaim locks being rejected for that file
system, even though normal locks are being granted for other file system, even though normal locks are being granted for other file
systems. Clients should not infer the absence of a grace period for systems. Clients should not infer the absence of a grace period for
file systems being transitioned to a server from responses to file systems being transitioned to a server from responses to
requests for other file systems. requests for other file systems.
In the case of lock reclamation for a given file system after a file In the case of lock reclamation for a given file system after a file
system transition, edge conditions can arise similar to those for system transition, edge conditions can arise similar to those for
reclaim after server reboot (although in the case of the planned reclaim after server restart (although in the case of the planned
state transfer associated with migration, these can be avoided by state transfer associated with migration, these can be avoided by
securely recording lock state as part of state migration). Unless securely recording lock state as part of state migration). Unless
the destination server can guarantee that locks will not be the destination server can guarantee that locks will not be
incorrectly granted, the destination server should not allow lock incorrectly granted, the destination server should not allow lock
reclaims and avoid establishing a grace period. reclaims and avoid establishing a grace period.
Once all locks have been reclaimed, or there were no locks to Once all locks have been reclaimed, or there were no locks to
reclaim, the client indicates that there are no more reclaims to be reclaim, the client indicates that there are no more reclaims to be
done for the file system in question by issuing a RECLAIM_COMPLETE done for the file system in question by issuing a RECLAIM_COMPLETE
operation with the one_fs parameter set to true. Once this has been operation with the rca_one_fs parameter set to true. Once this has
done, non-reclaim locking operations may be done, and any subsequent been done, non-reclaim locking operations may be done, and any
request to do reclaims will be rejected with the error subsequent request to do reclaims will be rejected with the error
NFS4ERR_NO_GRACE. NFS4ERR_NO_GRACE.
Information about client identity may be propagated between servers Information about client identity may be propagated between servers
in the form of client_owner4 and associated verifiers, under the in the form of client_owner4 and associated verifiers, under the
assumption that the client presents the same values to all the assumption that the client presents the same values to all the
servers with which it deals. servers with which it deals.
Servers are encouraged to provide facilities to allow locks to be Servers are encouraged to provide facilities to allow locks to be
reclaimed on the new server after a file system transition. Often, reclaimed on the new server after a file system transition. Often,
however, in cases in which the two servers do not share a server however, in cases in which the two servers do not share a server
scope value, such facilities may not be available and client should scope value, such facilities may not be available and client should
be prepared to re-obtain locks, even though it is possible that the be prepared to re-obtain locks, even though it is possible that the
client may have his LOCK or OPEN request denied due to a conflicting client may have its LOCK or OPEN request denied due to a conflicting
lock. lock.
The consequences of having no facilities available to reclaim locks The consequences of having no facilities available to reclaim locks
on the sew server will depend on the type of environment. In some on the sew server will depend on the type of environment. In some
environments, such as the transition between read-only file systems, environments, such as the transition between read-only file systems,
such denial of locks should not pose large difficulties in practice. such denial of locks should not pose large difficulties in practice.
When an attempt to re-establish a lock on a new server is denied, the When an attempt to re-establish a lock on a new server is denied, the
client should treat the situation as if his original lock had been client should treat the situation as if its original lock had been
revoked. Note that when the lock is granted, the client cannot revoked. Note that when the lock is granted, the client cannot
assume that no conflicting lock could have been granted in the assume that no conflicting lock could have been granted in the
interim. Where change attribute continuity is present, the client interim. Where change attribute continuity is present, the client
may check the change attribute to check for unwanted file may check the change attribute to check for unwanted file
modifications. Where even this is not available, and the file system modifications. Where even this is not available, and the file system
is not read-only, a client may reasonably treat all pending locks as is not read-only, a client may reasonably treat all pending locks as
having been revoked. having been revoked.
11.7.7.1. Leases and File System Transitions 11.7.7.1. Leases and File System Transitions
skipping to change at page 236, line 51 skipping to change at page 237, line 34
exception for GETFH. exception for GETFH.
o GETATTR fsid,fileid,size,time_modify. Not executed because the o GETATTR fsid,fileid,size,time_modify. Not executed because the
failure of the GETFH stops processing of the COMPOUND. failure of the GETFH stops processing of the COMPOUND.
Given the failure of the GETFH, the client has the job of determining Given the failure of the GETFH, the client has the job of determining
the root of the absent file system and where to find that file the root of the absent file system and where to find that file
system, i.e. the server and path relative to that server's root fh. system, i.e. the server and path relative to that server's root fh.
Note here that in this example, the client did not obtain filehandles Note here that in this example, the client did not obtain filehandles
and attribute information (e.g. fsid) for the intermediate and attribute information (e.g. fsid) for the intermediate
directories, so that he would not be sure where the absent file directories, so that it would not be sure where the absent file
system starts. It could be the case, for example, that /this/is/the system starts. It could be the case, for example, that /this/is/the
is the root of the moved file system and that the reason that the is the root of the moved file system and that the reason that the
lookup of "path" succeeded is that the file system was not absent on lookup of "path" succeeded is that the file system was not absent on
that op but was moved between the last LOOKUP and the GETFH (since that op but was moved between the last LOOKUP and the GETFH (since
COMPOUND is not atomic). Even if we had the fsids for all of the COMPOUND is not atomic). Even if we had the fsids for all of the
intermediate directories, we could have no way of knowing that /this/ intermediate directories, we could have no way of knowing that /this/
is/the/path was the root of a new file system, since we don't yet is/the/path was the root of a new file system, since we don't yet
have its fsid. have its fsid.
In order to get the necessary information, let us re-send the chain In order to get the necessary information, let us re-send the chain
skipping to change at page 249, line 21 skipping to change at page 249, line 52
o FSLI4GF_WRITABLE indicates that this file system target is o FSLI4GF_WRITABLE indicates that this file system target is
writable, allowing it to be selected by clients which may need to writable, allowing it to be selected by clients which may need to
write on this file system. When the current file system instance write on this file system. When the current file system instance
is writable, and is defined as of the same simultaneous use class is writable, and is defined as of the same simultaneous use class
(as specified by the value at index FSLI4BX_CLSIMUL) to which the (as specified by the value at index FSLI4BX_CLSIMUL) to which the
client was previously writing, then it must incorporate within its client was previously writing, then it must incorporate within its
data any committed write made on the source file system instance. data any committed write made on the source file system instance.
See Section 11.7.8 which discusses the write-verifier class. See Section 11.7.8 which discusses the write-verifier class.
While there is no harm in not setting this flag for a file system While there is no harm in not setting this flag for a file system
that turns out to be writable, turning the flag on for read-only that turns out to be writable, turning the flag on for read-only
file system can cause problems for clients who select a migration file system can cause problems for clients which select a
or replication target based on it and then find themselves unable migration or replication target based on it and then find
to write. themselves unable to write.
o FSLI4GF_CUR_REQ indicates that this replica is the one on which o FSLI4GF_CUR_REQ indicates that this replica is the one on which
the request is being made. Only a single server entry may have the request is being made. Only a single server entry may have
this flag set and in the case of a referral, no entry will have this flag set and in the case of a referral, no entry will have
it. it.
o FSLI4GF_ABSENT indicates that this entry corresponds an absent o FSLI4GF_ABSENT indicates that this entry corresponds an absent
file system replica. It can only be set if FSLI4GF_CUR_REQ is file system replica. It can only be set if FSLI4GF_CUR_REQ is
set. When both such bits are set it indicates that a file system set. When both such bits are set it indicates that a file system
instance is not usable but that the information in the entry can instance is not usable but that the information in the entry can
skipping to change at page 250, line 40 skipping to change at page 251, line 23
boundaries. Note that in the event of a referral, there will not boundaries. Note that in the event of a referral, there will not
be any such files and so these action will not be performed. be any such files and so these action will not be performed.
Instead, reference to portions of the original file system split Instead, reference to portions of the original file system split
off into other will encounter an fsid change and possibly a off into other will encounter an fsid change and possibly a
further referral. further referral.
Once the client recognizes that one file system has been split Once the client recognizes that one file system has been split
into two, it could maintain applications running without into two, it could maintain applications running without
disruption by presenting the two file systems as a single one disruption by presenting the two file systems as a single one
until a convenient point to recognize the transition, such as a until a convenient point to recognize the transition, such as a
reboot. This would require a mapping of fsids from the server's restart. This would require a mapping of fsids from the server's
fsids to fsids as seen by the client but this is already necessary fsids to fsids as seen by the client but this is already necessary
for other reasons. As noted above, existing fileids within the for other reasons. As noted above, existing fileids within the
two descendant file systems will not conflict. Providing non- two descendant file systems will not conflict. Providing non-
conflicting fileids for newly-created files on the files on the conflicting fileids for newly-created files on the files on the
split file systems is the responsibility of the server (or servers split file systems is the responsibility of the server (or servers
working in concert). Note that filehandles could be different for working in concert). Note that filehandles could be different for
file systems that tool part in the split form those newly file systems that tool part in the split form those newly
accessed, allowing the server to determine when the need for such accessed, allowing the server to determine when the need for such
treatment is over. treatment is over.
skipping to change at page 268, line 20 skipping to change at page 269, line 20
the server supports and the client is prepared to use. The layout the server supports and the client is prepared to use. The layout
returned to the client may not exactly align with the requested byte returned to the client may not exactly align with the requested byte
range. A field within the LAYOUTGET request, loga_minlength, range. A field within the LAYOUTGET request, loga_minlength,
specifies the minimum length of the layout. The loga_minlength field specifies the minimum length of the layout. The loga_minlength field
should be at least one. As needed a client may make multiple should be at least one. As needed a client may make multiple
LAYOUTGET requests; these will result in multiple overlapping, non- LAYOUTGET requests; these will result in multiple overlapping, non-
conflicting layouts. conflicting layouts.
In order to get a layout, the client must first have opened the file In order to get a layout, the client must first have opened the file
via the OPEN operation. When a client has no layout on a file, it via the OPEN operation. When a client has no layout on a file, it
presents a stateid as returned by OPEN, a delegation stateid, or a MUST present a stateid as returned by OPEN, a delegation stateid, or
byte-range lock stateid in the loga_stateid argument. A successful a byte-range lock stateid in the loga_stateid argument. A successful
LAYOUTGET result includes a layout stateid. The first successful LAYOUTGET result includes a layout stateid. The first successful
LAYOUTGET processed by the server using a non-layout stateid as an LAYOUTGET processed by the server using a non-layout stateid as an
argument MUST have the "seqid" field of the layout stateid in the argument MUST have the "seqid" field of the layout stateid in the
response set to one. Thereafter, the client uses a layout stateid response set to one. Thereafter, the client uses a layout stateid
(see Section 12.5.3) on future invocations of LAYOUTGET on the file, (see Section 12.5.3) on future invocations of LAYOUTGET on the file,
and the "seqid" MUST NOT ever be set to zero. Once the layout has and the "seqid" MUST NOT ever be set to zero. Once the layout has
been retrieved, it can be held across multiple OPEN and CLOSE been retrieved, it can be held across multiple OPEN and CLOSE
sequences. Therefore, a client may hold a layout for a file that is sequences. Therefore, a client may hold a layout for a file that is
not currently open by any user on the client. This allows for the not currently open by any user on the client. This allows for the
caching of layouts beyond CLOSE. caching of layouts beyond CLOSE.
skipping to change at page 270, line 10 skipping to change at page 271, line 10
CB_LAYOUTRECALL request. Simply seeing the result or the CB_LAYOUTRECALL request. Simply seeing the result or the
CB_LAYOUTRECALL request is not sufficient cause to use the seqid. CB_LAYOUTRECALL request is not sufficient cause to use the seqid.
For LAYOUTGET results, if the client is not using the forgetful model For LAYOUTGET results, if the client is not using the forgetful model
(Section 12.5.5.1), it MUST first update its record of what ranges of (Section 12.5.5.1), it MUST first update its record of what ranges of
the file's layout it has before using the seqid. For LAYOUTRETURN the file's layout it has before using the seqid. For LAYOUTRETURN
results, the client MUST delete the range from its record of what results, the client MUST delete the range from its record of what
ranges of the file's layout it had before using the seqid. For ranges of the file's layout it had before using the seqid. For
CB_LAYOUTRECALL arguments, the client MUST send a response to the CB_LAYOUTRECALL arguments, the client MUST send a response to the
recall before using the seqid. recall before using the seqid.
Once a client has no more layouts on a file, the layout stateid is no
longer valid, and MUST NOT be used. Any attempt to use such a layout
stateid will result in NFS4ERR_BAD_STATEID.
12.5.4. Committing a Layout 12.5.4. Committing a Layout
Allowing for varying storage protocols capabilities, the pNFS Allowing for varying storage protocols capabilities, the pNFS
protocol does not require the metadata server and storage devices to protocol does not require the metadata server and storage devices to
have a consistent view of file attributes and data location mappings. have a consistent view of file attributes and data location mappings.
Data location mapping refers to aspects such as which offsets store Data location mapping refers to aspects such as which offsets store
data as opposed to storing holes (see Section 13.4.4 for a data as opposed to storing holes (see Section 13.4.4 for a
discussion). Related issues arise for storage protocols where a discussion). Related issues arise for storage protocols where a
layout may hold provisionally allocated blocks where the allocation layout may hold provisionally allocated blocks where the allocation
of those blocks does not survive a complete restart of both the of those blocks does not survive a complete restart of both the
skipping to change at page 271, line 5 skipping to change at page 272, line 8
The control protocol is free to synchronize the attributes before it The control protocol is free to synchronize the attributes before it
receives a LAYOUTCOMMIT, however upon successful completion of a receives a LAYOUTCOMMIT, however upon successful completion of a
LAYOUTCOMMIT, state that exists on the metadata server that describes LAYOUTCOMMIT, state that exists on the metadata server that describes
the file MUST be in sync with the state existing on the storage the file MUST be in sync with the state existing on the storage
devices that comprise that file as of the issuing client's last devices that comprise that file as of the issuing client's last
operation. Thus, a client that queries the size of a file between a operation. Thus, a client that queries the size of a file between a
WRITE to a storage device and the LAYOUTCOMMIT may observe a size WRITE to a storage device and the LAYOUTCOMMIT may observe a size
that does not reflect the actual data written. that does not reflect the actual data written.
The client MUST have a layout in order to issue LAYOUTCOMMIT.
12.5.4.1. LAYOUTCOMMIT and change/time_modify 12.5.4.1. LAYOUTCOMMIT and change/time_modify
The change and time_modify attributes may be updated by the server The change and time_modify attributes may be updated by the server
when the LAYOUTCOMMIT operation is processed. The reason for this is when the LAYOUTCOMMIT operation is processed. The reason for this is
that some layout types do not support the update of these attributes that some layout types do not support the update of these attributes
when the storage devices process I/O operations. The client is when the storage devices process I/O operations. If client has a
capable providing a suggested value to the server for time_modify layout with the LAYOUTIOMODE4_RW iomode on the file, the client MAY
within the arguments to LAYOUTCOMMIT. Based on layout type, the provide a suggested value to the server for time_modify within the
provided value may or may not be used. The server should sanity arguments to LAYOUTCOMMIT. Based on the layout type, the provided
check the client provided values before they are used. For example, value may or may not be used. The server should sanity check the
the server should ensure that time does not flow backwards. The client provided values before they are used. For example, the server
client always has the option to set time_modify through an explicit should ensure that time does not flow backwards. The client always
SETATTR operation. has the option to set time_modify through an explicit SETATTR
operation.
For some layout protocols, the storage device is able to notify the For some layout protocols, the storage device is able to notify the
metadata server of the occurrence of an I/O and as a result the metadata server of the occurrence of an I/O and as a result the
change and time_modify attributes may be updated at the metadata change and time_modify attributes may be updated at the metadata
server. For a metadata server that is capable of monitoring updates server. For a metadata server that is capable of monitoring updates
to the change and time_modify attributes, LAYOUTCOMMIT processing is to the change and time_modify attributes, LAYOUTCOMMIT processing is
not required to update the change attribute; in this case the not required to update the change attribute; in this case the
metadata server must ensure that no further update to the data has metadata server must ensure that no further update to the data has
occurred since the last update of the attributes; file-based occurred since the last update of the attributes; file-based
protocols may have enough information to make this determination or protocols may have enough information to make this determination or
skipping to change at page 271, line 45 skipping to change at page 272, line 51
12.5.4.2. LAYOUTCOMMIT and size 12.5.4.2. LAYOUTCOMMIT and size
The size of a file may be updated when the LAYOUTCOMMIT operation is The size of a file may be updated when the LAYOUTCOMMIT operation is
used by the client. One of the fields in the argument to used by the client. One of the fields in the argument to
LAYOUTCOMMIT is loca_last_write_offset; this field indicates the LAYOUTCOMMIT is loca_last_write_offset; this field indicates the
highest byte offset written but not yet committed with the highest byte offset written but not yet committed with the
LAYOUTCOMMIT operation. The data type of lora_last_write_offset is LAYOUTCOMMIT operation. The data type of lora_last_write_offset is
newoffset4 and is switched on a boolean value, no_newoffset, that newoffset4 and is switched on a boolean value, no_newoffset, that
indicates if a previous write occurred or not. If no_newoffset is indicates if a previous write occurred or not. If no_newoffset is
FALSE, an offset is not given. A loca_last_write_offset value of FALSE, an offset is not given. If the client has a layout with
zero means that one byte was written at offset zero. LAYOUTIOMODE4_RW iomode on the file, with an lo_offset and lo_length
that overlaps loca_last_write_offset, then the client MAY set
no_newoffset to TRUE and provide an offset that will update the file
size. Keep in mind that offset is not the same as length, though
they are related. For example, a loca_last_write_offset value of
zero means that one byte was written at offset zero, and so the
length of the file is at least one byte.
The metadata server may do one of the following: The metadata server may do one of the following:
1. Update the file's size using the last write offset provided by 1. Update the file's size using the last write offset provided by
the client as either the true file size or as a hint of the file the client as either the true file size or as a hint of the file
size. If the metadata server has a method available, any new size. If the metadata server has a method available, any new
value for file size should be sanity checked. For example, the value for file size should be sanity checked. For example, the
file must not be truncated if the client presents a last write file must not be truncated if the client presents a last write
offset less than the file's current size. offset less than the file's current size.
skipping to change at page 281, line 46 skipping to change at page 283, line 11
LAYOUTCOMMIT to commit the modification time and the new size of the LAYOUTCOMMIT to commit the modification time and the new size of the
file (if it believes it extended the file size) to the metadata file (if it believes it extended the file size) to the metadata
server and the modified data to the file system. server and the modified data to the file system.
12.7. Recovery 12.7. Recovery
Recovery is complicated by the distributed nature of the pNFS Recovery is complicated by the distributed nature of the pNFS
protocol. In general, crash recovery for layouts is similar to crash protocol. In general, crash recovery for layouts is similar to crash
recovery for delegations in the base NFSv4.1 protocol. However, the recovery for delegations in the base NFSv4.1 protocol. However, the
client's ability to perform I/O without contacting the metadata client's ability to perform I/O without contacting the metadata
server subtleties that must be handled correctly if the possibility server introduces subtleties that must be handled correctly if the
of file system corruption is to be avoided. [[Comment.4: mre: possibility of file system corruption is to be avoided.
layouts are bound to stateids]]
12.7.1. Recovery from Client Restart 12.7.1. Recovery from Client Restart
Client recovery for layouts is similar to client recovery for other Client recovery for layouts is similar to client recovery for other
lock and delegation state. When an pNFS client restarts, it will lock and delegation state. When an pNFS client restarts, it will
lose all information about the layouts that it previously owned. lose all information about the layouts that it previously owned.
There are two methods by which the server can reclaim these resources There are two methods by which the server can reclaim these resources
and allow otherwise conflicting layouts to be provided to other and allow otherwise conflicting layouts to be provided to other
clients. clients.
skipping to change at page 284, line 8 skipping to change at page 285, line 14
fencing the client. In other words, prevent the execution of I/O fencing the client. In other words, prevent the execution of I/O
operations from the client to the storage devices after layout state operations from the client to the storage devices after layout state
loss. The details of how fencing is done are specific to the layout loss. The details of how fencing is done are specific to the layout
type. The solution for NFSv4.1 file-based layouts is described in type. The solution for NFSv4.1 file-based layouts is described in
(Section 13.11), and for other layout types in their respective (Section 13.11), and for other layout types in their respective
external specification documents. external specification documents.
12.7.4. Recovery from Metadata Server Restart 12.7.4. Recovery from Metadata Server Restart
The pNFS client will discover that the metadata server has restarted The pNFS client will discover that the metadata server has restarted
(e.g. rebooted) via the methods described in Section 8.4.2 and (e.g. restarted) via the methods described in Section 8.4.2 and
discussed in a pNFS-specific context in Paragraph 2, of discussed in a pNFS-specific context in Paragraph 2, of
Section 12.7.2. The client MUST stop using layouts and delete the Section 12.7.2. The client MUST stop using layouts and delete the
device ID to device address mappings it previously received from the device ID to device address mappings it previously received from the
metadata server. Having done that, if the client wrote data to the metadata server. Having done that, if the client wrote data to the
storage device without committing the layouts via LAYOUTCOMMIT, then storage device without committing the layouts via LAYOUTCOMMIT, then
the client has additional work to do in order to have the client, the client has additional work to do in order to have the client,
metadata server and storage device(s) all synchronized on the state metadata server and storage device(s) all synchronized on the state
of the data. of the data.
o If the client has data still modified and unwritten in the o If the client has data still modified and unwritten in the
skipping to change at page 286, line 38 skipping to change at page 287, line 46
effect at the time of restart, and use this information during the effect at the time of restart, and use this information during the
recovery grace period to determine that a WRITE request is safe. recovery grace period to determine that a WRITE request is safe.
12.7.6. Storage Device Recovery 12.7.6. Storage Device Recovery
Recovery from storage device restart is mostly dependent upon the Recovery from storage device restart is mostly dependent upon the
layout type in use. However, there are a few general techniques a layout type in use. However, there are a few general techniques a
client can use if it discovers a storage device has crashed while client can use if it discovers a storage device has crashed while
holding modified, uncommitted data that was asynchronously written. holding modified, uncommitted data that was asynchronously written.
First and foremost, it is important to realize that the client is the First and foremost, it is important to realize that the client is the
only one who has the information necessary to recover non-committed only one which has the information necessary to recover non-committed
data; since, it holds the modified data and probably nothing else data; since, it holds the modified data and probably nothing else
does. Second, the best solution is for the client to err on the side does. Second, the best solution is for the client to err on the side
of caution and attempt to re-write the modified data through another of caution and attempt to re-write the modified data through another
path. path.
The client SHOULD immediately write the data to the metadata server, The client SHOULD immediately write the data to the metadata server,
with the stable field in the WRITE4args set to FILE_SYNC4. Once it with the stable field in the WRITE4args set to FILE_SYNC4. Once it
does this, there is no need to wait for the original storage device. does this, there is no need to wait for the original storage device.
12.8. Metadata and Storage Device Roles 12.8. Metadata and Storage Device Roles
skipping to change at page 290, line 39 skipping to change at page 291, line 45
If a server is both a metadata server and a data server, the server If a server is both a metadata server and a data server, the server
might need to distinguish operations on files that are directed to might need to distinguish operations on files that are directed to
the metadata server from those that are directed to the data server. the metadata server from those that are directed to the data server.
It is RECOMMENDED that the values of the filehandles returned by the It is RECOMMENDED that the values of the filehandles returned by the
LAYOUTGET operation to be different than the value of the filehandle LAYOUTGET operation to be different than the value of the filehandle
returned by the OPEN of the same file. returned by the OPEN of the same file.
Another scenario is for the metadata server and the storage device to Another scenario is for the metadata server and the storage device to
be distinct from one client's point of view, and the roles reversed be distinct from one client's point of view, and the roles reversed
from another client's point of view. For example, in the cluster from another client's point of view. For example, in the cluster
file system model a metadata server to one client, may be a data file system model, a metadata server to one client may be a data
server to another client. If NFSv4.1 is being used as the storage server to another client. If NFSv4.1 is being used as the storage
protocol, then pNFS servers need to encode the values of filehandles protocol, then pNFS servers need to encode the values of filehandles
according to their specific roles. according to their specific roles.
13.1.1. Sessions Considerations for Data Servers
Section 2.10.9.2 states that a client has to keep its lease renewed
in order to prevent a session from being deleted by the server. If
the reply to EXCHANGE_ID has just the EXCHGID4_FLAG_USE_PNFS_DS role
set, then as noted in Section 13.6 the client will not be able to
determine the data server's lease_time attribute, because GETATTR
will not be permitted. Instead, the rule is that any time a client
receives a layout referring it to a data server that returns just the
EXCHGID4_FLAG_USE_PNFS_DS role, the client MAY assume that the
lease_time attribute from the metadata server that returned the
layout applies to the data server. Thus the data server MUST be
aware of the values of all lease_time attributes of all metadata
servers it is providing I/O for, and MUST use the maximum of all such
lease_time values as the lease interval for all client IDs and
sessions established on it.
For example, if one metadata server has a lease_time attribute of 20
seconds, and a second metadata server has a lease_time attribute of
10 seconds, then if both servers return layouts that refer to an
EXCHGID4_FLAG_USE_PNFS_DS-only data server, the data server MUST
renew a client's lease if the interval between two SEQUENCE
operations on different COMPOUND requests is less than 20 seconds.
13.2. File Layout Definitions 13.2. File Layout Definitions
The following definitions apply to the LAYOUT4_NFSV4_1_FILES layout The following definitions apply to the LAYOUT4_NFSV4_1_FILES layout
type, and may be applicable to other layout types. type, and may be applicable to other layout types.
Unit. A unit is a fixed size quantity of data written to a data Unit. A unit is a fixed size quantity of data written to a data
server. server.
Pattern. A pattern is a method of distributing one or more equal Pattern. A pattern is a method of distributing one or more equal
sized units across a set of data servers. A pattern is iterated sized units across a set of data servers. A pattern is iterated
skipping to change at page 292, line 41 skipping to change at page 294, line 18
then the client indicates in the second field, nflh_util, a then the client indicates in the second field, nflh_util, a
preference for how the data file is packed (Section 13.4.4), which is preference for how the data file is packed (Section 13.4.4), which is
controlled by the value of nflh_util & NFL4_UFLG_DENSE. If the controlled by the value of nflh_util & NFL4_UFLG_DENSE. If the
NFLH4_CARE_COMMIT_THRU_MDS flag is set, then the client indicates a NFLH4_CARE_COMMIT_THRU_MDS flag is set, then the client indicates a
preference for whether the client should send COMMIT operations to preference for whether the client should send COMMIT operations to
the metadata server or data server (Section 13.7), which is the metadata server or data server (Section 13.7), which is
controlled by the value of nflh_util & NFL4_UFLG_COMMIT_THRU_MDS. If controlled by the value of nflh_util & NFL4_UFLG_COMMIT_THRU_MDS. If
the NFLH4_CARE_STRIPE_UNIT_SIZE flag is set, the client indicates its the NFLH4_CARE_STRIPE_UNIT_SIZE flag is set, the client indicates its
preferred stripe unit size, which is indicated in nflh_util & preferred stripe unit size, which is indicated in nflh_util &
NFL4_UFLG_STRIPE_UNIT_SIZE_MASK (thus the stripe unit size MUST be a NFL4_UFLG_STRIPE_UNIT_SIZE_MASK (thus the stripe unit size MUST be a
multiple of 64 bytes). If the NFLH4_CARE_STRIPE_COUNT flag is set, multiple of 64 bytes). The minimum stripe unit size is 64 bytes. If
the client indicates in the third field, nflh_stripe_count, the the NFLH4_CARE_STRIPE_COUNT flag is set, the client indicates in the
stripe count. The stripe count multiplied by the stripe unit size is third field, nflh_stripe_count, the stripe count. The stripe count
the stripe width. multiplied by the stripe unit size is the stripe width.
When LAYOUTGET returns a LAYOUT4_NFSV4_1_FILES layout (indicated in When LAYOUTGET returns a LAYOUT4_NFSV4_1_FILES layout (indicated in
the loc_type field of the lo_content field), the loc_body field of the loc_type field of the lo_content field), the loc_body field of
the lo_content field contains a value of data type the lo_content field contains a value of data type
nfsv4_1_file_layout4. Among other content, nfsv4_1_file_layout4 has nfsv4_1_file_layout4. Among other content, nfsv4_1_file_layout4 has
a storage device ID (field nfl_deviceid) of data type deviceid4. The a storage device ID (field nfl_deviceid) of data type deviceid4. The
GETDEVICEINFO operation maps a device ID to a storage device address GETDEVICEINFO operation maps a device ID to a storage device address
(type device_addr4). When GETDEVICEINFO returns a device address (type device_addr4). When GETDEVICEINFO returns a device address
with a layout type of LAYOUT4_NFSV4_1_FILES (the da_layout_type with a layout type of LAYOUT4_NFSV4_1_FILES (the da_layout_type
field), the da_addr_body field contains a value of data type field), the da_addr_body field contains a value of data type
skipping to change at page 304, line 20 skipping to change at page 306, line 20
personalities, each COMPOUND sent by the client MUST be constructed personalities, each COMPOUND sent by the client MUST be constructed
so that it is appropriate to one of the two personalities, and must so that it is appropriate to one of the two personalities, and must
not contain operations directed to a mix of those personalities. The not contain operations directed to a mix of those personalities. The
server MUST enforce this. To understand the constraints, operations server MUST enforce this. To understand the constraints, operations
within a COMPOUND are divided into the following three classes: within a COMPOUND are divided into the following three classes:
1. An operation which is ambiguous regarding its personality 1. An operation which is ambiguous regarding its personality
assignment. These include all of the data-server housekeeping assignment. These include all of the data-server housekeeping
operations. Additionally, if the server has assigned filehandles operations. Additionally, if the server has assigned filehandles
so that the ones defined by the layout are the same as those used so that the ones defined by the layout are the same as those used
by the meta-data server, all operations in the second class are by the metadata server, all operations in the second class are
within this group unless a stateid used is incompatible with a within this group unless a stateid used is incompatible with a
data-server personality in that it is a special stateid or has a data-server personality in that it is a special stateid or has a
non-zero seqid field. non-zero seqid field.
2. An operation which is referable to the data server personality. 2. An operation which is referable to the data server personality.
These are data-server I/O operations where the filehandle is one These are data-server I/O operations where the filehandle is one
that can only be validly directed to the data-server personality. that can only be validly directed to the data-server personality.
3. An operation which is referable to the non-data-server 3. An operation which is referable to the non-data-server
personality. These include all COMPOUND operations that are personality. These include all COMPOUND operations that are
skipping to change at page 305, line 41 skipping to change at page 307, line 41
has completed (see Section 12.5.4.2). Section 13.10, describes the has completed (see Section 12.5.4.2). Section 13.10, describes the
mechanism by which the client is to handle data server files that do mechanism by which the client is to handle data server files that do
not reflect the metadata server's size. not reflect the metadata server's size.
13.7. COMMIT Through Metadata Server 13.7. COMMIT Through Metadata Server
The file layout provides two alternate means of providing for the The file layout provides two alternate means of providing for the
commit of data written through data servers. The flag commit of data written through data servers. The flag
NFL4_UFLG_COMMIT_THRU_MDS in the field nfl_util of the file layout NFL4_UFLG_COMMIT_THRU_MDS in the field nfl_util of the file layout
(data type nfsv4_1_file_layout4) is an indication from the metadata (data type nfsv4_1_file_layout4) is an indication from the metadata
server to the client of the preferred way of performing COMMIT, server to the client of the REQUIRED way of performing COMMIT, either
either by sending the COMMIT to the data server or the metadata by sending the COMMIT to the data server or the metadata server.
server. These two methods of dealing with the issue correspond to These two methods of dealing with the issue correspond to broad
broad styles of implementation for a pNFS server supporting the files styles of implementation for a pNFS server supporting the files
layout type. layout type.
o When the flag is false, COMMIT operations are to be done to the o When the flag is FALSE, COMMIT operations MUST to be sent to the
data server to which the corresponding writes were done. This data server to which the corresponding WRITE operations were sent.
approach is most useful when striping of files is implemented as This approach is most useful when striping of files is implemented
part of pNFS server, with the individual data servers each as part of pNFS server, with the individual data servers each
implementing their own file systems. implementing their own file systems.
o When the flag is true, COMMIT operations are done to the metadata o When the flag is TRUE, COMMIT operations MUST be sent to the
server, rather than to the individual data servers. This approach metadata server, rather than to the individual data servers. This
is most useful when the pNFS server is implemented on top of a approach is most useful when the pNFS server is implemented on top
clustered file system. In such an implementation, sending of a clustered file system. In such an implementation, sending
COMMIT's to multiple data servers may result in repeated writes of COMMIT's to multiple data servers may result in repeated writes of
metadata blocks as each individual COMMIT is executed, to the metadata blocks as each individual COMMIT is executed, to the
detriment of write performance. Sending a single COMMIT to the detriment of write performance. Sending a single COMMIT to the
metadata server can provide more efficiency when there exists a metadata server can provide more efficiency when there exists a
clustered file system capable of implementing such a co-ordinated clustered file system capable of implementing such a co-ordinated
COMMIT. COMMIT.
If nfl_util & NFL4_UFLG_COMMIT_THRU_MDS is TRUE, then in order to If nfl_util & NFL4_UFLG_COMMIT_THRU_MDS is TRUE, then in order to
maintain the current NFSv4.1 commit and recovery model, the data maintain the current NFSv4.1 commit and recovery model, the data
servers MUST return a common writeverf verifier in all WRITE servers MUST return a common writeverf verifier in all WRITE
skipping to change at page 314, line 32 skipping to change at page 316, line 32
Table B.1 Table B.1
Table B.2 is normally not part of the nfs4_cs_prep profile as it is Table B.2 is normally not part of the nfs4_cs_prep profile as it is
primarily for dealing with case-insensitive comparisons. However, if primarily for dealing with case-insensitive comparisons. However, if
the NFSv4.1 file server supports the case_insensitive file system the NFSv4.1 file server supports the case_insensitive file system
attribute, and if case_insensitive is true, the NFSv4.1 server MUST attribute, and if case_insensitive is true, the NFSv4.1 server MUST
use Table B.2 (in addition to Table B1) when processing utf8str_cs use Table B.2 (in addition to Table B1) when processing utf8str_cs
strings, and the NFSv4.1 client MUST assume Table B.2 (in addition to strings, and the NFSv4.1 client MUST assume Table B.2 (in addition to
Table B.1) are being used. Table B.1) are being used.
If the case_preserving attribute is present and set to false, then If the case_preserving attribute is present and set to FALSE, then
the NFSv4.1 server MUST use table B.2 to map case when processing the NFSv4.1 server MUST use table B.2 to map case when processing
utf8str_cs strings. Whether the server maps from lower to upper case utf8str_cs strings. Whether the server maps from lower to upper case
or the upper to lower case is an implementation dependency. or the upper to lower case is an implementation dependency.
14.1.4. Normalization used by nfs4_cs_prep 14.1.4. Normalization used by nfs4_cs_prep
The nfs4_cs_prep profile does not specify a normalization form. A The nfs4_cs_prep profile does not specify a normalization form. A
later revision of this specification may specify a particular later revision of this specification may specify a particular
normalization form. Therefore, the server and client can expect that normalization form. Therefore, the server and client can expect that
they may receive unnormalized characters within protocol requests and they may receive unnormalized characters within protocol requests and
skipping to change at page 342, line 35 skipping to change at page 344, line 35
| GETFH | NFS4ERR_FHEXPIRED, NFS4ERR_MOVED, | | GETFH | NFS4ERR_FHEXPIRED, NFS4ERR_MOVED, |
| | NFS4ERR_NOFILEHANDLE, | | | NFS4ERR_NOFILEHANDLE, |
| | NFS4ERR_OP_NOT_IN_SESSION, NFS4ERR_STALE | | | NFS4ERR_OP_NOT_IN_SESSION, NFS4ERR_STALE |
| ILLEGAL | NFS4ERR_BADXDR NFS4ERR_OP_ILLEGAL | | ILLEGAL | NFS4ERR_BADXDR NFS4ERR_OP_ILLEGAL |
| LAYOUTCOMMIT | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | | LAYOUTCOMMIT | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, |
| | NFS4ERR_ATTRNOTSUPP, NFS4ERR_BADIOMODE, | | | NFS4ERR_ATTRNOTSUPP, NFS4ERR_BADIOMODE, |
| | NFS4ERR_BADLAYOUT, NFS4ERR_BADXDR, | | | NFS4ERR_BADLAYOUT, NFS4ERR_BADXDR, |
| | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, |
| | NFS4ERR_EXPIRED, NFS4ERR_FBIG, | | | NFS4ERR_EXPIRED, NFS4ERR_FBIG, |
| | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, |
| | NFS4ERR_IO, NFS4ERR_ISDIR NFS4ERR_MOVED, | | | NFS4ERR_INVAL, NFS4ERR_IO, NFS4ERR_ISDIR |
| | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOTSUPP, | | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, |
| | NFS4ERR_NO_GRACE, | | | NFS4ERR_NOTSUPP, NFS4ERR_NO_GRACE, |
| | NFS4ERR_OP_NOT_IN_SESSION, | | | NFS4ERR_OP_NOT_IN_SESSION, |
| | NFS4ERR_RECLAIM_BAD, | | | NFS4ERR_RECLAIM_BAD, |
| | NFS4ERR_RECLAIM_CONFLICT, | | | NFS4ERR_RECLAIM_CONFLICT, |
| | NFS4ERR_REP_TOO_BIG, | | | NFS4ERR_REP_TOO_BIG, |
| | NFS4ERR_REP_TOO_BIG_TO_CACHE, | | | NFS4ERR_REP_TOO_BIG_TO_CACHE, |
| | NFS4ERR_REQ_TOO_BIG, NFS4ERR_SERVERFAULT, | | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_SERVERFAULT, |
| | NFS4ERR_STALE, NFS4ERR_SYMLINK, | | | NFS4ERR_STALE, NFS4ERR_SYMLINK, |
| | NFS4ERR_TOO_MANY_OPS, | | | NFS4ERR_TOO_MANY_OPS, |
| | NFS4ERR_UNKNOWN_LAYOUTTYPE, | | | NFS4ERR_UNKNOWN_LAYOUTTYPE, |
| | NFS4ERR_WRONG_CRED | | | NFS4ERR_WRONG_CRED |
skipping to change at page 361, line 38 skipping to change at page 363, line 38
| NFS4ERR_INVAL | ACCESS, BACKCHANNEL_CTL, | | NFS4ERR_INVAL | ACCESS, BACKCHANNEL_CTL, |
| | BIND_CONN_TO_SESSION, | | | BIND_CONN_TO_SESSION, |
| | CB_GETATTR, CB_LAYOUTRECALL, | | | CB_GETATTR, CB_LAYOUTRECALL, |
| | CB_NOTIFY, CB_PUSH_DELEG, | | | CB_NOTIFY, CB_PUSH_DELEG, |
| | CB_RECALLABLE_OBJ_AVAIL, | | | CB_RECALLABLE_OBJ_AVAIL, |
| | CB_RECALL_ANY, CREATE, | | | CB_RECALL_ANY, CREATE, |
| | CREATE_SESSION, DELEGRETURN, | | | CREATE_SESSION, DELEGRETURN, |
| | EXCHANGE_ID, GETATTR, | | | EXCHANGE_ID, GETATTR, |
| | GETDEVICEINFO, GETDEVICELIST, | | | GETDEVICEINFO, GETDEVICELIST, |
| | GET_DIR_DELEGATION, | | | GET_DIR_DELEGATION, |
| | LAYOUTGET, LAYOUTRETURN, | | | LAYOUTCOMMIT, LAYOUTGET, |
| | LINK, LOCK, LOCKT, LOCKU, | | | LAYOUTRETURN, LINK, LOCK, |
| | LOOKUP, NVERIFY, OPEN, | | | LOCKT, LOCKU, LOOKUP, |
| | NVERIFY, OPEN, |
| | OPEN_DOWNGRADE, READ, | | | OPEN_DOWNGRADE, READ, |
| | READDIR, READLINK, | | | READDIR, READLINK, |
| | RECLAIM_COMPLETE, REMOVE, | | | RECLAIM_COMPLETE, REMOVE, |
| | RENAME, SECINFO, | | | RENAME, SECINFO, |
| | SECINFO_NO_NAME, SETATTR, | | | SECINFO_NO_NAME, SETATTR, |
| | VERIFY, WANT_DELEGATION, | | | VERIFY, WANT_DELEGATION, |
| | WRITE | | | WRITE |
| NFS4ERR_IO | ACCESS, COMMIT, CREATE, | | NFS4ERR_IO | ACCESS, COMMIT, CREATE, |
| | GETATTR, GETDEVICELIST, | | | GETATTR, GETDEVICELIST, |
| | GET_DIR_DELEGATION, | | | GET_DIR_DELEGATION, |
skipping to change at page 389, line 33 skipping to change at page 391, line 33
stable field set to UNSTABLE4. stable field set to UNSTABLE4.
The offset specifies the position within the file where the flush is The offset specifies the position within the file where the flush is
to begin. An offset value of 0 (zero) means to flush data starting to begin. An offset value of 0 (zero) means to flush data starting
at the beginning of the file. The count specifies the number of at the beginning of the file. The count specifies the number of
bytes of data to flush. If count is 0 (zero), a flush from offset to bytes of data to flush. If count is 0 (zero), a flush from offset to
the end of the file is done. the end of the file is done.
The server returns a write verifier upon successful completion of the The server returns a write verifier upon successful completion of the
COMMIT. The write verifier is used by the client to determine if the COMMIT. The write verifier is used by the client to determine if the
server has restarted or rebooted between the initial WRITE(s) and the server has restarted between the initial WRITE(s) and the COMMIT.
COMMIT. The client does this by comparing the write verifier The client does this by comparing the write verifier returned from
returned from the initial writes and the verifier returned by the the initial writes and the verifier returned by the COMMIT operation.
COMMIT operation. The server must vary the value of the write The server must vary the value of the write verifier at each server
verifier at each server event or instantiation that may lead to a event or instantiation that may lead to a loss of uncommitted data.
loss of uncommitted data. Most commonly this occurs when the server Most commonly this occurs when the server is restarted; however,
is rebooted; however, other events at the server may result in other events at the server may result in uncommitted data loss as
uncommitted data loss as well. well.
On success, the current filehandle retains its value. On success, the current filehandle retains its value.
18.3.4. IMPLEMENTATION 18.3.4. IMPLEMENTATION
The COMMIT operation is similar in operation and semantics to the The COMMIT operation is similar in operation and semantics to the
POSIX fsync(2) system call that synchronizes a file's state with the POSIX fsync(2) system call that synchronizes a file's state with the
disk (file data and metadata is flushed to disk or stable storage). disk (file data and metadata is flushed to disk or stable storage).
COMMIT performs the same operation for a client, flushing any COMMIT performs the same operation for a client, flushing any
unsynchronized data and metadata on the server to the server's disk unsynchronized data and metadata on the server to the server's disk
skipping to change at page 394, line 38 skipping to change at page 396, line 38
This is useful for clients which do not commit delegation information This is useful for clients which do not commit delegation information
to stable storage to indicate that conflicting requests need not be to stable storage to indicate that conflicting requests need not be
delayed by the server awaiting recovery of delegation information. delayed by the server awaiting recovery of delegation information.
This operation should be used by clients that record delegation This operation should be used by clients that record delegation
information on stable storage on the client. In this case, information on stable storage on the client. In this case,
DELEGPURGE should be sent immediately after doing delegation recovery DELEGPURGE should be sent immediately after doing delegation recovery
on all delegations known to the client. Doing so will notify the on all delegations known to the client. Doing so will notify the
server that no additional delegations for the client will be server that no additional delegations for the client will be
recovered allowing it to free resources, and avoid delaying other recovered allowing it to free resources, and avoid delaying other
clients who make requests that conflict with the unrecovered clients which make requests that conflict with the unrecovered
delegations. The set of delegations known to the server and the delegations. The set of delegations known to the server and the
client may be different. The reason for this is that a client may client may be different. The reason for this is that a client may
fail after making a request which resulted in delegation but before fail after making a request which resulted in delegation but before
it received the results and committed them to the client's stable it received the results and committed them to the client's stable
storage. storage.
The server MAY support DELEGPURGE, but if it does not, it MUST NOT The server MAY support DELEGPURGE, but if it does not, it MUST NOT
support CLAIM_DELEGATE_PREV. support CLAIM_DELEGATE_PREV.
18.6. Operation 8: DELEGRETURN - Return Delegation 18.6. Operation 8: DELEGRETURN - Return Delegation
skipping to change at page 421, line 34 skipping to change at page 423, line 34
| CLAIM_NULL, CLAIM_FH | For the client, this is a new OPEN request | | CLAIM_NULL, CLAIM_FH | For the client, this is a new OPEN request |
| | and there is no previous state associate | | | and there is no previous state associate |
| | with the file for the client. With | | | with the file for the client. With |
| | CLAIM_NULL the file is identified by the | | | CLAIM_NULL the file is identified by the |
| | current filehandle and the specified | | | current filehandle and the specified |
| | component name. With CLAIM_FH (new to | | | component name. With CLAIM_FH (new to |
| | NFSv4.1) the file is identified by just | | | NFSv4.1) the file is identified by just |
| | the current filehandle. | | | the current filehandle. |
| CLAIM_PREVIOUS | The client is claiming basic OPEN state | | CLAIM_PREVIOUS | The client is claiming basic OPEN state |
| | for a file that was held previous to a | | | for a file that was held previous to a |
| | server reboot. Generally used when a | | | server restart. Generally used when a |
| | server is returning persistent | | | server is returning persistent |
| | filehandles; the client may not have the | | | filehandles; the client may not have the |
| | file name to reclaim the OPEN. | | | file name to reclaim the OPEN. |
| CLAIM_DELEGATE_CUR, | The client is claiming a delegation for | | CLAIM_DELEGATE_CUR, | The client is claiming a delegation for |
| CLAIM_DELEG_CUR_FH | OPEN as granted by the server. Generally | | CLAIM_DELEG_CUR_FH | OPEN as granted by the server. Generally |
| | this is done as part of recalling a | | | this is done as part of recalling a |
| | delegation. With CLAIM_DELEGATE_CUR, the | | | delegation. With CLAIM_DELEGATE_CUR, the |
| | file is identified by the current | | | file is identified by the current |
| | filehandle and the specified component | | | filehandle and the specified component |
| | name. With CLAIM_DELEG_CUR_FH (new to | | | name. With CLAIM_DELEG_CUR_FH (new to |
| | NFSv4.1), the file is identified by just | | | NFSv4.1), the file is identified by just |
| | the current filehandle. | | | the current filehandle. |
| CLAIM_DELEGATE_PREV, | The client is claiming a delegation | | CLAIM_DELEGATE_PREV, | The client is claiming a delegation |
| CLAIM_DELEG_PREV_FH | granted to a previous client instance; | | CLAIM_DELEG_PREV_FH | granted to a previous client instance; |
| | used after the client reboots. The server | | | used after the client restarts. The server |
| | MAY support CLAIM_DELEGATE_PREV or | | | MAY support CLAIM_DELEGATE_PREV or |
| | CLAIM_DELEG_PREV_FH (new to NFSv4.1). If | | | CLAIM_DELEG_PREV_FH (new to NFSv4.1). If |
| | it does support either open type, | | | it does support either open type, |
| | CREATE_SESSION MUST NOT remove the | | | CREATE_SESSION MUST NOT remove the |
| | client's delegation state, and the server | | | client's delegation state, and the server |
| | MUST support the DELEGPURGE operation. | | | MUST support the DELEGPURGE operation. |
+----------------------+--------------------------------------------+ +----------------------+--------------------------------------------+
For OPEN requests whose claim type is other than CLAIM_PREVIOUS (i.e. For OPEN requests whose claim type is other than CLAIM_PREVIOUS (i.e.
requests other than those devoted to reclaiming opens after a server requests other than those devoted to reclaiming opens after a server
reboot) that reach the server during its grace or lease expiration restart) that reach the server during its grace or lease expiration
period, the server returns an error of NFS4ERR_GRACE. period, the server returns an error of NFS4ERR_GRACE.
For any OPEN request, the server may return an open delegation, which For any OPEN request, the server may return an open delegation, which
allows further opens and closes to be handled locally on the client allows further opens and closes to be handled locally on the client
as described in Section 10.4. Note that delegation is up to the as described in Section 10.4. Note that delegation is up to the
server to decide. The client should never assume that delegation server to decide. The client should never assume that delegation
will or will not be granted in a particular instance. It should will or will not be granted in a particular instance. It should
always be prepared for either case. A partial exception is the always be prepared for either case. A partial exception is the
reclaim (CLAIM_PREVIOUS) case, in which a delegation type is claimed. reclaim (CLAIM_PREVIOUS) case, in which a delegation type is claimed.
In this case, delegation will always be granted, although the server In this case, delegation will always be granted, although the server
skipping to change at page 422, line 44 skipping to change at page 424, line 44
NFSv4.1 server. NFSv4.1 server.
o OPEN4_RESULT_LOCKTYPE_POSIX indicates the server's file locking o OPEN4_RESULT_LOCKTYPE_POSIX indicates the server's file locking
behavior supports the complete set of Posix locking techniques. behavior supports the complete set of Posix locking techniques.
From this the client can choose to manage file locking state in a From this the client can choose to manage file locking state in a
way to handle a mis-match of file locking management. way to handle a mis-match of file locking management.
o OPEN4_RESULT_PRESERVE_UNLINKED indicates the server will preserve o OPEN4_RESULT_PRESERVE_UNLINKED indicates the server will preserve
the open file if the client (or any other client) removes the file the open file if the client (or any other client) removes the file
as long as it is open. Furthermore, the server promises to as long as it is open. Furthermore, the server promises to
preserve the file through the grace period after server reboot, preserve the file through the grace period after server restart,
thereby giving the client the opportunity to reclaim his open. thereby giving the client the opportunity to reclaim its open.
o OPEN4_RESULT_MAY_NOTIFY_LOCK indicates that the server may attempt o OPEN4_RESULT_MAY_NOTIFY_LOCK indicates that the server may attempt
CB_NOTIFY_LOCK callbacks for locks on this file. This flag is a CB_NOTIFY_LOCK callbacks for locks on this file. This flag is a
hint only, and may be safely ignored by the client. hint only, and may be safely ignored by the client.
If the component is of zero length, NFS4ERR_INVAL will be returned. If the component is of zero length, NFS4ERR_INVAL will be returned.
The component is also subject to the normal UTF-8, character support, The component is also subject to the normal UTF-8, character support,
and name checks. See Section 14.5 for further discussion. and name checks. See Section 14.5 for further discussion.
skipping to change at page 426, line 48 skipping to change at page 428, line 48
In absence of a persistent session, the client invokes exclusive In absence of a persistent session, the client invokes exclusive
create by setting the how parameter to EXCLUSIVE4 or EXCLUSIVE4_1. create by setting the how parameter to EXCLUSIVE4 or EXCLUSIVE4_1.
In these cases, the client provides a verifier that can reasonably be In these cases, the client provides a verifier that can reasonably be
expected to be unique. A combination of a client identifier, perhaps expected to be unique. A combination of a client identifier, perhaps
the client network address, and a unique number generated by the the client network address, and a unique number generated by the
client, perhaps the RPC transaction identifier, may be appropriate. client, perhaps the RPC transaction identifier, may be appropriate.
If the object does not exist, the server creates the object and If the object does not exist, the server creates the object and
stores the verifier in stable storage. For file systems that do not stores the verifier in stable storage. For file systems that do not
provide a mechanism for the storage of arbitrary file attributes, the provide a mechanism for the storage of arbitrary file attributes, the
server may use one or more elements of the object meta-data to store server may use one or more elements of the object metadata to store
the verifier. The verifier must be stored in stable storage to the verifier. The verifier must be stored in stable storage to
prevent erroneous failure on retransmission of the request. It is prevent erroneous failure on retransmission of the request. It is
assumed that an exclusive create is being performed because exclusive assumed that an exclusive create is being performed because exclusive
semantics are critical to the application. Because of the expected semantics are critical to the application. Because of the expected
usage, exclusive CREATE does not rely solely on the server's reply usage, exclusive CREATE does not rely solely on the server's reply
cache for storage of the verifier. A nonpersistent reply cache does cache for storage of the verifier. A nonpersistent reply cache does
not survive a crash and the session and reply cache may be deleted not survive a crash and the session and reply cache may be deleted
after a network partition that exceeds the lease time, thus opening after a network partition that exceeds the lease time, thus opening
failure windows. failure windows.
skipping to change at page 460, line 45 skipping to change at page 462, line 45
The definition of stable storage has been historically a point of The definition of stable storage has been historically a point of
contention. The following expected properties of stable storage may contention. The following expected properties of stable storage may
help in resolving design sends in the implementation. Stable storage help in resolving design sends in the implementation. Stable storage
is persistent storage that survives: is persistent storage that survives:
1. Repeated power failures. 1. Repeated power failures.
2. Hardware failures (of any board, power supply, etc.). 2. Hardware failures (of any board, power supply, etc.).
3. Repeated software crashes, including reboot cycle. 3. Repeated software crashes, including restart cycle.
This definition does not address failure of the stable storage module This definition does not address failure of the stable storage module
itself. itself.
The verifier is defined to allow a client to detect different The verifier is defined to allow a client to detect different
instances of an NFSv4.1 protocol server over which cached, instances of an NFSv4.1 protocol server over which cached,
uncommitted data may be lost. In the most likely case, the verifier uncommitted data may be lost. In the most likely case, the verifier
allows the client to detect server reboots. This information is allows the client to detect server restarts. This information is
required so that the client can safely determine whether the server required so that the client can safely determine whether the server
could have lost cached data. If the server fails unexpectedly and could have lost cached data. If the server fails unexpectedly and
the client has uncommitted data from previous WRITE requests (done the client has uncommitted data from previous WRITE requests (done
with the stable argument set to UNSTABLE4 and in which the result with the stable argument set to UNSTABLE4 and in which the result
committed was returned as UNSTABLE4 as well) it may not have flushed committed was returned as UNSTABLE4 as well) it may not have flushed
cached data to stable storage. The burden of recovery is on the cached data to stable storage. The burden of recovery is on the
client and the client will need to retransmit the data to the server. client and the client will need to retransmit the data to the server.
A suggested verifier would be to use the time that the server was A suggested verifier would be to use the time that the server was
booted or the time the server was last started (if restarting the booted or the time the server was last started (if restarting the
server without a reboot results in lost buffers). server without a restart results in lost buffers).
The committed field in the results allows the client to do more The committed field in the results allows the client to do more
effective caching. If the server is committing all WRITE requests to effective caching. If the server is committing all WRITE requests to
stable storage, then it should return with committed set to stable storage, then it should return with committed set to
FILE_SYNC4, regardless of the value of the stable field in the FILE_SYNC4, regardless of the value of the stable field in the
arguments. A server that uses an NVRAM accelerator may choose to arguments. A server that uses an NVRAM accelerator may choose to
implement this policy. The client can use this to increase the implement this policy. The client can use this to increase the
effectiveness of the cache by discarding cached data that has already effectiveness of the cache by discarding cached data that has already
been committed on the server. been committed on the server.
skipping to change at page 465, line 33 skipping to change at page 467, line 33
csr_use_conn_in_rdma_mode to understand bctsr_use_conn_in_rdma_mode. csr_use_conn_in_rdma_mode to understand bctsr_use_conn_in_rdma_mode.
Invoking BIND_CONN_TO_SESSION on a connection already associated with Invoking BIND_CONN_TO_SESSION on a connection already associated with
the specified session has no effect, and the server SHOULD respond the specified session has no effect, and the server SHOULD respond
with NFS4_OK. with NFS4_OK.
18.34.4. IMPLEMENTATION 18.34.4. IMPLEMENTATION
If a session's channel loses all connections, the client needs to use If a session's channel loses all connections, the client needs to use
BIND_CONN_TO_SESSION to associate a new connection. If the server BIND_CONN_TO_SESSION to associate a new connection. If the server
rebooted and does not keep the reply cache in stable storage, the restarted and does not keep the reply cache in stable storage, the
server will not recognize the sessionid. The client will ultimately server will not recognize the sessionid. The client will ultimately
have to invoke EXCHANGE_ID to create a new client ID and session. have to invoke EXCHANGE_ID to create a new client ID and session.
Assuming SP4_SSV state protection is being used, there is an issue if Assuming SP4_SSV state protection is being used, there is an issue if
SET_SSV is sent, no response is returned, and the last connection SET_SSV is sent, no response is returned, and the last connection
associated with the client ID disconnects. The client, per the associated with the client ID disconnects. The client, per the
sessions model, needs to retry the SET_SSV. But it needs a new sessions model, needs to retry the SET_SSV. But it needs a new
connection to do so, and needs to associate that connection with the connection to do so, and needs to associate that connection with the
session via a BIND_CONN_TO_SESSION authenticated with the SSV GSS session via a BIND_CONN_TO_SESSION authenticated with the SSV GSS
mechanism. The problem is that the RPCSEC_GSS message integrity mechanism. The problem is that the RPCSEC_GSS message integrity
skipping to change at page 485, line 31 skipping to change at page 487, line 31
uses (which will be either what the client offered, or what the uses (which will be either what the client offered, or what the
server is insisting on). return the value used to the client. These server is insisting on). return the value used to the client. These
parameters have the following interpretation. parameters have the following interpretation.
csa_flags: csa_flags:
The csa_flags field contains a list of the following flag bits: The csa_flags field contains a list of the following flag bits:
CREATE_SESSION4_FLAG_PERSIST: CREATE_SESSION4_FLAG_PERSIST:
If CREATE_SESSION4_FLAG_PERSIST is set, the client desires If CREATE_SESSION4_FLAG_PERSIST is set, the client wants the
server support for persistent reply cache. For sessions in server to provide a persistent reply cache. For sessions in
which only idempotent operations will be used (e.g. a read-only which only idempotent operations will be used (e.g. a read-only
session), clients SHOULD NOT set CREATE_SESSION4_FLAG_PERSIST. session), clients SHOULD NOT set CREATE_SESSION4_FLAG_PERSIST.
If the server does not or cannot provide a persistent reply If the server does not or cannot provide a persistent reply
cache, the server MUST NOT set CREATE_SESSION4_FLAG_PERSIST in cache, the server MUST NOT set CREATE_SESSION4_FLAG_PERSIST in
the field csr_flags. the field csr_flags.
If the server is a pNFS metadata server, for reasons described If the server is a pNFS metadata server, for reasons described
in Section 12.5.2 it SHOULD support in Section 12.5.2 it SHOULD support
CREATE_SESSION4_FLAG_PERSIST if it supports the layout_hint CREATE_SESSION4_FLAG_PERSIST if it supports the layout_hint
(Section 5.11.4) attribute. (Section 5.11.4) attribute.
skipping to change at page 489, line 44 skipping to change at page 491, line 44
To describe a possible implementation, the same notation for client To describe a possible implementation, the same notation for client
records introduced in the description of EXCHANGE_ID is used with the records introduced in the description of EXCHANGE_ID is used with the
following addition: following addition:
clientid_arg: The value of the csa_clientid field of the clientid_arg: The value of the csa_clientid field of the
CREATE_SESSION4args structure of the current request. CREATE_SESSION4args structure of the current request.
Since CREATE_SESSION is a non-idempotent operation, we must consider Since CREATE_SESSION is a non-idempotent operation, we must consider
the possibility that retries may occur as a result of a client the possibility that retries may occur as a result of a client
reboot, network partition, malfunctioning router, etc. For each restart, network partition, malfunctioning router, etc. For each
client ID created by EXCHANGE_ID, the server maintains a separate client ID created by EXCHANGE_ID, the server maintains a separate
reply cache similar to the session reply cache used for SEQUENCE reply cache similar to the session reply cache used for SEQUENCE
operations, with two distinctions. operations, with two distinctions.
o First this is a reply cache just for detecting and processing o First this is a reply cache just for detecting and processing
CREATE_SESSION requests for a given client ID. CREATE_SESSION requests for a given client ID.
o Second, the size of the client ID reply cache is of one slot (and o Second, the size of the client ID reply cache is of one slot (and
as a result, the CREATE_SESSION request does not carry a slot as a result, the CREATE_SESSION request does not carry a slot
number). This means that at most one CREATE_SESSION request for a number). This means that at most one CREATE_SESSION request for a
skipping to change at page 493, line 20 skipping to change at page 495, line 20
18.37.2. RESULT 18.37.2. RESULT
struct DESTROY_SESSION4res { struct DESTROY_SESSION4res {
nfsstat4 dsr_status; nfsstat4 dsr_status;
}; };
18.37.3. DESCRIPTION 18.37.3. DESCRIPTION
The DESTROY_SESSION operation closes the session and discards the The DESTROY_SESSION operation closes the session and discards the
session's its reply cache, if any. Any remaining connections session's reply cache, if any. Any remaining connections associated
associated with the session are immediately disassociated and it not with the session are immediately disassociated. If the connection
associated with out sessions, MAY be closed by the server. Locks, has no remaining associated sessions, the connection MAY be closed by
delegations, layouts, wants, and the lease, which are all tied to the the server. Locks, delegations, layouts, wants, and the lease, which
client ID, are not affected by DESTROY_SESSION. are all tied to the client ID, are not affected by DESTROY_SESSION.
DESTROY_SESSION MUST be invoked on a connection that is associated DESTROY_SESSION MUST be invoked on a connection that is associated
with the session being destroyed. In addition if SP4_MACH_CRED state with the session being destroyed. In addition if SP4_MACH_CRED state
protection was specified when the client ID was created, the protection was specified when the client ID was created, the
RPCSEC_GSS principal that created the session MUST be the one that RPCSEC_GSS principal that created the session MUST be the one that
destroys the session, using RPCSEC_GSS privacy or integrity. If destroys the session, using RPCSEC_GSS privacy or integrity. If
SP4_SSV state protection was specified when the client ID was SP4_SSV state protection was specified when the client ID was
created, RPCSEC_GSS using the SSV mechanism (Section 2.10.8) MUST be created, RPCSEC_GSS using the SSV mechanism (Section 2.10.8) MUST be
used, with integrity or privacy. used, with integrity or privacy.
If the COMPOUND request starts with SEQUENCE, and if the sessions If the COMPOUND request starts with SEQUENCE, and if the sessionids
referred to by SEQUENCE and DESTROY_SESSION are the same, then specified in SEQUENCE and DESTROY_SESSION are the same, then
o DESTROY_SESSION MUST be the final operation in the COMPOUND o DESTROY_SESSION MUST be the final operation in the COMPOUND
request. request.
o It is advisable to not place DESTROY_SESSION in a COMPOUND request o It is advisable to not place DESTROY_SESSION in a COMPOUND request
with other state-modifying operations, because the DESTROY_SESSION with other state-modifying operations, because the DESTROY_SESSION
will destroy reply cache. will destroy the reply cache.
DESTROY_SESSION MAY be the only operation in a COMPOUND request. DESTROY_SESSION MAY be the only operation in a COMPOUND request.
Because the session is destroyed, a client that retries the request Because the session is destroyed, a client that retries the request
may receive an error in reply to the retry, even though the original may receive an error in reply to the retry, even though the original
request was successful. request was successful.
If there is a backchannel on the session and the server has If there is a backchannel on the session and the server has
outstanding CB_COMPOUND operations for the session which have not outstanding CB_COMPOUND operations for the session which have not
been replied to, then the server MAY refuse to destroy the session been replied to, then the server MAY refuse to destroy the session
skipping to change at page 504, line 32 skipping to change at page 506, line 32
void; void;
}; };
18.42.3. DESCRIPTION 18.42.3. DESCRIPTION
Commits changes in the layout represented by the current filehandle, Commits changes in the layout represented by the current filehandle,
client ID (derived from the sessionid in the preceding SEQUENCE client ID (derived from the sessionid in the preceding SEQUENCE
operation), byte range, and stateid. Since layouts are sub- operation), byte range, and stateid. Since layouts are sub-
dividable, a smaller portion of a layout, retrieved via LAYOUTGET, dividable, a smaller portion of a layout, retrieved via LAYOUTGET,
may be committed. The region being committed is specified through may be committed. The region being committed is specified through
the byte range (loca_offset and loca_length). the byte range (loca_offset and loca_length). This region MUST
overlap with one or more existing layouts previously granted via
LAYOUTGET (Section 18.43), each with an iomode of LAYOUTIOMODE4_RW.
In the case where any held layout segments iomode is not
LAYOUTIOMODE4_RW the server should return the error
NFS4ERR_BAD_IOMODE. For the case where the client does not hold
matching layout segment(s) for the defined region, the server should
return the error NFS4ERR_BAD_LAYOUT.
The LAYOUTCOMMIT operation indicates that the client has completed The LAYOUTCOMMIT operation indicates that the client has completed
writes using a layout obtained by a previous LAYOUTGET. The client writes using a layout obtained by a previous LAYOUTGET. The client
may have only written a subset of the data range it previously may have only written a subset of the data range it previously
requested. LAYOUTCOMMIT allows it to commit or discard provisionally requested. LAYOUTCOMMIT allows it to commit or discard provisionally
allocated space and to update the server with a new end of file. The allocated space and to update the server with a new end of file. The
layout referenced by LAYOUTCOMMIT is still valid after the operation layout referenced by LAYOUTCOMMIT is still valid after the operation
completes and can be continued to be referenced by the client ID, completes and can be continued to be referenced by the client ID,
filehandle, byte range, layout type, and stateid. filehandle, byte range, layout type, and stateid.
If the loca_reclaim field is set to TRUE, this indicates that the If the loca_reclaim field is set to TRUE, this indicates that the
client is attempting to commit changes to a layout after the reboot client is attempting to commit changes to a layout after the restart
of the metadata server during the metadata server's recovery grace of the metadata server during the metadata server's recovery grace
period. This type of request may be necessary when the client has period (see Section 12.7.4). This type of request may be necessary
uncommitted writes to provisionally allocated regions of a file which when the client has uncommitted writes to provisionally allocated
were sent to the storage devices before the reboot of the metadata regions of a file which were sent to the storage devices before the
server. In this case the layout provided by the client MUST be a restart of the metadata server. In this case the layout provided by
subset of a writable layout that the client held immediately before the client MUST be a subset of a writable layout that the client held
the reboot of the metadata server. The metadata server is free to immediately before the restart of the metadata server. The metadata
accept or reject this request based on its own internal metadata server is free to accept or reject this request based on its own
consistency checks. If the metadata server finds that the layout internal metadata consistency checks. If the metadata server finds
provided by the client does not pass its consistency checks, it MUST that the layout provided by the client does not pass its consistency
reject the request with the status NFS4ERR_RECLAIM_BAD. The checks, it MUST reject the request with the status
successful completion of the LAYOUTCOMMIT request with loca_reclaim NFS4ERR_RECLAIM_BAD. The successful completion of the LAYOUTCOMMIT
set to TRUE does NOT provide the client with a layout for the file. request with loca_reclaim set to TRUE does NOT provide the client
It simply commits the changes to the layout specified in the with a layout for the file. It simply commits the changes to the
loca_layoutupdate field. To obtain a layout for the file the client layout specified in the loca_layoutupdate field. To obtain a layout
must send a LAYOUTGET request to the server after the server's grace for the file the client must send a LAYOUTGET request to the server
period has expired. If the metadata server receives a LAYOUTCOMMIT after the server's grace period has expired. If the metadata server
request with loca_reclaim set to TRUE when the metadata server is not receives a LAYOUTCOMMIT request with loca_reclaim set to TRUE when
in its recovery grace period, it MUST reject the request with the the metadata server is not in its recovery grace period, it MUST
status NFS4ERR_NO_GRACE. reject the request with the status NFS4ERR_NO_GRACE.
Setting the loca_reclaim field to TRUE is required if and only if the Setting the loca_reclaim field to TRUE is required if and only if the
committed layout was acquired before the metadata server reboot. If committed layout was acquired before the metadata server restart. If
the client is committing a layout that was acquired during the the client is committing a layout that was acquired during the
metadata server's grace period, it MUST set the "reclaim" field to metadata server's grace period, it MUST set the "reclaim" field to
FALSE. FALSE.
The loca_stateid is a layout stateid value as returned by previously The loca_stateid is a layout stateid value as returned by previously
successful layout operations ( see Section 12.5.3). successful layout operations ( see Section 12.5.3).
The loca_last_write_offset field specifies the offset of the last The loca_last_write_offset field specifies the offset of the last
byte written by the client previous to the LAYOUTCOMMIT. Note that byte written by the client previous to the LAYOUTCOMMIT. Note that
this value is never equal to the file's size (at most it is one byte this value is never equal to the file's size (at most it is one byte
less than the file's size) and MUST be less than or equal to less than the file's size) and MUST be less than or equal to
NFS4_MAXFILEOFF. The metadata server may use this information to NFS4_MAXFILEOFF. Also, loca_last_write_offset MUST overlap the range
determine whether the file's size needs to be updated. If the described by loca_offset and loca_length. The metadata server may
metadata server updates the file's size as the result of the use this information to determine whether the file's size needs to be
LAYOUTCOMMIT operation, it must return the new size updated. If the metadata server updates the file's size as the
result of the LAYOUTCOMMIT operation, it must return the new size
(locr_newsize.ns_size) as part of the results. (locr_newsize.ns_size) as part of the results.
The loca_time_modify field allows the client to suggest a The loca_time_modify field allows the client to suggest a
modification time it would like the metadata server to set. The modification time it would like the metadata server to set. The
metadata server may use the suggestion or it may use the time of the metadata server may use the suggestion or it may use the time of the
LAYOUTCOMMIT operation to set the modification time. If the metadata LAYOUTCOMMIT operation to set the modification time. If the metadata
server uses the client provided modification time, it should ensure server uses the client provided modification time, it should ensure
time does not flow backwards. If the client wants to force the time does not flow backwards. If the client wants to force the
metadata server to set an exact time, the client should use a SETATTR metadata server to set an exact time, the client should use a SETATTR
operation in a compound right after LAYOUTCOMMIT. See Section 12.5.4 operation in a compound right after LAYOUTCOMMIT. See Section 12.5.4
skipping to change at page 506, line 31 skipping to change at page 508, line 39
On success, the current filehandle retains its value and the current On success, the current filehandle retains its value and the current
stateid retains its value. stateid retains its value.
18.42.4. IMPLEMENTATION 18.42.4. IMPLEMENTATION
The client MAY also use LAYOUTCOMMIT with the loca_reclaim field set The client MAY also use LAYOUTCOMMIT with the loca_reclaim field set
to TRUE to convey hints to modified file attributes or to report to TRUE to convey hints to modified file attributes or to report
layout-type specific information such as I/O errors for object-based layout-type specific information such as I/O errors for object-based
storage layouts, as normally done during normal operation. Doing so storage layouts, as normally done during normal operation. Doing so
may help the metadata server to recover files more efficiently after may help the metadata server to recover files more efficiently after
reboot. For example, some file system implementations may require restart. For example, some file system implementations may require
expansive recovery of file system objects if the metadata server does expansive recovery of file system objects if the metadata server does
not get a positive indication from all clients holding a write layout not get a positive indication from all clients holding a write layout
that they have successfully completed all their writes. Sending a that they have successfully completed all their writes. Sending a
LAYOUTCOMMIT (if required) and then following with LAYOUTRETURN can LAYOUTCOMMIT (if required) and then following with LAYOUTRETURN can
provide such an indication and allow for graceful and efficient provide such an indication and allow for graceful and efficient
recovery. recovery.
18.43. Operation 50: LAYOUTGET - Get Layout Information 18.43. Operation 50: LAYOUTGET - Get Layout Information
18.43.1. ARGUMENT 18.43.1. ARGUMENT
skipping to change at page 508, line 11 skipping to change at page 510, line 11
The LAYOUTGET operation returns layout information for the specified The LAYOUTGET operation returns layout information for the specified
byte range: a layout. To get a layout from a specific offset through byte range: a layout. To get a layout from a specific offset through
the end-of-file, regardless of the file's length, a loga_length field the end-of-file, regardless of the file's length, a loga_length field
with all bits set to 1 (one) should be used. If loga_length is zero, with all bits set to 1 (one) should be used. If loga_length is zero,
or if a loga_length which is not all bits set to one is specified, or if a loga_length which is not all bits set to one is specified,
and loga_length when added to loga_offset exceeds the maximum 64-bit and loga_length when added to loga_offset exceeds the maximum 64-bit
unsigned integer value, the error NFS4ERR_INVAL will result. unsigned integer value, the error NFS4ERR_INVAL will result.
The loga_minlength field specifies the minimum length of layout the The loga_minlength field specifies the minimum length of layout the
server MUST return. If this requirement cannot be met, no layout server MUST return with two exceptions:
must be returned; the error NFS4ERR_BADLAYOUT will be returned.
1. The argument loga_iomode was set to LAYOUTIOMODE_READ, and
loga_offset plus loga_minlength goes past the end of the file.
2. The range from loga_offset through loga_offset + loga_minlength -
1 overlaps two or more striping patterns. In which case,
logr_layout will contain two or more elements, and the sum of the
lo_length fields of each element MUST be at least loga_minlength
unless the first exception also applies.
If this requirement cannot be met, the server MUST NOT return a
layout and the error NFS4ERR_BADLAYOUT MUST be returned.
The loga_stateid field specifies a valid stateid. If a layout is not The loga_stateid field specifies a valid stateid. If a layout is not
currently held by the client, the loga_stateid field represents a currently held by the client, the loga_stateid field represents a
stateid reflecting the correspondingly valid open, record lock, or stateid reflecting the correspondingly valid open, record lock, or
delegation stateid. Once a layout is held by the client for the delegation stateid. Once a layout is held by the client for the
file, the loga_stateid field is a stateid as returned from a previous file, the loga_stateid field is a stateid as returned from a previous
LAYOUTGET or LAYOUTRETURN operation or provided by a CB_LAYOUTRECALL LAYOUTGET or LAYOUTRETURN operation or provided by a CB_LAYOUTRECALL
operation (see Section 12.5.3). operation (see Section 12.5.3).
The loga_maxcount field specifies the maximum layout size (in bytes) The loga_maxcount field specifies the maximum layout size (in bytes)
skipping to change at page 508, line 39 skipping to change at page 510, line 50
then logr_layout will contain just one entry. Otherwise, if the then logr_layout will contain just one entry. Otherwise, if the
requested range overlaps more than one striping pattern, logr_layout requested range overlaps more than one striping pattern, logr_layout
will contain the required number of entries. The elements of will contain the required number of entries. The elements of
logr_layout MUST be sorted in ascending order of the value of the logr_layout MUST be sorted in ascending order of the value of the
lo_offset field of each element. There MUST be no gaps or overlaps lo_offset field of each element. There MUST be no gaps or overlaps
in the range between two successive elements of logr_layout. The in the range between two successive elements of logr_layout. The
lo_iomode field in each element of logr_layout MUST be the same. lo_iomode field in each element of logr_layout MUST be the same.
The metadata server may adjust the range of the returned layout based The metadata server may adjust the range of the returned layout based
on the usage implied by the loga_iomode. The client MUST be prepared on the usage implied by the loga_iomode. The client MUST be prepared
to get a layout that does not align exactly with its request. The to get a layout that does not align exactly with its request. See
lo_length field in each element of logr_layout SHOULD be at least as
long as loga_minlength or the server SHOULD reject the request. See
Section 12.5.2 for more details. Section 12.5.2 for more details.
The metadata server may also return a layout with an lo_iomode other The metadata server may also return a layout with an lo_iomode other
than that requested by the client. If it does so, it must ensure than that requested by the client. If it does so, it MUST ensure
that the lo_iomode is more permissive than the loga_iomode requested. that the lo_iomode is more permissive than the loga_iomode requested.
For example, this behavior allows an implementation to upgrade read- For example, this behavior allows an implementation to upgrade read-
only requests to read/write requests at its discretion, within the only requests to read/write requests at its discretion, within the
limits of the layout type specific protocol. A lo_iomode of either limits of the layout type specific protocol. A lo_iomode of either
LAYOUTIOMODE4_READ or LAYOUTIOMODE4_RW must be returned. LAYOUTIOMODE4_READ or LAYOUTIOMODE4_RW MUST be returned.
The logr_return_on_close result field is a directive to return the The logr_return_on_close result field is a directive to return the
layout before closing the file. When the server sets this return layout before closing the file. When the server sets this return
value to TRUE, it must be prepared to recall the layout in the case value to TRUE, it MUST be prepared to recall the layout in the case
the client fails to return the layout before close. For the server the client fails to return the layout before close. For the server
that knows a layout must be returned before a close of the file, this that knows a layout must be returned before a close of the file, this
return value can be used to communicate the desired behavior to the return value can be used to communicate the desired behavior to the
client and thus remove one extra step from the client's and server's client and thus remove one extra step from the client's and server's
interaction. interaction.
The logr_stateid, as with all stateid processing, is returned to the The logr_stateid, as with all stateid processing, is returned to the
client for use in subsequent layout related operations. See client for use in subsequent layout related operations. See
Section 8.2 for a further discussion. Section 8.2 for a further discussion.
skipping to change at page 509, line 36 skipping to change at page 511, line 44
If layouts are not supported for the requested file or its containing If layouts are not supported for the requested file or its containing
file system the server SHOULD return NFS4ERR_LAYOUTUNAVAILABLE. If file system the server SHOULD return NFS4ERR_LAYOUTUNAVAILABLE. If
the layout type is not supported, the metadata server should return the layout type is not supported, the metadata server should return
NFS4ERR_UNKNOWN_LAYOUTTYPE. If layouts are supported but no layout NFS4ERR_UNKNOWN_LAYOUTTYPE. If layouts are supported but no layout
matches the client provided layout identification, the server should matches the client provided layout identification, the server should
return NFS4ERR_BADLAYOUT. If an invalid loga_iomode is specified, or return NFS4ERR_BADLAYOUT. If an invalid loga_iomode is specified, or
a loga_iomode of LAYOUTIOMODE4_ANY is specified, the server should a loga_iomode of LAYOUTIOMODE4_ANY is specified, the server should
return NFS4ERR_BADIOMODE. return NFS4ERR_BADIOMODE.
If the layout for the file is unavailable due to transient If the layout for the file is unavailable due to transient
conditions, e.g. file sharing prohibits layouts, the server must conditions, e.g. file sharing prohibits layouts, the server MUST
return NFS4ERR_LAYOUTTRYLATER. return NFS4ERR_LAYOUTTRYLATER.
If the layout request is rejected due to an overlapping layout If the layout request is rejected due to an overlapping layout
recall, the server must return NFS4ERR_RECALLCONFLICT. See recall, the server MUST return NFS4ERR_RECALLCONFLICT. See
Section 12.5.5.2 for details. Section 12.5.5.2 for details.
If the layout conflicts with a mandatory byte range lock held on the If the layout conflicts with a mandatory byte range lock held on the
file, and if the storage devices have no method of enforcing file, and if the storage devices have no method of enforcing
mandatory locks, other than through the restriction of layouts, the mandatory locks, other than through the restriction of layouts, the
metadata server should return NFS4ERR_LOCKED. metadata server should return NFS4ERR_LOCKED.
If client sets loga_signal_layout_avail to TRUE, then it is If client sets loga_signal_layout_avail to TRUE, then it is
registering with the client a "want" for a layout in the event the registering with the client a "want" for a layout in the event the
layout cannot be obtained due to resource exhaustion. If the server layout cannot be obtained due to resource exhaustion. If the server
skipping to change at page 513, line 29 skipping to change at page 515, line 29
layout operations. Although the precise value returned (with a non- layout operations. Although the precise value returned (with a non-
zero seqid) may be used, it is generally best to use the same "other" zero seqid) may be used, it is generally best to use the same "other"
value and set the seqid to zero. value and set the seqid to zero.
Return of a layout or all layouts does not invalidate the mapping of Return of a layout or all layouts does not invalidate the mapping of
storage device ID to storage device address which remains in effect storage device ID to storage device address which remains in effect
until specifically recalled or changed via notification callbacks. until specifically recalled or changed via notification callbacks.
The lora_reclaim field set to TRUE in a LAYOUTRETURN request The lora_reclaim field set to TRUE in a LAYOUTRETURN request
specifies that the client is attempting to return a layout that was specifies that the client is attempting to return a layout that was
acquired before the reboot of the metadata server during the metadata acquired before the restart of the metadata server during the
server's grace period. When returning layouts that were acquired metadata server's grace period. When returning layouts that were
during the metadata server's grace period MUST set the lora_reclaim acquired during the metadata server's grace period MUST set the
field to FALSE. The lora_reclaim field MUST be set to FALSE also lora_reclaim field to FALSE. The lora_reclaim field MUST be set to
when lr_layoutreturn is LAYOUTRETURN4_FSID or LAYOUTRETURN4_ALL. See FALSE also when lr_layoutreturn is LAYOUTRETURN4_FSID or
LAYOUTCOMMIT (Section 18.42) for more details. LAYOUTRETURN4_ALL. See LAYOUTCOMMIT (Section 18.42) for more
details.
Layouts may be returned when recalled or voluntarily (i.e., before Layouts may be returned when recalled or voluntarily (i.e., before
the server has recalled them). In either case the client must the server has recalled them). In either case the client must
properly propagate state changed under the context of the layout to properly propagate state changed under the context of the layout to
the storage device(s) or to the metadata server before returning the the storage device(s) or to the metadata server before returning the
layout. layout.
If the client is returning the layout in response to a If the client is returning the layout in response to a
CB_LAYOUTRECALL where the lor_recalltype was LAYOUTRECALL4_FILE, the CB_LAYOUTRECALL where the lor_recalltype was LAYOUTRECALL4_FILE, the
client should include use lor_stateid value from CB_LAYOUTRECALL as client should include use lor_stateid value from CB_LAYOUTRECALL as
skipping to change at page 514, line 22 skipping to change at page 516, line 24
layout. See Section 12.5.5 for more details. layout. See Section 12.5.5 for more details.
If the LAYOUTRETURN request sets the lora_reclaim field to TRUE after If the LAYOUTRETURN request sets the lora_reclaim field to TRUE after
the metadata server's grace period, NFS4ERR_NO_GRACE is returned. the metadata server's grace period, NFS4ERR_NO_GRACE is returned.
If the LAYOUTRETURN request sets the lora_reclaim field to TRUE and If the LAYOUTRETURN request sets the lora_reclaim field to TRUE and
lr_returntype is set to LAYOUTRETURN4_FSID or LAYOUTRETURN4_ALL, lr_returntype is set to LAYOUTRETURN4_FSID or LAYOUTRETURN4_ALL,
NFS4ERR_INVAL is returned. NFS4ERR_INVAL is returned.
If the operation specified lr_returntype of LAYOUTRETURN4_FILE, then If the operation specified lr_returntype of LAYOUTRETURN4_FILE, then
the lorr_stateid will represent the layout stateid as updated for lrs_stateid will represent the layout stateid as updated for this
this operation's processing; the current stateid will also be updated operation's processing; the current stateid will also be updated to
to match the returned value. If the last byte of any layout for the match the returned value. If the last byte of any layout for the
current file, client ID, and layout type is being returned and there current file, client ID, and layout type is being returned and there
are not remaining pending CB_LAYOUTRECALL operations for which a are no remaining pending CB_LAYOUTRECALL operations for which a
LAYOUTRETURN operation must be done as a completing operation, this LAYOUTRETURN operation must be done as a completing operation,
stateid value may be the special stateid consisting of all zeros. lrs_present MUST be FALSE, and thus no stateid will be returned.
On success, the current filehandle retains its value. On success, the current filehandle retains its value.
The server MAY require that the principal, security flavor, and if The server MAY require that the principal, security flavor, and if
applicable, the GSS mechanism, combination that acquired the layout applicable, the GSS mechanism, combination that acquired the layout
also be the one to send LAYOUTRETURN. This might not be possible if also be the one to send LAYOUTRETURN. This might not be possible if
credentials for the principal are no longer available. The server credentials for the principal are no longer available. The server
MAY allow the machine credential or SSV credential (see MAY allow the machine credential or SSV credential (see
Section 18.35) to send LAYOUTRETURN. Section 18.35) to send LAYOUTRETURN.
skipping to change at page 518, line 28 skipping to change at page 520, line 44
a request outstanding for; it could be equal to sa_slotid. The a request outstanding for; it could be equal to sa_slotid. The
server returns two "highest_slotid" values: sr_highest_slotid, and server returns two "highest_slotid" values: sr_highest_slotid, and
sr_target_highest_slotid. The former is the highest slot id the sr_target_highest_slotid. The former is the highest slot id the
server will accept in future SEQUENCE operation, and SHOULD NOT be server will accept in future SEQUENCE operation, and SHOULD NOT be
less than the value of sa_highest_slotid. (but see Section 2.10.5.1 less than the value of sa_highest_slotid. (but see Section 2.10.5.1
for an exception). The latter is the highest slot id the server for an exception). The latter is the highest slot id the server
would prefer the client use on a future SEQUENCE operation. would prefer the client use on a future SEQUENCE operation.
If sa_cachethis is TRUE, then the client is requesting that the If sa_cachethis is TRUE, then the client is requesting that the
server cache the entire reply in the server's reply cache; therefore server cache the entire reply in the server's reply cache; therefore
the server MUST cache the reply (see Section 2.10.5.1.2). The server the server MUST cache the reply (see Section 2.10.5.1.3). The server
MAY cache the reply if sa_cachethis is FALSE. If the server does not MAY cache the reply if sa_cachethis is FALSE. If the server does not
cache the entire reply, it MUST still record that it executed the cache the entire reply, it MUST still record that it executed the
request at the specified slot and sequence id. request at the specified slot and sequence id.
The response to the SEQUENCE operation contains a word of status The response to the SEQUENCE operation contains a word of status
flags (sr_status_flags) that can provide to the client information flags (sr_status_flags) that can provide to the client information
related to the status of the client's lock state and communications related to the status of the client's lock state and communications
paths. Note that any status bits relating to lock state MAY be reset paths. Note that any status bits relating to lock state MAY be reset
when lock state is lost due to a server reboot (even if the session when lock state is lost due to a server restart (even if the session
is persistent across reboots; session persistence does not imply lock is persistent across restarts; session persistence does not imply
state persistence) or the establishment of a new client instance. lock state persistence) or the establishment of a new client
instance.
SEQ4_STATUS_CB_PATH_DOWN SEQ4_STATUS_CB_PATH_DOWN
When set, indicates that the client has no operational backchannel When set, indicates that the client has no operational backchannel
path for any session associated with the client ID, making it path for any session associated with the client ID, making it
necessary for the client to re-establish one. This bit remains necessary for the client to re-establish one. This bit remains
set on all SEQUENCE responses on all sessions associated with the set on all SEQUENCE responses on all sessions associated with the
client ID until at least one backchannel is available on any client ID until at least one backchannel is available on any
session associated with the client ID. If the client fails to re- session associated with the client ID. If the client fails to re-
establish a backchannel for the client ID, it is subject to having establish a backchannel for the client ID, it is subject to having
recallable state revoked. recallable state revoked.
skipping to change at page 520, line 34 skipping to change at page 522, line 50
SEQ4_STATUS_LEASE_MOVED SEQ4_STATUS_LEASE_MOVED
When set indicates that responsibility for lease renewal has been When set indicates that responsibility for lease renewal has been
transferred to one or more new servers. This condition will transferred to one or more new servers. This condition will
continue until the client receives an NFS4ERR_MOVED error and the continue until the client receives an NFS4ERR_MOVED error and the
server receives the subsequent GETATTR for the fs_locations or server receives the subsequent GETATTR for the fs_locations or
fs_locations_info attribute for an access to each file system for fs_locations_info attribute for an access to each file system for
which a lease has been moved to a new server. See which a lease has been moved to a new server. See
Section 11.7.7.1. Section 11.7.7.1.
SEQ4_STATUS_RESTART_RECLAIM_NEEDED SEQ4_STATUS_RESTART_RECLAIM_NEEDED
When set indicates that due to server restart or reboot the client When set indicates that due to server restart or restart the
must reclaim locking state. Until the client sends a global client must reclaim locking state. Until the client sends a
RECLAIM_COMPLETE (Section 18.51, every SEQUENCE operation will global RECLAIM_COMPLETE (Section 18.51), every SEQUENCE operation
return SEQ4_STATUS_RESTART_RECLAIM_NEEDED. will return SEQ4_STATUS_RESTART_RECLAIM_NEEDED.
SEQ4_STATUS_BACKCHANNEL_FAULT SEQ4_STATUS_BACKCHANNEL_FAULT
The server has encountered an unrecoverable fault with the The server has encountered an unrecoverable fault with the
backchannel (e.g. it has lost track of the sequence id for a slot backchannel (e.g. it has lost track of the sequence id for a slot
in the backchannel). The client MUST stop sending more requests in the backchannel). The client MUST stop sending more requests
on the session's fore channel, wait for all outstanding requests on the session's fore channel, wait for all outstanding requests
to complete on the fore and back channel, and then destroy the to complete on the fore and back channel, and then destroy the
session. session.
SEQ4_STATUS_DEVID_CHANGED SEQ4_STATUS_DEVID_CHANGED
skipping to change at page 521, line 52 skipping to change at page 524, line 20
renewed (see Section 8.3), except if renewed (see Section 8.3), except if
SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED is returned in sr_status_flags. SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED is returned in sr_status_flags.
18.46.4. IMPLEMENTATION 18.46.4. IMPLEMENTATION
The server MUST maintain a mapping of sessionid to client ID in order The server MUST maintain a mapping of sessionid to client ID in order
to validate any operations that follow SEQUENCE that take a stateid to validate any operations that follow SEQUENCE that take a stateid
as an argument and/or result. as an argument and/or result.
If the client establishes a persistent session, then a SEQUENCE done If the client establishes a persistent session, then a SEQUENCE done
after a server reboot may encounter requests performed and recorded after a server restart may encounter requests performed and recorded
in a persistent reply cache before the server reboot. In this case, in a persistent reply cache before the server restart. In this case,
SEQUENCE will be processed successfully, while requests which were SEQUENCE will be processed successfully, while requests which were
not processed previously are rejected with NFS4ERR_DEADSESSION. not processed previously are rejected with NFS4ERR_DEADSESSION.
Depending on the operations within the COMPOUND successfully Depending on the operations within the COMPOUND successfully
performed before the server reboot, these operations will also have performed before the server restart, these operations will also have
replies sent from the server reply cache. Note that when these replies sent from the server reply cache. Note that when these
operations establish locking state it is locking state that applies operations establish locking state it is locking state that applies
to the previous server instance and to the previous client ID, even to the previous server instance and to the previous client ID, even
though the server reboot, which logically happened after these though the server restart, which logically happened after these
operations eliminated that state. In the case of a partially operations eliminated that state. In the case of a partially
executed COMPOUND, processing may reach an operation not processed executed COMPOUND, processing may reach an operation not processed
during the earlier server instance, making this operation a new one during the earlier server instance, making this operation a new one
and not performable on the existing session. In this case and not performable on the existing session. In this case
NFS4ERR_DEADSESSION will be returned from that operation. NFS4ERR_DEADSESSION will be returned from that operation.
18.47. Operation 54: SET_SSV - Update SSV for a Client ID 18.47. Operation 54: SET_SSV - Update SSV for a Client ID
18.47.1. ARGUMENT 18.47.1. ARGUMENT
skipping to change at page 525, line 48 skipping to change at page 528, line 20
o Special stateids are always considered invalid (they result in the o Special stateids are always considered invalid (they result in the
error code NFS4ERR_BAD_STATEID). error code NFS4ERR_BAD_STATEID).
All stateids are interpreted as being associated with the client for All stateids are interpreted as being associated with the client for
the current session. Any possible association with a previous the current session. Any possible association with a previous
instance of the client (as stale stateids) is not considered. instance of the client (as stale stateids) is not considered.
The errors which are validly returned within the status_code array The errors which are validly returned within the status_code array
are: NFS4ERR_OK, NFS4ERR_BAD_STATEID, NFS4ERR_OLD_STATEID, are: NFS4ERR_OK, NFS4ERR_BAD_STATEID, NFS4ERR_OLD_STATEID,
NFS4ERR_EXPIRED, NFS4ERR_ADMIN_REVOKED, and NFS4ERR_DELEG_REVOKED. NFS4ERR_EXPIRED, NFS4ERR_ADMIN_REVOKED, and NFS4ERR_DELEG_REVOKED.
[[Comment.5: _LAYOUT_REVOKED]]. [[Comment.4: _LAYOUT_REVOKED]].
18.48.4. IMPLEMENTATION 18.48.4. IMPLEMENTATION
See Section 8.2.2 and Section 8.2.4 for a discussion of stateid See Section 8.2.2 and Section 8.2.4 for a discussion of stateid
structure, lifetime, and validation. structure, lifetime, and validation.
18.49. Operation 56: WANT_DELEGATION - Request Delegation 18.49. Operation 56: WANT_DELEGATION - Request Delegation
18.49.1. ARGUMENT 18.49.1. ARGUMENT
skipping to change at page 530, line 38 skipping to change at page 533, line 26
}; };
18.51.2. RESULTS 18.51.2. RESULTS
struct RECLAIM_COMPLETE4res { struct RECLAIM_COMPLETE4res {
nfsstat4 rcr_status; nfsstat4 rcr_status;
}; };
18.51.3. DESCRIPTION 18.51.3. DESCRIPTION
A RECLAIM_COMPLETE operation must be used to indicate that the client A RECLAIM_COMPLETE operation is used to indicate that the client has
has reclaimed all of the locking state that it will recover, when it reclaimed all of the locking state that it will recover, when it is
is recovering state due to either a server restart or the transfer of recovering state due to either a server restart or the transfer of a
a file system to another server. There are two types of file system to another server. There are two types of
RECLAIM_COMPLETE operations: RECLAIM_COMPLETE operations:
o When one_fs is false, a global RECLAIM_COMPLETE is being done. o When rca_one_fs is FALSE, a global RECLAIM_COMPLETE is being done.
This indicates that recovery of all locks that the client held on This indicates that recovery of all locks that the client held on
the previous server instance have been completed. the previous server instance have been completed.
o When one_fs is true, a file system-specific RECLAIM_COMPLETE is o When rca_one_fs is TRUE, a file system-specific RECLAIM_COMPLETE
being done. This indicates that recovery of locks for a single fs is being done. This indicates that recovery of locks for a single
(the one designated by the current filehandle) due to a file fs (the one designated by the current filehandle) due to a file
system transition have been completed. Presence of a current system transition have been completed. Presence of a current
filehandle is only required when one_fs is true. filehandle is only required when rca_one_fs is true.
Once a RECLAIM_COMPLETE is done, there can be no further reclaim Once a RECLAIM_COMPLETE is done, there can be no further reclaim
operations for locks whose scope is defined as having completed operations for locks whose scope is defined as having completed
recovery. Once the client sends RECLAIM_COMPLETE, the server will recovery. Once the client sends RECLAIM_COMPLETE, the server will
not allow the client to do subsequent reclaims of locking state for not allow the client to do subsequent reclaims of locking state for
that scope and will return NFS4ERR_NO_GRACE, if these are attempted. that scope and will return NFS4ERR_NO_GRACE, if these are attempted.
Whenever a client establishes a new client ID and before it does the Whenever a client establishes a new client ID and before it does the
first non-reclaim operation that obtains a lock, it MUST do a global first non-reclaim operation that obtains a lock, it MUST do a global
RECLAIM_COMPLETE, even if there are no locks to reclaim. If non- RECLAIM_COMPLETE, even if there are no locks to reclaim. If non-
reclaim locking operations are done before the RECLAIM_COMPLETE, a reclaim locking operations are done before the RECLAIM_COMPLETE, a
NFS4ERR_GRACE will be returned. NFS4ERR_GRACE will be returned.
Similarly, when the client accesses a file system on a new server, Similarly, when the client accesses a file system on a new server,
before it sends the first non-reclaim operation that obtains a lock before it sends the first non-reclaim operation that obtains a lock
on this new server, it must do a RECLAIM_COMPLETE with one_fs true on this new server, it must do a RECLAIM_COMPLETE with rca_one_fs
and current filehandle within that file system, even if there are no true and current filehandle within that file system, even if there
locks to reclaim. If non-reclaim locking operations are done on that are no locks to reclaim. If non-reclaim locking operations are done
file system before the RECLAIM_COMPLETE, a NFS4ERR_GRACE will be on that file system before the RECLAIM_COMPLETE, a NFS4ERR_GRACE will
returned. be returned.
Any locks not reclaimed at the point at which RECLAIM_COMPLETE is Any locks not reclaimed at the point at which RECLAIM_COMPLETE is
done become non-reclaimable. The client MUST NOT attempt to reclaim done become non-reclaimable. The client MUST NOT attempt to reclaim
them, either during the current server instance or in any subsequent them, either during the current server instance or in any subsequent
server instance, or on another server to which responsibility for server instance, or on another server to which responsibility for
that file system is transferred. If the client were to do so, it that file system is transferred. If the client were to do so, it
would be violating the protocol by representing itself as owning would be violating the protocol by representing itself as owning
locks that it does not own, and so has no right to reclaim. See locks that it does not own, and so has no right to reclaim. See
Section 8.4.3 for a discussion of edge conditions related to lock Section 8.4.3 for a discussion of edge conditions related to lock
reclaim. reclaim.
skipping to change at page 549, line 49 skipping to change at page 551, line 49
The server may decide that it cannot hold all of the state for The server may decide that it cannot hold all of the state for
recallable objects, such as delegations and layouts, without running recallable objects, such as delegations and layouts, without running
out of resources. In such a case, it is free to recall individual out of resources. In such a case, it is free to recall individual
objects to reduce the load but this would be far from optimal. objects to reduce the load but this would be far from optimal.
Because the general purpose of such recallable objects as delegations Because the general purpose of such recallable objects as delegations
is to eliminate client interaction with the server, the server cannot is to eliminate client interaction with the server, the server cannot
interpret lack of recent use as indicating that the object is no interpret lack of recent use as indicating that the object is no
longer useful. The absence of visible use may be the result of a longer useful. The absence of visible use may be the result of a
large number of potential operations eliminated. In the case of large number of potential operations eliminated. In the case of
layouts, the layout will be used explicitly but the meta-data server layouts, the layout will be used explicitly but the metadata server
does not have direct knowledge of such use. does not have direct knowledge of such use.
In order to implement an effective reclaim scheme for such objects, In order to implement an effective reclaim scheme for such objects,
the server's knowledge of available resources must be used to the server's knowledge of available resources must be used to
determine when objects must be recalled with the clients selecting determine when objects must be recalled with the clients selecting
the actual objects to be returned. the actual objects to be returned.
Server implementations may differ in their resource allocation Server implementations may differ in their resource allocation
requirements. For example, one server may share resources among all requirements. For example, one server may share resources among all
classes of recallable objects whereas another may use separate classes of recallable objects whereas another may use separate
skipping to change at page 553, line 9 skipping to change at page 555, line 9
slots, and if applicable, transport credits (e.g. RDMA credits for slots, and if applicable, transport credits (e.g. RDMA credits for
connections associated with the operations channel) to the server. connections associated with the operations channel) to the server.
CB_RECALL_SLOT specifies rsa_target_highest_slotid, the target CB_RECALL_SLOT specifies rsa_target_highest_slotid, the target
highest_slot the server wants for the session. The client, should highest_slot the server wants for the session. The client, should
then work toward reducing the highest_slot to the target. then work toward reducing the highest_slot to the target.
If the session has only non-RDMA connections associated with its If the session has only non-RDMA connections associated with its
operations channel, then the client need only wait for all operations channel, then the client need only wait for all
outstanding requests with a slotid > rsa_target_highest_slotid to outstanding requests with a slotid > rsa_target_highest_slotid to
complete, then send a single COMPOUND consisting of a single SEQUENCE complete, then send a single COMPOUND consisting of a single SEQUENCE
operation, with the sa_highslot field set to operation, with the sa_highestslot field set to
rsa_target_highest_slotid. If there are RDMA-based connections rsa_target_highest_slotid. If there are RDMA-based connections
associated with operation channel, then the client needs to also send associated with operation channel, then the client needs to also send
enough zero-length RDMA Sends to take the total RDMA credit count to enough zero-length RDMA Sends to take the total RDMA credit count to
rsa_target_highest_slotid + 1 or below. rsa_target_highest_slotid + 1 or below.
20.8.4. IMPLEMENTATION 20.8.4. IMPLEMENTATION
If the client fails to reduce highest slot it has on the fore channel If the client fails to reduce highest slot it has on the fore channel
to what the server requests, the server can force the issue by to what the server requests, the server can force the issue by
asserting flow control on the receive side of all connections bound asserting flow control on the receive side of all connections bound
skipping to change at page 554, line 36 skipping to change at page 556, line 36
contents include the session to which this request belongs, slot id contents include the session to which this request belongs, slot id
and sequence id used by the server to implement session request and sequence id used by the server to implement session request
control and exactly once semantics, and exchanged slot maximums which control and exactly once semantics, and exchanged slot maximums which
are used to adjust the size of the reply cache. This operation MUST are used to adjust the size of the reply cache. This operation MUST
appear once as the first operation in each CB_COMPOUND request or a appear once as the first operation in each CB_COMPOUND request or a
protocol error must result. See Section 18.46.3 for a description of protocol error must result. See Section 18.46.3 for a description of
how slots are processed. how slots are processed.
If csa_cachethis is TRUE, then the server is requesting that the If csa_cachethis is TRUE, then the server is requesting that the
client cache the reply in the callback reply cache. The client MUST client cache the reply in the callback reply cache. The client MUST
cache the reply (see Section 2.10.5.1.2). cache the reply (see Section 2.10.5.1.3).
The csa_referring_call_lists array is the list of COMPOUND requests, The csa_referring_call_lists array is the list of COMPOUND requests,
identified by sessionid, slot id and sequencid. These are requests identified by sessionid, slot id and sequencid. These are requests
that the client previously sent to the server. These previous that the client previously sent to the server. These previous
requests created state that some operation(s) in the in the same requests created state that some operation(s) in the in the same
CB_COMPOUND as the csa_referring_call_lists is identifying. A CB_COMPOUND as the csa_referring_call_lists is identifying. A
sessionid is included because leased state is tied to a client ID, sessionid is included because leased state is tied to a client ID,
and a client ID can have multiple sessions. See Section 2.10.5.3. and a client ID can have multiple sessions. See Section 2.10.5.3.
The value of csa_sequenceid argument relative to the cached sequence The value of csa_sequenceid argument relative to the cached sequence
skipping to change at page 571, line 23 skipping to change at page 573, line 23
Phone: +1-512-401-1080 Phone: +1-512-401-1080
Email: spencer.shepler@sun.com Email: spencer.shepler@sun.com
Mike Eisler Mike Eisler
NetApp NetApp
5765 Chase Point Circle 5765 Chase Point Circle
Colorado Springs, CO 80919 Colorado Springs, CO 80919
USA USA
Phone: +1-719-599-9026 Phone: +1-719-599-9026
Email: email2mre-@yahoo.com Email: mike@eisler.com
URI: Insert ietf2 between the - and @ symbols in the above address URI: http://www.eisler.com
David Noveck David Noveck
NetApp NetApp
1601 Trapelo Road, Suite 16 1601 Trapelo Road, Suite 16
Waltham, MA 02454 Waltham, MA 02454
USA USA
Phone: +1-781-768-5347 Phone: +1-781-768-5347
Email: dnoveck@netapp.com Email: dnoveck@netapp.com
 End of changes. 230 change blocks. 
693 lines changed or deleted 801 lines changed or added

This html diff was produced by rfcdiff 1.34. The latest version is available from http://tools.ietf.org/tools/rfcdiff/