draft-ietf-nfsv4-minorversion1-12.txt   draft-ietf-nfsv4-minorversion1-13.txt 
NFSv4 S. Shepler NFSv4 S. Shepler
Internet-Draft M. Eisler Internet-Draft M. Eisler
Intended status: Standards Track D. Noveck Intended status: Standards Track D. Noveck
Expires: December 3, 2007 Editors Expires: January 2, 2008 Editors
NFSv4 Minor Version 1 NFSv4 Minor Version 1
draft-ietf-nfsv4-minorversion1-12.txt draft-ietf-nfsv4-minorversion1-13.txt
Status of this Memo Status of this Memo
By submitting this Internet-Draft, each author represents that any By submitting this Internet-Draft, each author represents that any
applicable patent or other IPR claims of which he or she is aware applicable patent or other IPR claims of which he or she is aware
have been or will be disclosed, and any of which he or she becomes have been or will be disclosed, and any of which he or she becomes
aware will be disclosed, in accordance with Section 6 of BCP 79. aware will be disclosed, in accordance with Section 6 of BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that Task Force (IETF), its areas, and its working groups. Note that
skipping to change at page 1, line 33 skipping to change at page 1, line 33
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt. http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html. http://www.ietf.org/shadow.html.
This Internet-Draft will expire on December 3, 2007. This Internet-Draft will expire on January 2, 2008.
Copyright Notice Copyright Notice
Copyright (C) The IETF Trust (2007). Copyright (C) The IETF Trust (2007).
Abstract Abstract
This Internet-Draft describes NFSv4 minor version one, including This Internet-Draft describes NFSv4 minor version one, including
features retained from the base protocol and protocol extensions made features retained from the base protocol and protocol extensions made
subsequently. The current draft includes description of the major subsequently. The current draft includes description of the major
skipping to change at page 3, line 8 skipping to change at page 3, line 8
2.9.1. Required and Recommended Properties of Transports . 35 2.9.1. Required and Recommended Properties of Transports . 35
2.9.2. Client and Server Transport Behavior . . . . . . . . 36 2.9.2. Client and Server Transport Behavior . . . . . . . . 36
2.9.3. Ports . . . . . . . . . . . . . . . . . . . . . . . 37 2.9.3. Ports . . . . . . . . . . . . . . . . . . . . . . . 37
2.10. Session . . . . . . . . . . . . . . . . . . . . . . . . 37 2.10. Session . . . . . . . . . . . . . . . . . . . . . . . . 37
2.10.1. Motivation and Overview . . . . . . . . . . . . . . 37 2.10.1. Motivation and Overview . . . . . . . . . . . . . . 37
2.10.2. NFSv4 Integration . . . . . . . . . . . . . . . . . 38 2.10.2. NFSv4 Integration . . . . . . . . . . . . . . . . . 38
2.10.3. Channels . . . . . . . . . . . . . . . . . . . . . . 40 2.10.3. Channels . . . . . . . . . . . . . . . . . . . . . . 40
2.10.4. Trunking . . . . . . . . . . . . . . . . . . . . . . 41 2.10.4. Trunking . . . . . . . . . . . . . . . . . . . . . . 41
2.10.5. Exactly Once Semantics . . . . . . . . . . . . . . . 44 2.10.5. Exactly Once Semantics . . . . . . . . . . . . . . . 44
2.10.6. RDMA Considerations . . . . . . . . . . . . . . . . 56 2.10.6. RDMA Considerations . . . . . . . . . . . . . . . . 56
2.10.7. Sessions Security . . . . . . . . . . . . . . . . . 58 2.10.7. Sessions Security . . . . . . . . . . . . . . . . . 59
2.10.8. Session Mechanics - Steady State . . . . . . . . . . 67 2.10.8. Session Mechanics - Steady State . . . . . . . . . . 67
2.10.9. Session Mechanics - Recovery . . . . . . . . . . . . 69 2.10.9. Session Mechanics - Recovery . . . . . . . . . . . . 69
2.10.10. Parallel NFS and Sessions . . . . . . . . . . . . . 72 2.10.10. Parallel NFS and Sessions . . . . . . . . . . . . . 72
3. Protocol Data Types . . . . . . . . . . . . . . . . . . . . . 72 3. Protocol Data Types . . . . . . . . . . . . . . . . . . . . . 72
3.1. Basic Data Types . . . . . . . . . . . . . . . . . . . . 72 3.1. Basic Data Types . . . . . . . . . . . . . . . . . . . . 72
3.2. Structured Data Types . . . . . . . . . . . . . . . . . 74 3.2. Structured Data Types . . . . . . . . . . . . . . . . . 74
4. Filehandles . . . . . . . . . . . . . . . . . . . . . . . . . 83 4. Filehandles . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.1. Obtaining the First Filehandle . . . . . . . . . . . . . 84 4.1. Obtaining the First Filehandle . . . . . . . . . . . . . 84
4.1.1. Root Filehandle . . . . . . . . . . . . . . . . . . 84 4.1.1. Root Filehandle . . . . . . . . . . . . . . . . . . 84
4.1.2. Public Filehandle . . . . . . . . . . . . . . . . . 84 4.1.2. Public Filehandle . . . . . . . . . . . . . . . . . 85
4.2. Filehandle Types . . . . . . . . . . . . . . . . . . . . 85 4.2. Filehandle Types . . . . . . . . . . . . . . . . . . . . 85
4.2.1. General Properties of a Filehandle . . . . . . . . . 85 4.2.1. General Properties of a Filehandle . . . . . . . . . 85
4.2.2. Persistent Filehandle . . . . . . . . . . . . . . . 86 4.2.2. Persistent Filehandle . . . . . . . . . . . . . . . 86
4.2.3. Volatile Filehandle . . . . . . . . . . . . . . . . 86 4.2.3. Volatile Filehandle . . . . . . . . . . . . . . . . 86
4.3. One Method of Constructing a Volatile Filehandle . . . . 87 4.3. One Method of Constructing a Volatile Filehandle . . . . 88
4.4. Client Recovery from Filehandle Expiration . . . . . . . 88 4.4. Client Recovery from Filehandle Expiration . . . . . . . 88
5. File Attributes . . . . . . . . . . . . . . . . . . . . . . . 89 5. File Attributes . . . . . . . . . . . . . . . . . . . . . . . 89
5.1. Mandatory Attributes . . . . . . . . . . . . . . . . . . 90 5.1. Mandatory Attributes . . . . . . . . . . . . . . . . . . 90
5.2. Recommended Attributes . . . . . . . . . . . . . . . . . 90 5.2. Recommended Attributes . . . . . . . . . . . . . . . . . 91
5.3. Named Attributes . . . . . . . . . . . . . . . . . . . . 91 5.3. Named Attributes . . . . . . . . . . . . . . . . . . . . 91
5.4. Classification of Attributes . . . . . . . . . . . . . . 91 5.4. Classification of Attributes . . . . . . . . . . . . . . 92
5.5. Mandatory Attributes - Definitions . . . . . . . . . . . 93 5.5. Mandatory Attributes - Definitions . . . . . . . . . . . 93
5.6. Recommended Attributes - Definitions . . . . . . . . . . 94 5.6. Recommended Attributes - Definitions . . . . . . . . . . 94
5.7. Time Access . . . . . . . . . . . . . . . . . . . . . . 104 5.7. Time Access . . . . . . . . . . . . . . . . . . . . . . 104
5.8. Interpreting owner and owner_group . . . . . . . . . . . 105 5.8. Interpreting owner and owner_group . . . . . . . . . . . 105
5.9. Character Case Attributes . . . . . . . . . . . . . . . 107 5.9. Character Case Attributes . . . . . . . . . . . . . . . 107
5.10. Quota Attributes . . . . . . . . . . . . . . . . . . . . 107 5.10. Quota Attributes . . . . . . . . . . . . . . . . . . . . 107
5.11. mounted_on_fileid . . . . . . . . . . . . . . . . . . . 108 5.11. mounted_on_fileid . . . . . . . . . . . . . . . . . . . 108
5.12. Directory Notification Attributes . . . . . . . . . . . 109 5.12. Directory Notification Attributes . . . . . . . . . . . 109
5.12.1. dir_notif_delay . . . . . . . . . . . . . . . . . . 109 5.12.1. dir_notif_delay . . . . . . . . . . . . . . . . . . 109
5.12.2. dirent_notif_delay . . . . . . . . . . . . . . . . . 109 5.12.2. dirent_notif_delay . . . . . . . . . . . . . . . . . 109
skipping to change at page 4, line 14 skipping to change at page 4, line 14
6.2.2. dacl and sacl Attributes . . . . . . . . . . . . . . 127 6.2.2. dacl and sacl Attributes . . . . . . . . . . . . . . 127
6.2.3. mode Attribute . . . . . . . . . . . . . . . . . . . 127 6.2.3. mode Attribute . . . . . . . . . . . . . . . . . . . 127
6.2.4. mode_set_masked Attribute . . . . . . . . . . . . . 128 6.2.4. mode_set_masked Attribute . . . . . . . . . . . . . 128
6.3. Common Methods . . . . . . . . . . . . . . . . . . . . . 129 6.3. Common Methods . . . . . . . . . . . . . . . . . . . . . 129
6.3.1. Interpreting an ACL . . . . . . . . . . . . . . . . 129 6.3.1. Interpreting an ACL . . . . . . . . . . . . . . . . 129
6.3.2. Computing a Mode Attribute from an ACL . . . . . . . 130 6.3.2. Computing a Mode Attribute from an ACL . . . . . . . 130
6.4. Requirements . . . . . . . . . . . . . . . . . . . . . . 131 6.4. Requirements . . . . . . . . . . . . . . . . . . . . . . 131
6.4.1. Setting the mode and/or ACL Attributes . . . . . . . 132 6.4.1. Setting the mode and/or ACL Attributes . . . . . . . 132
6.4.2. Retrieving the mode and/or ACL Attributes . . . . . 133 6.4.2. Retrieving the mode and/or ACL Attributes . . . . . 133
6.4.3. Creating New Objects . . . . . . . . . . . . . . . . 134 6.4.3. Creating New Objects . . . . . . . . . . . . . . . . 134
7. Single-server Name Space . . . . . . . . . . . . . . . . . . 138 7. Single-server Namespace . . . . . . . . . . . . . . . . . . . 138
7.1. Server Exports . . . . . . . . . . . . . . . . . . . . . 138 7.1. Server Exports . . . . . . . . . . . . . . . . . . . . . 138
7.2. Browsing Exports . . . . . . . . . . . . . . . . . . . . 138 7.2. Browsing Exports . . . . . . . . . . . . . . . . . . . . 138
7.3. Server Pseudo File System . . . . . . . . . . . . . . . 139 7.3. Server Pseudo File System . . . . . . . . . . . . . . . 139
7.4. Multiple Roots . . . . . . . . . . . . . . . . . . . . . 139 7.4. Multiple Roots . . . . . . . . . . . . . . . . . . . . . 139
7.5. Filehandle Volatility . . . . . . . . . . . . . . . . . 139 7.5. Filehandle Volatility . . . . . . . . . . . . . . . . . 140
7.6. Exported Root . . . . . . . . . . . . . . . . . . . . . 140 7.6. Exported Root . . . . . . . . . . . . . . . . . . . . . 140
7.7. Mount Point Crossing . . . . . . . . . . . . . . . . . . 140 7.7. Mount Point Crossing . . . . . . . . . . . . . . . . . . 140
7.8. Security Policy and Name Space Presentation . . . . . . 141 7.8. Security Policy and Namespace Presentation . . . . . . . 141
8. File Locking and Share Reservations . . . . . . . . . . . . . 141 8. State Management . . . . . . . . . . . . . . . . . . . . . . 142
8.1. Locking . . . . . . . . . . . . . . . . . . . . . . . . 142 8.1. Client and Session ID . . . . . . . . . . . . . . . . . 142
8.1.1. Client and Session ID . . . . . . . . . . . . . . . 142 8.2. Stateid Definition . . . . . . . . . . . . . . . . . . . 143
8.1.2. State-owner Definition . . . . . . . . . . . . . . . 142 8.2.1. Stateid Types . . . . . . . . . . . . . . . . . . . 143
8.1.3. Stateid Definition . . . . . . . . . . . . . . . . . 143 8.2.2. Stateid Structure . . . . . . . . . . . . . . . . . 144
8.1.4. Use of the Stateid and Locking . . . . . . . . . . . 147 8.2.3. Special Stateids . . . . . . . . . . . . . . . . . . 145
8.2. Lock Ranges . . . . . . . . . . . . . . . . . . . . . . 149 8.2.4. Stateid Lifetime and Validation . . . . . . . . . . 146
8.3. Upgrading and Downgrading Locks . . . . . . . . . . . . 150 8.3. Lease Renewal . . . . . . . . . . . . . . . . . . . . . 148
8.4. Blocking Locks . . . . . . . . . . . . . . . . . . . . . 150 8.4. Crash Recovery . . . . . . . . . . . . . . . . . . . . . 149
8.5. Lease Renewal . . . . . . . . . . . . . . . . . . . . . 151 8.4.1. Client Failure and Recovery . . . . . . . . . . . . 149
8.6. Crash Recovery . . . . . . . . . . . . . . . . . . . . . 152 8.4.2. Server Failure and Recovery . . . . . . . . . . . . 150
8.6.1. Client Failure and Recovery . . . . . . . . . . . . 152 8.4.3. Network Partitions and Recovery . . . . . . . . . . 154
8.6.2. Server Failure and Recovery . . . . . . . . . . . . 152 8.5. Server Revocation of Locks . . . . . . . . . . . . . . . 158
8.6.3. Network Partitions and Recovery . . . . . . . . . . 156 8.6. Short and Long Leases . . . . . . . . . . . . . . . . . 159
8.7. Server Revocation of Locks . . . . . . . . . . . . . . . 160 8.7. Clocks, Propagation Delay, and Calculating Lease
8.8. Share Reservations . . . . . . . . . . . . . . . . . . . 161 Expiration . . . . . . . . . . . . . . . . . . . . . . . 159
8.9. OPEN/CLOSE Operations . . . . . . . . . . . . . . . . . 162 8.8. Vestigial Locking Infrastructure From V4.0 . . . . . . . 160
8.10. Open Upgrade and Downgrade . . . . . . . . . . . . . . . 162 9. File Locking and Share Reservations . . . . . . . . . . . . . 161
8.11. Short and Long Leases . . . . . . . . . . . . . . . . . 163 9.1. Opens and Byte-range Locks . . . . . . . . . . . . . . . 161
8.12. Clocks, Propagation Delay, and Calculating Lease 9.1.1. State-owner Definition . . . . . . . . . . . . . . . 161
Expiration . . . . . . . . . . . . . . . . . . . . . . . 164 9.1.2. Use of the Stateid and Locking . . . . . . . . . . . 162
8.13. Vestigial Locking Infrastructure From V4.0 . . . . . . . 164 9.2. Lock Ranges . . . . . . . . . . . . . . . . . . . . . . 165
9. Client-Side Caching . . . . . . . . . . . . . . . . . . . . . 165 9.3. Upgrading and Downgrading Locks . . . . . . . . . . . . 165
9.1. Performance Challenges for Client-Side Caching . . . . . 166 9.4. Blocking Locks . . . . . . . . . . . . . . . . . . . . . 166
9.2. Delegation and Callbacks . . . . . . . . . . . . . . . . 166 9.5. Share Reservations . . . . . . . . . . . . . . . . . . . 167
9.2.1. Delegation Recovery . . . . . . . . . . . . . . . . 168 9.6. OPEN/CLOSE Operations . . . . . . . . . . . . . . . . . 167
9.3. Data Caching . . . . . . . . . . . . . . . . . . . . . . 170 9.7. Open Upgrade and Downgrade . . . . . . . . . . . . . . . 168
9.3.1. Data Caching and OPENs . . . . . . . . . . . . . . . 170 9.8. Reclaim of Open and Byte-range Locks . . . . . . . . . . 169
9.3.2. Data Caching and File Locking . . . . . . . . . . . 171 10. Client-Side Caching . . . . . . . . . . . . . . . . . . . . . 169
9.3.3. Data Caching and Mandatory File Locking . . . . . . 173 10.1. Performance Challenges for Client-Side Caching . . . . . 170
9.3.4. Data Caching and File Identity . . . . . . . . . . . 173 10.2. Delegation and Callbacks . . . . . . . . . . . . . . . . 171
9.4. Open Delegation . . . . . . . . . . . . . . . . . . . . 174 10.2.1. Delegation Recovery . . . . . . . . . . . . . . . . 172
9.4.1. Open Delegation and Data Caching . . . . . . . . . . 177 10.3. Data Caching . . . . . . . . . . . . . . . . . . . . . . 174
9.4.2. Open Delegation and File Locks . . . . . . . . . . . 178 10.3.1. Data Caching and OPENs . . . . . . . . . . . . . . . 175
9.4.3. Handling of CB_GETATTR . . . . . . . . . . . . . . . 178 10.3.2. Data Caching and File Locking . . . . . . . . . . . 176
9.4.4. Recall of Open Delegation . . . . . . . . . . . . . 181 10.3.3. Data Caching and Mandatory File Locking . . . . . . 177
9.4.5. Clients that Fail to Honor Delegation Recalls . . . 183 10.3.4. Data Caching and File Identity . . . . . . . . . . . 178
9.4.6. Delegation Revocation . . . . . . . . . . . . . . . 184 10.4. Open Delegation . . . . . . . . . . . . . . . . . . . . 179
9.5. Data Caching and Revocation . . . . . . . . . . . . . . 184 10.4.1. Open Delegation and Data Caching . . . . . . . . . . 181
9.5.1. Revocation Recovery for Write Open Delegation . . . 185 10.4.2. Open Delegation and File Locks . . . . . . . . . . . 182
9.6. Attribute Caching . . . . . . . . . . . . . . . . . . . 186 10.4.3. Handling of CB_GETATTR . . . . . . . . . . . . . . . 183
9.7. Data and Metadata Caching and Memory Mapped Files . . . 188 10.4.4. Recall of Open Delegation . . . . . . . . . . . . . 186
9.8. Name Caching . . . . . . . . . . . . . . . . . . . . . . 190 10.4.5. Clients that Fail to Honor Delegation Recalls . . . 188
9.9. Directory Caching . . . . . . . . . . . . . . . . . . . 191 10.4.6. Delegation Revocation . . . . . . . . . . . . . . . 189
10. Multi-Server Name Space . . . . . . . . . . . . . . . . . . . 192 10.4.7. Delegations via WANT_DELEGATION . . . . . . . . . . 189
10.1. Location attributes . . . . . . . . . . . . . . . . . . 192 10.5. Data Caching and Revocation . . . . . . . . . . . . . . 189
10.2. File System Presence or Absence . . . . . . . . . . . . 192 10.5.1. Revocation Recovery for Write Open Delegation . . . 190
10.3. Getting Attributes for an Absent File System . . . . . . 194 10.6. Attribute Caching . . . . . . . . . . . . . . . . . . . 191
10.3.1. GETATTR Within an Absent File System . . . . . . . . 194 10.7. Data and Metadata Caching and Memory Mapped Files . . . 193
10.3.2. READDIR and Absent File Systems . . . . . . . . . . 195 10.8. Name Caching . . . . . . . . . . . . . . . . . . . . . . 195
10.4. Uses of Location Information . . . . . . . . . . . . . . 196 10.9. Directory Caching . . . . . . . . . . . . . . . . . . . 196
10.4.1. File System Replication . . . . . . . . . . . . . . 196 11. Multi-Server Namespace . . . . . . . . . . . . . . . . . . . 197
10.4.2. File System Migration . . . . . . . . . . . . . . . 198 11.1. Location attributes . . . . . . . . . . . . . . . . . . 197
10.4.3. Referrals . . . . . . . . . . . . . . . . . . . . . 199 11.2. File System Presence or Absence . . . . . . . . . . . . 197
10.5. Additional Client-side Considerations . . . . . . . . . 200 11.3. Getting Attributes for an Absent File System . . . . . . 199
10.6. Effecting File System Transitions . . . . . . . . . . . 201 11.3.1. GETATTR Within an Absent File System . . . . . . . . 199
10.6.1. File System Transitions and Simultaneous Access . . 202 11.3.2. READDIR and Absent File Systems . . . . . . . . . . 200
10.6.2. Simultaneous Use and Transparent Transitions . . . . 203 11.4. Uses of Location Information . . . . . . . . . . . . . . 201
10.6.3. Filehandles and File System Transitions . . . . . . 205 11.4.1. File System Replication . . . . . . . . . . . . . . 201
10.6.4. Fileid's and File System Transitions . . . . . . . . 205 11.4.2. File System Migration . . . . . . . . . . . . . . . 203
10.6.5. Fsids and File System Transitions . . . . . . . . . 206 11.4.3. Referrals . . . . . . . . . . . . . . . . . . . . . 204
10.6.6. The Change Attribute and File System Transitions . . 206 11.5. Additional Client-side Considerations . . . . . . . . . 205
10.6.7. Lock State and File System Transitions . . . . . . . 207 11.6. Effecting File System Transitions . . . . . . . . . . . 206
10.6.8. Write Verifiers and File System Transitions . . . . 211 11.6.1. File System Transitions and Simultaneous Access . . 207
10.7. Effecting File System Referrals . . . . . . . . . . . . 211 11.6.2. Simultaneous Use and Transparent Transitions . . . . 208
10.7.1. Referral Example (LOOKUP) . . . . . . . . . . . . . 211 11.6.3. Filehandles and File System Transitions . . . . . . 210
10.7.2. Referral Example (READDIR) . . . . . . . . . . . . . 215 11.6.4. Fileid's and File System Transitions . . . . . . . . 210
10.8. The Attribute fs_absent . . . . . . . . . . . . . . . . 217 11.6.5. Fsids and File System Transitions . . . . . . . . . 211
10.9. The Attribute fs_locations . . . . . . . . . . . . . . . 218 11.6.6. The Change Attribute and File System Transitions . . 211
10.10. The Attribute fs_locations_info . . . . . . . . . . . . 220 11.6.7. Lock State and File System Transitions . . . . . . . 212
10.10.1. The fs_locations_server4 Structure . . . . . . . . . 222 11.6.8. Write Verifiers and File System Transitions . . . . 216
10.10.2. The fs_locations_info4 Structure . . . . . . . . . . 227 11.7. Effecting File System Referrals . . . . . . . . . . . . 216
10.10.3. The fs_locations_item4 Structure . . . . . . . . . . 228 11.7.1. Referral Example (LOOKUP) . . . . . . . . . . . . . 216
10.11. The Attribute fs_status . . . . . . . . . . . . . . . . 229 11.7.2. Referral Example (READDIR) . . . . . . . . . . . . . 220
11. Directory Delegations . . . . . . . . . . . . . . . . . . . . 233 11.8. The Attribute fs_absent . . . . . . . . . . . . . . . . 223
11.1. Introduction to Directory Delegations . . . . . . . . . 233 11.9. The Attribute fs_locations . . . . . . . . . . . . . . . 223
11.2. Directory Delegation Design . . . . . . . . . . . . . . 234 11.10. The Attribute fs_locations_info . . . . . . . . . . . . 225
11.3. Attributes in Support of Directory Notifications . . . . 235 11.10.1. The fs_locations_server4 Structure . . . . . . . . . 228
11.4. Delegation Recall . . . . . . . . . . . . . . . . . . . 235 11.10.2. The fs_locations_info4 Structure . . . . . . . . . . 233
11.5. Directory Delegation Recovery . . . . . . . . . . . . . 235 11.10.3. The fs_locations_item4 Structure . . . . . . . . . . 234
12. Parallel NFS (pNFS) . . . . . . . . . . . . . . . . . . . . . 235 11.11. The Attribute fs_status . . . . . . . . . . . . . . . . 235
12.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 235 12. Directory Delegations . . . . . . . . . . . . . . . . . . . . 239
12.2. pNFS Definitions . . . . . . . . . . . . . . . . . . . . 237 12.1. Introduction to Directory Delegations . . . . . . . . . 239
12.2.1. Metadata . . . . . . . . . . . . . . . . . . . . . . 237 12.2. Directory Delegation Design . . . . . . . . . . . . . . 240
12.2.2. Metadata Server . . . . . . . . . . . . . . . . . . 237 12.3. Attributes in Support of Directory Notifications . . . . 241
12.2.3. pNFS Client . . . . . . . . . . . . . . . . . . . . 238 12.4. Delegation Recall . . . . . . . . . . . . . . . . . . . 241
12.2.4. Storage Device . . . . . . . . . . . . . . . . . . . 238 12.5. Directory Delegation Recovery . . . . . . . . . . . . . 241
12.2.5. Storage Protocol . . . . . . . . . . . . . . . . . . 238 13. Parallel NFS (pNFS) . . . . . . . . . . . . . . . . . . . . . 241
12.2.6. Control Protocol . . . . . . . . . . . . . . . . . . 238 13.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 241
12.2.7. Layout Types . . . . . . . . . . . . . . . . . . . . 238 13.2. pNFS Definitions . . . . . . . . . . . . . . . . . . . . 243
12.2.8. Layout . . . . . . . . . . . . . . . . . . . . . . . 239 13.2.1. Metadata . . . . . . . . . . . . . . . . . . . . . . 243
12.2.9. Layout Iomode . . . . . . . . . . . . . . . . . . . 239 13.2.2. Metadata Server . . . . . . . . . . . . . . . . . . 243
12.2.10. Device IDs . . . . . . . . . . . . . . . . . . . . . 240 13.2.3. pNFS Client . . . . . . . . . . . . . . . . . . . . 244
12.3. pNFS Operations . . . . . . . . . . . . . . . . . . . . 240 13.2.4. Storage Device . . . . . . . . . . . . . . . . . . . 244
12.4. pNFS Attributes . . . . . . . . . . . . . . . . . . . . 241 13.2.5. Storage Protocol . . . . . . . . . . . . . . . . . . 244
12.5. Layout Semantics . . . . . . . . . . . . . . . . . . . . 241 13.2.6. Control Protocol . . . . . . . . . . . . . . . . . . 244
12.5.1. Guarantees Provided by Layouts . . . . . . . . . . . 242 13.2.7. Layout Types . . . . . . . . . . . . . . . . . . . . 244
12.5.2. Getting a Layout . . . . . . . . . . . . . . . . . . 243 13.2.8. Layout . . . . . . . . . . . . . . . . . . . . . . . 245
12.5.3. Committing a Layout . . . . . . . . . . . . . . . . 244 13.2.9. Layout Iomode . . . . . . . . . . . . . . . . . . . 245
12.5.4. Recalling a Layout . . . . . . . . . . . . . . . . . 247 13.2.10. Device IDs . . . . . . . . . . . . . . . . . . . . . 246
12.5.5. Metadata Server Write Propagation . . . . . . . . . 253 13.3. pNFS Operations . . . . . . . . . . . . . . . . . . . . 246
12.6. pNFS Mechanics . . . . . . . . . . . . . . . . . . . . . 253 13.4. pNFS Attributes . . . . . . . . . . . . . . . . . . . . 247
12.7. Recovery . . . . . . . . . . . . . . . . . . . . . . . . 254 13.5. Layout Semantics . . . . . . . . . . . . . . . . . . . . 247
12.7.1. Client Recovery . . . . . . . . . . . . . . . . . . 255 13.5.1. Guarantees Provided by Layouts . . . . . . . . . . . 248
12.7.2. Dealing with Lease Expiration on the Client . . . . 255 13.5.2. Getting a Layout . . . . . . . . . . . . . . . . . . 249
12.7.3. Dealing with Loss of Layout State on the Metadata 13.5.3. Committing a Layout . . . . . . . . . . . . . . . . 250
Server . . . . . . . . . . . . . . . . . . . . . . . 257 13.5.4. Recalling a Layout . . . . . . . . . . . . . . . . . 253
12.7.4. Recovery from Metadata Server Restart . . . . . . . 257 13.5.5. Metadata Server Write Propagation . . . . . . . . . 259
12.7.5. Operations During Metadata Server Grace Period . . . 259 13.6. pNFS Mechanics . . . . . . . . . . . . . . . . . . . . . 259
12.7.6. Storage Device Recovery . . . . . . . . . . . . . . 260 13.7. Recovery . . . . . . . . . . . . . . . . . . . . . . . . 260
12.8. Metadata and Storage Device Roles . . . . . . . . . . . 260 13.7.1. Client Recovery . . . . . . . . . . . . . . . . . . 261
12.9. Security Considerations . . . . . . . . . . . . . . . . 262 13.7.2. Dealing with Lease Expiration on the Client . . . . 261
13. PNFS: NFSv4.1 File Layout Type . . . . . . . . . . . . . . . 263 13.7.3. Dealing with Loss of Layout State on the Metadata
13.1. Client ID and Session Considerations . . . . . . . . . . 263 Server . . . . . . . . . . . . . . . . . . . . . . . 263
13.2. File Layout Definitions . . . . . . . . . . . . . . . . 264 13.7.4. Recovery from Metadata Server Restart . . . . . . . 263
13.3. File Layout Data Types . . . . . . . . . . . . . . . . . 265 13.7.5. Operations During Metadata Server Grace Period . . . 265
13.4. Interpreting the File Layout . . . . . . . . . . . . . . 268 13.7.6. Storage Device Recovery . . . . . . . . . . . . . . 266
13.5. Sparse and Dense Stripe Unit Packing . . . . . . . . . . 270 13.8. Metadata and Storage Device Roles . . . . . . . . . . . 266
13.6. Data Server Multipathing . . . . . . . . . . . . . . . . 271 13.9. Security Considerations . . . . . . . . . . . . . . . . 268
13.7. Operations Issued to NFSv4.1 Data Servers . . . . . . . 272 14. PNFS: NFSv4.1 File Layout Type . . . . . . . . . . . . . . . 269
13.8. COMMIT Through Metadata Server . . . . . . . . . . . . . 272 14.1. Client ID and Session Considerations . . . . . . . . . . 269
13.9. The Layout Iomode . . . . . . . . . . . . . . . . . . . 273 14.2. File Layout Definitions . . . . . . . . . . . . . . . . 270
13.10. Metadata and Data Server State Coordination . . . . . . 274 14.3. File Layout Data Types . . . . . . . . . . . . . . . . . 271
13.10.1. Global Stateid Requirements . . . . . . . . . . . . 274 14.4. Interpreting the File Layout . . . . . . . . . . . . . . 274
13.10.2. Data Server State Propagation . . . . . . . . . . . 274 14.5. Sparse and Dense Stripe Unit Packing . . . . . . . . . . 276
13.11. Data Server Component File Size . . . . . . . . . . . . 276 14.6. Data Server Multipathing . . . . . . . . . . . . . . . . 277
13.12. Recovery from Loss of Layout . . . . . . . . . . . . . . 277 14.7. Operations Issued to NFSv4.1 Data Servers . . . . . . . 278
13.13. Security Considerations for the File Layout Type . . . . 278 14.8. COMMIT Through Metadata Server . . . . . . . . . . . . . 279
14. Internationalization . . . . . . . . . . . . . . . . . . . . 278 14.9. The Layout Iomode . . . . . . . . . . . . . . . . . . . 280
14.1. Stringprep profile for the utf8str_cs type . . . . . . . 279 14.10. Metadata and Data Server State Coordination . . . . . . 280
14.2. Stringprep profile for the utf8str_cis type . . . . . . 281 14.10.1. Global Stateid Requirements . . . . . . . . . . . . 280
14.3. Stringprep profile for the utf8str_mixed type . . . . . 282 14.10.2. Data Server State Propagation . . . . . . . . . . . 280
14.4. UTF-8 Related Errors . . . . . . . . . . . . . . . . . . 284 14.11. Data Server Component File Size . . . . . . . . . . . . 283
15. Error Values . . . . . . . . . . . . . . . . . . . . . . . . 284 14.12. Recovery from Loss of Layout . . . . . . . . . . . . . . 283
15.1. Error Definitions . . . . . . . . . . . . . . . . . . . 284 14.13. Security Considerations for the File Layout Type . . . . 284
15.2. Operations and their valid errors . . . . . . . . . . . 299 15. Internationalization . . . . . . . . . . . . . . . . . . . . 284
15.3. Callback operations and their valid errors . . . . . . . 313 15.1. Stringprep profile for the utf8str_cs type . . . . . . . 286
15.4. Errors and the operations that use them . . . . . . . . 314 15.2. Stringprep profile for the utf8str_cis type . . . . . . 287
16. NFS version 4.1 Procedures . . . . . . . . . . . . . . . . . 321 15.3. Stringprep profile for the utf8str_mixed type . . . . . 289
16.1. Procedure 0: NULL - No Operation . . . . . . . . . . . . 321 15.4. UTF-8 Related Errors . . . . . . . . . . . . . . . . . . 290
16.2. Procedure 1: COMPOUND - Compound Operations . . . . . . 322 16. Error Values . . . . . . . . . . . . . . . . . . . . . . . . 290
17. NFS version 4.1 Operations . . . . . . . . . . . . . . . . . 327 16.1. Error Definitions . . . . . . . . . . . . . . . . . . . 291
17.1. Operation 3: ACCESS - Check Access Rights . . . . . . . 327 16.2. Operations and their valid errors . . . . . . . . . . . 305
17.2. Operation 4: CLOSE - Close File . . . . . . . . . . . . 329 16.3. Callback operations and their valid errors . . . . . . . 319
17.3. Operation 5: COMMIT - Commit Cached Data . . . . . . . . 331 16.4. Errors and the operations that use them . . . . . . . . 320
17.4. Operation 6: CREATE - Create a Non-Regular File Object . 333 17. NFS version 4.1 Procedures . . . . . . . . . . . . . . . . . 327
17.5. Operation 7: DELEGPURGE - Purge Delegations Awaiting 17.1. Procedure 0: NULL - No Operation . . . . . . . . . . . . 327
Recovery . . . . . . . . . . . . . . . . . . . . . . . . 336 17.2. Procedure 1: COMPOUND - Compound Operations . . . . . . 328
17.6. Operation 8: DELEGRETURN - Return Delegation . . . . . . 337 18. NFS version 4.1 Operations . . . . . . . . . . . . . . . . . 333
17.7. Operation 9: GETATTR - Get Attributes . . . . . . . . . 337 18.1. Operation 3: ACCESS - Check Access Rights . . . . . . . 333
17.8. Operation 10: GETFH - Get Current Filehandle . . . . . . 339 18.2. Operation 4: CLOSE - Close File . . . . . . . . . . . . 335
17.9. Operation 11: LINK - Create Link to a File . . . . . . . 340 18.3. Operation 5: COMMIT - Commit Cached Data . . . . . . . . 337
17.10. Operation 12: LOCK - Create Lock . . . . . . . . . . . . 341 18.4. Operation 6: CREATE - Create a Non-Regular File Object . 339
17.11. Operation 13: LOCKT - Test For Lock . . . . . . . . . . 345 18.5. Operation 7: DELEGPURGE - Purge Delegations Awaiting
17.12. Operation 14: LOCKU - Unlock File . . . . . . . . . . . 346 Recovery . . . . . . . . . . . . . . . . . . . . . . . . 342
17.13. Operation 15: LOOKUP - Lookup Filename . . . . . . . . . 348 18.6. Operation 8: DELEGRETURN - Return Delegation . . . . . . 343
17.14. Operation 16: LOOKUPP - Lookup Parent Directory . . . . 350 18.7. Operation 9: GETATTR - Get Attributes . . . . . . . . . 343
17.15. Operation 17: NVERIFY - Verify Difference in 18.8. Operation 10: GETFH - Get Current Filehandle . . . . . . 345
Attributes . . . . . . . . . . . . . . . . . . . . . . . 351 18.9. Operation 11: LINK - Create Link to a File . . . . . . . 346
17.16. Operation 18: OPEN - Open a Regular File . . . . . . . . 352 18.10. Operation 12: LOCK - Create Lock . . . . . . . . . . . . 347
17.17. Operation 19: OPENATTR - Open Named Attribute 18.11. Operation 13: LOCKT - Test For Lock . . . . . . . . . . 351
Directory . . . . . . . . . . . . . . . . . . . . . . . 367 18.12. Operation 14: LOCKU - Unlock File . . . . . . . . . . . 352
17.18. Operation 21: OPEN_DOWNGRADE - Reduce Open File Access . 368 18.13. Operation 15: LOOKUP - Lookup Filename . . . . . . . . . 354
17.19. Operation 22: PUTFH - Set Current Filehandle . . . . . . 369 18.14. Operation 16: LOOKUPP - Lookup Parent Directory . . . . 356
17.20. Operation 23: PUTPUBFH - Set Public Filehandle . . . . . 370 18.15. Operation 17: NVERIFY - Verify Difference in
17.21. Operation 24: PUTROOTFH - Set Root Filehandle . . . . . 372 Attributes . . . . . . . . . . . . . . . . . . . . . . . 357
17.22. Operation 25: READ - Read from File . . . . . . . . . . 373 18.16. Operation 18: OPEN - Open a Regular File . . . . . . . . 358
17.23. Operation 26: READDIR - Read Directory . . . . . . . . . 375 18.17. Operation 19: OPENATTR - Open Named Attribute
17.24. Operation 27: READLINK - Read Symbolic Link . . . . . . 379 Directory . . . . . . . . . . . . . . . . . . . . . . . 373
17.25. Operation 28: REMOVE - Remove File System Object . . . . 380 18.18. Operation 21: OPEN_DOWNGRADE - Reduce Open File Access . 374
17.26. Operation 29: RENAME - Rename Directory Entry . . . . . 382 18.19. Operation 22: PUTFH - Set Current Filehandle . . . . . . 375
17.27. Operation 31: RESTOREFH - Restore Saved Filehandle . . . 384 18.20. Operation 23: PUTPUBFH - Set Public Filehandle . . . . . 376
17.28. Operation 32: SAVEFH - Save Current Filehandle . . . . . 385 18.21. Operation 24: PUTROOTFH - Set Root Filehandle . . . . . 378
17.29. Operation 33: SECINFO - Obtain Available Security . . . 385 18.22. Operation 25: READ - Read from File . . . . . . . . . . 379
17.30. Operation 34: SETATTR - Set Attributes . . . . . . . . . 389 18.23. Operation 26: READDIR - Read Directory . . . . . . . . . 381
17.31. Operation 37: VERIFY - Verify Same Attributes . . . . . 391 18.24. Operation 27: READLINK - Read Symbolic Link . . . . . . 385
17.32. Operation 38: WRITE - Write to File . . . . . . . . . . 392 18.25. Operation 28: REMOVE - Remove File System Object . . . . 386
17.33. Operation 40: BACKCHANNEL_CTL - Backchannel control . . 397 18.26. Operation 29: RENAME - Rename Directory Entry . . . . . 388
17.34. Operation 41: BIND_CONN_TO_SESSION . . . . . . . . . . . 398 18.27. Operation 31: RESTOREFH - Restore Saved Filehandle . . . 390
17.35. Operation 42: EXCHANGE_ID - Instantiate Client ID . . . 400 18.28. Operation 32: SAVEFH - Save Current Filehandle . . . . . 391
17.36. Operation 43: CREATE_SESSION - Create New Session and 18.29. Operation 33: SECINFO - Obtain Available Security . . . 391
Confirm Client ID . . . . . . . . . . . . . . . . . . . 417 18.30. Operation 34: SETATTR - Set Attributes . . . . . . . . . 395
17.37. Operation 44: DESTROY_SESSION - Destroy existing 18.31. Operation 37: VERIFY - Verify Same Attributes . . . . . 397
session . . . . . . . . . . . . . . . . . . . . . . . . 427 18.32. Operation 38: WRITE - Write to File . . . . . . . . . . 398
17.38. Operation 45: FREE_STATEID - Free stateid with no 18.33. Operation 40: BACKCHANNEL_CTL - Backchannel control . . 403
locks . . . . . . . . . . . . . . . . . . . . . . . . . 428 18.34. Operation 41: BIND_CONN_TO_SESSION . . . . . . . . . . . 404
17.39. Operation 46: GET_DIR_DELEGATION - Get a directory 18.35. Operation 42: EXCHANGE_ID - Instantiate Client ID . . . 406
delegation . . . . . . . . . . . . . . . . . . . . . . . 429 18.36. Operation 43: CREATE_SESSION - Create New Session and
17.40. Operation 47: GETDEVICEINFO - Get Device Information . . 434 Confirm Client ID . . . . . . . . . . . . . . . . . . . 423
17.41. Operation 48: GETDEVICELIST . . . . . . . . . . . . . . 435 18.37. Operation 44: DESTROY_SESSION - Destroy existing
17.42. Operation 49: LAYOUTCOMMIT - Commit writes made using session . . . . . . . . . . . . . . . . . . . . . . . . 433
a layout . . . . . . . . . . . . . . . . . . . . . . . . 436 18.38. Operation 45: FREE_STATEID - Free stateid with no
17.43. Operation 50: LAYOUTGET - Get Layout Information . . . . 439 locks . . . . . . . . . . . . . . . . . . . . . . . . . 435
17.44. Operation 51: LAYOUTRETURN - Release Layout 18.39. Operation 46: GET_DIR_DELEGATION - Get a directory
Information . . . . . . . . . . . . . . . . . . . . . . 442 delegation . . . . . . . . . . . . . . . . . . . . . . . 436
17.45. Operation 52: SECINFO_NO_NAME - Get Security on 18.40. Operation 47: GETDEVICEINFO - Get Device Information . . 440
Unnamed Object . . . . . . . . . . . . . . . . . . . . . 445 18.41. Operation 48: GETDEVICELIST . . . . . . . . . . . . . . 441
17.46. Operation 53: SEQUENCE - Supply per-procedure 18.42. Operation 49: LAYOUTCOMMIT - Commit writes made using
sequencing and control . . . . . . . . . . . . . . . . . 446 a layout . . . . . . . . . . . . . . . . . . . . . . . . 442
17.47. Operation 54: SET_SSV . . . . . . . . . . . . . . . . . 453 18.43. Operation 50: LAYOUTGET - Get Layout Information . . . . 445
17.48. Operation 55: TEST_STATEID - Test stateids for 18.44. Operation 51: LAYOUTRETURN - Release Layout
validity . . . . . . . . . . . . . . . . . . . . . . . . 455 Information . . . . . . . . . . . . . . . . . . . . . . 448
17.49. Operation 56: WANT_DELEGATION . . . . . . . . . . . . . 456 18.45. Operation 52: SECINFO_NO_NAME - Get Security on
17.50. Operation 57: DESTROY_CLIENTID - Destroy existing Unnamed Object . . . . . . . . . . . . . . . . . . . . . 451
client ID . . . . . . . . . . . . . . . . . . . . . . . 459 18.46. Operation 53: SEQUENCE - Supply per-procedure
17.51. Operation 58: RECLAIM_COMPLETE - Indicates Reclaims sequencing and control . . . . . . . . . . . . . . . . . 452
Finished . . . . . . . . . . . . . . . . . . . . . . . . 460 18.47. Operation 54: SET_SSV . . . . . . . . . . . . . . . . . 459
17.52. Operation 10044: ILLEGAL - Illegal operation . . . . . . 461 18.48. Operation 55: TEST_STATEID - Test stateids for
18. NFS version 4.1 Callback Procedures . . . . . . . . . . . . . 462 validity . . . . . . . . . . . . . . . . . . . . . . . . 461
18.1. Procedure 0: CB_NULL - No Operation . . . . . . . . . . 462 18.49. Operation 56: WANT_DELEGATION . . . . . . . . . . . . . 462
18.2. Procedure 1: CB_COMPOUND - Compound Operations . . . . . 462 18.50. Operation 57: DESTROY_CLIENTID - Destroy existing
19. NFS version 4.1 Callback Operations . . . . . . . . . . . . . 464 client ID . . . . . . . . . . . . . . . . . . . . . . . 465
19.1. Operation 3: CB_GETATTR - Get Attributes . . . . . . . . 464 18.51. Operation 58: RECLAIM_COMPLETE - Indicates Reclaims
19.2. Operation 4: CB_RECALL - Recall an Open Delegation . . . 466 Finished . . . . . . . . . . . . . . . . . . . . . . . . 466
19.3. Operation 5: CB_LAYOUTRECALL . . . . . . . . . . . . . . 467 18.52. Operation 10044: ILLEGAL - Illegal operation . . . . . . 468
19.4. Operation 6: CB_NOTIFY - Notify directory changes . . . 470 19. NFS version 4.1 Callback Procedures . . . . . . . . . . . . . 468
19.5. Operation 7: CB_PUSH_DELEG . . . . . . . . . . . . . . . 473 19.1. Procedure 0: CB_NULL - No Operation . . . . . . . . . . 469
19.6. Operation 8: CB_RECALL_ANY - Keep any N delegations . . 474 19.2. Procedure 1: CB_COMPOUND - Compound Operations . . . . . 469
19.7. Operation 9: CB_RECALLABLE_OBJ_AVAIL . . . . . . . . . . 477 20. NFS version 4.1 Callback Operations . . . . . . . . . . . . . 471
19.8. Operation 10: CB_RECALL_SLOT - change flow control 20.1. Operation 3: CB_GETATTR - Get Attributes . . . . . . . . 471
limits . . . . . . . . . . . . . . . . . . . . . . . . . 478 20.2. Operation 4: CB_RECALL - Recall an Open Delegation . . . 473
19.9. Operation 11: CB_SEQUENCE - Supply backchannel 20.3. Operation 5: CB_LAYOUTRECALL . . . . . . . . . . . . . . 474
sequencing and control . . . . . . . . . . . . . . . . . 479 20.4. Operation 6: CB_NOTIFY - Notify directory changes . . . 477
19.10. Operation 12: CB_WANTS_CANCELLED . . . . . . . . . . . . 482 20.5. Operation 7: CB_PUSH_DELEG . . . . . . . . . . . . . . . 480
19.11. Operation 13: CB_NOTIFY_LOCK - Notify of possible 20.6. Operation 8: CB_RECALL_ANY - Keep any N delegations . . 481
lock availability . . . . . . . . . . . . . . . . . . . 483 20.7. Operation 9: CB_RECALLABLE_OBJ_AVAIL . . . . . . . . . . 484
19.12. Operation 10044: CB_ILLEGAL - Illegal Callback 20.8. Operation 10: CB_RECALL_SLOT - change flow control
Operation . . . . . . . . . . . . . . . . . . . . . . . 484 limits . . . . . . . . . . . . . . . . . . . . . . . . . 485
20. Security Considerations . . . . . . . . . . . . . . . . . . . 485 20.9. Operation 11: CB_SEQUENCE - Supply backchannel
21. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 485 sequencing and control . . . . . . . . . . . . . . . . . 486
21.1. Defining new layout types . . . . . . . . . . . . . . . 485 20.10. Operation 12: CB_WANTS_CANCELLED . . . . . . . . . . . . 489
22. References . . . . . . . . . . . . . . . . . . . . . . . . . 486 20.11. Operation 13: CB_NOTIFY_LOCK - Notify of possible
22.1. Normative References . . . . . . . . . . . . . . . . . . 486 lock availability . . . . . . . . . . . . . . . . . . . 490
22.2. Informative References . . . . . . . . . . . . . . . . . 487 20.12. Operation 10044: CB_ILLEGAL - Illegal Callback
Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 489 Operation . . . . . . . . . . . . . . . . . . . . . . . 491
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 490 21. Security Considerations . . . . . . . . . . . . . . . . . . . 492
Intellectual Property and Copyright Statements . . . . . . . . . 491 22. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 492
22.1. Defining new layout types . . . . . . . . . . . . . . . 492
23. References . . . . . . . . . . . . . . . . . . . . . . . . . 493
23.1. Normative References . . . . . . . . . . . . . . . . . . 493
23.2. Informative References . . . . . . . . . . . . . . . . . 494
Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 496
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 497
Intellectual Property and Copyright Statements . . . . . . . . . 498
1. Introduction 1. Introduction
1.1. The NFSv4.1 Protocol 1.1. The NFSv4.1 Protocol
The NFSv4.1 protocol is a minor version of the NFSv4 protocol The NFSv4.1 protocol is a minor version of the NFSv4 protocol
described in [2]. It generally follows the guidelines for minor described in [2]. It generally follows the guidelines for minor
versioning model laid in Section 10 of RFC 3530. However, it versioning model laid in Section 10 of RFC 3530. However, it
diverges from guidelines 11 ("a client and server that supports minor diverges from guidelines 11 ("a client and server that supports minor
version X must support minor versions 0 through X-1"), and 12 ("no version X must support minor versions 0 through X-1"), and 12 ("no
skipping to change at page 22, line 46 skipping to change at page 22, line 46
A sequence of an EXCHANGE_ID operation followed by a CREATE_SESSION A sequence of an EXCHANGE_ID operation followed by a CREATE_SESSION
operation using that client ID (eir_clientid as returned from operation using that client ID (eir_clientid as returned from
EXCHANGE_ID) is required to establish the identification on the EXCHANGE_ID) is required to establish the identification on the
server. Establishment of identification by a new incarnation of the server. Establishment of identification by a new incarnation of the
client also has the effect of immediately releasing any locking state client also has the effect of immediately releasing any locking state
that a previous incarnation of that same client might have had on the that a previous incarnation of that same client might have had on the
server. Such released state would include all lock, share server. Such released state would include all lock, share
reservation, layout state, and where the server is not supporting the reservation, layout state, and where the server is not supporting the
CLAIM_DELEGATE_PREV claim type, all delegation state associated with CLAIM_DELEGATE_PREV claim type, all delegation state associated with
same client with the same identity. For discussion of delegation same client with the same identity. For discussion of delegation
state recovery, see Section 9.2.1. For discussion of layout state state recovery, see Section 10.2.1. For discussion of layout state
recovery see Section 12.7.1. recovery see Section 13.7.1.
Releasing such state requires that the server be able to determine Releasing such state requires that the server be able to determine
that one client instance is the successor of another. Where this that one client instance is the successor of another. Where this
cannot be done, for any of a number of reasons, the locking state cannot be done, for any of a number of reasons, the locking state
will remain for a time subject to lease expiration (see Section 8.5) will remain for a time subject to lease expiration (see Section 8.3)
and the new client will need to wait for such state to be removed, if and the new client will need to wait for such state to be removed, if
it makes conflicting lock requests. it makes conflicting lock requests.
Client identification is encapsulated in the following Client Owner Client identification is encapsulated in the following Client Owner
structure: structure:
struct client_owner4 { struct client_owner4 {
verifier4 co_verifier; verifier4 co_verifier;
opaque co_ownerid<NFS4_OPAQUE_LIMIT>; opaque co_ownerid<NFS4_OPAQUE_LIMIT>;
}; };
skipping to change at page 25, line 34 skipping to change at page 25, line 34
the session is persistent (see Section 2.10.5.5). the session is persistent (see Section 2.10.5.5).
When a session is not persistent, the client will need to create a When a session is not persistent, the client will need to create a
new session. When the existing client ID is presented to a server as new session. When the existing client ID is presented to a server as
part of creating a session and that client ID is not recognized, as part of creating a session and that client ID is not recognized, as
would happen after a server reboot, the server will reject the would happen after a server reboot, the server will reject the
request with the error NFS4ERR_STALE_CLIENTID. When this happens, request with the error NFS4ERR_STALE_CLIENTID. When this happens,
the client must obtain a new client ID by use of the EXCHANGE_ID the client must obtain a new client ID by use of the EXCHANGE_ID
operation and then use that client ID as the basis of the basis of a operation and then use that client ID as the basis of the basis of a
new session and then proceed to any other necessary recovery for the new session and then proceed to any other necessary recovery for the
server reboot case (See Section 8.6.2). server reboot case (See Section 8.4.2).
In the case of the session being persistent, the client will re- In the case of the session being persistent, the client will re-
establish communication using the existing session after the reboot. establish communication using the existing session after the reboot.
This session will be associated with a client ID that has had state This session will be associated with a client ID that has had state
revoked (but the persistent session is never associated with a stale revoked (but the persistent session is never associated with a stale
client ID, because if the session is persistent, the client ID MUST client ID, because if the session is persistent, the client ID MUST
persist), and the client will receive an indication of that fact in persist), and the client will receive an indication of that fact in
the sr_status_flags field returned by the SEQUENCE operation (see the sr_status_flags field returned by the SEQUENCE operation (see
Section 17.46.4). The client can then use the existing session to do Section 18.46.4). The client can then use the existing session to do
whatever operations are necessary to determine the status of requests whatever operations are necessary to determine the status of requests
outstanding at the time of reboot, while avoiding issuing new outstanding at the time of reboot, while avoiding issuing new
requests, particularly any involving locking on that session. Such requests, particularly any involving locking on that session. Such
requests would fail with an NFS4ERR_STALE_STATEID error, if requests would fail with an NFS4ERR_STALE_STATEID error, if
attempted. attempted.
See the detailed descriptions of EXCHANGE_ID (Section 17.35 and See the detailed descriptions of EXCHANGE_ID (Section 18.35 and
CREATE_SESSION (Section 17.36) for a complete specification of these CREATE_SESSION (Section 18.36) for a complete specification of these
operations. operations.
2.4.1. Server Release of Client ID 2.4.1. Server Release of Client ID
NFSv4.1 introduces a new operation called DESTROY_CLIENTID NFSv4.1 introduces a new operation called DESTROY_CLIENTID
(Section 17.50) which the client SHOULD use to destroy a client ID it (Section 18.50) which the client SHOULD use to destroy a client ID it
no longer needs. This permits graceful, bilateral release of a no longer needs. This permits graceful, bilateral release of a
client ID. client ID.
If the server determines that the client holds no associated state If the server determines that the client holds no associated state
for its client ID (including sessions, opens, locks, delegations, for its client ID (including sessions, opens, locks, delegations,
layouts, and wants), the server may choose to unilaterally release layouts, and wants), the server may choose to unilaterally release
the client ID. The server may make this choice for an inactive the client ID. The server may make this choice for an inactive
client so that resources are not consumed by those intermittently client so that resources are not consumed by those intermittently
active clients. If the client contacts the server after this active clients. If the client contacts the server after this
release, the server must ensure the client receives the appropriate release, the server must ensure the client receives the appropriate
skipping to change at page 26, line 46 skipping to change at page 26, line 46
by the appropriate CREATE_SESSION. by the appropriate CREATE_SESSION.
When the server gets an EXCHANGE_ID for a client owner that currently When the server gets an EXCHANGE_ID for a client owner that currently
has state and an unexpired lease, the server MUST NOT destroy any has state and an unexpired lease, the server MUST NOT destroy any
state that currently exists for the client owner unless one of the state that currently exists for the client owner unless one of the
following are true: following are true:
o The principal that created the client ID for the client owner is o The principal that created the client ID for the client owner is
the same as the principal that is issuing the EXCHANGE_ID. Note the same as the principal that is issuing the EXCHANGE_ID. Note
that if the client ID was created with SP4_MACH_CRED protection that if the client ID was created with SP4_MACH_CRED protection
(Section 17.35), the principal MUST be based on RPCSEC_GSS (Section 18.35), the principal MUST be based on RPCSEC_GSS
authentication, the RPCSEC_GSS service used MUST be integrity or authentication, the RPCSEC_GSS service used MUST be integrity or
privacy, and the same GSS mechanism and principal must be used as privacy, and the same GSS mechanism and principal must be used as
that used when the client ID was created. that used when the client ID was created.
o The client ID was established with SP4_SSV protection o The client ID was established with SP4_SSV protection
(Section 17.35), and the client sends the EXCHANGE_ID with the (Section 18.35), and the client sends the EXCHANGE_ID with the
security flavor set to RPCSEC_GSS using the GSS SSV mechanism security flavor set to RPCSEC_GSS using the GSS SSV mechanism
(Section 2.10.7.4). Note that this is possible only if the server (Section 2.10.7.4). Note that this is possible only if the server
and client persist the SSV. and client persist the SSV.
o The client ID was established with SP4_SSV protection. Because o The client ID was established with SP4_SSV protection. Because
the SSV might not be persisted across client and server restart, the SSV might not be persisted across client and server restart,
and because the first time a client issues EXCHANGE_ID to a server and because the first time a client issues EXCHANGE_ID to a server
it does not have an SSV, the client MAY issue the subsequent it does not have an SSV, the client MAY issue the subsequent
EXCHANGE_ID without an SSV RPCSEC_GSS handle. Instead, as with EXCHANGE_ID without an SSV RPCSEC_GSS handle. Instead, as with
SP4_MACH_CRED protection, the principal MUST be based on SP4_MACH_CRED protection, the principal MUST be based on
skipping to change at page 27, line 51 skipping to change at page 27, line 51
The Server Owner is returned in the results of EXCHANGE_ID. When the The Server Owner is returned in the results of EXCHANGE_ID. When the
so_major_id fields are the same in two EXCHANGE_ID results, the so_major_id fields are the same in two EXCHANGE_ID results, the
connections each EXCHANGE_ID are sent over can be assumed to address connections each EXCHANGE_ID are sent over can be assumed to address
the same Server (as defined in Section 1.5). If the so_minor_id the same Server (as defined in Section 1.5). If the so_minor_id
fields are also the same, then not only do both connections connect fields are also the same, then not only do both connections connect
to the same server, but the session and other state can be shared to the same server, but the session and other state can be shared
across both connections. The reader is cautioned that multiple across both connections. The reader is cautioned that multiple
servers may deliberately or accidentally claim to have the same servers may deliberately or accidentally claim to have the same
so_major_id or so_major_id/so_minor_id; the reader should examine so_major_id or so_major_id/so_minor_id; the reader should examine
Section 2.10.4 and Section 17.35. Section 2.10.4 and Section 18.35.
The considerations for generating a so_major_id are similar to that The considerations for generating a so_major_id are similar to that
for generating a co_ownerid string (see Section 2.4). The for generating a co_ownerid string (see Section 2.4). The
consequences of two servers generating conflicting so_major_id values consequences of two servers generating conflicting so_major_id values
are less dire than they are for co_ownerid conflicts because the are less dire than they are for co_ownerid conflicts because the
client can use RPCSEC_GSS to compare the authenticity of each server client can use RPCSEC_GSS to compare the authenticity of each server
(see Section 2.10.4). (see Section 2.10.4).
2.6. Security Service Negotiation 2.6. Security Service Negotiation
skipping to change at page 28, line 28 skipping to change at page 28, line 28
that are available for use by NFS clients. These points can be that are available for use by NFS clients. These points can be
considered security policy boundaries, and in some NFS considered security policy boundaries, and in some NFS
implementations are tied to NFS export points. In turn the NFS implementations are tied to NFS export points. In turn the NFS
server may be configured such that each of these security policy server may be configured such that each of these security policy
boundaries may have different or multiple security mechanisms in use. boundaries may have different or multiple security mechanisms in use.
The security negotiation between client and server must be done with The security negotiation between client and server must be done with
a secure channel to eliminate the possibility of a third party a secure channel to eliminate the possibility of a third party
intercepting the negotiation sequence and forcing the client and intercepting the negotiation sequence and forcing the client and
server to choose a lower level of security than required or desired. server to choose a lower level of security than required or desired.
See Section 20 for further discussion. See Section 21 for further discussion.
2.6.1. NFSv4.1 Security Tuples 2.6.1. NFSv4.1 Security Tuples
An NFS server can assign one or more "security tuples" to each An NFS server can assign one or more "security tuples" to each
security policy boundary in its namespace. Each security tuple security policy boundary in its namespace. Each security tuple
consists of a security flavor (see Section 2.2.1.1), and if the consists of a security flavor (see Section 2.2.1.1), and if the
flavor is RPCSEC_GSS, a GSS-API mechanism OID, a GSS-API quality of flavor is RPCSEC_GSS, a GSS-API mechanism OID, a GSS-API quality of
protection, and an RPCSEC_GSS service. protection, and an RPCSEC_GSS service.
2.6.2. SECINFO and SECINFO_NO_NAME 2.6.2. SECINFO and SECINFO_NO_NAME
skipping to change at page 29, line 32 skipping to change at page 29, line 32
2.6.3.1. Using NFS4ERR_WRONGSEC, SECINFO, and SECINFO_NO_NAME 2.6.3.1. Using NFS4ERR_WRONGSEC, SECINFO, and SECINFO_NO_NAME
This section explains of the mechanics of NFSv4.1 security This section explains of the mechanics of NFSv4.1 security
negotiation. The term "put filehandle operation" refers to negotiation. The term "put filehandle operation" refers to
PUTROOTFH, PUTPUBFH, PUTFH, and RESTOREFH. PUTROOTFH, PUTPUBFH, PUTFH, and RESTOREFH.
2.6.3.1.1. Put Filehandle Operation + SAVEFH 2.6.3.1.1. Put Filehandle Operation + SAVEFH
The client is saving a filehandle for a future RESTOREFH. The server The client is saving a filehandle for a future RESTOREFH. The server
MUST NOT return NFS4ERR_WRONG to either the put filehandle operation MUST NOT return NFS4ERR_WRONGSEC to either the put filehandle
or SAVEFH. operation or SAVEFH.
2.6.3.1.2. Two or More Put Filehandle Operations 2.6.3.1.2. Two or More Put Filehandle Operations
For a series of N put filehandle operations, the server MUST NOT For a series of N put filehandle operations, the server MUST NOT
return NFS4ERR_WRONGSEC to the first N-1 put filehandle operations. return NFS4ERR_WRONGSEC to the first N-1 put filehandle operations.
The Nth put filehandle operation is handled as if it is the first in The Nth put filehandle operation is handled as if it is the first in
a series of operations, and the second in the series of operations is a series of operations, and the second in the series of operations is
not a put filehandle operation. For example if the server received not a put filehandle operation. For example if the server received
PUTFH, PUTROOTFH, LOOKUP, then the PUTFH is ignored for PUTFH, PUTROOTFH, LOOKUP, then the PUTFH is ignored for
NFS4ERR_WRONGSEC purposes, and the PUTROOTFH, LOOKUP subseries is NFS4ERR_WRONGSEC purposes, and the PUTROOTFH, LOOKUP subseries is
skipping to change at page 34, line 46 skipping to change at page 34, line 46
As described in Section 2.2.1.1.1.1, NFSv4.1 relies on RPC for As described in Section 2.2.1.1.1.1, NFSv4.1 relies on RPC for
identification, authentication, integrity, and privacy. NFSv4.1 identification, authentication, integrity, and privacy. NFSv4.1
itself provides additional security services as described in the next itself provides additional security services as described in the next
several subsections. several subsections.
2.8.1. Authorization 2.8.1. Authorization
Authorization to access a file object via an NFSv4.1 operation is Authorization to access a file object via an NFSv4.1 operation is
ultimately determined by the NFSv4.1 server. A client can ultimately determined by the NFSv4.1 server. A client can
predetermine its access to a file object via the OPEN (Section 17.16) predetermine its access to a file object via the OPEN (Section 18.16)
and the ACCESS (Section 17.1) operations. and the ACCESS (Section 18.1) operations.
Principals with appropriate access rights can modify the Principals with appropriate access rights can modify the
authorization on a file object via the SETATTR (Section 17.30) authorization on a file object via the SETATTR (Section 18.30)
operation. Four attributes that affect access rights are: mode, operation. Four attributes that affect access rights are: mode,
owner, owner_group, and acl. See Section 5. owner, owner_group, and acl. See Section 5.
2.8.2. Auditing 2.8.2. Auditing
NFSv4.1 provides auditing on a per file object basis, via the ACL NFSv4.1 provides auditing on a per file object basis, via the ACL
attribute as described in Section 6. It is outside the scope of this attribute as described in Section 6. It is outside the scope of this
specification to specify audit log formats or management policies. specification to specify audit log formats or management policies.
2.8.3. Intrusion Detection 2.8.3. Intrusion Detection
skipping to change at page 37, line 49 skipping to change at page 37, line 49
feasible to keep the cache in persistent storage and enable EOS feasible to keep the cache in persistent storage and enable EOS
through server failure and recovery. One reason that previous through server failure and recovery. One reason that previous
revisions of NFS did not support EOS was because some EOS revisions of NFS did not support EOS was because some EOS
approaches often limited parallelism. As will be explained in approaches often limited parallelism. As will be explained in
Section 2.10.5, NFSv4.1 supports both EOS and unlimited Section 2.10.5, NFSv4.1 supports both EOS and unlimited
parallelism. parallelism.
o The NFSv4.1 client (defined in Section 1.5, Paragraph 1) creates o The NFSv4.1 client (defined in Section 1.5, Paragraph 1) creates
transport connections and provides them to the server to use for transport connections and provides them to the server to use for
sending callback requests, thus solving the firewall issue sending callback requests, thus solving the firewall issue
(Section 17.34). Races between responses from client requests, (Section 18.34). Races between responses from client requests,
and callbacks caused by the requests are detected via the and callbacks caused by the requests are detected via the
session's sequencing properties which are a consequence of EOS session's sequencing properties which are a consequence of EOS
(Section 2.10.5.3). (Section 2.10.5.3).
o The NFSv4.1 client can add an arbitrary number of connections to o The NFSv4.1 client can add an arbitrary number of connections to
the session, and thus provide trunking (Section 2.10.4). the session, and thus provide trunking (Section 2.10.4).
o The NFSv4.1 client and server produces a session key independent o The NFSv4.1 client and server produces a session key independent
of client and server machine credentials which can be used to of client and server machine credentials which can be used to
compute a digest for protecting critical session management compute a digest for protecting critical session management
skipping to change at page 38, line 35 skipping to change at page 38, line 35
accessed using any of the sessions associated with that client's accessed using any of the sessions associated with that client's
client ID, when connections are associated with those sessions. When client ID, when connections are associated with those sessions. When
no connections are associated for any of the sessions associated with no connections are associated for any of the sessions associated with
the client ID for an extended time such objects as locks, opens, the client ID for an extended time such objects as locks, opens,
delegations, layouts, etc. are subject to expiration. The session delegations, layouts, etc. are subject to expiration. The session
serves as an object representing a means of access by a client to the serves as an object representing a means of access by a client to the
associated client state on the server, independent of the physical associated client state on the server, independent of the physical
means of access to that state. means of access to that state.
A single client may create multiple sessions. A single session MUST A single client may create multiple sessions. A single session MUST
NOT server multiple clients. NOT serve multiple clients.
2.10.2. NFSv4 Integration 2.10.2. NFSv4 Integration
Sessions are part of NFSv4.1 and not NFSv4.0. Normally, a major Sessions are part of NFSv4.1 and not NFSv4.0. Normally, a major
infrastructure change such as sessions would require a new major infrastructure change such as sessions would require a new major
version number to an ONC RPC program like NFS. However, because version number to an ONC RPC program like NFS. However, because
NFSv4 encapsulates its functionality in a single procedure, COMPOUND, NFSv4 encapsulates its functionality in a single procedure, COMPOUND,
and because COMPOUND can support an arbitrary number of operations, and because COMPOUND can support an arbitrary number of operations,
sessions have been added to NFSv4.1 with little difficulty. COMPOUND sessions have been added to NFSv4.1 with little difficulty. COMPOUND
includes a minor version number field, and for NFSv4.1 this minor includes a minor version number field, and for NFSv4.1 this minor
version is set to 1. When the NFSv4 server processes a COMPOUND with version is set to 1. When the NFSv4 server processes a COMPOUND with
the minor version set to 1, it expects a different set of operations the minor version set to 1, it expects a different set of operations
than it does for NFSv4.0. NFSv4.1 defines the SEQUENCE operation, than it does for NFSv4.0. NFSv4.1 defines the SEQUENCE operation,
which is required for every COMPOUND that operates over an which is required for every COMPOUND that operates over an
established session, with the exception of some session established session, with the exception of some session
administration operations, such as DESTROY_SESSION (Section 17.37). administration operations, such as DESTROY_SESSION (Section 18.37).
2.10.2.1. SEQUENCE and CB_SEQUENCE 2.10.2.1. SEQUENCE and CB_SEQUENCE
In NFSv4.1, when the SEQUENCE operation is present, it MUST be the In NFSv4.1, when the SEQUENCE operation is present, it MUST be the
first operation in the COMPOUND procedure. The primary purpose of first operation in the COMPOUND procedure. The primary purpose of
SEQUENCE is to carry the session identifier. The session identifier SEQUENCE is to carry the session identifier. The session identifier
associates all other operations in the COMPOUND procedure with a associates all other operations in the COMPOUND procedure with a
particular session. SEQUENCE also contains required information for particular session. SEQUENCE also contains required information for
maintaining EOS (see Section 2.10.5). Session-enabled NFSv4.1 maintaining EOS (see Section 2.10.5). Session-enabled NFSv4.1
COMPOUND requests thus have the form: COMPOUND requests thus have the form:
skipping to change at page 40, line 41 skipping to change at page 40, line 41
caches (see Section 2.10.5.1). Note that even the backchannel caches (see Section 2.10.5.1). Note that even the backchannel
requires a reply cache because some callback operations are requires a reply cache because some callback operations are
nonidempotent. nonidempotent.
2.10.3.1. Association of Connections, Channels, and Sessions 2.10.3.1. Association of Connections, Channels, and Sessions
Each channel is associated with zero or more transport connections. Each channel is associated with zero or more transport connections.
A connection can be associated with one channel or both channels of a A connection can be associated with one channel or both channels of a
session; the client and server negotiate whether a connection will session; the client and server negotiate whether a connection will
carry traffic for one channel or both channels via the CREATE_SESSION carry traffic for one channel or both channels via the CREATE_SESSION
(Section 17.36) and the BIND_CONN_TO_SESSION (Section 17.34) (Section 18.36) and the BIND_CONN_TO_SESSION (Section 18.34)
operations. When a session is created via CREATE_SESSION, the operations. When a session is created via CREATE_SESSION, the
connection that transported the CREATE_SESSION request is connection that transported the CREATE_SESSION request is
automatically associated with the fore channel, and optionally the automatically associated with the fore channel, and optionally the
backchannel. If the client specifies no state protection backchannel. If the client specifies no state protection
(Section 17.35). when the session is created, then when SEQUENCE is (Section 18.35). when the session is created, then when SEQUENCE is
transmitted on a different connection, the connection is transmitted on a different connection, the connection is
automatically associated with the fore channel of the session automatically associated with the fore channel of the session
specified in the SEQUENCE operation. specified in the SEQUENCE operation.
A connection's association with a session is not exclusive. A A connection's association with a session is not exclusive. A
connection associated with the channel(s) of one session may be connection associated with the channel(s) of one session may be
simultaneously associated with the channel(s) of other sessions simultaneously associated with the channel(s) of other sessions
including sessions associated with other client IDs. including sessions associated with other client IDs.
It is permissible for connections of multiple transport types to be It is permissible for connections of multiple transport types to be
skipping to change at page 41, line 36 skipping to change at page 41, line 36
server in order to increase the speed of data transfer. NFSv4.1 server in order to increase the speed of data transfer. NFSv4.1
supports two types of trunking: session trunking and client ID supports two types of trunking: session trunking and client ID
trunking. NFSv4.1 servers MUST support trunking. trunking. NFSv4.1 servers MUST support trunking.
Session trunking is essentially the association of multiple Session trunking is essentially the association of multiple
connections, each with a potentially different target network connections, each with a potentially different target network
address, to the same session. address, to the same session.
Client ID trunking is the association of multiple sessions to the Client ID trunking is the association of multiple sessions to the
same client ID, major server owner ID (Section 2.5), and server scope same client ID, major server owner ID (Section 2.5), and server scope
(Section 10.6.7). When two servers return the same major server (Section 11.6.7). When two servers return the same major server
owner and server scope it means the two servers are cooperating on owner and server scope it means the two servers are cooperating on
locking state management which is a prerequisite for client ID locking state management which is a prerequisite for client ID
trunking. trunking.
Understanding and distinguishing session and client ID trunking Understanding and distinguishing session and client ID trunking
requires understanding how the results of the EXCHANGE_ID requires understanding how the results of the EXCHANGE_ID
(Section 17.35) operation identify a server. Suppose a client issues (Section 18.35) operation identify a server. Suppose a client issues
EXCHANGE_ID over two different connections each with a possibly EXCHANGE_ID over two different connections each with a possibly
different target network address but each EXCHANGE_ID with the same different target network address but each EXCHANGE_ID with the same
value in the eia_clientowner field. If the same NFSv4.1 server is value in the eia_clientowner field. If the same NFSv4.1 server is
listening over each connection, then each EXCHANGE_ID result MUST listening over each connection, then each EXCHANGE_ID result MUST
return the same values of eir_clientid, eir_server_owner.so_major_id return the same values of eir_clientid, eir_server_owner.so_major_id
and eir_server_scope. The client can then treat each connection as and eir_server_scope. The client can then treat each connection as
referring to the same server (subject to verification, see referring to the same server (subject to verification, see
Paragraph 5 later in this section), and it can use each connection to Paragraph 5 later in this section), and it can use each connection to
trunk requests and replies. The question is whether session trunking trunk requests and replies. The question is whether session trunking
and/or client ID trunking applies. and/or client ID trunking applies.
skipping to change at page 43, line 8 skipping to change at page 43, line 8
the client does not have to trust the servers' claims. The client the client does not have to trust the servers' claims. The client
may verify these claims before trunking traffic in the following may verify these claims before trunking traffic in the following
ways: ways:
o For session trunking, clients SHOULD reliably verify if o For session trunking, clients SHOULD reliably verify if
connections between different network paths are in fact associated connections between different network paths are in fact associated
with the same NFSv4.1 server and usable on the same session, and with the same NFSv4.1 server and usable on the same session, and
servers MUST allow clients to perform reliable verification. When servers MUST allow clients to perform reliable verification. When
a client ID is created, the client SHOULD specify that a client ID is created, the client SHOULD specify that
BIND_CONN_TO_SESSION is to be verified according to the SP4_SSV or BIND_CONN_TO_SESSION is to be verified according to the SP4_SSV or
SP4_MACH_CRED (Section 17.35) state protection options. For SP4_MACH_CRED (Section 18.35) state protection options. For
SP4_SSV, reliable verification depends on a shared secret (the SP4_SSV, reliable verification depends on a shared secret (the
SSV) that is established via the SET_SSV (Section 17.47) SSV) that is established via the SET_SSV (Section 18.47)
operation. operation.
When a new connection is associated with the session (via the When a new connection is associated with the session (via the
BIND_CONN_TO_SESSION operation, see Section 17.34), if the client BIND_CONN_TO_SESSION operation, see Section 18.34), if the client
specified SP4_SSV state protection for the BIND_CONN_TO_SESSION specified SP4_SSV state protection for the BIND_CONN_TO_SESSION
operation, the client MUST issue the BIND_CONN_TO_SESSSION with operation, the client MUST issue the BIND_CONN_TO_SESSION with
RPCSEC_GSS protection, using integrity or privacy, and a RPCSEC_GSS protection, using integrity or privacy, and a
RPCSEC_GSS using the GSS SSV mechanism (Section 2.10.7.4 If the RPCSEC_GSS using the GSS SSV mechanism (Section 2.10.7.4). The
client mistakenly tries to associate a connection to a session of RPCSEC_GSS handle is created by CREATE_SESSION (Section 18.36).
a wrong server, the server will either reject the attempt because
it is not aware of the session identifier of the If the client mistakenly tries to associate a connection to a
session of a wrong server, the server will either reject the
attempt because it is not aware of the session identifier of the
BIND_CONN_TO_SESSION arguments, or it will reject the attempt BIND_CONN_TO_SESSION arguments, or it will reject the attempt
because the RPCSEC_GSS authentication fails. Even if the server because the RPCSEC_GSS authentication fails. Even if the server
mistakenly or maliciously accepts the connection association mistakenly or maliciously accepts the connection association
attempt, the RPCSEC_GSS verifier it computes in the response will attempt, the RPCSEC_GSS verifier it computes in the response will
not be verified by the client, the client will know it cannot use not be verified by the client, the client will know it cannot use
the connection for trunking the specified session. the connection for trunking the specified session.
If the client specified SP4_MACH_CRED state protection, the If the client specified SP4_MACH_CRED state protection, the
BIND_CONN_TO_SESSION operation will use RPCSEC_GSS integrity or BIND_CONN_TO_SESSION operation will use RPCSEC_GSS integrity or
privacy, using the same credential that was used when the client privacy, using the same credential that was used when the client
ID was created. Mutual authentication via RPCSEC_GSS assures the ID was created. Mutual authentication via RPCSEC_GSS assures the
client that the connection is associated with the correct sesssion client that the connection is associated with the correct session
of the correct server. of the correct server.
o For client ID trunking, the client has at least two options for o For client ID trunking, the client has at least two options for
verifying that the same client ID obtained from two different verifying that the same client ID obtained from two different
EXCHANGE_ID operations came from the same server. The first EXCHANGE_ID operations came from the same server. The first
option is to use RPCSEC_GSS authentication when issuing each option is to use RPCSEC_GSS authentication when issuing each
EXCHANGE_ID. Each time an EXCHANGE_ID is issued with RPCSEC_GSS EXCHANGE_ID. Each time an EXCHANGE_ID is issued with RPCSEC_GSS
authentication, the client notes the principal name of GSS target. authentication, the client notes the principal name of the GSS
If the EXCHANGE_ID results indicate client ID trunking is target. If the EXCHANGE_ID results indicate client ID trunking is
possible, and the GSS targets' principal names are the same, the possible, and the GSS targets' principal names are the same, the
servers are the same and client ID trunking is allowed. servers are the same and client ID trunking is allowed.
The second option for verification is to use SP4_SSV protection. The second option for verification is to use SP4_SSV protection.
When the client issues EXCHANGE_ID is specifies SP4_SSV When the client issues EXCHANGE_ID it specifies SP4_SSV
protection. The first EXCHANGE_ID the client issues always has to protection. The first EXCHANGE_ID the client issues always has to
be confirmed by a CREATE_SESSION call. The client then issues be confirmed by a CREATE_SESSION call. The client then issues
SET_SSV on the sessions. Later the client issues EXCHANGE_ID to a SET_SSV on the sessions. Later the client issues EXCHANGE_ID to a
second destination network address than the first EXCHANGE_ID was second destination network address than the first EXCHANGE_ID was
issued with. The client checks that each EXCHANGE_ID reply has issued with. The client checks that each EXCHANGE_ID reply has
the same eir_clientid, eir_server_owner.so_major_id, and the same eir_clientid, eir_server_owner.so_major_id, and
eir_server_scope. If so, the client verifies the claim by issuing eir_server_scope. If so, the client verifies the claim by issuing
a CREATE_SESSION to the second destination address, protected with a CREATE_SESSION to the second destination address, protected with
RPCSEC_GSS integrity using an RPCSEC_GSS handle returned by the RPCSEC_GSS integrity using an RPCSEC_GSS handle returned by the
second EXCHANGE_ID. If the server accept the CREATE_SESSION second EXCHANGE_ID. If the server accept the CREATE_SESSION
skipping to change at page 46, line 13 skipping to change at page 46, line 16
number of replies in the cache, and use a least recently used (LRU) number of replies in the cache, and use a least recently used (LRU)
approach to replacing cache entries with new entries when the cache approach to replacing cache entries with new entries when the cache
is full. In NFSv4.1, the number of outstanding requests is bounded is full. In NFSv4.1, the number of outstanding requests is bounded
by the size of the slot table, and a sequence id per slot is used to by the size of the slot table, and a sequence id per slot is used to
tell the replier when it is safe to delete a cached reply. tell the replier when it is safe to delete a cached reply.
In the NFSv4.1 reply cache, when the requester issues a new request, In the NFSv4.1 reply cache, when the requester issues a new request,
it selects a slot id in the range 0..N, where N is the replier's it selects a slot id in the range 0..N, where N is the replier's
current maximum slot id granted to the requester on the session over current maximum slot id granted to the requester on the session over
which the request is to be issued. The value of N starts out as which the request is to be issued. The value of N starts out as
equal to ca_maxrequests - 1 (Section 17.36), but can be adjusted by equal to ca_maxrequests - 1 (Section 18.36), but can be adjusted by
the response to SEQUENCE or CB_SEQUENCE as described later in this the response to SEQUENCE or CB_SEQUENCE as described later in this
section. The slot id must be unused by any of the requests which the section. The slot id must be unused by any of the requests which the
requester has already active on the session. "Unused" here means the requester has already active on the session. "Unused" here means the
requester has no outstanding request for that slot id. requester has no outstanding request for that slot id.
A slot contains a sequence id and the cached reply corresponding to A slot contains a sequence id and the cached reply corresponding to
the request send with that sequence id. The sequence id is a 32 bit the request send with that sequence id. The sequence id is a 32 bit
unsigned value, and is therefore in the range 0..0xFFFFFFFF (2^32 - unsigned value, and is therefore in the range 0..0xFFFFFFFF (2^32 -
1). The first time a slot is used, the requester must specify a 1). The first time a slot is used, the requester must specify a
sequence id of one (1) (Section 17.36). Each time a slot is re-used, sequence id of one (1) (Section 18.36). Each time a slot is re-used,
the request MUST specify a sequence id that is one greater than that the request MUST specify a sequence id that is one greater than that
of the previous request on the slot. If the previous sequence id was of the previous request on the slot. If the previous sequence id was
0xFFFFFFFF, then the next request for the slot MUST have the sequence 0xFFFFFFFF, then the next request for the slot MUST have the sequence
id set to zero (i.e. (2^32 - 1) + 1 mod 2^32). id set to zero (i.e. (2^32 - 1) + 1 mod 2^32).
The sequence id accompanies the slot id in each request. It is for The sequence id accompanies the slot id in each request. It is for
the critical check at the server: it used to efficiently determine the critical check at the server: it used to efficiently determine
whether a request using a certain slot id is a retransmit or a new, whether a request using a certain slot id is a retransmit or a new,
never-before-seen request. It is not feasible for the client to never-before-seen request. It is not feasible for the client to
assert that it is retransmitting to implement this, because for any assert that it is retransmitting to implement this, because for any
skipping to change at page 52, line 45 skipping to change at page 53, line 4
The client must not simply wait forever for the expected server reply The client must not simply wait forever for the expected server reply
to arrive before responding to the CB_COMPOUND that won the race, to arrive before responding to the CB_COMPOUND that won the race,
because it is possible that it will be delayed indefinitely. The because it is possible that it will be delayed indefinitely. The
client should assume the likely case that the reply will arrive client should assume the likely case that the reply will arrive
within the average round trip time for COMPOUND requests to the within the average round trip time for COMPOUND requests to the
server, and wait that period of time. If that period of expires it server, and wait that period of time. If that period of expires it
can respond to the CB_COMPOUND with NFS4ERR_DELAY. can respond to the CB_COMPOUND with NFS4ERR_DELAY.
There are other scenarios under which callbacks may race replies, There are other scenarios under which callbacks may race replies,
among them pNFS layout recalls, described in Section 12.5.4.2. among them pNFS layout recalls, described in Section 13.5.4.2.
2.10.5.4. COMPOUND and CB_COMPOUND Construction Issues 2.10.5.4. COMPOUND and CB_COMPOUND Construction Issues
Very large requests and replies may pose both buffer management Very large requests and replies may pose both buffer management
issues (especially with RDMA) and reply cache issues. When the issues (especially with RDMA) and reply cache issues. When the
session is created, (Section 17.36), for each channel (fore and session is created, (Section 18.36), for each channel (fore and
back), the client and server negotiate the maximum sized request they back), the client and server negotiate the maximum sized request they
will send or process (ca_maxrequestsize), the maximum sized reply will send or process (ca_maxrequestsize), the maximum sized reply
they will return or process (ca_maxresponsesize), and the maximum they will return or process (ca_maxresponsesize), and the maximum
sized reply they will store in the reply cache sized reply they will store in the reply cache
(ca_maxresponsesize_cached). (ca_maxresponsesize_cached).
If a request exceeds ca_maxrequestsize, the reply will have the If a request exceeds ca_maxrequestsize, the reply will have the
status NFS4ERR_REQ_TOO_BIG. A replier MAY return NFS4ERR_REQ_TOO_BIG status NFS4ERR_REQ_TOO_BIG. A replier MAY return NFS4ERR_REQ_TOO_BIG
as the status for first operation (SEQUENCE or CB_SEQUENCE) in the as the status for first operation (SEQUENCE or CB_SEQUENCE) in the
request (which means no operations in the request executed, and the request (which means no operations in the request executed, and the
skipping to change at page 54, line 5 skipping to change at page 54, line 12
in order for correct operation of exactly once semantics. If the in order for correct operation of exactly once semantics. If the
client retries the request, the server will have cached a reply that client retries the request, the server will have cached a reply that
contains results for ten of the eleven requested operations, with the contains results for ten of the eleven requested operations, with the
tenth operation having a status of NFS4ERR_REP_TOO_BIG_TO_CACHE. tenth operation having a status of NFS4ERR_REP_TOO_BIG_TO_CACHE.
A client needs to take care that when sending operations that change A client needs to take care that when sending operations that change
the current filehandle (except for PUTFH, PUTPUBFH, and PUTROOTFH) the current filehandle (except for PUTFH, PUTPUBFH, and PUTROOTFH)
that it not exceed the maximum reply buffer before the GETFH that it not exceed the maximum reply buffer before the GETFH
operation. Otherwise the client will have to retry the operation operation. Otherwise the client will have to retry the operation
that changed the current filehandle, in order to obtain the desired that changed the current filehandle, in order to obtain the desired
filehandle. For the OPEN operation (see Section 17.16), retry is not filehandle. For the OPEN operation (see Section 18.16), retry is not
always available as an option. The following guidelines for the always available as an option. The following guidelines for the
handling of filehandle changing operations are advised: handling of filehandle changing operations are advised:
o Within the same COMPOUND procedure, a client SHOULD issue GETFH o Within the same COMPOUND procedure, a client SHOULD issue GETFH
immediately after a current filehandle changing operation. A immediately after a current filehandle changing operation. A
client MUST issue GETFH after a current filehandle change client MUST issue GETFH after a current filehandle change
operation that is also non-idempotent (for example, the OPEN operation that is also non-idempotent (for example, the OPEN
operation). operation).
o A server MAY return NFS4ERR_REP_TOO_BIG or o A server MAY return NFS4ERR_REP_TOO_BIG or
skipping to change at page 54, line 38 skipping to change at page 54, line 45
operation (in the same COMPOUND procedure) and finds it is not operation (in the same COMPOUND procedure) and finds it is not
GETFH. The server SHOULD do this if it is unable to determine in GETFH. The server SHOULD do this if it is unable to determine in
advance whether the total response size would exceed advance whether the total response size would exceed
ca_maxresponsesize_cached or ca_maxresponsesize. ca_maxresponsesize_cached or ca_maxresponsesize.
2.10.5.5. Persistence 2.10.5.5. Persistence
Since the reply cache is bounded, it is practical for the reply cache Since the reply cache is bounded, it is practical for the reply cache
to persist across server restarts. The replier MUST persist the to persist across server restarts. The replier MUST persist the
following information if it agreed to persist the session (when the following information if it agreed to persist the session (when the
session was created; see Section 17.36): session was created; see Section 18.36):
o The sessionid. o The sessionid.
o The slot table including the sequence id and cached reply for each o The slot table including the sequence id and cached reply for each
slot. slot.
The above are sufficient for a replier to provide EOS semantics for The above are sufficient for a replier to provide EOS semantics for
any requests that were sent and executed before the server restarted. any requests that were sent and executed before the server restarted.
If the replier is a client then there is no need for it to persist If the replier is a client then there is no need for it to persist
any more information, unless the client will be persisting all other any more information, unless the client will be persisting all other
skipping to change at page 55, line 18 skipping to change at page 55, line 26
SEQUENCE). Such a session is considered: dead. A server MAY re- SEQUENCE). Such a session is considered: dead. A server MAY re-
animate a session after a server restart so that the session will animate a session after a server restart so that the session will
accept new requests as well as retries. To re-animate a session the accept new requests as well as retries. To re-animate a session the
server needs to persist additional information through server server needs to persist additional information through server
restart: restart:
o The client ID. This is a prerequisite to let the client to create o The client ID. This is a prerequisite to let the client to create
more sessions associated with the same client ID as the more sessions associated with the same client ID as the
o The client ID's sequenceid that is used for creating sessions (see o The client ID's sequenceid that is used for creating sessions (see
Section 17.35 and Section 17.36. This is a prerequisite to let Section 18.35 and Section 18.36. This is a prerequisite to let
the client create more sessions. the client create more sessions.
o The principal that created the client ID. This allows the server o The principal that created the client ID. This allows the server
to authenticate the client when it issues EXCHANGE_ID. to authenticate the client when it issues EXCHANGE_ID.
o The SSV, if SP4_SSV state protection was specified when the client o The SSV, if SP4_SSV state protection was specified when the client
ID was created (see Section 17.35). This lets the client create ID was created (see Section 18.35). This lets the client create
new sessions, and associate connections with the new and existing new sessions, and associate connections with the new and existing
sessions. sessions.
o The properties of the client ID as defined in Section 17.35. o The properties of the client ID as defined in Section 18.35.
A persistent reply cache places certain demands on the server. The A persistent reply cache places certain demands on the server. The
execution of the sequence of operations (starting with SEQUENCE) and execution of the sequence of operations (starting with SEQUENCE) and
placement of its results in the persistent cache MUST be atomic. If placement of its results in the persistent cache MUST be atomic. If
a client retries an sequence of operations that was previously a client retries an sequence of operations that was previously
executed on the server the only acceptable outcomes are either the executed on the server the only acceptable outcomes are either the
original cached reply or an indication that client ID or session has original cached reply or an indication that client ID or session has
been lost (indicating a catastrophic loss of the reply cache or a been lost (indicating a catastrophic loss of the reply cache or a
session that has been deleted because the client failed to use the session that has been deleted because the client failed to use the
session for an extended period of time). session for an extended period of time).
skipping to change at page 57, line 6 skipping to change at page 57, line 14
lesser of X and Y. lesser of X and Y.
2.10.6.2. Flow Control 2.10.6.2. Flow Control
Previous versions of NFS do not provide flow control; instead they Previous versions of NFS do not provide flow control; instead they
rely on the windowing provided by transports like TCP to throttle rely on the windowing provided by transports like TCP to throttle
requests. This does not work with RDMA, which provides no operation requests. This does not work with RDMA, which provides no operation
flow control and will terminate a connection in error when limits are flow control and will terminate a connection in error when limits are
exceeded. Limits such as maximum number of requests outstanding are exceeded. Limits such as maximum number of requests outstanding are
therefore negotiated when a session is created (see the therefore negotiated when a session is created (see the
ca_maxrequests field in Section 17.36). These limits then provide ca_maxrequests field in Section 18.36). These limits then provide
the maxima which each connection associated with the session's the maxima which each connection associated with the session's
channel(s) must remain within. RDMA connections are managed within channel(s) must remain within. RDMA connections are managed within
these limits as described in section 3.3 ("Flow Control"[[Comment.4: these limits as described in section 3.3 ("Flow Control"[[Comment.4:
RFC Editor: please verify section and title of the RPCRDMA RFC Editor: please verify section and title of the RPCRDMA
document]]) of [9]; if there are multiple RDMA connections, then the document]]) of [9]; if there are multiple RDMA connections, then the
maximum number of requests for a channel will be divided among the maximum number of requests for a channel will be divided among the
RDMA connections. Put a different way, the onus is on the replier to RDMA connections. Put a different way, the onus is on the replier to
ensure that total number of RDMA credits across all connections ensure that total number of RDMA credits across all connections
associated with the replier's channel does exceed the channel's associated with the replier's channel does exceed the channel's
maximum number of outstanding requests. maximum number of outstanding requests.
The limits may also be modified dynamically at the replier's choosing The limits may also be modified dynamically at the replier's choosing
by manipulating certain parameters present in each NFSv4.1 reply. In by manipulating certain parameters present in each NFSv4.1 reply. In
addition, the CB_RECALL_SLOT callback operation (see Section 19.8) addition, the CB_RECALL_SLOT callback operation (see Section 20.8)
can be issued by a server to a client to return RDMA credits to the can be issued by a server to a client to return RDMA credits to the
server, thereby lowering the maximum number of requests a client can server, thereby lowering the maximum number of requests a client can
have outstanding to the server. have outstanding to the server.
2.10.6.3. Padding 2.10.6.3. Padding
Header padding is requested by each peer at session initiation (see Header padding is requested by each peer at session initiation (see
the ca_headerpadsize argument to CREATE_SESSION in Section 17.36), the ca_headerpadsize argument to CREATE_SESSION in Section 18.36),
and subsequently used by the RPC RDMA layer, as described in [9]. and subsequently used by the RPC RDMA layer, as described in [9].
Zero padding is permitted. Zero padding is permitted.
Padding leverages the useful property that RDMA preserve alignment of Padding leverages the useful property that RDMA preserve alignment of
data, even when they are placed into anonymous (untagged) buffers. data, even when they are placed into anonymous (untagged) buffers.
If requested, client inline writes will insert appropriate pad octets If requested, client inline writes will insert appropriate pad octets
within the request header to align the data payload on the specified within the request header to align the data payload on the specified
boundary. The client is encouraged to add sufficient padding (up to boundary. The client is encouraged to add sufficient padding (up to
the negotiated size) so that the "data" field of the NFSv4.1 WRITE the negotiated size) so that the "data" field of the NFSv4.1 WRITE
operation is aligned. Most servers can make good use of such operation is aligned. Most servers can make good use of such
skipping to change at page 58, line 38 skipping to change at page 58, line 46
perform similar optimizations, if desired. perform similar optimizations, if desired.
2.10.6.4. Dual RDMA and Non-RDMA Transports 2.10.6.4. Dual RDMA and Non-RDMA Transports
Some RDMA transports (for example [11]), permit a "streaming" (non- Some RDMA transports (for example [11]), permit a "streaming" (non-
RDMA) phase, where ordinary traffic might flow before "stepping up" RDMA) phase, where ordinary traffic might flow before "stepping up"
to RDMA mode, commencing RDMA traffic. Some RDMA transports start to RDMA mode, commencing RDMA traffic. Some RDMA transports start
connections always in RDMA mode. NFSv4.1 allows, but does not connections always in RDMA mode. NFSv4.1 allows, but does not
assume, a streaming phase before RDMA mode. When a connection is assume, a streaming phase before RDMA mode. When a connection is
associated with a session, the client and server negotiate whether associated with a session, the client and server negotiate whether
the connection is used in RDMA or non-RDMA mode (see Section 17.36 the connection is used in RDMA or non-RDMA mode (see Section 18.36
and Section 17.34). and Section 18.34).
2.10.7. Sessions Security 2.10.7. Sessions Security
2.10.7.1. Session Callback Security 2.10.7.1. Session Callback Security
Via session / connection association, NFSv4.1 improves security over Via session / connection association, NFSv4.1 improves security over
that provided by NFSv4.0 for the backchannel. The connection is that provided by NFSv4.0 for the backchannel. The connection is
client-initiated (see Section 17.34), and subject to the same client-initiated (see Section 18.34), and subject to the same
firewall and routing checks as the fore channel. The connection firewall and routing checks as the fore channel. The connection
cannot be hijacked by an attacker who connects to the client port cannot be hijacked by an attacker who connects to the client port
prior to the intended server as is possible with NFSv4.0. At the prior to the intended server as is possible with NFSv4.0. At the
client's option (see Section 17.35), connection association is fully client's option (see Section 18.35), connection association is fully
authenticated before being activated (see Section 17.34). Traffic authenticated before being activated (see Section 18.34). Traffic
from the server over the backchannel is authenticated exactly as the from the server over the backchannel is authenticated exactly as the
client specifies (see Section 2.10.7.2). client specifies (see Section 2.10.7.2).
2.10.7.2. Backchannel RPC Security 2.10.7.2. Backchannel RPC Security
When the NFSv4.1 client establishes the backchannel, it informs the When the NFSv4.1 client establishes the backchannel, it informs the
server of the security flavors and principals to use when sending server of the security flavors and principals to use when sending
requests. If the security flavor is RPCSEC_GSS, the client expresses requests. If the security flavor is RPCSEC_GSS, the client expresses
the principal in the form of an established RPCSEC_GSS context. The the principal in the form of an established RPCSEC_GSS context. The
server is free to use any of the flavor/principal combinations the server is free to use any of the flavor/principal combinations the
client offers, but it MUST NOT use unoffered combinations. This way, client offers, but it MUST NOT use unoffered combinations. This way,
the client need not to provide a target GSS principal for the the client need not provide a target GSS principal for the
backchannel as it did with NFSv4.0, nor the server have to implement backchannel as it did with NFSv4.0, nor the server have to implement
an RPCSEC_GSS initiator as it did with NFSv4.0 [2]. an RPCSEC_GSS initiator as it did with NFSv4.0 [2].
The CREATE_SESSION (Section 17.36) and BACKCHANNEL_CTL The CREATE_SESSION (Section 18.36) and BACKCHANNEL_CTL
(Section 17.33) operations allow the client to specify flavor/ (Section 18.33) operations allow the client to specify flavor/
principal combinations. principal combinations.
Also note that the SP4_SSV state protection mode (see Section 17.35 Also note that the SP4_SSV state protection mode (see Section 18.35
and Section 2.10.7.3) has the side benefit of providing SSV-derived and Section 2.10.7.3) has the side benefit of providing SSV-derived
RPCSEC_GSS contexts (Section 2.10.7.4). RPCSEC_GSS contexts (Section 2.10.7.4).
2.10.7.3. Protection from Unauthorized State Changes 2.10.7.3. Protection from Unauthorized State Changes
As described to this point in the specification, the state model of As described to this point in the specification, the state model of
NFSv4.1 is vulnerable to an attacker that issues a SEQUENCE operation NFSv4.1 is vulnerable to an attacker that issues a SEQUENCE operation
with a forged sessionid and with a slot id that it expects the with a forged sessionid and with a slot id that it expects the
legitimate client to use next. When the legitimate client uses the legitimate client to use next. When the legitimate client uses the
slot id with the same sequence number, the server returns the slot id with the same sequence number, the server returns the
skipping to change at page 59, line 41 skipping to change at page 60, line 4
NFSv4.1 is vulnerable to an attacker that issues a SEQUENCE operation NFSv4.1 is vulnerable to an attacker that issues a SEQUENCE operation
with a forged sessionid and with a slot id that it expects the with a forged sessionid and with a slot id that it expects the
legitimate client to use next. When the legitimate client uses the legitimate client to use next. When the legitimate client uses the
slot id with the same sequence number, the server returns the slot id with the same sequence number, the server returns the
attacker's result from the reply cache which disrupts the legitimate attacker's result from the reply cache which disrupts the legitimate
client and thus denies service to it. Similarly an attacker could client and thus denies service to it. Similarly an attacker could
issue a CREATE_SESSION with a forged client ID to create a new issue a CREATE_SESSION with a forged client ID to create a new
session associated with the client ID. The attacker could issue session associated with the client ID. The attacker could issue
requests using the new session that change locking state, such as requests using the new session that change locking state, such as
LOCKU operations to release locks the legitimate client has acquired. LOCKU operations to release locks the legitimate client has acquired.
Setting a security policy on the file which requires RPCSEC_GSS Setting a security policy on the file which requires RPCSEC_GSS
credentials when manipulating the file's state is one potential work credentials when manipulating the file's state is one potential work
around, but has the disadvantage of preventing a legitimate client around, but has the disadvantage of preventing a legitimate client
from releasing state when RPCSEC_GSS is required to do so, but a GSS from releasing state when RPCSEC_GSS is required to do so, but a GSS
context cannot be obtained (possibly because the user has logged off context cannot be obtained (possibly because the user has logged off
the client). the client).
NFSv4.1 provides three options to a client for state protection which NFSv4.1 provides three options to a client for state protection which
are specified when a client creates a client ID via EXCHANGE_ID are specified when a client creates a client ID via EXCHANGE_ID
(Section 17.35). (Section 18.35).
The first (SP4_NONE) is to simply waive state protection. The first (SP4_NONE) is to simply waive state protection.
The other two options (SP4_MACH_CRED and SP4_SSV) share several The other two options (SP4_MACH_CRED and SP4_SSV) share several
traits: traits:
o An RPCSEC_GSS-based credential is used to authenticate client ID o An RPCSEC_GSS-based credential is used to authenticate client ID
and session maintenance operations, including creating and and session maintenance operations, including creating and
destroying a session, associating a connection with the session, destroying a session, associating a connection with the session,
and destroying the client ID. and destroying the client ID.
skipping to change at page 61, line 25 skipping to change at page 61, line 35
the approach suffers from lack of economy. the approach suffers from lack of economy.
The SP4_SSV protection option uses a Secret State Verifier (SSV) The SP4_SSV protection option uses a Secret State Verifier (SSV)
which is shared between a client and server. The SSV serves as the which is shared between a client and server. The SSV serves as the
secret key for an internal (that is, internal to NFSv4.1) GSS secret key for an internal (that is, internal to NFSv4.1) GSS
mechanism that uses the secret key for Message Integrity Code (MIC) mechanism that uses the secret key for Message Integrity Code (MIC)
and Wrap tokens (Section 2.10.7.4). The SP4_SSV protection option is and Wrap tokens (Section 2.10.7.4). The SP4_SSV protection option is
intended for the client that has multiple users, and the system intended for the client that has multiple users, and the system
administrator does not wish to configure a permanent machine administrator does not wish to configure a permanent machine
credential for each client. The SSV is established on the server via credential for each client. The SSV is established on the server via
SET_SSV (see Section 17.47). To prevent eavesdropping, a client SET_SSV (see Section 18.47). To prevent eavesdropping, a client
SHOULD issue SET_SSV via RPCSEC_GSS with the privacy service. SHOULD issue SET_SSV via RPCSEC_GSS with the privacy service.
Several aspects of the SSV make it intractable for an attacker to Several aspects of the SSV make it intractable for an attacker to
guess the SSV, and thus associate rogue connections with a session, guess the SSV, and thus associate rogue connections with a session,
and rogue sessions with a client ID: and rogue sessions with a client ID:
o The arguments to and results of SET_SSV include digests of the old o The arguments to and results of SET_SSV include digests of the old
and new SSV, respectively. and new SSV, respectively.
o Because the initial value of the SSV is zero, therefore known, the o Because the initial value of the SSV is zero, therefore known, the
client that opts for SP4_SSV protection and opts to apply SP4_SSV client that opts for SP4_SSV protection and opts to apply SP4_SSV
skipping to change at page 69, line 20 skipping to change at page 69, line 30
2.10.9.1. Events Requiring Client Action 2.10.9.1. Events Requiring Client Action
The following events require client action to recover. The following events require client action to recover.
2.10.9.1.1. RPCSEC_GSS Context Loss by Callback Path 2.10.9.1.1. RPCSEC_GSS Context Loss by Callback Path
If all RPCSEC_GSS contexts granted by the client to the server for If all RPCSEC_GSS contexts granted by the client to the server for
callback use have expired, the client MUST establish a new context callback use have expired, the client MUST establish a new context
via BACKCHANNEL_CTL. The sr_status_flags field of the SEQUENCE via BACKCHANNEL_CTL. The sr_status_flags field of the SEQUENCE
results indicates when callback contexts are nearly expired, or fully results indicates when callback contexts are nearly expired, or fully
expired (see Section 17.46.4). expired (see Section 18.46.4).
2.10.9.1.2. Connection Loss 2.10.9.1.2. Connection Loss
If the client loses the last connection of the session, and if wants If the client loses the last connection of the session, and if wants
to retain the session, then it must create a new connection, and if, to retain the session, then it must create a new connection, and if,
when the client ID was created, BIND_CONN_TO_SESSION was specified in when the client ID was created, BIND_CONN_TO_SESSION was specified in
the spo_must_enforce list, the client MUST use BIND_CONNN_TO_SESSION the spo_must_enforce list, the client MUST use BIND_CONNN_TO_SESSION
to associate the connection with the session. to associate the connection with the session.
If there was a request outstanding at the time the of connection If there was a request outstanding at the time the of connection
skipping to change at page 70, line 50 skipping to change at page 71, line 12
recover. Any non-idempotent operations that were in progress may recover. Any non-idempotent operations that were in progress may
have been performed on the server at the time of session loss. The have been performed on the server at the time of session loss. The
client has no general way to recover from this. client has no general way to recover from this.
Note that loss of session does not imply loss of lock, open, Note that loss of session does not imply loss of lock, open,
delegation, or layout state because locks, opens, delegations, and delegation, or layout state because locks, opens, delegations, and
layouts are tied to the client ID and depend on the client ID, not layouts are tied to the client ID and depend on the client ID, not
the session. Nor does loss of lock, open, delegation, or layout the session. Nor does loss of lock, open, delegation, or layout
state imply loss of session state, because the session depends on the state imply loss of session state, because the session depends on the
client ID; loss of client ID however does imply loss of session, client ID; loss of client ID however does imply loss of session,
lock, open, delegation, and layout state. See Section 8.6.2. A lock, open, delegation, and layout state. See Section 8.4.2. A
session can survive a server reboot, but lock recovery may still be session can survive a server reboot, but lock recovery may still be
needed. needed.
It is possible CREATE_SESSION will fail with NFS4ERR_STALE_CLIENTID It is possible CREATE_SESSION will fail with NFS4ERR_STALE_CLIENTID
(for example the server reboots and does not preserve client ID (for example the server reboots and does not preserve client ID
state). If so, the client needs to call EXCHANGE_ID, followed by state). If so, the client needs to call EXCHANGE_ID, followed by
CREATE_SESSION. CREATE_SESSION.
2.10.9.2. Events Requiring Server Action 2.10.9.2. Events Requiring Server Action
The following events require server action to recover. The following events require server action to recover.
2.10.9.2.1. Client Crash and Reboot 2.10.9.2.1. Client Crash and Reboot
As described in Section 17.35, a rebooted client issues EXCHANGE_ID As described in Section 18.35, a rebooted client issues EXCHANGE_ID
in such a way it causes the server to delete any sessions it had. in such a way it causes the server to delete any sessions it had.
2.10.9.2.2. Client Crash with No Reboot 2.10.9.2.2. Client Crash with No Reboot
If a client crashes and never comes back, it will never issue If a client crashes and never comes back, it will never issue
EXCHANGE_ID with its old client owner. Thus the server has session EXCHANGE_ID with its old client owner. Thus the server has session
state that will never be used again. After an extended period of state that will never be used again. After an extended period of
time and if the server has resource constraints, it MAY destroy the time and if the server has resource constraints, it MAY destroy the
old session as well as locking state. old session as well as locking state.
skipping to change at page 72, line 4 skipping to change at page 72, line 14
sessionid, slot id, and sequence id in the retry match that of the sessionid, slot id, and sequence id in the retry match that of the
original request, the callback target will recognize the request as a original request, the callback target will recognize the request as a
retry even if it did see the request prior to disconnect. retry even if it did see the request prior to disconnect.
If the connection lost is the last one associated with the If the connection lost is the last one associated with the
backchannel, then the server MUST indicate that in the backchannel, then the server MUST indicate that in the
sr_status_flags field of every SEQUENCE reply until the backchannel sr_status_flags field of every SEQUENCE reply until the backchannel
is reestablished. There are two situations each of which use is reestablished. There are two situations each of which use
different status flags: no connectivity for the session's different status flags: no connectivity for the session's
backchannel, and no connectivity for any session backchannel of the backchannel, and no connectivity for any session backchannel of the
client. See Section 17.46 for a description of the appropriate flags client. See Section 18.46 for a description of the appropriate flags
in sr_status_flags. in sr_status_flags.
2.10.9.2.5. GSS Context Loss 2.10.9.2.5. GSS Context Loss
The server SHOULD monitor when the number RPCSEC_GSS contexts The server SHOULD monitor when the number RPCSEC_GSS contexts
assigned to the backchannel reaches one, and that one context is near assigned to the backchannel reaches one, and that one context is near
expiry (i.e. between one and two periods of lease time), and indicate expiry (i.e. between one and two periods of lease time), and indicate
so in the sr_status_flags field of all SEQUENCE replies. The server so in the sr_status_flags field of all SEQUENCE replies. The server
MUST indicate when the all of the backchannel's assigned RPCSEC_GSS MUST indicate when the all of the backchannel's assigned RPCSEC_GSS
contexts have expired in the sr_status_flags field of all SEQUENCE contexts have expired in the sr_status_flags field of all SEQUENCE
skipping to change at page 72, line 26 skipping to change at page 72, line 36
2.10.10. Parallel NFS and Sessions 2.10.10. Parallel NFS and Sessions
A client and server can potentially be a non-pNFS implementation, a A client and server can potentially be a non-pNFS implementation, a
metadata server implementation, a data server implementation, or two metadata server implementation, a data server implementation, or two
or three types of implementations. The EXCHGID4_FLAG_USE_NON_PNFS, or three types of implementations. The EXCHGID4_FLAG_USE_NON_PNFS,
EXCHGID4_FLAG_USE_PNFS_MDS, and EXCHGID4_FLAG_USE_PNFS_DS flags (not EXCHGID4_FLAG_USE_PNFS_MDS, and EXCHGID4_FLAG_USE_PNFS_DS flags (not
mutually exclusive) are passed in the EXCHANGE_ID arguments and mutually exclusive) are passed in the EXCHANGE_ID arguments and
results to allow the client to indicate how it wants to use sessions results to allow the client to indicate how it wants to use sessions
created under the client ID, and to allow the server to indicate how created under the client ID, and to allow the server to indicate how
it will allow the sessions to be used. See Section 13.1 for pNFS it will allow the sessions to be used. See Section 14.1 for pNFS
sessions considerations. sessions considerations.
3. Protocol Data Types 3. Protocol Data Types
The syntax and semantics to describe the data types of the NFS The syntax and semantics to describe the data types of the NFS
version 4 protocol are defined in the XDR RFC4506 [3] and RPC RFC1831 version 4 protocol are defined in the XDR RFC4506 [3] and RPC RFC1831
[4] documents. The next sections build upon the XDR data types to [4] documents. The next sections build upon the XDR data types to
define types and structures specific to this protocol. define types and structures specific to this protocol.
3.1. Basic Data Types 3.1. Basic Data Types
skipping to change at page 79, line 31 skipping to change at page 79, line 43
that clients have "layout drivers" that support one or more layout that clients have "layout drivers" that support one or more layout
types. The file server advertises the layout types it supports types. The file server advertises the layout types it supports
through the fs_layout_type file system attribute (Section 5.13.1). A through the fs_layout_type file system attribute (Section 5.13.1). A
client asks for layouts of a particular type in LAYOUTGET, and passes client asks for layouts of a particular type in LAYOUTGET, and passes
those layouts to its layout driver. those layouts to its layout driver.
The layouttype4 structure is 32 bits in length. The range The layouttype4 structure is 32 bits in length. The range
represented by the layout type is split into three parts. Type 0x0 represented by the layout type is split into three parts. Type 0x0
is reserved. Types within the range 0x00000001-0x7FFFFFFF are is reserved. Types within the range 0x00000001-0x7FFFFFFF are
globally unique and are assigned according to the description in globally unique and are assigned according to the description in
Section 21.1; they are maintained by IANA. Types within the range Section 22.1; they are maintained by IANA. Types within the range
0x80000000-0xFFFFFFFF are site specific and for "private use" only. 0x80000000-0xFFFFFFFF are site specific and for "private use" only.
The LAYOUT4_NFSV4_1_FILES enumeration specifies that the NFSv4.1 file The LAYOUT4_NFSV4_1_FILES enumeration specifies that the NFSv4.1 file
layout type is to be used. The LAYOUT4_OSD2_OBJECTS enumeration layout type is to be used. The LAYOUT4_OSD2_OBJECTS enumeration
specifies that the object layout, as defined in [29], is to be used. specifies that the object layout, as defined in [29], is to be used.
Similarly, the LAYOUT4_BLOCK_VOLUME enumeration that the block/volume Similarly, the LAYOUT4_BLOCK_VOLUME enumeration that the block/volume
layout, as defined in [30], is to be used. layout, as defined in [30], is to be used.
3.2.15. deviceid4 3.2.15. deviceid4
typedef uint64_t deviceid4; typedef uint32_t deviceid4;
Layout information includes device IDs that specify a storage device Layout information includes device IDs that specify a storage device
through a compact handle. Addressing and type information is through a compact handle. Addressing and type information is
obtained with the GETDEVICEINFO operation. A client must not assume obtained with the GETDEVICEINFO operation. A client must not assume
that device IDs are valid across metadata server reboots. The device that device IDs are valid across metadata server reboots. The device
ID is qualified by the layout type and are unique per file system ID is qualified by the layout type and are unique per file system
(FSID). See Section 12.2.10 for more details. (FSID). See Section 13.2.10 for more details.
3.2.16. device_addr4 3.2.16. device_addr4
struct device_addr4 { struct device_addr4 {
layouttype4 da_layout_type; layouttype4 da_layout_type;
opaque da_addr_body<>; opaque da_addr_body<>;
}; };
The device address is used to set up a communication channel with the The device address is used to set up a communication channel with the
storage device. Different layout types will require different types storage device. Different layout types will require different types
skipping to change at page 80, line 46 skipping to change at page 81, line 14
3.2.18. layout_content4 3.2.18. layout_content4
struct layout_content4 { struct layout_content4 {
layouttype4 loc_type; layouttype4 loc_type;
opaque loc_body<>; opaque loc_body<>;
}; };
The loc_body field must be interpreted based on the layout type The loc_body field must be interpreted based on the layout type
(loc_type). This document defines the loc_body for the NFSv4.1 file (loc_type). This document defines the loc_body for the NFSv4.1 file
layout type is defined; see Section 13.3 for its definition. layout type is defined; see Section 14.3 for its definition.
3.2.19. layout4 3.2.19. layout4
struct layout4 { struct layout4 {
offset4 lo_offset; offset4 lo_offset;
length4 lo_length; length4 lo_length;
layoutiomode4 lo_iomode; layoutiomode4 lo_iomode;
layout_content4 lo_content; layout_content4 lo_content;
}; };
skipping to change at page 81, line 42 skipping to change at page 82, line 4
determined by the layout type and are defined in their context. The determined by the layout type and are defined in their context. The
NFSv4.1 file-based layout does not use this structure, thus the NFSv4.1 file-based layout does not use this structure, thus the
lou_body field should have a zero length. lou_body field should have a zero length.
3.2.21. layouthint4 3.2.21. layouthint4
struct layouthint4 { struct layouthint4 {
layouttype4 loh_type; layouttype4 loh_type;
opaque loh_body<>; opaque loh_body<>;
}; };
The layouthint4 structure is used by the client to pass in a hint The layouthint4 structure is used by the client to pass in a hint
about the type of layout it would like created for a particular file. about the type of layout it would like created for a particular file.
It is the structure specified by the layout_hint attribute described It is the structure specified by the layout_hint attribute described
in Section 5.13.4. The metadata server may ignore the hint, or may in Section 5.13.4. The metadata server may ignore the hint, or may
selectively ignore fields within the hint. This hint should be selectively ignore fields within the hint. This hint should be
provided at create time as part of the initial attributes within provided at create time as part of the initial attributes within
OPEN. The loh_body field is specific to the type of layout OPEN. The loh_body field is specific to the type of layout
(loh_type). The NFSv4.1 file-based layout uses the (loh_type). The NFSv4.1 file-based layout uses the
nfsv4_1_file_layouthint4 structure as defined in Section 13.3. nfsv4_1_file_layouthint4 structure as defined in Section 14.3.
3.2.22. layoutiomode4 3.2.22. layoutiomode4
enum layoutiomode4 { enum layoutiomode4 {
LAYOUTIOMODE4_READ = 1, LAYOUTIOMODE4_READ = 1,
LAYOUTIOMODE4_RW = 2, LAYOUTIOMODE4_RW = 2,
LAYOUTIOMODE4_ANY = 3 LAYOUTIOMODE4_ANY = 3
}; };
The iomode specifies whether the client intends to read or write The iomode specifies whether the client intends to read or write
skipping to change at page 87, line 45 skipping to change at page 88, line 10
This situation can arise if FH4_VOL_MIGRATION or FH4_VOL_RENAME is This situation can arise if FH4_VOL_MIGRATION or FH4_VOL_RENAME is
set, if FH4_VOLATILE_ANY is set and FH4_NOEXPIRE_WITH_OPEN not set, set, if FH4_VOLATILE_ANY is set and FH4_NOEXPIRE_WITH_OPEN not set,
or if a non-readonly file system has a transition target in a or if a non-readonly file system has a transition target in a
different _handle _ class. In these cases, the server should deny a different _handle _ class. In these cases, the server should deny a
RENAME or REMOVE that would affect an OPEN file of any of the RENAME or REMOVE that would affect an OPEN file of any of the
components leading to the OPEN file. In addition, the server should components leading to the OPEN file. In addition, the server should
deny all RENAME or REMOVE requests during the grace period, in order deny all RENAME or REMOVE requests during the grace period, in order
to make sure that reclaims of files where filehandles may have to make sure that reclaims of files where filehandles may have
expired do not do a reclaim for the wrong file. expired do not do a reclaim for the wrong file.
Volatile filehandles are especialy suitable for implementation of the
pseudo file systems used to bridge exports. See Section 7.5 for a
discussion of this.
4.3. One Method of Constructing a Volatile Filehandle 4.3. One Method of Constructing a Volatile Filehandle
A volatile filehandle, while opaque to the client could contain: A volatile filehandle, while opaque to the client could contain:
[volatile bit = 1 | server boot time | slot | generation number] [volatile bit = 1 | server boot time | slot | generation number]
o slot is an index in the server volatile filehandle table o slot is an index in the server volatile filehandle table
o generation number is the generation number for the table entry/ o generation number is the generation number for the table entry/
slot slot
When the client presents a volatile filehandle, the server makes the When the client presents a volatile filehandle, the server makes the
following checks, which assume that the check for the volatile bit following checks, which assume that the check for the volatile bit
has passed. If the server boot time is less than the current server has passed. If the server boot time is less than the current server
boot time, return NFS4ERR_FHEXPIRED. If slot is out of range, return boot time, return NFS4ERR_FHEXPIRED. If slot is out of range, return
NFS4ERR_BADHANDLE. If the generation number does not match, return NFS4ERR_BADHANDLE. If the generation number does not match, return
skipping to change at page 109, line 11 skipping to change at page 109, line 11
should obey an invariant that has it returning a value that is equal should obey an invariant that has it returning a value that is equal
to the file object's entry in the object's parent directory, i.e. to the file object's entry in the object's parent directory, i.e.
what readdir() would have returned. Some operating environments what readdir() would have returned. Some operating environments
allow a series of two or more file systems to be mounted onto a allow a series of two or more file systems to be mounted onto a
single mount point. In this case, for the server to obey the single mount point. In this case, for the server to obey the
aforementioned invariant, it will need to find the base mount point, aforementioned invariant, it will need to find the base mount point,
and not the intermediate mount points. and not the intermediate mount points.
5.12. Directory Notification Attributes 5.12. Directory Notification Attributes
As described in Section 17.39, the client can request a minimum delay As described in Section 18.39, the client can request a minimum delay
for notifications of changes to attributes, but the server is free for notifications of changes to attributes, but the server is free
ignore what the client requests. The client can determine in advance ignore what the client requests. The client can determine in advance
what notification delays the server will accept by issuing a GETATTR what notification delays the server will accept by issuing a GETATTR
for either or both of two directory notification attributes. When for either or both of two directory notification attributes. When
the client calls the GET_DIR_DELEGATION operation and asks^M for the client calls the GET_DIR_DELEGATION operation and asks^M for
attribute change notifications, it should request^M notification attribute change notifications, it should request^M notification
delays that are no less than the values in the^M server-provided delays that are no less than the values in the^M server-provided
attributes. attributes.
5.12.1. dir_notif_delay 5.12.1. dir_notif_delay
skipping to change at page 110, line 46 skipping to change at page 110, line 46
client when it is more efficient to issue read and write requests to client when it is more efficient to issue read and write requests to
the metadata server or the data server. The two types of thresholds the metadata server or the data server. The two types of thresholds
described are file size thresholds and I/O size thresholds. If a described are file size thresholds and I/O size thresholds. If a
file's size is smaller than the file size threshold, data accesses file's size is smaller than the file size threshold, data accesses
should be issued to the metadata server. If an I/O is below the I/O should be issued to the metadata server. If an I/O is below the I/O
size threshold, the I/O should be issued to the metadata server. As size threshold, the I/O should be issued to the metadata server. As
defined, each threshold type is specified separately for read and defined, each threshold type is specified separately for read and
write. write.
The server may provide both types of thresholds for a file. If both The server may provide both types of thresholds for a file. If both
file size and I/O size are provided, the client must exceed both file size and I/O size are provided, the client should exceed both
thresholds before issuing its read or write requests to the data thresholds before issuing its read or write requests to the data
server. Alternatively, if only one of the specified thresholds is server. Alternatively, if only one of the specified thresholds is
exceeded, the I/O requests are issued to the metadata server. exceeded, the I/O requests are issued to the metadata server.
For each threshold type, a value of 0 indicates no read or write For each threshold type, a value of 0 indicates no read or write
should be issued to the metadata server, while a value of all 1s should be issued to the metadata server, while a value of all 1s
indicates all reads or writes should be issued to the metadata indicates all reads or writes should be issued to the metadata
server. server.
The attribute is available on a per filehandle basis. If the current The attribute is available on a per filehandle basis. If the current
skipping to change at page 138, line 12 skipping to change at page 138, line 12
Finally, in the case where the request that creates a new file or Finally, in the case where the request that creates a new file or
directory does not also set permissions for that file or directory, directory does not also set permissions for that file or directory,
and there are also no ACEs to inherit from the parent's directory, and there are also no ACEs to inherit from the parent's directory,
then the server's choice of ACL for the new object is implementation- then the server's choice of ACL for the new object is implementation-
dependent. In this case, the server SHOULD set the ACL4_DEFAULTED dependent. In this case, the server SHOULD set the ACL4_DEFAULTED
flag on the ACL it chooses for the new object. An application flag on the ACL it chooses for the new object. An application
performing automatic inheritance takes the ACL4_DEFAULTED flag as a performing automatic inheritance takes the ACL4_DEFAULTED flag as a
sign that the ACL should be completely replaced by one generated sign that the ACL should be completely replaced by one generated
using the automatic inheritance rules. using the automatic inheritance rules.
7. Single-server Name Space 7. Single-server Namespace
This chapter describes the NFSv4 single-server name space. Single- This chapter describes the NFSv4 single-server name space. Single-
server namespaces may be presented directly to clients, or they may server namespaces may be presented directly to clients, or they may
be used as a basis to form larger multi-server namespaces (e.g. site- be used as a basis to form larger multi-server namespaces (e.g. site-
wide or organization-wide) to be presented to clients, as described wide or organization-wide) to be presented to clients, as described
in Section 10. in Section 11.
7.1. Server Exports 7.1. Server Exports
On a UNIX server, the name space describes all the files reachable by On a UNIX server, the name space describes all the files reachable by
pathnames under the root directory or "/". On a Windows NT server pathnames under the root directory or "/". On a Windows NT server
the name space constitutes all the files on disks named by mapped the namespace constitutes all the files on disks named by mapped disk
disk letters. NFS server administrators rarely make the entire letters. NFS server administrators rarely make the entire server's
server's file system name space available to NFS clients. More often file system namespace available to NFS clients. More often portions
portions of the name space are made available via an "export" of the namespace are made available via an "export" feature. In
feature. In previous versions of the NFS protocol, the root previous versions of the NFS protocol, the root filehandle for each
filehandle for each export is obtained through the MOUNT protocol; export is obtained through the MOUNT protocol; the client sends a
the client sends a string that identifies the export of name space string that identifies the export of namespace and the server returns
and the server returns the root filehandle for it. The MOUNT the root filehandle for it. The MOUNT protocol supports an EXPORTS
protocol supports an EXPORTS procedure that will enumerate the procedure that will enumerate the server's exports.
server's exports.
7.2. Browsing Exports 7.2. Browsing Exports
The NFS version 4 protocol provides a root filehandle that clients The NFS version 4 protocol provides a root filehandle that clients
can use to obtain filehandles for the exports of a particular server, can use to obtain filehandles for the exports of a particular server,
via a series of LOOKUP operations within a COMPOUND, to traverse a via a series of LOOKUP operations within a COMPOUND, to traverse a
path. A common user experience is to use a graphical user interface path. A common user experience is to use a graphical user interface
(perhaps a file "Open" dialog window) to find a file via progressive (perhaps a file "Open" dialog window) to find a file via progressive
browsing through a directory tree. The client must be able to move browsing through a directory tree. The client must be able to move
from one export to another export via single-component, progressive from one export to another export via single-component, progressive
LOOKUP operations. LOOKUP operations.
This style of browsing is not well supported by the NFS version 2 and This style of browsing is not well supported by the NFS version 2 and
3 protocols. The client expects all LOOKUP operations to remain 3 protocols. The client expects all LOOKUP operations to remain
within a single server file system. For example, the device within a single server file system. For example, the device
attribute will not change. This prevents a client from taking name attribute will not change. This prevents a client from taking
space paths that span exports. namespace paths that span exports.
An automounter on the client can obtain a snapshot of the server's In the case of Veriosn 2 and 3, an automounter on the client can
name space using the EXPORTS procedure of the MOUNT protocol. If it obtain a snapshot of the server's namespace using the EXPORTS
understands the server's pathname syntax, it can create an image of procedure of the MOUNT protocol. If it understands the server's
the server's name space on the client. The parts of the name space pathname syntax, it can create an image of the server's namespace on
that are not exported by the server are filled in with a "pseudo file the client. The parts of the namespace that are not exported by the
system" that allows the user to browse from one mounted file system server are filled in with directories that might be arrange similarly
to another. There is a drawback to this representation of the to a version 4 "pseudo file system" that allows the user to browse
server's name space on the client: it is static. If the server from one mounted file system to another. There is a drawback to this
administrator adds a new export the client will be unaware of it. representation of the server's namespace on the client: it is static.
If the server administrator adds a new export the client will be
unaware of it.
7.3. Server Pseudo File System 7.3. Server Pseudo File System
NFS version 4 servers avoid this name space inconsistency by NFS version 4 servers avoid this name space inconsistency by
presenting all the exports for a given server within the framework of presenting all the exports for a given server within the framework of
a single namespace, for that server. An NFS version 4 client uses a single namespace, for that server. An NFS version 4 client uses
LOOKUP and READDIR operations to browse seamlessly from one export to LOOKUP and READDIR operations to browse seamlessly from one export to
another. Portions of the server name space that are not exported are another.
bridged via a "pseudo file system" that provides a view of exported
Where there are portions of the server namespace that are not
exported, clients require some way of traversing those portions to
reach actual exported file systems. A technique that servers may use
to provide for this is to bridge unexported portion of the namespace
via a "pseudo file system" that provides a view of exported
directories only. A pseudo file system has a unique fsid and behaves directories only. A pseudo file system has a unique fsid and behaves
like a normal, read only file system. like a normal, read only file system.
Based on the construction of the server's name space, it is possible Based on the construction of the server's name space, it is possible
that multiple pseudo file systems may exist. For example, that multiple pseudo file systems may exist. For example,
/a pseudo file system /a pseudo file system
/a/b real file system /a/b real file system
/a/b/c pseudo file system /a/b/c pseudo file system
/a/b/c/d real file system /a/b/c/d real file system
Each of the pseudo file systems are considered separate entities and Each of the pseudo file systems are considered separate entities and
therefore will have its own unique fsid. therefore MUST have its own unique fsid.
7.4. Multiple Roots 7.4. Multiple Roots
The DOS and Windows operating environments are sometimes described as Certain operating environments are sometimes described as having
having "multiple roots". File Systems are commonly represented as "multiple roots". In such environments individual file systems are
disk letters. MacOS represents file systems as top level names. NFS commonly represented by disk or volume names. NFS version 4 servers
version 4 servers for these platforms can construct a pseudo file for these platforms can construct a pseudo file system above these
system above these root names so that disk letters or volume names root names so that disk letters or volume names are simply directory
are simply directory names in the pseudo root. names in the pseudo root.
7.5. Filehandle Volatility 7.5. Filehandle Volatility
The nature of the server's pseudo file system is that it is a logical The nature of the server's pseudo file system is that it is a logical
representation of file system(s) available from the server. representation of file system(s) available from the server.
Therefore, the pseudo file system is most likely constructed Therefore, the pseudo file system is most likely constructed
dynamically when the server is first instantiated. It is expected dynamically when the server is first instantiated. It is expected
that the pseudo file system may not have an on disk counterpart from that the pseudo file system may not have an on disk counterpart from
which persistent filehandles could be constructed. Even though it is which persistent filehandles could be constructed. Even though it is
preferable that the server provide persistent filehandles for the preferable that the server provide persistent filehandles for the
pseudo file system, the NFS client should expect that pseudo file pseudo file system, the NFS client should expect that pseudo file
system filehandles are volatile. This can be confirmed by checking system filehandles are volatile. This can be confirmed by checking
the associated "fh_expire_type" attribute for those filehandles in the associated "fh_expire_type" attribute for those filehandles in
question. If the filehandles are volatile, the NFS client must be question. If the filehandles are volatile, the NFS client must be
prepared to recover a filehandle value (e.g. with a series of LOOKUP prepared to recover a filehandle value (e.g. with a series of LOOKUP
operations) when receiving an error of NFS4ERR_FHEXPIRED. operations) when receiving an error of NFS4ERR_FHEXPIRED.
Because it is quite likely that servers will implement pseudo file
systems using volative filehandles, clients need to be prepared for
them, rather than assuming that all filehandles will be persistent.
7.6. Exported Root 7.6. Exported Root
If the server's root file system is exported, one might conclude that If the server's root file system is exported, one might conclude that
a pseudo-file system is unneeded. This not necessarily so. Assume a pseudo-file system is unneeded. This not necessarily so. Assume
the following file systems on a server: the following file systems on a server:
/ disk1 (exported) / fs1 (exported)
/a disk2 (not exported) /a fs2 (not exported)
/a/b disk3 (exported) /a/b fs3 (exported)
Because disk2 is not exported, disk3 cannot be reached with simple Because fs2 is not exported, fs3 cannot be reached with simple
LOOKUPs. The server must bridge the gap with a pseudo-file system. LOOKUPs. The server must bridge the gap with a pseudo-file system.
7.7. Mount Point Crossing 7.7. Mount Point Crossing
The server file system environment may be constructed in such a way The server file system environment may be constructed in such a way
that one file system contains a directory which is 'covered' or that one file system contains a directory which is 'covered' or
mounted upon by a second file system. For example: mounted upon by a second file system. For example:
/a/b (file system 1) /a/b (file system 1)
/a/b/c/d (file system 2) /a/b/c/d (file system 2)
skipping to change at page 141, line 5 skipping to change at page 141, line 14
It is the server's responsibility to present the pseudo file system It is the server's responsibility to present the pseudo file system
that is complete to the client. If the client sends a lookup request that is complete to the client. If the client sends a lookup request
for the path "/a/b/c/d", the server's response is the filehandle of for the path "/a/b/c/d", the server's response is the filehandle of
the file system "/a/b/c/d". In previous versions of the NFS the file system "/a/b/c/d". In previous versions of the NFS
protocol, the server would respond with the filehandle of directory protocol, the server would respond with the filehandle of directory
"/a/b/c/d" within the file system "/a/b". "/a/b/c/d" within the file system "/a/b".
The NFS client will be able to determine if it crosses a server mount The NFS client will be able to determine if it crosses a server mount
point by a change in the value of the "fsid" attribute. point by a change in the value of the "fsid" attribute.
7.8. Security Policy and Name Space Presentation 7.8. Security Policy and Namespace Presentation
The application of the server's security policy needs to be carefully Because NFSv4 clients possess the ability to change the security
considered by the implementor. One may choose to limit the mechanisms used, after determining what is allowed, by using SECINFO
viewability of portions of the pseudo file system based on the and SECINFO_NONAME, the server SHOULD NOT present a different view of
server's perception of the client's ability to authenticate itself the namespace based on the security mechanism being used by a client.
properly. However, with the support of multiple security mechanisms Instead, it should present a consistent view and return
and the ability to negotiate the appropriate use of these mechanisms, NFS4ERR_WRONGSEC if an attempt is made to access data with an
the server is unable to properly determine if a client will be able inappropriate security mechanism.
to authenticate itself. If, based on its policies, the server
chooses to limit the contents of the pseudo file system, the server
may effectively hide file systems from a client that may otherwise
have legitimate access.
As suggested practice, the server should apply the security policy of If security considerations make it necessary to hide the existence of
a shared resource in the server's namespace to the components of the a particular file system, as opposed to all of the data within it,
resource's ancestors. For example: the server can apply the security policy of a shared resource in the
server's namespace to components of the resource's ancestors. For
example:
/ /
/a/b /a/b
/a/b/c /a/b/MySecretProject
The /a/b/c directory is a real file system and is the shared The /a/b/MySecretProject directory is a real file system and is the
resource. The security policy for /a/b/c is Kerberos with integrity. shared resource. Suppose the security policy for /a/b/
The server should apply the same security policy to /, /a, and /a/b. MySecretProject is Kerberos with integrity and it desired to prevent
This allows for the extension of the protection of the server's knowledge of the existence of this file system to be very limited.
namespace to the ancestors of the real shared resource. In this case the server should apply the same security policy to
/a/b. This allows for knowledge the existence of a filesystem to be
secured in cases where this is desirable.
For the case of the use of multiple, disjoint security mechanisms in For the case of the use of multiple, disjoint security mechanisms in
the server's resources, the security for a particular object in the the server's resources, the security for a particular object in the
server's namespace should be the union of all security mechanisms of server's namespace should be the union of all security mechanisms of
all direct descendants. all direct descendants. A common and convenient practice, unless
strong security requirements dictate otherwise, is to make all of the
pseudo file system accessible by all of the valid security
mechanisms.
8. File Locking and Share Reservations Where there is concern about the security of data on the wire,
clients should use strong security mechanisms to access the pseudo
file system in order to prevent man-in-the-middle-attacks from
directing LOOKUP's within the pseudo-fs from compromising the
existence of sensitive data, or getting access to data that the
client is sending by directing the client to send it using weak
security mechanisms.
8. State Management
Integrating locking into the NFS protocol necessarily causes it to be Integrating locking into the NFS protocol necessarily causes it to be
stateful. With the inclusion of such features as share reservations, stateful. With the inclusion of such features as share reservations,
file and directory delegations, recallable layouts, and support for file and directory delegations, recallable layouts, and support for
mandatory record locking the protocol becomes substantially more mandatory record locking the protocol becomes substantially more
dependent on state than the traditional combination of NFS and NLM dependent on state than the traditional combination of NFS and NLM
[XNFS]. There are three components to making this state manageable: [XNFS]. There are three components to making this state manageable:
o Clear division between client and server o Clear division between client and server
skipping to change at page 142, line 4 skipping to change at page 142, line 23
stateful. With the inclusion of such features as share reservations, stateful. With the inclusion of such features as share reservations,
file and directory delegations, recallable layouts, and support for file and directory delegations, recallable layouts, and support for
mandatory record locking the protocol becomes substantially more mandatory record locking the protocol becomes substantially more
dependent on state than the traditional combination of NFS and NLM dependent on state than the traditional combination of NFS and NLM
[XNFS]. There are three components to making this state manageable: [XNFS]. There are three components to making this state manageable:
o Clear division between client and server o Clear division between client and server
o Ability to reliably detect inconsistency in state between client o Ability to reliably detect inconsistency in state between client
and server and server
o Simple and robust recovery mechanisms o Simple and robust recovery mechanisms
In this model, the server owns the state information. The client In this model, the server owns the state information. The client
requests changes in locks and the server responds with the changes requests changes in locks and the server responds with the changes
made. Non-client-initiated changes in locking state are infrequent made. Non-client-initiated changes in locking state are infrequent
and the client receives prompt notification of them and can adjust and the client receives prompt notification of them and can adjust
its view of the locking state to reflect the server's changes. its view of the locking state to reflect the server's changes.
To support Win32 share reservations it is necessary to provide Individual pieces of state created by the server and passed to the
operations which atomically OPEN or CREATE files. Having a separate client at its request are represented by 128-bit stateids. These
share/unshare operation would not allow correct implementation of the stateids may represent a particular open file, a set of byte-range
Win32 OpenFile API. In order to correctly implement share semantics, locks held by a particular owner, or a recallable delegation of
the previous NFS protocol mechanisms used when a file is opened or privileges to access a file in particular ways, or at a particular
created (LOOKUP, CREATE, ACCESS) need to be replaced. The NFS location.
version 4.1 protocol defines OPEN operation which looks up or creates
a file and establishes locking state on the server.
8.1. Locking
It is assumed that manipulating a lock is rare when compared to READ
and WRITE operations. It is also assumed that crashes and network
partitions are relatively rare. Therefore it is important that the
READ and WRITE operations have a lightweight mechanism to indicate if
they possess a held lock. A lock request contains the heavyweight
information required to establish a lock and uniquely define the lock
owner.
The following sections describe the transition from the heavyweight In all cases, there is a transition from the largest-gauge
information to the eventual lightweight stateid used for most client information which represents a client as a whole to the eventual
and server locking interactions. lightweight stateid used for most client and server locking
interactions. The details of this transition will vary with the type
of object but it always starts with a client_id.
8.1.1. Client and Session ID 8.1. Client and Session ID
A client must establish a client ID (see Section 2.4) and then one or A client must establish a client ID (see Section 2.4) and then one or
more sessionids (see Section 2.10) before performing any operations more sessionids (see Section 2.10) before performing any operations
to open, lock, or delegate a file object. The sessionid services as to open, lock, delegate, or obtain a layout for a file object. The
a shorthand referral to an NFSv4.1 client. sessionid serves as a shorthand reference to an NFSv4.1 client.
8.1.2. State-owner Definition
When opening a file or requesting a record lock, the client must
specify an identifier which represents the owner of the requested
lock. This identifier is in the form of a state-owner, represented
in the protocol by a state_owner4, a variable-length opaque array
which, when concatenated with the current client ID uniquely defines
the owner of lock managed by the client. This may be a thread id,
process id, or other unique value.
Owners of opens and owners of record locks are separate entities and
remain separate even if the same opaque arrays are used to designate
owners of each. The protocol distinguishes between open-owners
(represented by open_owner4 structures) and lock-owners (represented
by lock_owner4 structures).
Each open is associated with a specific open-owner while each record For some types of locking interactions, the client will represent
lock is associated with a lock-owner and an open-owner, the latter some number of internal locking entities called "owners", which
being the open-owner associated with the open file under which the normally correspond to processes internal to the client. For other
LOCK operation was done. Delegations and layouts, on the other hand, types of locking-related objects, such as delegations and layouts, no
are not associated with a specific owner but are associated the such intermediate entities are provided for, and the locking-related
client as a whole. objects are considered to be transferred directly between the server
and a unitary client.
8.1.3. Stateid Definition 8.2. Stateid Definition
When the server grants a lock of any type (including opens, record When the server grants a lock of any type (including opens, record
locks, delegations, and layouts) it responds with a unique stateid, locks, delegations, and layouts) it responds with a unique stateid,
that represents a set of locks (often a single lock) for the same that represents a set of locks (often a single lock) for the same
file, of the same type, and sharing the same ownership file, of the same type, and sharing the same ownership
characteristics. Thus opens of the same file by different open- characteristics. Thus opens of the same file by different open-
owners each have an identifying stateid. Similarly, each set of owners each have an identifying stateid. Similarly, each set of
record locks on a file owned by a specific lock-owner and gotten via record locks on a file owned by a specific lock-owner and gotten via
an open for a specific open-owner, has its own identifying stateid. an open for a specific open-owner, has its own identifying stateid.
Delegations and layouts also have associated stateids by which they Delegations and layouts also have associated stateids by which they
may be referenced. The stateid is used as a shorthand reference to a may be referenced. The stateid is used as a shorthand reference to a
lock or set of locks and given a stateid the server can determine the lock or set of locks and given a stateid the server can determine the
associated state-owner or state-owners (in the case of an open-owner/ associated state-owner or state-owners (in the case of an open-owner/
lock-owner pair) and the associated filehandle. When stateids are lock-owner pair) and the associated filehandle. When stateids are
used the current filehandle must be the one associated with that used, the current filehandle must be the one associated with that
stateid. stateid.
The server may assign stateids independently for different clients The server may assign stateids independently for different clients
and a stateid with the same bit pattern for one client may designate and a stateid with the same bit pattern for one client may designate
an entirely different set of locks for a different client. The an entirely different set of locks for a different client. The
stateid is always interpreted with respect to the client ID stateid is always interpreted with respect to the client ID
associated with the current session. Stateids apply to all sessions associated with the current session. Stateids apply to all sessions
associated with the given client ID and the client may use a stateid associated with the given client ID and the client may use a stateid
obtained from one session on another session associated with the same obtained from one session on another session associated with the same
client ID. client ID.
8.1.3.1. Stateid Structure 8.2.1. Stateid Types
Besides special stateids, to be discussed later, each stateid
represents locking objects of one of set of types defined by the
NFSv4.1 protocol. Note that in all these cases, where we speak of
guarantee, there is always an implied codicil that any situation such
as a client reboot, or lock revocation, allows the guarantee to be
voided.
o Stateids may represent opens of files.
o Stateids may represent sets of byte-range locks held on a
particular file by a particular owner and all gotten under the
aegis of a particular open file.
o Stateids may represent file delegations, which are recallable
guarantees by the server to the client, that other clients will
not reference, or will not modify a particular file, until the
delegation is returned.
o Stateids may represent directory delegations, which are recallable
guarantees by the server to the client, that other clients will
not modify the directory, until the delegation is returned.
o Stateids may represent layouts, which are recallable guarantees by
the server to the client, that particular files may be accessed
via an alternate data access protocol at specific locations. Such
access is limited to particular sets of byte ranges and may
proceed until those byte ranges are reduced or the layout is
returned.
o Stateids may represent device maps, which are recallable
guarantees by the server to the client, that device id's in
layouts will not be changed to designate different devices.
8.2.2. Stateid Structure
Stateids are divided into two fields, a 96-bit "other" field Stateids are divided into two fields, a 96-bit "other" field
identifying the specific set of locks and a 32-bit "seqid" sequence identifying the specific set of locks and a 32-bit "seqid" sequence
value. Except in the case of special stateids, to be discussed value. Except in the case of special stateids, to be discussed
below, the NFSv4.1 specification requires the server to increment below, a particular value of the "other" field denotes a set of locks
seqid field by one (1) whenever it returns a stateid with an "other" of the same type (for example byte-range lock, opens, delegations, or
field on that is different that that of the previous stateid it layouts), for a specific file or directory, and sharing the same
generated for the state owner/file combination. The purpose of the ownership characteristics. The seqid designates a specific instance
incrementing the seqid is to allow the replier to communicate to the of such a set of locks, and is incremented to indicate changes in
requester the order in which operations that modified locking state such a set of locks, either by the addition or deletion of locks from
associated with a stateid have been processed. This is necessary for the, a change in the byte-range they apply to, or an upgrade or
the scenario where the state owner is sending multiple parallel downgrade in the type of one or more locks.
operations with the same stateid as an argument, or in the case of
OPEN, the same file as an argument. In this scenario, at least one
returned stateid differs from the other returned stateids. Without
knowing the order of how the operations executed, the client cannot
tell which of the returned stateids corresponds to the current state
of the file/state owner combination. This is a problem because
subsequent operations on the same file/state owner combination
require the latest stateid to be used in the arguments The visibility
of the "seqid" value in the stateid allows a client to determine
which of the returned stateids is the latest.
In the case of stateids associated with opens, i.e. the stateids When such a set of locks is first created the server returns a
returned by OPEN (the state for the open, rather than that for the stateid with seqid value of one. On subsequent operations which
delegation), OPEN_DOWNGRADE, or CLOSE, the server MUST provide an modify the set of locks the server is required to increment the seqid
"seqid" value starting at one for the first use of a given "other" field by one (1) whenever it returns a stateid for the same state
value and incremented by one with each subsequent operation returning owner/file/type combination and there is some change in the set of
a stateid. locks actually designated. In this case the server will return a
stateid with an other field the same as previously used for that
state owner/file/type combination, with an incremented seqid field.
In the case of other sorts of stateids (i.e. stateids associated with The purpose of the incrementing of the seqid is to allow the replier
record locks and delegations), the server MAY provide an incrementing to communicate to the requester the order in which operations that
sequence value on successive stateids returned with same identifying modified locking state associated with a stateid have been processed
field, or it may return the value zero. If it does return a non-zero and to make it possible for the client to issue requests that are
"seqid" value it MUST start at one and be incremented by one with conditional on the set of locks not having changed since the stateid
each subsequent operation returning a stateid with same "other" in question was returned.
value, just as is done with open state.
The client when using a stateid as a parameter to an operation, must, When stateids are sent to the server by the client, it has two
except in the case of a special stateid, set the sequence value to choices with regard to the seqid sent. It may set the seqid to zero
zero. If the value is non-zero, the server MUST return the error to indicate to the server that it wishes the most up-to-date seqid
NFS4ERR_BAD_STATEID. for that stateid's "other" field to be used. This would be the
common choice in the case of stateid sent with a READ or WRITE
operation. It also may set a non-zero value in which case the server
checks if that seqid is the correct one. In that case the server is
required to return NFS4ERR_OLD_STATEID if the seqid is lower than the
most current value and NFS4ERR_BAD_STATEID if the seqid is greater
than the most current value. This would be the common choice in the
case if stateids sent with a CLOSE or OPEN_DOWNGRADE. Because OPENs
may be sent in parallel for the same owner, a client might close a
file without knowing that an OPEN upgrade had been done by the
server, changing the lock in question. If CLOSE were sent with a
zero seqid, the OPEN upgrade would be canceled before the client even
received an indication that it had happened.
8.1.3.2. Special Stateids 8.2.3. Special Stateids
Stateid values whose "other" field is either all zeros or all ones Stateid values whose "other" field is either all zeros or all ones
are reserved. They may not be assigned by the server but have are reserved. They may not be assigned by the server but have
special meanings defined by the protocol. The particular meaning special meanings defined by the protocol. The particular meaning
depends on whether the "other" field is all zeros or all ones and the depends on whether the "other" field is all zeros or all ones and the
specific value of the "seqid" field. specific value of the "seqid" field.
The following combinations of "other" and "seqid" are defined in The following combinations of "other" and "seqid" are defined in
NFSv4.1: NFSv4.1:
o When "other" and "seqid" are both zero, the stateid is treated as o When "other" and "seqid" are both zero, the stateid is treated as
a special anonymous stateid, which can be used in READ, WRITE, and a special anonymous stateid, which can be used in READ, WRITE, and
SETATTR requests to indicate the absence of any open state SETATTR requests to indicate the absence of any open state
associated with the request. When an anonymous stateid value is associated with the request. When an anonymous stateid value is
used, and an existing open denies the form of access requested, used, and an existing open denies the form of access requested,
then access will be denied to the request. This stateid MUST NOT then access will be denied to the request. This stateid MUST NOT
be used on operations to data servers (Section 13.7), nor may it be used on operations to data servers (Section 14.7), nor may it
be used as the argument to the WANT_DELEGATTION (Section 17.49) be used as the argument to the WANT_DELEGATTION (Section 18.49)
operation. operation.
o When "other" and "seqid" are both all ones, the stateid is a o When "other" and "seqid" are both all ones, the stateid is a
special read bypass stateid. When this value is used in WRITE or special read bypass stateid. When this value is used in WRITE or
SETATTR, it is treated like the anonymous value. When used in SETATTR, it is treated like the anonymous value. When used in
READ, the server MAY grant access, even if access would normally READ, the server MAY grant access, even if access would normally
be denied to READ requests. This stateid MUST NOT be used on be denied to READ requests. This stateid MUST NOT be used on
operations to data servers, nor may it be used as the argument to operations to data servers, nor may it be used as the argument to
the WANT_DELEGATION operation. the WANT_DELEGATION operation.
o When "other" is zero and "seqid" is one, the stateid represents o When "other" is zero and "seqid" is one, the stateid represents
the current stateid, which is whatever value is the last stateid the current stateid, which is whatever value is the last stateid
returned by an operation within the COMPOUND. In the case of an returned by an operation within the COMPOUND. In the case of an
OPEN, the stateid returned for the open file, and not the OPEN, the stateid returned for the open file, and not the
delegation is used. The stateid passed to the operation in place delegation is used. The stateid passed to the operation in place
of the special value has its "seqid" value set to zero. If there of the special value has its "seqid" value set to zero, except
is no operation in the COMPOUND which has returned a stateid when the current stateid is used by the operation CLOSE or
value, the server MUST return the error NFS4ERR_BAD_STATEID. OPEN_DOWNGRADE. If there is no operation in the COMPOUND which
has returned a stateid value, the server MUST return the error
NFS4ERR_BAD_STATEID.
If a stateid value is used which has all zero or all ones in the If a stateid value is used which has all zero or all ones in the
"other" field, but does not match one of the cases above, the server "other" field, but does not match one of the cases above, the server
MUST return the error NFS4ERR_BAD_STATEID. MUST return the error NFS4ERR_BAD_STATEID.
Special stateids, unlike other stateids are not associated with Special stateids, unlike other stateids, are not associated with
individual client ID's or filehandles and can be used with all valid individual client ID's or filehandles and can be used with all valid
client ID's and filehandles. In the case of a special stateid client ID's and filehandles. In the case of a special stateid
designating the current current stateid, the current stateid value designating the current stateid, the current stateid value
substituted for the special stateid is associated with a particular substituted for the special stateid is associated with a particular
client ID and filehandle. client ID and filehandle.
8.1.3.3. Stateid Lifetime and Validation 8.2.4. Stateid Lifetime and Validation
Stateids must remain valid until either a client reboot or a sever Stateids must remain valid until either a client reboot or a server
reboot or until the client returns all of the locks associated with reboot or until the client returns all of the locks associated with
the stateid by means of an operation such as CLOSE or DELEGRETURN. the stateid by means of an operation such as CLOSE or DELEGRETURN.
If the locks are lost due to revocation the stateid remains a valid If the locks are lost due to revocation the stateid remains a valid
designation of that revoked state until the client frees it by using designation of that revoked state until the client frees it by using
FREE_STATEID. Stateids associated with record locks are an FREE_STATEID. Stateids associated with record locks are an
exception. They remain valid even if a LOCKU free all remaining exception. They remain valid even if a LOCKU frees all remaining
locks, so long as the open file with which they are associated locks, so long as the open file with which they are associated
remains open, unless the client does a FREE_STATEID to cause the remains open, unless the client does a FREE_STATEID to cause the
stateid to be freed. stateid to be freed.
An "other" value must never be reused for a different purpose (i.e. An "other" value must never be reused for a different purpose (i.e.
different filehandle, owner, or type of locks) within the context of different filehandle, owner, or type of locks) within the context of
a single client ID. A server may retain the "other" value for the a single client ID. A server may retain the "other" value for the
same purpose beyond the point where it may otherwise be freed but if same purpose beyond the point where it may otherwise be freed but if
it does so, it must maintain "seqid" continuity with previous values, it does so, it must maintain "seqid" continuity with previous values.
in all case in which it is required to return incrementing "seqid"
values in general.
One mechanism that may be used to satisfy the requirement that the One mechanism that may be used to satisfy the requirement that the
server recognize invalid and out-of-date stateids is for the server server recognize invalid and out-of-date stateids is for the server
to divide the "other" field of the stateid into two fields. to divide the "other" field of the stateid into two fields.
o An index into a table of locking-state structures. o An index into a table of locking-state structures.
o A generation number which is incremented on each allocation of a o A generation number which is incremented on each allocation of a
table entry for a particular use. table entry for a particular use.
skipping to change at page 146, line 43 skipping to change at page 147, line 29
delegation, directory delegation, layout). delegation, directory delegation, layout).
o The last "seqid" value returned corresponding to the current o The last "seqid" value returned corresponding to the current
"other" value. "other" value.
With this information, the following procedure would be used to With this information, the following procedure would be used to
validate an incoming stateid and return an appropriate error, when validate an incoming stateid and return an appropriate error, when
necessary: necessary:
o If the server has restarted resulting in loss of all leased state o If the server has restarted resulting in loss of all leased state
but the sessionid and clientID are still valid, return but the sessionid and client Id are still valid, return
NFS4ERR_STALE_STATEID. (If server restart has resulted in an NFS4ERR_STALE_STATEID. (If server restart has resulted in an
invalid client ID or sessionid is invalid, SEQUENCE will return an invalid client ID or sessionid is invalid, SEQUENCE will return an
error - not NFS4ERR_STATE_STATEID - and the operation that takes a error - not NFS4ERR_STALE_STATEID - and the operation that takes a
stateid as an argument will never be processed.) stateid as an argument will never be processed.)
o If the "other" field is all zeros or all ones, check that the o If the "other" field is all zeros or all ones, check that the
"other" and "seqid" match a defined combination for a special "other" and "seqid" match a defined combination for a special
stateid and that that stateid can be used in the current context. stateid and than that stateid can be used in the current context.
If not, then return NFS4ERR_BAD_STATEID. If not, then return NFS4ERR_BAD_STATEID.
o If the "seqid" field is not zero, return NFS4ERR_BAD_STATEID. o If the "seqid" field is not zero, and it is greater than the
current sequence value corresponding the current "other" field,
return NFS4ERR_BAD_STATEID.
o If the "seqid" field is not zero, and it is less than the current
sequence value corresponding the current "other" field, return
NFS4ERR_OLD_STATEID.
o Otherwise divide the "other" into a table index and an entry o Otherwise divide the "other" into a table index and an entry
generation. generation.
o If the table index field is outside the range of the associated o If the table index field is outside the range of the associated
table, return NFS4ERR_BAD_STATEID. table, return NFS4ERR_BAD_STATEID.
o If the selected table entry is of a different generation than that o If the selected table entry is of a different generation than that
specified in the incoming stateid, return NFS4ERR_BAD_STATEID. specified in the incoming stateid, return NFS4ERR_BAD_STATEID.
o If the selected table entry does not match the current file o If the selected table entry does not match the current file
handle, return NFS4ERR_BAD_STATEID. handle, return NFS4ERR_BAD_STATEID.
o If the client ID in the table entry does not match the client ID o If the client ID in the table entry does not match the client ID
associated with the current session, return NFS4ERR_BAD_STATEID. associated with the current session, return NFS4ERR_BAD_STATEID.
o If the stateid represents revoked state, then return
NFS4ERR_EXPIRED, NFS4ERR_ADMIN_REVOKED, or NFS4ERR_DELEG_REVOKED,
as appropriate.
o If the stateid type is not valid for the context in which the o If the stateid type is not valid for the context in which the
stateid appears, return NFS4ERR_BAD_STATEID. stateid appears, return NFS4ERR_BAD_STATEID.
o Otherwise, the stateid is valid and the table entry should contain o Otherwise, the stateid is valid and the table entry should contain
any additional information about the associated set of locks, such any additional information about the type of stateid and
as open-owner and lock-owner information, as well as information information associated with that particular type of stateid, such
on the specific locks, such as open modes and octet ranges. as the associated set of locks, such as open-owner and lock-owner
information, as well as information on the specific locks, such as
8.1.4. Use of the Stateid and Locking open modes and octet ranges.
All READ, WRITE and SETATTR operations contain a stateid. For the
purposes of this section, SETATTR operations which change the size
attribute of a file are treated as if they are writing the area
between the old and new size (i.e. the range truncated or added to
the file by means of the SETATTR), even where SETATTR is not
explicitly mentioned in the text.
If the state-owner performs a READ or WRITE in a situation in which
it has established a lock or share reservation on the server (any
OPEN constitutes a share reservation) the stateid (previously
returned by the server) must be used to indicate what locks,
including both record locks and share reservations, are held by the
state-owner. If no state is established by the client, either record
lock or share reservation, a special stateid for anonymous state
(zero as "other" and "seqid") is used. Regardless whether a stateid
for anonymous state or a stateid returned by the server is used, if
there is a conflicting share reservation or mandatory record lock
held on the file, the server MUST refuse to service the READ or WRITE
operation.
Share reservations are established by OPEN operations and by their
nature are mandatory in that when the OPEN denies READ or WRITE
operations, that denial results in such operations being rejected
with error NFS4ERR_LOCKED. Record locks may be implemented by the
server as either mandatory or advisory, or the choice of mandatory or
advisory behavior may be determined by the server on the basis of the
file being accessed (for example, some UNIX-based servers support a
"mandatory lock bit" on the mode attribute such that if set, record
locks are required on the file before I/O is possible). When record
locks are advisory, they only prevent the granting of conflicting
lock requests and have no effect on READs or WRITEs. Mandatory
record locks, however, prevent conflicting I/O operations. When they
are attempted, they are rejected with NFS4ERR_LOCKED. When the
client gets NFS4ERR_LOCKED on a file it knows it has the proper share
reservation for, it will need to issue a LOCK request on the region
of the file that includes the region the I/O was to be performed on,
with an appropriate locktype (i.e. READ*_LT for a READ operation,
WRITE*_LT for a WRITE operation).
Note that for UNIX environments that support mandatory file locking,
the distinction between advisory and mandatory locking is subtle. In
fact, advisory and mandatory record locks are exactly the same in so
far as the APIs and requirements on implementation. If the mandatory
lock attribute is set on the file, the server checks to see if the
lock-owner has an appropriate shared (read) or exclusive (write)
record lock on the region it wishes to read or write to. If there is
no appropriate lock, the server checks if there is a conflicting lock
(which can be done by attempting to acquire the conflicting lock on
the behalf of the lock-owner, and if successful, release the lock
after the READ or WRITE is done), and if there is, the server returns
NFS4ERR_LOCKED.
For Windows environments, there are no advisory record locks, so the
server always checks for record locks during I/O requests.
Thus, the NFS version 4 LOCK operation does not need to distinguish
between advisory and mandatory record locks. It is the NFS version 4
server's processing of the READ and WRITE operations that introduces
the distinction.
Every stateid with the exception of special stateid values, whether
returned by an OPEN-type operation (i.e. OPEN, OPEN_DOWNGRADE), or
by a LOCK-type operation (i.e. LOCK or LOCKU), defines an access
mode for the file (i.e. READ, WRITE, or READ-WRITE) as established
by the original OPEN which caused the allocation of the open stateid
and as modified by subsequent OPENs and OPEN_DOWNGRADEs for the same
open-owner/file pair. Stateids returned by record lock operations
imply the access mode for the open stateid associated with the lock
set represented by the stateid. Delegation stateids have an access
mode based on the type of delegation. When a READ, WRITE, or SETATTR
which specifies the size attribute, is done, the operation is subject
to checking against the access mode to verify that the operation is
appropriate given the OPEN with which the operation is associated.
In the case of WRITE-type operations (i.e. WRITEs and SETATTRs which
set size), the server must verify that the access mode allows writing
and return an NFS4ERR_OPENMODE error if it does not. In the case, of
READ, the server may perform the corresponding check on the access
mode, or it may choose to allow READ on opens for WRITE only, to
accommodate clients whose write implementation may unavoidably do
reads (e.g. due to buffer cache constraints). However, even if READs
are allowed in these circumstances, the server MUST still check for
locks that conflict with the READ (e.g. another open specify denial
of READs). Note that a server which does enforce the access mode
check on READs need not explicitly check for conflicting share
reservations since the existence of OPEN for read access guarantees
that no conflicting share reservation can exist.
The read bypass special stateid (all bits of "other" and "seqid" set
to one) stateid indicates a desire to bypass locking checks. The
server MAY allow READ operations to bypass locking checks at the
server, when this special stateid is used. However, WRITE operations
with this special stateid value MUST NOT bypass locking checks and
are treated exactly the same as if a special stateid for anonymous
state were used.
A lock may not be granted while a READ or WRITE operation using one
of the special stateids is being performed and the range of the lock
request conflicts with the range of the READ or WRITE operation. For
the purposes of this paragraph, a conflict occurs when a shared lock
is requested and a WRITE operation is being performed, or an
exclusive lock is requested and either a READ or a WRITE operation is
being performed. A SETATTR that sets size is treated similarly to a
WRITE as discussed above.
8.2. Lock Ranges
The protocol allows a lock owner to request a lock with an octet
range and then either upgrade, downgrade, or unlock a sub-range of
the initial lock. It is expected that this will be an uncommon type
of request. In any case, servers or server file systems may not be
able to support sub-range lock semantics. In the event that a server
receives a locking request that represents a sub-range of current
locking state for the lock owner, the server is allowed to return the
error NFS4ERR_LOCK_RANGE to signify that it does not support sub-
range lock operations. Therefore, the client should be prepared to
receive this error and, if appropriate, report the error to the
requesting application.
The client is discouraged from combining multiple independent locking
ranges that happen to be adjacent into a single request since the
server may not support sub-range requests and for reasons related to
the recovery of file locking state in the event of server failure.
As discussed in the section "Server Failure and Recovery" below, the
server may employ certain optimizations during recovery that work
effectively only when the client's behavior during lock recovery is
similar to the client's locking behavior prior to server failure.
8.3. Upgrading and Downgrading Locks
If a client has a write lock on a record, it can request an atomic
downgrade of the lock to a read lock via the LOCK request, by setting
the type to READ_LT. If the server supports atomic downgrade, the
request will succeed. If not, it will return NFS4ERR_LOCK_NOTSUPP.
The client should be prepared to receive this error, and if
appropriate, report the error to the requesting application.
If a client has a read lock on a record, it can request an atomic
upgrade of the lock to a write lock via the LOCK request by setting
the type to WRITE_LT or WRITEW_LT. If the server does not support
atomic upgrade, it will return NFS4ERR_LOCK_NOTSUPP. If the upgrade
can be achieved without an existing conflict, the request will
succeed. Otherwise, the server will return either NFS4ERR_DENIED or
NFS4ERR_DEADLOCK. The error NFS4ERR_DEADLOCK is returned if the
client issued the LOCK request with the type set to WRITEW_LT and the
server has detected a deadlock. The client should be prepared to
receive such errors and if appropriate, report the error to the
requesting application.
8.4. Blocking Locks
Some clients require the support of blocking locks. NFSv4.1 does not
provide a callback when a previously unavailable lock becomes
available. Clients thus have no choice but to continually poll for
the lock. This presents a fairness problem. Two new lock types are
added, READW and WRITEW, and are used to indicate to the server that
the client is requesting a blocking lock. The server should maintain
an ordered list of pending blocking locks. When the conflicting lock
is released, the server may wait the lease period for the first
waiting client to re-request the lock. After the lease period
expires the next waiting client request is allowed the lock. Clients
are required to poll at an interval sufficiently small that it is
likely to acquire the lock in a timely manner. The server is not
required to maintain a list of pending blocked locks as it is used to
increase fairness and not correct operation. Because of the
unordered nature of crash recovery, storing of lock state to stable
storage would be required to guarantee ordered granting of blocking
locks.
Servers may also note the lock types and delay returning denial of
the request to allow extra time for a conflicting lock to be
released, allowing a successful return. In this way, clients can
avoid the burden of needlessly frequent polling for blocking locks.
The server should take care in the length of delay in the event the
client retransmits the request.
If a server receives a blocking lock request, denies it, and then Note that a stateid may be valid in general, as would be reported by
later receives a nonblocking request for the same lock, which is also the TEST_STATEID operation, but be invalid for a particular
denied, then it should remove the lock in question from its list of operation, as, for example, when a stateid which doesn't represent
pending blocking locks. Clients should use such a nonblocking byte-range locks is passed to non-from_open case of LOCK or to LOCKU,
request to indicate to the server that this is the last time they or when a stateid which does not represent an open is passed to CLOSE
intend to poll for the lock, as may happen when the process or OPEN_DOWNGRADE. In such cases, the server MUST return
requesting the lock is interrupted. This is a courtesy to the NFS4ERR_BAD_STATEID.
server, to prevent it from unnecessarily waiting a lease period
before granting other lock requests. However, clients are not
required to perform this courtesy, and servers must not depend on
them doing so. Also, clients must be prepared for the possibility
that this final locking request will be accepted.
8.5. Lease Renewal 8.3. Lease Renewal
The purpose of a lease is to allow a server to remove stale locks The purpose of a lease is to allow a server to remove stale locking-
that are held by a client that has crashed or is otherwise related objects that are held by a client that has crashed or is
unreachable. It is not a mechanism for cache consistency and lease otherwise unreachable. It is not a mechanism for cache consistency
renewals may not be denied if the lease interval has not expired. and lease renewals may not be denied if the lease interval has not
expired.
Since each session is associated with a specific client, any Since each session is associated with a specific client, any
operation issued on that session is an indication that the associated operation issued on that session is an indication that the associated
client is reachable. When a request is issued for a given session, client is reachable. When a request is issued for a given session,
successful execution of a SEQUENCE operation (or successful retrieval successful execution of a SEQUENCE operation (or successful retrieval
of the result of SEQUENCE from the reply cache) will result in all of the result of SEQUENCE from the reply cache) will result in all
leases for the associated client to be implicitly renewed. This leases for the associated client to be implicitly renewed. In
approach allows for low overhead lease renewal which scales well. In addition, whenever a new stateid is created ot updated (i.e. returned
the typical case no extra RPC calls are required for lease renewal with a new seqid value), all leases for the associate client are also
and in the worst case one RPC is required every lease period, via a renewed. This approach allows for low overhead lease renewal which
COMPOUND that consists solely of a single SEQUENCE operation. The scales well. In the typical case no extra RPC calls are required for
number of locks held by the client is not a factor since all state lease renewal and in the worst case one RPC is required every lease
for the client is involved with the lease renewal action. period, via a COMPOUND that consists solely of a single SEQUENCE
operation. The number of locks held by the client is not a factor
since all state for the client is involved with the lease renewal
action.
Since all operations that create a new lease also renew existing Since all operations that create a new lease also renew existing
leases, the server must maintain a common lease expiration time for leases, the server must maintain a common lease expiration time for
all valid leases for a given client. This lease time can then be all valid leases for a given client. This lease time can then be
easily updated upon implicit lease renewal actions. easily updated upon implicit lease renewal actions.
8.6. Crash Recovery 8.4. Crash Recovery
The important requirement in crash recovery is that both the client The important requirement in crash recovery is that both the client
and the server know when the other has failed. Additionally, it is and the server know when the other has failed. Additionally, it is
required that a client sees a consistent view of data across server required that a client sees a consistent view of data across server
restarts or reboots. All READ and WRITE operations that may have restarts or reboots. All READ and WRITE operations that may have
been queued within the client or network buffers must wait until the been queued within the client or network buffers must wait until the
client has successfully recovered the locks protecting the READ and client has successfully recovered the locks protecting the READ and
WRITE operations. WRITE operations. Any that reach the server before it can safely
determine that it has re-established enough locking state to be sure
that such requests can be safely processed must be rejected, either
because the state presented is no longer valid
(NFS4ERR_STALE_CLIENTID or NFS4ERR_STALE_STATEID) or because
subsequent recovery of locks may make execution of the operation
inappropriate (NFS4ERR_GRACE).
8.6.1. Client Failure and Recovery 8.4.1. Client Failure and Recovery
In the event that a client fails, the server may release the client's In the event that a client fails, the server may release the client's
locks when the associated leases have expired. Conflicting locks locks when the associated leases have expired. Conflicting locks
from another client may only be granted after this lease expiration. from another client may only be granted after this lease expiration.
When a client has not failed and re-establishes his lease before When a client has not failed and re-establishes his lease before
expiration occurs, requests for conflicting locks will not be expiration occurs, requests for conflicting locks will not be
granted. granted.
To minimize client delay upon restart, lock requests are associated To minimize client delay upon restart, lock requests are associated
with an instance of the client by a client supplied verifier. This with an instance of the client by a client-supplied verifier. This
verifier is part of the initial EXCHANGE_ID call made by the client. verifier is part of the client_owner4 sent in the initial EXCHANGE_ID
The server returns a client ID as a result of the EXCHANGE_ID call made by the client. The server returns a client ID as a result
operation. The client then confirms the use of the client ID by of the EXCHANGE_ID operation. The client then confirms the use of
establishing a session associated with that client ID. All locks, the client ID by establishing a session associated with that client
including opens, record locks, delegations, and layout obtained by ID. All locks, including opens, record locks, delegations, and
sessions using that client ID are associated with that client ID. layout obtained by sessions using that client ID are associated with
that client ID.
Since the verifier will be changed by the client upon each Since the verifier will be changed by the client upon each
initialization, the server can compare a new verifier to the verifier initialization, the server can compare a new verifier to the verifier
associated with currently held locks and determine that they do not associated with currently held locks and determine that they do not
match. This signifies the client's new instantiation and subsequent match. This signifies the client's new instantiation and subsequent
loss of locking state. As a result, the server is free to release loss of locking state. As a result, the server is free to release
all locks held which are associated with the old client ID which was all locks held which are associated with the old client ID which was
derived from the old verifier. At this point conflicting locks from derived from the old verifier. At this point conflicting locks from
other clients, kept waiting while the leaser had not yet expired, can other clients, kept waiting while the lease had not yet expired, can
be granted. be granted.
Note that the verifier must have the same uniqueness properties of Note that the verifier must have the same uniqueness properties of
the verifier for the COMMIT operation. the verifier for the COMMIT operation.
8.6.2. Server Failure and Recovery 8.4.2. Server Failure and Recovery
If the server loses locking state (usually as a result of a restart If the server loses locking state (usually as a result of a restart
or reboot), it must allow clients time to discover this fact and re- or reboot), it must allow clients time to discover this fact and re-
establish the lost locking state. The client must be able to re- establish the lost locking state. The client must be able to re-
establish the locking state without having the server deny valid establish the locking state without having the server deny valid
requests because the server has granted conflicting access to another requests because the server has granted conflicting access to another
client. Likewise, if there is a possibility that clients have not client. Likewise, if there is a possibility that clients have not
yet re-established their locking state for a file, the server must yet re-established their locking state for a file, and that such
disallow READ and WRITE operations for that file. locking state might make it invalid to perform READ or WRITE
operations, for example though the establishment of mandatory locks,
the server must disallow READ and WRITE operations for that file.
A client can determine that loss of locking state has occurred via A client can determine that loss of locking state has occurred via
several methods. several methods.
1. When a SEQUENCE succeeds, but sr_status_flags in the reply to 1. When a SEQUENCE succeeds, but sr_status_flags in the reply to
SEQUENCE indicates SEQ4_STATUS_RESTART_RECLAIM_NEEDED (see SEQUENCE indicates SEQ4_STATUS_RESTART_RECLAIM_NEEDED (see
Section 17.46.4), this indicates client's client ID and session Section 18.46.4), this indicates client's client ID and session
are valid (have persisted through server restart) and the client are valid (have persisted through server restart) and the client
can now re-establish its lock state (Section 8.6.2.1). can now re-establish its lock state (Section 8.4.2.1).
2. When an operation returns NFS4ERR_STALE_STATEID, this indicates a 2. When an operation returns NFS4ERR_STALE_STATEID, this indicates a
stateid invalidated by a server reboot or restart. Since the stateid invalidated by a server reboot or restart. Since the
operation that returned NFS4ERR_STALE_STATEID MUST have been operation that returned NFS4ERR_STALE_STATEID MUST have been
preceded by SEQUENCE, and SEQUENCE did not return an error, this preceded by SEQUENCE, and SEQUENCE did not return an error, this
means the client ID and session are valid. The client can now means the client ID and session are valid. The client can now
re-establish is lock state as described in Section 8.6.2.1. Note re-establish is lock state as described in Section 8.4.2.1. Note
that the server should (MUST) have set that the server should (MUST) have set
SEQ4_STATUS_RESTART_RECLAIM_NEEDED in the sr_status_flags of the SEQ4_STATUS_RESTART_RECLAIM_NEEDED in the sr_status_flags of the
results of the SEQUENCE operation, and thus this situation should results of the SEQUENCE operation, and thus this situation should
be the same as that described above. be the same as that described above.
3. When a SEQUENCE operation returns NFS4ERR_STALE_CLIENTID, this 3. When a SEQUENCE operation returns NFS4ERR_STALE_CLIENTID, this
means both sessionid SEQUENCE refers to (field sa_sessionid) and means both sessionid SEQUENCE refers to (field sa_sessionid) and
the implied client ID are now invalid, where the client ID was the implied client ID are now invalid, where the client ID was
invalidated by server reboot or restart or by lease expiration. invalidated by server reboot or restart or by lease expiration.
When SEQUENCE returns NFS4ERR_STALE_CLIENTID, the client must When SEQUENCE returns NFS4ERR_STALE_CLIENTID, the client must
establish a new client ID (see Section 8.1.1) and re-establish establish a new client ID (see Section 8.1) and re-establish its
its lock state (Section 8.6.2.1). lock state (Section 8.4.2.1).
4. When a SEQUENCE operation returns NFS4ERR_BADSESSION, this may 4. When a SEQUENCE operation returns NFS4ERR_BADSESSION, this may
mean the session has been destroyed, but the client ID is still mean the session has been destroyed, but the client ID is still
valid. The client issues a CREATE_SESSION request with the valid. The client issues a CREATE_SESSION request with the
client ID to re-establish the session. If CREATE_SESSION fails client ID to re-establish the session. If CREATE_SESSION fails
with NFS4ERR_STALE_CLIENTID, the client must establish a new with NFS4ERR_STALE_CLIENTID, the client must establish a new
client ID (see Section 8.1.1) and re-establish its lock state client ID (see Section 8.1) and re-establish its lock state
(Section 8.6.2.1). If CREATE_SESSION succeeds, the client must (Section 8.4.2.1). If CREATE_SESSION succeeds, the client must
then re-establish its lock state (Section 8.6.2.1). then re-establish its lock state (Section 8.4.2.1).
5. When a operation, neither SEQUENCE nor preceded by SEQUENCE (for 5. When a operation, neither SEQUENCE nor preceded by SEQUENCE (for
example, CREATE_SESSION, DESTROY_SESSION) returns example, CREATE_SESSION, DESTROY_SESSION) returns
NFS4ERR_STALE_CLIENTID. The client MUST establish a new client NFS4ERR_STALE_CLIENTID. The client MUST establish a new client
ID (Section 8.1.1) and re-establish its lock state ID (Section 8.1) and re-establish its lock state
(Section 8.6.2.1). (Section 8.4.2.1).
8.6.2.1. State Reclaim 8.4.2.1. State Reclaim
When state information and the associated locks are lost as a result
of a server reboot, the protocol must provide a way to cause that
state to be re-established. The approach used is to define, for most
type of locking state (layouts are an exception, a request whose
function is to allow the client to re-establish on the server a lock
first gotten on a previous instance. Generally these requests are
variants of the requests normally used to create locks of that type
and are referred to as "reclaim-type" requests and the process of re-
establishing such locks is referred to as "reclaiming" them.
Because each client must have an opportunity to reclaim all of the
locks that it has without the possibility that some other client will
be granted a conflicting lock, a special period called the "grace
period" is devoted to the reclaim process. During this period, only
reclaim-type locking requests are allowed, unless the server is able
to reliably determine (through state persistently maintained across
reboot instances), that granting any such lock cannot possibly
conflict with a subsequent reclaim. When a request is made to obtain
a new lock (i.e. not a reclaim-type request) during the grace period
and such a determination cannot be made, the server must return the
error NFS4ERR_GRACE.
Once a session is established using the new client ID, the client Once a session is established using the new client ID, the client
will use reclaim-type locking requests (i.e. LOCK requests with will use reclaim-type locking requests (e.g. LOCK requests with
reclaim set to true and OPEN operations with a claim type of reclaim set to true and OPEN operations with a claim type of
CLAIM_PREVIOUS) to re-establish its locking state. Once this is CLAIM_PREVIOUS. See Section 9.8) to re-establish its locking state.
done, or if there is no such locking state to reclaim, the client
does a RECLAIM_COMPLETE operation to indicate that it has reclaimed Once this is done, or if there is no such locking state to reclaim,
all of the locking state that it will reclaim. Once a client does a the client does a global RECLAIM_COMPLETE operation, i.e. one with
RECLAIM_COMPLETE operation, it may attempt non-reclaim locking the one_fs argument set to false, to indicate that it has reclaimed
all of the locking state that it will reclaim. Once a client does
such a RECLAIM_COMPLETE operation, it may attempt non-reclaim locking
operations, although it may get NFS4ERR_GRACE errors on these until operations, although it may get NFS4ERR_GRACE errors on these until
the period of special handling is over. the period of special handling is over. See Section 11.6.7 for a
discussion of the analogous handling lock reclamation in the case of
filesystems transitioning from server to server.
Note that if the client ID persisted through a server reboot (which Note that if the client ID persisted through a server reboot (which
will be self-evident if the client never received a will be self-evident if the client never received a
NFS4ERR_STALE_CLIENTID error, and instead got NFS4ERR_STALE_CLIENTID error, and instead got
SEQ4_STATUS_RESTART_RECLAIM_NEEDED status from SEQUENCE SEQ4_STATUS_RESTART_RECLAIM_NEEDED status from SEQUENCE
(Section 17.46.4), no client ID was re-established. For reasons (Section 18.46.4), no client ID was re-established. See Paragraph 2
described in Section 17.46.5, OPEN reclaims that perform upgrades can of Section 9.8 for discussion of some restrictions on use of upgrade
cause the client and server to not have the same view of open state. semantics in connection with reclaim that are the result of some
Therefore, the client MUST NOT perform an OPEN reclaim that is also issues that apply to this situation.
an OPEN upgrade (Section 8.10) unless the client precedes the OPEN
upgrade/reclaim with a TEST_STATEID operation in the same COMPOUND.
The stateid used in TEST_STATEID will be that returned by the reclaim
OPEN the OPEN upgrade/reclaim is upgrading the open state from.
Alternatively, the client can avoid OPEN upgrade during the reclaim
phase.
The period of special handling of locking and READs and WRITEs, is During the grace period, the server must reject READ and WRITE
referred to as the "grace period". During the grace period, clients
recover locks and the associated state using reclaim-type locking
requests. During this period, the server must reject READ and WRITE
operations and non-reclaim locking requests (i.e. other LOCK and OPEN operations and non-reclaim locking requests (i.e. other LOCK and OPEN
operations) with an error of NFS4ERR_GRACE, unless it is able to operations) with an error of NFS4ERR_GRACE, unless it is able to
guarantee that these may be done safely, as described below. guarantee that these may be done safely, as described below.
The grace period may last until all clients who are known to possibly The grace period may last until all clients who are known to possibly
have had locks have done a RECLAIM_COMPLETE operation, indicating have had locks have done a global RECLAIM_COMPLETE operation,
that they have finished reclaiming the locks they held before the indicating that they have finished reclaiming the locks they held
server reboot. The server is assumed to maintain in stable storage a before the server reboot. The server is assumed to maintain in
list of clients who may have such locks. The server may also stable storage a list of clients who may have such locks. The server
terminate the grace period before all clients have done may also terminate the grace period before all clients have done a
RECLAIM_COMPLETE. The server SHOULD NOT terminate the grace period global RECLAIM_COMPLETE. The server SHOULD NOT terminate the grace
before a time equal to the lease period in order to give clients an period before a time equal to the lease period in order to give
opportunity to find out about the server reboot. Some additional clients an opportunity to find out about the server reboot. Some
time in order to allow time to establish a new client ID and session additional time in order to allow time to establish a new client ID
and to effect lock reclaims may be added. and session and to effect lock reclaims may be added. Note that
analogous rules apply to filesystem-specific grace periods discussed
in Section 11.6.7.
If the server can reliably determine that granting a non-reclaim If the server can reliably determine that granting a non-reclaim
request will not conflict with reclamation of locks by other clients, request will not conflict with reclamation of locks by other clients,
the NFS4ERR_GRACE error does not have to be returned even within the the NFS4ERR_GRACE error does not have to be returned even within the
grace period, although NFS4ERR_GRACE must always be returned to grace period, although NFS4ERR_GRACE must always be returned to
clients attempting a non-reclaim lock request before doing their own clients attempting a non-reclaim lock request before doing their own
RECLAIM_COMPLETE. For the server to be able to service READ and global RECLAIM_COMPLETE. For the server to be able to service READ
WRITE operations during the grace period, it must again be able to and WRITE operations during the grace period, it must again be able
guarantee that no possible conflict could arise between a potential to guarantee that no possible conflict could arise between a
reclaim locking request and the READ or WRITE operation. If the potential reclaim locking request and the READ or WRITE operation.
server is unable to offer that guarantee, the NFS4ERR_GRACE error If the server is unable to offer that guarantee, the NFS4ERR_GRACE
must be returned to the client. error must be returned to the client.
For a server to provide simple, valid handling during the grace For a server to provide simple, valid handling during the grace
period, the easiest method is to simply reject all non-reclaim period, the easiest method is to simply reject all non-reclaim
locking requests and READ and WRITE operations by returning the locking requests and READ and WRITE operations by returning the
NFS4ERR_GRACE error. However, a server may keep information about NFS4ERR_GRACE error. However, a server may keep information about
granted locks in stable storage. With this information, the server granted locks in stable storage. With this information, the server
could determine if a regular lock or READ or WRITE operation can be could determine if a regular lock or READ or WRITE operation can be
safely processed. safely processed.
For example, if the server maintained on stable storage summary For example, if the server maintained on stable storage summary
information on whether mandatory locks exist, either mandatory record information on whether mandatory locks exist, either mandatory record
locks, or share reservations specifying deny modes, many requests locks, or share reservations specifying deny modes, many requests
could be allowed during the grace period. If it is known that no could be allowed during the grace period. If it is known that no
such share reservations exist, OPEN request that do not specify deny such share reservations exist, OPEN request that do not specify deny
modes may be safely granted. If, in addition, it is known that no modes may be safely granted. If, in addition, it is known that no
mandatory record locks exist, either through information stored on mandatory record locks exist, either through information stored on
stable storage or simply because the server does not support such stable storage or simply because the server does not support such
locks, READ and WRITE requests may be safely processed during the locks, READ and WRITE requests may be safely processed during the
grace period. grace period.
Another important case is where it is known that no mandatory byte-
range locks exist, either because the server does not provide support
for them, or because their absence is known from persistently
recorded data. In this case, READ and WRITE operations specifying
stateids derived from reclaim-type operation may be validly processed
during the grace period because the fact of the valid reclaim ensures
that no lock subsequently granted can prevent the IO.
To reiterate, for a server that allows non-reclaim lock and I/O To reiterate, for a server that allows non-reclaim lock and I/O
requests to be processed during the grace period, it MUST determine requests to be processed during the grace period, it MUST determine
that no lock subsequently reclaimed will be rejected and that no lock that no lock subsequently reclaimed will be rejected and that no lock
subsequently reclaimed would have prevented any I/O operation subsequently reclaimed would have prevented any I/O operation
processed during the grace period. processed during the grace period.
Clients should be prepared for the return of NFS4ERR_GRACE errors for Clients should be prepared for the return of NFS4ERR_GRACE errors for
non-reclaim lock and I/O requests. In this case the client should non-reclaim lock and I/O requests. In this case the client should
employ a retry mechanism for the request. A delay (on the order of employ a retry mechanism for the request. A delay (on the order of
several seconds) between retries should be used to avoid overwhelming several seconds) between retries should be used to avoid overwhelming
skipping to change at page 156, line 15 skipping to change at page 154, line 14
A server may, upon restart, establish a new value for the lease A server may, upon restart, establish a new value for the lease
period. Therefore, clients should, once a new client ID is period. Therefore, clients should, once a new client ID is
established, refetch the lease_time attribute and use it as the basis established, refetch the lease_time attribute and use it as the basis
for lease renewal for the lease associated with that server. for lease renewal for the lease associated with that server.
However, the server must establish, for this restart event, a grace However, the server must establish, for this restart event, a grace
period at least as long as the lease period for the previous server period at least as long as the lease period for the previous server
instantiation. This allows the client state obtained during the instantiation. This allows the client state obtained during the
previous server instance to be reliably re-established. previous server instance to be reliably re-established.
8.6.3. Network Partitions and Recovery 8.4.3. Network Partitions and Recovery
If the duration of a network partition is greater than the lease If the duration of a network partition is greater than the lease
period provided by the server, the server will have not received a period provided by the server, the server will have not received a
lease renewal from the client. If this occurs, the server may free lease renewal from the client. If this occurs, the server may free
all locks held for the client, or it may allow the lock state to all locks held for the client, or it may allow the lock state to
remain for a considerable period, subject to the constraint that if a remain for a considerable period, subject to the constraint that if a
request for a conflicting lock is made, locks associated with expired request for a conflicting lock is made, locks associated with expired
leases do not prevent such a conflicting lock from being granted but leases do not prevent such a conflicting lock from being granted but
are revoked as necessary so as not to interfere with such conflicting are revoked as necessary so as not to interfere with such conflicting
requests. requests.
skipping to change at page 159, line 4 skipping to change at page 156, line 49
edge conditions will record in stable storage every lock that is edge conditions will record in stable storage every lock that is
acquired, removing the lock record from stable storage only when the acquired, removing the lock record from stable storage only when the
lock is released. For the two edge conditions discussed above, the lock is released. For the two edge conditions discussed above, the
harshest a server can be, and still support a grace period for harshest a server can be, and still support a grace period for
reclaims, requires that the server record in stable storage reclaims, requires that the server record in stable storage
information some minimal information. For example, a server information some minimal information. For example, a server
implementation could, for each client, save in stable storage a implementation could, for each client, save in stable storage a
record containing: record containing:
o the client's id string o the client's id string
o a boolean that indicates if the client's lease expired or if there o a boolean that indicates if the client's lease expired or if there
was administrative intervention (see Section 8.7) to revoke a was administrative intervention (see Section 8.5) to revoke a
record lock, share reservation, or delegation and there has been record lock, share reservation, or delegation and there has been
no acknowledgement (via FREE_STATEID) of such revocation. no acknowledgement (via FREE_STATEID) of such revocation.
o a boolean that indicates whether the client may have locks that it o a boolean that indicates whether the client may have locks that it
believes to be reclaimable in situations which the grace period believes to be reclaimable in situations which the grace period
was terminated, making the server's view of lock reclaimability was terminated, making the server's view of lock reclaimability
suspect. The server will set this for any client record in stable suspect. The server will set this for any client record in stable
storage where the client has not done a RECLAIM_COMPLETE, before storage where the client has not done a suitable RECLAIM_COMPLETE
it grants any new (i.e. not reclaimed) lock to any client. (global or filesystem-specific depending on the target of the lock
request) before it grants any new (i.e. not reclaimed) lock to any
client.
Assuming the above record keeping, for the first edge condition, Assuming the above record keeping, for the first edge condition,
after the server reboots, the record that client A's lease expired after the server reboots, the record that client A's lease expired
means that another client could have acquired a conflicting record means that another client could have acquired a conflicting record
lock, share reservation, or delegation. Hence the server must reject lock, share reservation, or delegation. Hence the server must reject
a reclaim from client A with the error NFS4ERR_NO_GRACE. a reclaim from client A with the error NFS4ERR_NO_GRACE.
For the second edge condition, after the server reboots for a second For the second edge condition, after the server reboots for a second
time, the indication that the client had not completed its reclaims time, the indication that the client had not completed its reclaims
at the time at which the grace period ended means that the server at the time at which the grace period ended means that the server
must reject a reclaim from client A with the error NFS4ERR_NO_GRACE. must reject a reclaim from client A with the error NFS4ERR_NO_GRACE.
When either edge condition occurs, the client's attempt to reclaim When either edge condition occurs, the client's attempt to reclaim
locks will result in the error NFS4ERR_NO_GRACE. When this is locks will result in the error NFS4ERR_NO_GRACE. When this is
received, or after the client reboots with no lock state, the client received, or after the client reboots with no lock state, the client
will issue a RECLAIM_COMPLETE. When the RECLAIM_COMPLETE is will issue a global RECLAIM_COMPLETE. When the RECLAIM_COMPLETE is
received, the server and client are again in agreement regarding received, the server and client are again in agreement regarding
reclaimable locks and both booleans in persistent storage can be reclaimable locks and both booleans in persistent storage can be
reset, to be set again only when there is a subsequent event that reset, to be set again only when there is a subsequent event that
causes lock reclaim operations to be questionable. causes lock reclaim operations to be questionable.
Regardless of the level and approach to record keeping, the server Regardless of the level and approach to record keeping, the server
MUST implement one of the following strategies (which apply to MUST implement one of the following strategies (which apply to
reclaims of share reservations, record locks, and delegations): reclaims of share reservations, record locks, and delegations):
1. Reject all reclaims with NFS4ERR_NO_GRACE. This is extremely 1. Reject all reclaims with NFS4ERR_NO_GRACE. This is extremely
skipping to change at page 160, line 20 skipping to change at page 158, line 19
environment. However, one potential approach is described below. environment. However, one potential approach is described below.
When the client receives NFS4ERR_NO_GRACE, it could examine the When the client receives NFS4ERR_NO_GRACE, it could examine the
change attribute of the objects the client is trying to reclaim state change attribute of the objects the client is trying to reclaim state
for, and use that to determine whether to re-establish the state via for, and use that to determine whether to re-establish the state via
normal OPEN or LOCK requests. This is acceptable provided the normal OPEN or LOCK requests. This is acceptable provided the
client's operating environment allows it. In other words, the client client's operating environment allows it. In other words, the client
implementor is advised to document for his users the behavior. The implementor is advised to document for his users the behavior. The
client could also inform the application that its record lock or client could also inform the application that its record lock or
share reservations (whether they were delegated or not) have been share reservations (whether they were delegated or not) have been
lost, such as via a UNIX signal, a GUI pop-up window, etc. See the lost, such as via a UNIX signal, a GUI pop-up window, etc. See
section, "Data Caching and Revocation" for a discussion of what the Section 10.5 for a discussion of what the client should do for
client should do for dealing with unreclaimed delegations on client dealing with unreclaimed delegations on client state.
state.
For further discussion of revocation of locks see Section 8.7. For further discussion of revocation of locks see Section 8.5.
8.7. Server Revocation of Locks 8.5. Server Revocation of Locks
At any point, the server can revoke locks held by a client and the At any point, the server can revoke locks held by a client and the
client must be prepared for this event. When the client detects that client must be prepared for this event. When the client detects that
its locks have been or may have been revoked, the client is its locks have been or may have been revoked, the client is
responsible for validating the state information between itself and responsible for validating the state information between itself and
the server. Validating locking state for the client means that it the server. Validating locking state for the client means that it
must verify or reclaim state for each lock currently held. must verify or reclaim state for each lock currently held.
The first occasion of lock revocation is upon server reboot or The first occasion of lock revocation is upon server reboot or
restart. In this instance the client will receive an error restart. In this instance the client will receive an error
(NFS4ERR_STALE_STATEID on an operation that takes a stateid as an (NFS4ERR_STALE_STATEID on an operation that takes a stateid as an
argument or NFS4ERR_STALE_CLIENTID on an operation that takes a argument or NFS4ERR_STALE_CLIENTID on an operation that takes a
sessionid or client ID) and the client will proceed with normal crash sessionid or client ID) and the client will proceed with normal crash
recovery as described in the Section 8.6.2.1. recovery as described in the Section 8.4.2.1.
The second occasion of lock revocation is the inability to renew the The second occasion of lock revocation is the inability to renew the
lease before expiration, as discussed above. While this is lease before expiration, as discussed above. While this is
considered a rare or unusual event, the client must be prepared to considered a rare or unusual event, the client must be prepared to
recover. The server is responsible for determining lease expiration, recover. The server is responsible for determining lease expiration,
and deciding exactly how to deal with it, informing the client of the and deciding exactly how to deal with it, informing the client of the
scope of the lock revocation. The client then uses the status scope of the lock revocation. The client then uses the status
information provided by the server in the SEQUENCE results (field information provided by the server in the SEQUENCE results (field
sr_status_flags, see Section 17.46.4) to synchronize its locking sr_status_flags, see Section 18.46.4) to synchronize its locking
state with that of the server, in order to recover. state with that of the server, in order to recover.
The third occasion of lock revocation can occur as a result of The third occasion of lock revocation can occur as a result of
revocation of locks within the lease period, either because of revocation of locks within the lease period, either because of
administrative intervention, or because a recallable lock (a administrative intervention, or because a recallable lock (a
delegation or layout) was not returned within the lease period after delegation or layout) was not returned within the lease period after
having been recalled. While these are considered rare events, they having been recalled. While these are considered rare events, they
are possible and the client must be prepared to deal with them. When are possible and the client must be prepared to deal with them. When
either of these events occur, the client finds out about the either of these events occur, the client finds out about the
situation through the status returned by the SEQUENCE operation. Any situation through the status returned by the SEQUENCE operation. Any
use of stateids associated with revoked locks will receive the error use of stateids associated with locks revoked during the lease period
NFS4ERR_ADMIN_REVOKED or NFS4ERR_DELEG_REVOKED, as appropriate. will receive the error NFS4ERR_ADMIN_REVOKED or
NFS4ERR_DELEG_REVOKED, as appropriate.
In all situations in which a subset of locking state may have been In all situations in which a subset of locking state may have been
revoked, which include all cases in which locking state is revoked revoked, which include all cases in which locking state is revoked
within the lease period, it is up to the client to determine which within the lease period, it is up to the client to determine which
locks have been revoked and which have not. It does this by using locks have been revoked and which have not. It does this by using
the TEST_STATEID operation on the appropriate set of stateids. Once the TEST_STATEID operation on the appropriate set of stateids. Once
the set of revoked locks has been determined, the applications can be the set of revoked locks has been determined, the applications can be
notified, and the invalidated stateids can be freed and lock notified, and the invalidated stateids can be freed and lock
revocation acknowledged by using FREE_STATEID. revocation acknowledged by using FREE_STATEID.
8.8. Share Reservations 8.6. Short and Long Leases
When determining the time period for the server lease, the usual
lease tradeoffs apply. Short leases are good for fast server
recovery at a cost of increased operations to effect lease renewal
(when there are no other operations during the period to effect lease
renewal as a side-effect). Long leases are certainly kinder and
gentler to servers trying to handle very large numbers of clients.
The number of extra requests to effect lock renewal drops in inverse
proportion to the lease time. The disadvantages of long leases
include the possibility of slower recovery after certain failures.
After server failure, a longer grace period may be required when some
clients do not promptly reclaim their locks and do a global
RECLAIM_COMPLETE. In the event of client failure, there can be a
longer period for leases to expire thus forcing conflicting requests
to wait.
Long leases are usable if the server is able to store lease state in
non-volatile memory. Upon recovery, the server can reconstruct the
lease state from its non-volatile memory and continue operation with
its clients and therefore long leases would not be an issue.
8.7. Clocks, Propagation Delay, and Calculating Lease Expiration
To avoid the need for synchronized clocks, lease times are granted by
the server as a time delta. However, there is a requirement that the
client and server clocks do not drift excessively over the duration
of the lock. There is also the issue of propagation delay across the
network which could easily be several hundred milliseconds as well as
the possibility that requests will be lost and need to be
retransmitted.
To take propagation delay into account, the client should subtract it
from lease times (e.g. if the client estimates the one-way
propagation delay as 200 msec, then it can assume that the lease is
already 200 msec old when it gets it). In addition, it will take
another 200 msec to get a response back to the server. So the client
must send a lock renewal or write data back to the server 400 msec
before the lease would expire.
The server's lease period configuration should take into account the
network distance of the clients that will be accessing the server's
resources. It is expected that the lease period will take into
account the network propagation delays and other network delay
factors for the client population. Since the protocol does not allow
for an automatic method to determine an appropriate lease period, the
server's administrator may have to tune the lease period.
8.8. Vestigial Locking Infrastructure From V4.0
There are a number of operations and fields within existing
operations that no longer have a function in minor version one. In
one way or another, these changes are all due to the implementation
of sessions which provides client context and exactly once semantics
as a base feature of the protocol, separate from locking itself.
The following operations have become mandatory-to-not-implement. The
server should return NFS4ERR_NOTSUPP if these operations are found in
an NFSv4.1 COMPOUND.
o SETCLIENTID since its function has been replaced by EXCHANGE_ID.
o SETCLIENTID_CONFIRM since client ID confirmation now happens by
means of CREATE_SESSION.
o OPEN_CONFIRM because OPENs no longer require confirmation to
establish an owner-based sequence value.
o RELEASE_LOCKOWNER because lock-owners with no associated locks do
not have any sequence-related state and so can be deleted by the
server at will.
o RENEW because every SEQUENCE operation for a session causes lease
renewal, making a separate operation useless.
Also, there are a number of fields, present in existing operations
related to locking that have no use in minor version one. They were
used in minor version zero to perform functions now provided in a
different fashion.
o Sequence ids used to sequence requests for a given state-owner and
to provide retry protection, now provided via sessions.
o Client IDs used to identify the client associated with a given
request. Client identification is now available using the client
ID associated with the current session, without needing an
explicit client ID field.
Such vestigial fields in existing operations should be set by the
client to zero. When they are not, the server MUST return an
NFS4ERR_INVAL error.
9. File Locking and Share Reservations
To support Win32 share reservations it is necessary to provide
operations which atomically OPEN or CREATE files. Having a separate
share/unshare operation would not allow correct implementation of the
Win32 OpenFile API. In order to correctly implement share semantics,
the previous NFS protocol mechanisms used when a file is opened or
created (LOOKUP, CREATE, ACCESS) need to be replaced. The NFS
version 4.1 protocol defines an OPEN operation which looks up or
creates a file and establishes locking state on the server.
9.1. Opens and Byte-range Locks
It is assumed that manipulating a lock is rare when compared to READ
and WRITE operations. It is also assumed that crashes and network
partitions are relatively rare. Therefore it is important that the
READ and WRITE operations have a lightweight mechanism to indicate if
they possess a held lock. A lock request contains the heavyweight
information required to establish a lock and uniquely define the lock
owner.
9.1.1. State-owner Definition
When opening a file or requesting a record lock, the client must
specify an identifier which represents the owner of the requested
lock. This identifier is in the form of a state-owner, represented
in the protocol by a state_owner4, a variable-length opaque array
which, when concatenated with the current client ID uniquely defines
the owner of lock managed by the client. This may be a thread id,
process id, or other unique value.
Owners of opens and owners of record locks are separate entities and
remain separate even if the same opaque arrays are used to designate
owners of each. The protocol distinguishes between open-owners
(represented by open_owner4 structures) and lock-owners (represented
by lock_owner4 structures).
Each open is associated with a specific open-owner while each record
lock is associated with a lock-owner and an open-owner, the latter
being the open-owner associated with the open file under which the
LOCK operation was done. Delegations and layouts, on the other hand,
are not associated with a specific owner but are associated with the
client as a whole.
9.1.2. Use of the Stateid and Locking
All READ, WRITE and SETATTR operations contain a stateid. For the
purposes of this section, SETATTR operations which change the size
attribute of a file are treated as if they are writing the area
between the old and new size (i.e. the range truncated or added to
the file by means of the SETATTR), even where SETATTR is not
explicitly mentioned in the text. The stateid passed to these
operation must be one that represents, an open, a ser of byte-range
locks or a delegation, or it may be a special stateid representing
anonymous access or the special bypass stateid.
If the state-owner performs a READ or WRITE in a situation in which
it has established a lock or share reservation on the server (any
OPEN constitutes a share reservation) the stateid (previously
returned by the server) must be used to indicate what locks,
including both record locks and share reservations, are held by the
state-owner. If no state is established by the client, either record
lock or share reservation, a special stateid for anonymous state
(zero as "other" and "seqid") is used. Regardless whether a stateid
for anonymous state or a stateid returned by the server is used, if
there is a conflicting share reservation or mandatory record lock
held on the file, the server MUST refuse to service the READ or WRITE
operation.
Share reservations are established by OPEN operations and by their
nature are mandatory in that when the OPEN denies READ or WRITE
operations, that denial results in such operations being rejected
with error NFS4ERR_LOCKED. Record locks may be implemented by the
server as either mandatory or advisory, or the choice of mandatory or
advisory behavior may be determined by the server on the basis of the
file being accessed (for example, some UNIX-based servers support a
"mandatory lock bit" on the mode attribute such that if set, record
locks are required on the file before I/O is possible). When record
locks are advisory, they only prevent the granting of conflicting
lock requests and have no effect on READs or WRITEs. Mandatory
record locks, however, prevent conflicting I/O operations. When they
are attempted, they are rejected with NFS4ERR_LOCKED. When the
client gets NFS4ERR_LOCKED on a file it knows it has the proper share
reservation for, it will need to issue a LOCK request on the region
of the file that includes the region the I/O was to be performed on,
with an appropriate locktype (i.e. READ*_LT for a READ operation,
WRITE*_LT for a WRITE operation).
Note that for UNIX environments that support mandatory file locking,
the distinction between advisory and mandatory locking is subtle. In
fact, advisory and mandatory record locks are exactly the same in so
far as the APIs and requirements on implementation. If the mandatory
lock attribute is set on the file, the server checks to see if the
lock-owner has an appropriate shared (read) or exclusive (write)
record lock on the region it wishes to read or write to. If there is
no appropriate lock, the server checks if there is a conflicting lock
(which can be done by attempting to acquire the conflicting lock on
behalf of the lock-owner, and if successful, release the lock after
the READ or WRITE is done), and if there is, the server returns
NFS4ERR_LOCKED.
For Windows environments, there are no advisory record locks, so the
server always checks for record locks during I/O requests.
Thus, the NFS version 4 LOCK operation does not need to distinguish
between advisory and mandatory record locks. It is the NFS version 4
server's processing of the READ and WRITE operations that introduces
the distinction.
Every stateid which is validly passed to READ, WRITE or SETATTR, with
the exception of special stateid values, defines an access mode for
the file (i.e. READ, WRITE, or READ-WRITE)
o For stateids associated with opens, this is the mode defined by
the original OPEN which caused the allocation of the open stateid
and as modified by subsequent OPENs and OPEN_DOWNGRADEs for the
same open-owner/file pair.
o For stateids returned by record lock the appropriate mode is the
access mode for the open stateid associated with the lock set
represented by the stateid.
o For delegation stateids the access mode is based on the type of
delegation.
When a READ, WRITE, or SETATTR which specifies the size attribute, is
done, the operation is subject to checking against the access mode to
verify that the operation is appropriate given the OPEN with which
the operation is associated.
In the case of WRITE-type operations (i.e. WRITEs and SETATTRs which
set size), the server must verify that the access mode allows writing
and return an NFS4ERR_OPENMODE error if it does not. In the case, of
READ, the server may perform the corresponding check on the access
mode, or it may choose to allow READ on opens for WRITE only, to
accommodate clients whose write implementation may unavoidably do
reads (e.g. due to buffer cache constraints). However, even if READs
are allowed in these circumstances, the server MUST still check for
locks that conflict with the READ (e.g. another open specify denial
of READs). Note that a server which does enforce the access mode
check on READs need not explicitly check for conflicting share
reservations since the existence of OPEN for read access guarantees
that no conflicting share reservation can exist.
The read bypass special stateid (all bits of "other" and "seqid" set
to one) stateid indicates a desire to bypass locking checks. The
server MAY allow READ operations to bypass locking checks at the
server, when this special stateid is used. However, WRITE operations
with this special stateid value MUST NOT bypass locking checks and
are treated exactly the same as if a special stateid for anonymous
state were used.
A lock may not be granted while a READ or WRITE operation using one
of the special stateids is being performed and the scope of the lock
to be granted would conflict with the READ or WRITE operation. This
can occur when:
o A mandatory byte range lock is requested with range that conflicts
with the range of the READ or WRITE operation. For the purposes
of this paragraph, a conflict occurs when a shared lock is
requested and a WRITE operation is being performed, or an
exclusive lock is requested and either a READ or a WRITE operation
is being performed.
o A share reservation is requested which denies reading and or
writing and the corresponding is being performed.
o A delegation is to be granted and the delegation type would
prevent the IO operation, i.e. READ and WRITE conflict with a
write delegation and WRITE conflicts with a read delegation.
A SETATTR that sets size is treated similarly to a WRITE as discussed
above.
When a client holds a delegation, it is particularly important to
make sure that the stateid sent conveys the association of operation
with the delegation, to avoid the delegation from being avoidably
recalled. When the delegation stateid, or a stateid open associated
with that delegation, or a stateid representing byte-range locks
derived form such an open is used, the server knows that the READ,
WRITE, or SETATTR does not conflict with the delegation, but is
issued under the aegis of the delegation. Even though it is possible
for the server to determine from the clientid (gotten from the
sessionid) that the client does in fact have a delegation, the server
is not obliged to check this, so using a special stateid can result
in avoidable recall of the delegation.
9.2. Lock Ranges
The protocol allows a lock owner to request a lock with an octet
range and then either upgrade, downgrade, or unlock a sub-range of
the initial lock. It is expected that this will be an uncommon type
of request. In any case, servers or server file systems may not be
able to support sub-range lock semantics. In the event that a server
receives a locking request that represents a sub-range of current
locking state for the lock owner, the server is allowed to return the
error NFS4ERR_LOCK_RANGE to signify that it does not support sub-
range lock operations. Therefore, the client should be prepared to
receive this error and, if appropriate, report the error to the
requesting application.
The client is discouraged from combining multiple independent locking
ranges that happen to be adjacent into a single request since the
server may not support sub-range requests and for reasons related to
the recovery of file locking state in the event of server failure.
As discussed in Section 8.4.2 below, the server may employ certain
optimizations during recovery that work effectively only when the
client's behavior during lock recovery is similar to the client's
locking behavior prior to server failure.
9.3. Upgrading and Downgrading Locks
If a client has a write lock on a record, it can request an atomic
downgrade of the lock to a read lock via the LOCK request, by setting
the type to READ_LT. If the server supports atomic downgrade, the
request will succeed. If not, it will return NFS4ERR_LOCK_NOTSUPP.
The client should be prepared to receive this error, and if
appropriate, report the error to the requesting application.
If a client has a read lock on a record, it can request an atomic
upgrade of the lock to a write lock via the LOCK request by setting
the type to WRITE_LT or WRITEW_LT. If the server does not support
atomic upgrade, it will return NFS4ERR_LOCK_NOTSUPP. If the upgrade
can be achieved without an existing conflict, the request will
succeed. Otherwise, the server will return either NFS4ERR_DENIED or
NFS4ERR_DEADLOCK. The error NFS4ERR_DEADLOCK is returned if the
client issued the LOCK request with the type set to WRITEW_LT and the
server has detected a deadlock. The client should be prepared to
receive such errors and if appropriate, report the error to the
requesting application.
9.4. Blocking Locks
Some clients require the support of blocking locks. NFSv4.1 does not
provide a callback when a previously unavailable lock becomes
available. Clients thus have no choice but to continually poll for
the lock. This presents a fairness problem. Two new lock types are
added, READW and WRITEW, and are used to indicate to the server that
the client is requesting a blocking lock. The server should maintain
an ordered list of pending blocking locks. When the conflicting lock
is released, the server may wait the lease period for the first
waiting client to re-request the lock. After the lease period
expires the next waiting client request is allowed the lock. Clients
are required to poll at an interval sufficiently small that it is
likely to acquire the lock in a timely manner. The server is not
required to maintain a list of pending blocked locks as it is used to
increase fairness and not correct operation. Because of the
unordered nature of crash recovery, storing of lock state to stable
storage would be required to guarantee ordered granting of blocking
locks.
Servers may also note the lock types and delay returning denial of
the request to allow extra time for a conflicting lock to be
released, allowing a successful return. In this way, clients can
avoid the burden of needlessly frequent polling for blocking locks.
The server should take care in the length of delay in the event the
client retransmits the request.
If a server receives a blocking lock request, denies it, and then
later receives a nonblocking request for the same lock, which is also
denied, then it should remove the lock in question from its list of
pending blocking locks. Clients should use such a nonblocking
request to indicate to the server that this is the last time they
intend to poll for the lock, as may happen when the process
requesting the lock is interrupted. This is a courtesy to the
server, to prevent it from unnecessarily waiting a lease period
before granting other lock requests. However, clients are not
required to perform this courtesy, and servers must not depend on
them doing so. Also, clients must be prepared for the possibility
that this final locking request will be accepted.
9.5. Share Reservations
A share reservation is a mechanism to control access to a file. It A share reservation is a mechanism to control access to a file. It
is a separate and independent mechanism from record locking. When a is a separate and independent mechanism from record locking. When a
client opens a file, it issues an OPEN operation to the server client opens a file, it issues an OPEN operation to the server
specifying the type of access required (READ, WRITE, or BOTH) and the specifying the type of access required (READ, WRITE, or BOTH) and the
type of access to deny others (deny NONE, READ, WRITE, or BOTH). If type of access to deny others (deny NONE, READ, WRITE, or BOTH). If
the OPEN fails the client will fail the application's open request. the OPEN fails the client will fail the application's open request.
Pseudo-code definition of the semantics: Pseudo-code definition of the semantics:
skipping to change at page 162, line 14 skipping to change at page 167, line 38
const OPEN4_SHARE_ACCESS_READ = 0x00000001; const OPEN4_SHARE_ACCESS_READ = 0x00000001;
const OPEN4_SHARE_ACCESS_WRITE = 0x00000002; const OPEN4_SHARE_ACCESS_WRITE = 0x00000002;
const OPEN4_SHARE_ACCESS_BOTH = 0x00000003; const OPEN4_SHARE_ACCESS_BOTH = 0x00000003;
const OPEN4_SHARE_DENY_NONE = 0x00000000; const OPEN4_SHARE_DENY_NONE = 0x00000000;
const OPEN4_SHARE_DENY_READ = 0x00000001; const OPEN4_SHARE_DENY_READ = 0x00000001;
const OPEN4_SHARE_DENY_WRITE = 0x00000002; const OPEN4_SHARE_DENY_WRITE = 0x00000002;
const OPEN4_SHARE_DENY_BOTH = 0x00000003; const OPEN4_SHARE_DENY_BOTH = 0x00000003;
8.9. OPEN/CLOSE Operations 9.6. OPEN/CLOSE Operations
To provide correct share semantics, a client MUST use the OPEN To provide correct share semantics, a client MUST use the OPEN
operation to obtain the initial filehandle and indicate the desired operation to obtain the initial filehandle and indicate the desired
access and what access, if any, to deny. Even if the client intends access and what access, if any, to deny. Even if the client intends
to use a special stateid for anonymous state or read bypass, it must to use a special stateid for anonymous state or read bypass, it must
still obtain the filehandle for the regular file with the OPEN still obtain the filehandle for the regular file with the OPEN
operation so the appropriate share semantics can be applied. For operation so the appropriate share semantics can be applied. For
clients that do not have a deny mode built into their open clients that do not have a deny mode built into their open
programming interfaces, deny equal to NONE should be used. programming interfaces, deny equal to NONE should be used.
skipping to change at page 162, line 45 skipping to change at page 168, line 21
CLOSE. CLOSE.
The LOOKUP operation will return a filehandle without establishing The LOOKUP operation will return a filehandle without establishing
any lock state on the server. Without a valid stateid, the server any lock state on the server. Without a valid stateid, the server
will assume the client has the least access. For example, a file will assume the client has the least access. For example, a file
opened with deny READ/WRITE using a filehandle obtained through opened with deny READ/WRITE using a filehandle obtained through
LOOKUP could only be read using the special read bypass stateid and LOOKUP could only be read using the special read bypass stateid and
could not be written at all because it would not have a valid stateid could not be written at all because it would not have a valid stateid
and the special anonymous stateid would not be allowed access. and the special anonymous stateid would not be allowed access.
8.10. Open Upgrade and Downgrade 9.7. Open Upgrade and Downgrade
When an OPEN is done for a file and the open-owner for which the open When an OPEN is done for a file and the open-owner for which the open
is being done already has the file open, the result is to upgrade the is being done already has the file open, the result is to upgrade the
open file status maintained on the server to include the access and open file status maintained on the server to include the access and
deny bits specified by the new OPEN as well as those for the existing deny bits specified by the new OPEN as well as those for the existing
OPEN. The result is that there is one open file, as far as the OPEN. The result is that there is one open file, as far as the
protocol is concerned, and it includes the union of the access and protocol is concerned, and it includes the union of the access and
deny bits for all of the OPEN requests completed. Only a single deny bits for all of the OPEN requests completed. Only a single
CLOSE will be done to reset the effects of both OPENs. Note that the CLOSE will be done to reset the effects of both OPENs. Note that the
client, when issuing the OPEN, may not know that the same file is in client, when issuing the OPEN, may not know that the same file is in
skipping to change at page 163, line 27 skipping to change at page 168, line 52
When multiple open files on the client are merged into a single open When multiple open files on the client are merged into a single open
file object on the server, the close of one of the open files (on the file object on the server, the close of one of the open files (on the
client) may necessitate change of the access and deny status of the client) may necessitate change of the access and deny status of the
open file on the server. This is because the union of the access and open file on the server. This is because the union of the access and
deny bits for the remaining opens may be smaller (i.e. a proper deny bits for the remaining opens may be smaller (i.e. a proper
subset) than previously. The OPEN_DOWNGRADE operation is used to subset) than previously. The OPEN_DOWNGRADE operation is used to
make the necessary change and the client should use it to update the make the necessary change and the client should use it to update the
server so that share reservation requests by other clients are server so that share reservation requests by other clients are
handled properly. handled properly.
8.11. Short and Long Leases Because of the possibility that the client will issue multiple open
for the same owner in parallel, it may be the case that a open
When determining the time period for the server lease, the usual upgrade ay happen without the client knowing beforehand that this
lease tradeoffs apply. Short leases are good for fast server could happen. Because of this possiblity, CLOSEs and
recovery at a cost of increased operations to effect lease renewal OPEN_DOWNGRADEs, should generally be issued with a non-zero seqid in
(when there are no other operations during the period to effect lease the stateid, to avoid the possibility that the status change
renewal as a side-effect). Long leases are certainly kinder and associated with an open upgrade is not inadvertanyly lost.
gentler to servers trying to handle very large numbers of clients.
The number of extra requests to effect lock renewal drop in inverse
proportion to the lease time. The disadvantages of long leases
include the possibility of slower recovery after certain failures.
After server failure, a longer grace period may be required when some
clients do not promptly reclaim their locks and do a
RECLAIM_COMPLETE. In the event of client failure, it can longer
period for leases to expire thus forcing conflicting requests to
wait.
Long leases are usable if the server is able to store lease state in
non-volatile memory. Upon recovery, the server can reconstruct the
lease state from its non-volatile memory and continue operation with
its clients and therefore long leases would not be an issue.
8.12. Clocks, Propagation Delay, and Calculating Lease Expiration
To avoid the need for synchronized clocks, lease times are granted by
the server as a time delta. However, there is a requirement that the
client and server clocks do not drift excessively over the duration
of the lock. There is also the issue of propagation delay across the
network which could easily be several hundred milliseconds as well as
the possibility that requests will be lost and need to be
retransmitted.
To take propagation delay into account, the client should subtract it
from lease times (e.g. if the client estimates the one-way
propagation delay as 200 msec, then it can assume that the lease is
already 200 msec old when it gets it). In addition, it will take
another 200 msec to get a response back to the server. So the client
must send a lock renewal or write data back to the server 400 msec
before the lease would expire.
The server's lease period configuration should take into account the
network distance of the clients that will be accessing the server's
resources. It is expected that the lease period will take into
account the network propagation delays and other network delay
factors for the client population. Since the protocol does not allow
for an automatic method to determine an appropriate lease period, the
server's administrator may have to tune the lease period.
8.13. Vestigial Locking Infrastructure From V4.0
There are a number of operations and fields within existing
operations that no longer have a function in minor version one. In
one way or another, these changes are all due to the implementation
of sessions which provides client context and exactly once semantics
as a base feature of the protocol, separate from locking itself.
The following operations have become mandatory-to-not-implement. The
server should return NFS4ERR_NOTSUPP if these operations are found in
an NFSv4.1 COMPOUND.
o SETCLIENTID since its function has been replaced by EXCHANGE_ID.
o SETCLIENTID_CONFIRM since client ID confirmation now happens by
means of CREATE_SESSION.
o OPEN_CONFIRM because OPEN's no longer require confirmation to
establish an owner-based sequence value.
o RELEASE_LOCKOWNER because lock-owners with no associated locks 9.8. Reclaim of Open and Byte-range Locks
have any sequence-related state and so can be deleted by the
server at will.
o RENEW because every SEQUENCE operation for a session causes lease Special forms of the LOCK and OPEN operations are provided when it is
renewal, making a separate operation useless. necessary to re-establish byte-range locks or opens after a server
failure.
Also, there are a number of fields, present in existing operations o To reclaim existing opens, an OPEN operation is performed using a
related to locking that have no use in minor version one. They were CLAIM_PREVIOUS. Because the client, in this type of situation,
used in minor version zero to perform functions now provided in a will have already opened the file and have the filehandle of the
different fashion. target file, this operation requires that the current filehandle
be the target file, rather than a directory and no file name is
specified.
o Sequence ids used to sequence requests for a given state-owner and o To reclaim byte-range locks, a LOCK operation with the reclaim
to provide retry protection, now provided via sessions. parameter set to true is used.
o Client IDs used to identify the client associated with a given For reasons described in Section 18.46.5, OPEN reclaims that perform
request. Client identification is now available using the client upgrades can cause the client and server to not have the same view of
ID associated with the current session, without needing an open state. Therefore, the client MUST NOT perform an OPEN reclaim
explicit client ID field. that is also an OPEN upgrade (Section 9.7) unless the client precedes
the OPEN upgrade/reclaim with a TEST_STATEID operation in the same
COMPOUND. The stateid used in TEST_STATEID will be that returned by
the reclaim OPEN the OPEN upgrade/reclaim is upgrading the open state
from. Alternatively, the client can avoid OPEN upgrade during the
reclaim phase.
Such vestigial fields in existing operations should be set by the Reclaims of opens associated with delegations are discussed in
client to zero. When they are not, the server MUST return an Section 10.2.1.
NFS4ERR_INVAL error.
9. Client-Side Caching 10. Client-Side Caching
Client-side caching of data, of file attributes, and of file names is Client-side caching of data, of file attributes, and of file names is
essential to providing good performance with the NFS protocol. essential to providing good performance with the NFS protocol.
Providing distributed cache coherence is a difficult problem and Providing distributed cache coherence is a difficult problem and
previous versions of the NFS protocol have not attempted it. previous versions of the NFS protocol have not attempted it.
Instead, several NFS client implementation techniques have been used Instead, several NFS client implementation techniques have been used
to reduce the problems that a lack of coherence poses for users. to reduce the problems that a lack of coherence poses for users.
These techniques have not been clearly defined by earlier protocol These techniques have not been clearly defined by earlier protocol
specifications and it is often unclear what is valid or invalid specifications and it is often unclear what is valid or invalid
client behavior. client behavior.
skipping to change at page 166, line 5 skipping to change at page 170, line 18
defines a more limited set of caching guarantees to allow locks and defines a more limited set of caching guarantees to allow locks and
share reservations to be used without destructive interference from share reservations to be used without destructive interference from
client side caching. client side caching.
In addition, the NFS version 4 protocol introduces a delegation In addition, the NFS version 4 protocol introduces a delegation
mechanism which allows many decisions normally made by the server to mechanism which allows many decisions normally made by the server to
be made locally by clients. This mechanism provides efficient be made locally by clients. This mechanism provides efficient
support of the common cases where sharing is infrequent or where support of the common cases where sharing is infrequent or where
sharing is read-only. sharing is read-only.
9.1. Performance Challenges for Client-Side Caching 10.1. Performance Challenges for Client-Side Caching
Caching techniques used in previous versions of the NFS protocol have Caching techniques used in previous versions of the NFS protocol have
been successful in providing good performance. However, several been successful in providing good performance. However, several
scalability challenges can arise when those techniques are used with scalability challenges can arise when those techniques are used with
very large numbers of clients. This is particularly true when very large numbers of clients. This is particularly true when
clients are geographically distributed which classically increases clients are geographically distributed which classically increases
the latency for cache revalidation requests. the latency for cache revalidation requests.
The previous versions of the NFS protocol repeat their file data The previous versions of the NFS protocol repeat their file data
cache validation requests at the time the file is opened. This cache validation requests at the time the file is opened. This
skipping to change at page 166, line 29 skipping to change at page 170, line 42
In this case, repeated reference to the server to find that no In this case, repeated reference to the server to find that no
conflicts exist is expensive. A better option with regards to conflicts exist is expensive. A better option with regards to
performance is to allow a client that repeatedly opens a file to do performance is to allow a client that repeatedly opens a file to do
so without reference to the server. This is done until potentially so without reference to the server. This is done until potentially
conflicting operations from another client actually occur. conflicting operations from another client actually occur.
A similar situation arises in connection with file locking. Sending A similar situation arises in connection with file locking. Sending
file lock and unlock requests to the server as well as the read and file lock and unlock requests to the server as well as the read and
write requests necessary to make data caching consistent with the write requests necessary to make data caching consistent with the
locking semantics (see the section "Data Caching and File Locking") locking semantics (see Section 10.3.2 can severely limit performance.
can severely limit performance. When locking is used to provide When locking is used to provide protection against infrequent
protection against infrequent conflicts, a large penalty is incurred. conflicts, a large penalty is incurred. This penalty may discourage
This penalty may discourage the use of file locking by applications. the use of file locking by applications.
The NFS version 4 protocol provides more aggressive caching The NFS version 4 protocol provides more aggressive caching
strategies with the following design goals: strategies with the following design goals:
.IP o Compatibility with a large range of server semantics. .IP o .IP o Compatibility with a large range of server semantics. .IP o
Provide the same caching benefits as previous versions of the NFS Provide the same caching benefits as previous versions of the NFS
protocol when unable to provide the more aggressive model. .IP o protocol when unable to provide the more aggressive model. .IP o
Requirements for aggressive caching are organized so that a large Requirements for aggressive caching are organized so that a large
portion of the benefit can be obtained even when not all of the portion of the benefit can be obtained even when not all of the
requirements can be met. .LP The appropriate requirements for the requirements can be met. .LP The appropriate requirements for the
server are discussed in later sections in which specific forms of server are discussed in later sections in which specific forms of
caching are covered. (see the section "Open Delegation"). caching are covered (see Section 10.4).
9.2. Delegation and Callbacks 10.2. Delegation and Callbacks
Recallable delegation of server responsibilities for a file to a Recallable delegation of server responsibilities for a file to a
client improves performance by avoiding repeated requests to the client improves performance by avoiding repeated requests to the
server in the absence of inter-client conflict. With the use of a server in the absence of inter-client conflict. With the use of a
"callback" RPC from server to client, a server recalls delegated "callback" RPC from server to client, a server recalls delegated
responsibilities when another client engages in sharing of a responsibilities when another client engages in sharing of a
delegated file. delegated file.
A delegation is passed from the server to the client, specifying the A delegation is passed from the server to the client, specifying the
object of the delegation and the type of delegation. There are object of the delegation and the type of delegation. There are
skipping to change at page 168, line 9 skipping to change at page 172, line 22
At the time the client receives a delegation recall, it may have At the time the client receives a delegation recall, it may have
substantial state that needs to be flushed to the server. Therefore, substantial state that needs to be flushed to the server. Therefore,
the server should allow sufficient time for the delegation to be the server should allow sufficient time for the delegation to be
returned since it may involve numerous RPCs to the server. If the returned since it may involve numerous RPCs to the server. If the
server is able to determine that the client is diligently flushing server is able to determine that the client is diligently flushing
state to the server as a result of the recall, the server may extend state to the server as a result of the recall, the server may extend
the usual time allowed for a recall. However, the time allowed for the usual time allowed for a recall. However, the time allowed for
recall completion should not be unbounded. recall completion should not be unbounded.
An example of this is when responsibility to mediate opens on a given An example of this is when responsibility to mediate opens on a given
file is delegated to a client (see the section "Open Delegation"). file is delegated to a client (see Section 10.4). The server will
The server will not know what opens are in effect on the client. not know what opens are in effect on the client. Without this
Without this knowledge the server will be unable to determine if the knowledge the server will be unable to determine if the access and
access and deny state for the file allows any particular open until deny state for the file allows any particular open until the
the delegation for the file has been returned. delegation for the file has been returned.
A client failure or a network partition can result in failure to A client failure or a network partition can result in failure to
respond to a recall callback. In this case, the server will revoke respond to a recall callback. In this case, the server will revoke
the delegation which in turn will render useless any modified state the delegation which in turn will render useless any modified state
still on the client. still on the client.
9.2.1. Delegation Recovery 10.2.1. Delegation Recovery
There are three situations that delegation recovery must deal with: There are three situations that delegation recovery must deal with:
o Client reboot or restart o Client reboot or restart
o Server reboot or restart o Server reboot or restart
o Network partition (full or callback-only) o Network partition (full or callback-only)
In the event the client reboots or restarts, the failure to renew In the event the client reboots or restarts, the failure to renew
skipping to change at page 168, line 51 skipping to change at page 173, line 16
To allow for this type of client recovery, the server MAY extend the To allow for this type of client recovery, the server MAY extend the
period for delegation recovery beyond the typical lease expiration period for delegation recovery beyond the typical lease expiration
period. This implies that requests from other clients that conflict period. This implies that requests from other clients that conflict
with these delegations will need to wait. Because the normal recall with these delegations will need to wait. Because the normal recall
process may require significant time for the client to flush changed process may require significant time for the client to flush changed
state to the server, other clients need be prepared for delays that state to the server, other clients need be prepared for delays that
occur because of a conflicting delegation. This longer interval occur because of a conflicting delegation. This longer interval
would increase the window for clients to reboot and consult stable would increase the window for clients to reboot and consult stable
storage so that the delegations can be reclaimed. For open storage so that the delegations can be reclaimed. For open
delegations, such delegations are reclaimed using OPEN with a claim delegations, such delegations are reclaimed using OPEN with a claim
type of CLAIM_DELEGATE_PREV. (See the sections on "Data Caching and type of CLAIM_DELEGATE_PREV. (See Section 10.5 and Section 18.16 for
Revocation" and "Operation 18: OPEN" for discussion of open discussion of open delegation and the details of OPEN respectively).
delegation and the details of OPEN respectively).
A server MAY support a claim type of CLAIM_DELEGATE_PREV, but if it A server MAY support a claim type of CLAIM_DELEGATE_PREV, but if it
does, it MUST NOT remove delegations upon a CREATE_SESSION that does, it MUST NOT remove delegations upon a CREATE_SESSION that
confirms a client ID created by EXCHANGE_ID, and instead MUST, for a confirms a client ID created by EXCHANGE_ID, and instead MUST, for a
period of time no less than that of the value of the lease_time period of time no less than that of the value of the lease_time
attribute, maintain the client's delegations to allow time for the attribute, maintain the client's delegations to allow time for the
client to issue CLAIM_DELEGATE_PREV requests. The server that client to issue CLAIM_DELEGATE_PREV requests. The server that
supports CLAIM_DELEGATE_PREV MUST support the DELEGPURGE operation. supports CLAIM_DELEGATE_PREV MUST support the DELEGPURGE operation.
When the server reboots or restarts, delegations are reclaimed (using When the server reboots or restarts, delegations are reclaimed (using
skipping to change at page 169, line 38 skipping to change at page 173, line 50
o Upon reclaim, a client reporting resources assigned to it by an o Upon reclaim, a client reporting resources assigned to it by an
earlier server instance must be granted those resources. earlier server instance must be granted those resources.
o The server has unquestionable authority to determine whether o The server has unquestionable authority to determine whether
delegations are to be granted and, once granted, whether they are delegations are to be granted and, once granted, whether they are
to be continued. to be continued.
o The use of callbacks is not to be depended upon until the client o The use of callbacks is not to be depended upon until the client
has proven its ability to receive them. has proven its ability to receive them.
When a client needs to reclaim a delegation and there is no
associated open, the client may use the CLAIM_PREVIOUS variant of the
WANT_DELEGATION operation. However, since the server is not required
to support this operation, an alternative is to reclaim a dummy open
together with the delegation using an OPEN of type CLAIM_PREVIOUS.
The dummy open file can be released using a CLOSE to re-establish the
original state to be reclaimed, a delegation without an associated
open.
When a client has more than a single open associated with a
delegation, state for those additional opens can be established using
OPEN operations of type CLAIM_DELEGATE_CUR". When these are used to
establish opens associated with reclaimed delegations, the server
MUST allow them with the grace period.
When a network partition occurs, delegations are subject to freeing When a network partition occurs, delegations are subject to freeing
by the server when the lease renewal period expires. This is similar by the server when the lease renewal period expires. This is similar
to the behavior for locks and share reservations. For delegations, to the behavior for locks and share reservations. For delegations,
however, the server may extend the period in which conflicting however, the server may extend the period in which conflicting
requests are held off. Eventually the occurrence of a conflicting requests are held off. Eventually the occurrence of a conflicting
request from another client will cause revocation of the delegation. request from another client will cause revocation of the delegation.
A loss of the backchannel (e.g. by later network configuration A loss of the backchannel (e.g. by later network configuration
change) will have the same effect. A recall request will fail and change) will have the same effect. A recall request will fail and
revocation of the delegation will result. revocation of the delegation will result.
A client normally finds out about revocation of a delegation when it A client normally finds out about revocation of a delegation when it
uses a stateid associated with a delegation and receives the error uses a stateid associated with a delegation and receives the error
NFS4ERR_EXPIRED. It also may find out about delegation revocation NFS4ERR_EXPIRED. It also may find out about delegation revocation
after a client reboot when it attempts to reclaim a delegation and after a client reboot when it attempts to reclaim a delegation and
receives that same error. Note that in the case of a revoked write receives that same error. Note that in the case of a revoked write
open delegation, there are issues because data may have been modified open delegation, there are issues because data may have been modified
by the client whose delegation is revoked and separately by other by the client whose delegation is revoked and separately by other
clients. See the section "Revocation Recovery for Write Open clients. See Section 10.5.1 for a discussion of such issues. Note
Delegation" for a discussion of such issues. Note also that when also that when delegations are revoked, information about the revoked
delegations are revoked, information about the revoked delegation delegation will be written by the server to stable storage (as
will be written by the server to stable storage (as described in the described in Section 13.7). This is done to deal with the case in
section "Crash Recovery"). This is done to deal with the case in
which a server reboots after revoking a delegation but before the which a server reboots after revoking a delegation but before the
client holding the revoked delegation is notified about the client holding the revoked delegation is notified about the
revocation. revocation.
9.3. Data Caching 10.3. Data Caching
When applications share access to a set of files, they need to be When applications share access to a set of files, they need to be
implemented so as to take account of the possibility of conflicting implemented so as to take account of the possibility of conflicting
access by another application. This is true whether the applications access by another application. This is true whether the applications
in question execute on different clients or reside on the same in question execute on different clients or reside on the same
client. client.
Share reservations and record locks are the facilities the NFS Share reservations and record locks are the facilities the NFS
version 4 protocol provides to allow applications to coordinate version 4 protocol provides to allow applications to coordinate
access by providing mutual exclusion facilities. The NFS version 4 access by providing mutual exclusion facilities. The NFS version 4
protocol's data caching must be implemented such that it does not protocol's data caching must be implemented such that it does not
invalidate the assumptions that those using these facilities depend invalidate the assumptions that those using these facilities depend
upon. upon.
9.3.1. Data Caching and OPENs 10.3.1. Data Caching and OPENs
In order to avoid invalidating the sharing assumptions that In order to avoid invalidating the sharing assumptions that
applications rely on, NFS version 4 clients should not provide cached applications rely on, NFS version 4 clients should not provide cached
data to applications or modify it on behalf of an application when it data to applications or modify it on behalf of an application when it
would not be valid to obtain or modify that same data via a READ or would not be valid to obtain or modify that same data via a READ or
WRITE operation. WRITE operation.
Furthermore, in the absence of open delegation (see the section "Open Furthermore, in the absence of open delegation (see Section 10.4),
Delegation") two additional rules apply. Note that these rules are two additional rules apply. Note that these rules are obeyed in
obeyed in practice by many NFS version 2 and version 3 clients. practice by many NFS version 2 and version 3 clients.
o First, cached data present on a client must be revalidated after o First, cached data present on a client must be revalidated after
doing an OPEN. Revalidating means that the client fetches the doing an OPEN. Revalidating means that the client fetches the
change attribute from the server, compares it with the cached change attribute from the server, compares it with the cached
change attribute, and if different, declares the cached data (as change attribute, and if different, declares the cached data (as
well as the cached attributes) as invalid. This is to ensure that well as the cached attributes) as invalid. This is to ensure that
the data for the OPENed file is still correctly reflected in the the data for the OPENed file is still correctly reflected in the
client's cache. This validation must be done at least when the client's cache. This validation must be done at least when the
client's OPEN operation includes DENY=WRITE or BOTH thus client's OPEN operation includes DENY=WRITE or BOTH thus
terminating a period in which other clients may have had the terminating a period in which other clients may have had the
skipping to change at page 171, line 31 skipping to change at page 176, line 9
a file OPENed for write. This is complementary to the first rule. a file OPENed for write. This is complementary to the first rule.
If the data is not flushed at CLOSE, the revalidation done after If the data is not flushed at CLOSE, the revalidation done after
client OPENs as file is unable to achieve its purpose. The other client OPENs as file is unable to achieve its purpose. The other
aspect to flushing the data before close is that the data must be aspect to flushing the data before close is that the data must be
committed to stable storage, at the server, before the CLOSE committed to stable storage, at the server, before the CLOSE
operation is requested by the client. In the case of a server operation is requested by the client. In the case of a server
reboot or restart and a CLOSEd file, it may not be possible to reboot or restart and a CLOSEd file, it may not be possible to
retransmit the data to be written to the file. Hence, this retransmit the data to be written to the file. Hence, this
requirement. requirement.
9.3.2. Data Caching and File Locking 10.3.2. Data Caching and File Locking
For those applications that choose to use file locking instead of For those applications that choose to use file locking instead of
share reservations to exclude inconsistent file access, there is an share reservations to exclude inconsistent file access, there is an
analogous set of constraints that apply to client side data caching. analogous set of constraints that apply to client side data caching.
These rules are effective only if the file locking is used in a way These rules are effective only if the file locking is used in a way
that matches in an equivalent way the actual READ and WRITE that matches in an equivalent way the actual READ and WRITE
operations executed. This is as opposed to file locking that is operations executed. This is as opposed to file locking that is
based on pure convention. For example, it is possible to manipulate based on pure convention. For example, it is possible to manipulate
a two-megabyte file by dividing the file into two one-megabyte a two-megabyte file by dividing the file into two one-megabyte
regions and protecting access to the two regions by file locks on regions and protecting access to the two regions by file locks on
skipping to change at page 173, line 13 skipping to change at page 177, line 40
unrelated unlock. However, it would not be valid to write the entire unrelated unlock. However, it would not be valid to write the entire
block in which that single written octet was located since it block in which that single written octet was located since it
includes an area that is not locked and might be locked by another includes an area that is not locked and might be locked by another
client. Client implementations can avoid this problem by dividing client. Client implementations can avoid this problem by dividing
files with modified data into those for which all modifications are files with modified data into those for which all modifications are
done to areas covered by an appropriate record lock and those for done to areas covered by an appropriate record lock and those for
which there are modifications not covered by a record lock. Any which there are modifications not covered by a record lock. Any
writes done for the former class of files must not include areas not writes done for the former class of files must not include areas not
locked and thus not modified on the client. locked and thus not modified on the client.
9.3.3. Data Caching and Mandatory File Locking 10.3.3. Data Caching and Mandatory File Locking
Client side data caching needs to respect mandatory file locking when Client side data caching needs to respect mandatory file locking when
it is in effect. The presence of mandatory file locking for a given it is in effect. The presence of mandatory file locking for a given
file is indicated when the client gets back NFS4ERR_LOCKED from a file is indicated when the client gets back NFS4ERR_LOCKED from a
READ or WRITE on a file it has an appropriate share reservation for. READ or WRITE on a file it has an appropriate share reservation for.
When mandatory locking is in effect for a file, the client must check When mandatory locking is in effect for a file, the client must check
for an appropriate file lock for data being read or written. If a for an appropriate file lock for data being read or written. If a
lock exists for the range being read or written, the client may lock exists for the range being read or written, the client may
satisfy the request using the client's validated cache. If an satisfy the request using the client's validated cache. If an
appropriate file lock is not held for the range of the read or write, appropriate file lock is not held for the range of the read or write,
the read or write request must not be satisfied by the client's cache the read or write request must not be satisfied by the client's cache
and the request must be sent to the server for processing. When a and the request must be sent to the server for processing. When a
read or write request partially overlaps a locked region, the request read or write request partially overlaps a locked region, the request
should be subdivided into multiple pieces with each region (locked or should be subdivided into multiple pieces with each region (locked or
not) treated appropriately. not) treated appropriately.
9.3.4. Data Caching and File Identity 10.3.4. Data Caching and File Identity
When clients cache data, the file data needs to be organized When clients cache data, the file data needs to be organized
according to the file system object to which the data belongs. For according to the file system object to which the data belongs. For
NFS version 3 clients, the typical practice has been to assume for NFS version 3 clients, the typical practice has been to assume for
the purpose of caching that distinct filehandles represent distinct the purpose of caching that distinct filehandles represent distinct
file system objects. The client then has the choice to organize and file system objects. The client then has the choice to organize and
maintain the data cache on this basis. maintain the data cache on this basis.
In the NFS version 4 protocol, there is now the possibility to have In the NFS version 4 protocol, there is now the possibility to have
significant deviations from a "one filehandle per object" model significant deviations from a "one filehandle per object" model
skipping to change at page 174, line 35 skipping to change at page 179, line 14
caching) cannot be done reliably. Note that if GETATTR does not caching) cannot be done reliably. Note that if GETATTR does not
return the fileid attribute for both filehandles, it will return return the fileid attribute for both filehandles, it will return
it for neither of the filehandles, since the fsid for both it for neither of the filehandles, since the fsid for both
filehandles is the same. filehandles is the same.
o If GETATTR directed to the two filehandles returns different o If GETATTR directed to the two filehandles returns different
values for the fileid attribute, then they are distinct objects. values for the fileid attribute, then they are distinct objects.
o Otherwise they are the same object. o Otherwise they are the same object.
9.4. Open Delegation 10.4. Open Delegation
When a file is being OPENed, the server may delegate further handling When a file is being OPENed, the server may delegate further handling
of opens and closes for that file to the opening client. Any such of opens and closes for that file to the opening client. Any such
delegation is recallable, since the circumstances that allowed for delegation is recallable, since the circumstances that allowed for
the delegation are subject to change. In particular, the server may the delegation are subject to change. In particular, the server may
receive a conflicting OPEN from another client, the server must receive a conflicting OPEN from another client, the server must
recall the delegation before deciding whether the OPEN from the other recall the delegation before deciding whether the OPEN from the other
client may be granted. Making a delegation is up to the server and client may be granted. Making a delegation is up to the server and
clients should not assume that any particular OPEN either will or clients should not assume that any particular OPEN either will or
will not result in an open delegation. The following is a typical will not result in an open delegation. The following is a typical
skipping to change at page 175, line 49 skipping to change at page 180, line 28
For a read open delegation, opens that cannot be handled locally For a read open delegation, opens that cannot be handled locally
(opens for write or that deny read access) must be sent to the (opens for write or that deny read access) must be sent to the
server. server.
When an open delegation is made, the response to the OPEN contains an When an open delegation is made, the response to the OPEN contains an
open delegation structure which specifies the following: open delegation structure which specifies the following:
o the type of delegation (read or write) o the type of delegation (read or write)
o space limitation information to control flushing of data on close o space limitation information to control flushing of data on close
(write open delegation only, see the section "Open Delegation and (write open delegation only, see Section 10.4.1.
Data Caching")
o an nfsace4 specifying read and write permissions o an nfsace4 specifying read and write permissions
o a stateid to represent the delegation for READ and WRITE o a stateid to represent the delegation for READ and WRITE
The delegation stateid is separate and distinct from the stateid for The delegation stateid is separate and distinct from the stateid for
the OPEN proper. The standard stateid, unlike the delegation the OPEN proper. The standard stateid, unlike the delegation
stateid, is associated with a particular lock_owner and will continue stateid, is associated with a particular lock_owner and will continue
to be valid after the delegation is recalled and the file remains to be valid after the delegation is recalled and the file remains
open. open.
When a request internal to the client is made to open a file and open When a request internal to the client is made to open a file and open
delegation is in effect, it will be accepted or rejected solely on delegation is in effect, it will be accepted or rejected solely on
the basis of the following conditions. Any requirement for other the basis of the following conditions. Any requirement for other
checks to be made by the delegate should result in open delegation checks to be made by the delegate should result in open delegation
being denied so that the checks can be made by the server itself. being denied so that the checks can be made by the server itself.
o The access and deny bits for the request and the file as described o The access and deny bits for the request and the file as described
in the section "Share Reservations". in Section 9.5.
o The read and write permissions as determined below. o The read and write permissions as determined below.
The nfsace4 passed with delegation can be used to avoid frequent The nfsace4 passed with delegation can be used to avoid frequent
ACCESS calls. The permission check should be as follows: ACCESS calls. The permission check should be as follows:
o If the nfsace4 indicates that the open may be done, then it should o If the nfsace4 indicates that the open may be done, then it should
be granted without reference to the server. be granted without reference to the server.
o If the nfsace4 indicates that the open may not be done, then an o If the nfsace4 indicates that the open may not be done, then an
skipping to change at page 177, line 5 skipping to change at page 181, line 30
The use of delegation together with various other forms of caching The use of delegation together with various other forms of caching
creates the possibility that no server authentication will ever be creates the possibility that no server authentication will ever be
performed for a given user since all of the user's requests might be performed for a given user since all of the user's requests might be
satisfied locally. Where the client is depending on the server for satisfied locally. Where the client is depending on the server for
authentication, the client should be sure authentication occurs for authentication, the client should be sure authentication occurs for
each user by use of the ACCESS operation. This should be the case each user by use of the ACCESS operation. This should be the case
even if an ACCESS operation would not be required otherwise. As even if an ACCESS operation would not be required otherwise. As
mentioned before, the server may enforce frequent authentication by mentioned before, the server may enforce frequent authentication by
returning an nfsace4 denying all access with every open delegation. returning an nfsace4 denying all access with every open delegation.
9.4.1. Open Delegation and Data Caching 10.4.1. Open Delegation and Data Caching
OPEN delegation allows much of the message overhead associated with OPEN delegation allows much of the message overhead associated with
the opening and closing files to be eliminated. An open when an open the opening and closing files to be eliminated. An open when an open
delegation is in effect does not require that a validation message be delegation is in effect does not require that a validation message be
sent to the server. The continued endurance of the "read open sent to the server. The continued endurance of the "read open
delegation" provides a guarantee that no OPEN for write and thus no delegation" provides a guarantee that no OPEN for write and thus no
write has occurred. Similarly, when closing a file opened for write write has occurred. Similarly, when closing a file opened for write
and if write open delegation is in effect, the data written does not and if write open delegation is in effect, the data written does not
have to be flushed to the server until the open delegation is have to be flushed to the server until the open delegation is
recalled. The continued endurance of the open delegation provides a recalled. The continued endurance of the open delegation provides a
skipping to change at page 178, line 20 skipping to change at page 182, line 45
With respect to authentication, flushing modified data to the server With respect to authentication, flushing modified data to the server
after a CLOSE has occurred may be problematic. For example, the user after a CLOSE has occurred may be problematic. For example, the user
of the application may have logged off the client and unexpired of the application may have logged off the client and unexpired
authentication credentials may not be present. In this case, the authentication credentials may not be present. In this case, the
client may need to take special care to ensure that local unexpired client may need to take special care to ensure that local unexpired
credentials will in fact be available. This may be accomplished by credentials will in fact be available. This may be accomplished by
tracking the expiration time of credentials and flushing data well in tracking the expiration time of credentials and flushing data well in
advance of their expiration or by making private copies of advance of their expiration or by making private copies of
credentials to assure their availability when needed. credentials to assure their availability when needed.
9.4.2. Open Delegation and File Locks 10.4.2. Open Delegation and File Locks
When a client holds a write open delegation, lock operations are When a client holds a write open delegation, lock operations are
performed locally. This includes those required for mandatory file performed locally. This includes those required for mandatory file
locking. This can be done since the delegation implies that there locking. This can be done since the delegation implies that there
can be no conflicting locks. Similarly, all of the revalidations can be no conflicting locks. Similarly, all of the revalidations
that would normally be associated with obtaining locks and the that would normally be associated with obtaining locks and the
flushing of data associated with the releasing of locks need not be flushing of data associated with the releasing of locks need not be
done. done.
When a client holds a read open delegation, lock operations are not When a client holds a read open delegation, lock operations are not
performed locally. All lock operations, including those requesting performed locally. All lock operations, including those requesting
non-exclusive locks, are sent to the server for resolution. non-exclusive locks, are sent to the server for resolution.
9.4.3. Handling of CB_GETATTR 10.4.3. Handling of CB_GETATTR
The server needs to employ special handling for a GETATTR where the The server needs to employ special handling for a GETATTR where the
target is a file that has a write open delegation in effect. The target is a file that has a write open delegation in effect. The
reason for this is that the client holding the write delegation may reason for this is that the client holding the write delegation may
have modified the data and the server needs to reflect this change to have modified the data and the server needs to reflect this change to
the second client that submitted the GETATTR. Therefore, the client the second client that submitted the GETATTR. Therefore, the client
holding the write delegation needs to be interrogated. The server holding the write delegation needs to be interrogated. The server
will use the CB_GETATTR operation. The only attributes that the will use the CB_GETATTR operation. The only attributes that the
server can reliably query via CB_GETATTR are size and change. server can reliably query via CB_GETATTR are size and change.
skipping to change at page 181, line 38 skipping to change at page 186, line 9
CB_GETATTR and responds to the second client as in the last step. CB_GETATTR and responds to the second client as in the last step.
This methodology resolves issues of clock differences between client This methodology resolves issues of clock differences between client
and server and other scenarios where the use of CB_GETATTR break and server and other scenarios where the use of CB_GETATTR break
down. down.
It should be noted that the server is under no obligation to use It should be noted that the server is under no obligation to use
CB_GETATTR and therefore the server MAY simply recall the delegation CB_GETATTR and therefore the server MAY simply recall the delegation
to avoid its use. to avoid its use.
9.4.4. Recall of Open Delegation 10.4.4. Recall of Open Delegation
The following events necessitate recall of an open delegation: The following events necessitate recall of an open delegation:
o Potentially conflicting OPEN request (or READ/WRITE done with o Potentially conflicting OPEN request (or READ/WRITE done with
"special" stateid) "special" stateid)
o SETATTR issued by another client o SETATTR issued by another client
o REMOVE request for the file o REMOVE request for the file
skipping to change at page 182, line 31 skipping to change at page 186, line 51
no previous CLOSE operation has been sent to the server, a CLOSE no previous CLOSE operation has been sent to the server, a CLOSE
operation must be sent to the server. operation must be sent to the server.
o If a file has other open references at the client, then OPEN o If a file has other open references at the client, then OPEN
operations must be sent to the server. The appropriate stateids operations must be sent to the server. The appropriate stateids
will be provided by the server for subsequent use by the client will be provided by the server for subsequent use by the client
since the delegation stateid will not longer be valid. These OPEN since the delegation stateid will not longer be valid. These OPEN
requests are done with the claim type of CLAIM_DELEGATE_CUR. This requests are done with the claim type of CLAIM_DELEGATE_CUR. This
will allow the presentation of the delegation stateid so that the will allow the presentation of the delegation stateid so that the
client can establish the appropriate rights to perform the OPEN. client can establish the appropriate rights to perform the OPEN.
(see the section "Operation 18: OPEN" for details.) (see the Section 18.16 which describes the OPEN" operation for
details.)
o If there are granted file locks, the corresponding LOCK operations o If there are granted file locks, the corresponding LOCK operations
need to be performed. This applies to the write open delegation need to be performed. This applies to the write open delegation
case only. case only.
o For a write open delegation, if at the time of recall the file is o For a write open delegation, if at the time of recall the file is
not open for write, all modified data for the file must be flushed not open for write, all modified data for the file must be flushed
to the server. If the delegation had not existed, the client to the server. If the delegation had not existed, the client
would have done this data flush before the CLOSE operation. would have done this data flush before the CLOSE operation.
o For a write open delegation when a file is still open at the time o For a write open delegation when a file is still open at the time
skipping to change at page 183, line 31 skipping to change at page 188, line 5
except as part of delegation return. Only in the case of closing the except as part of delegation return. Only in the case of closing the
open that resulted in obtaining the delegation would clients be open that resulted in obtaining the delegation would clients be
likely to do this early, since, in that case, the close once done likely to do this early, since, in that case, the close once done
will not be undone. Regardless of the client's choices on scheduling will not be undone. Regardless of the client's choices on scheduling
these actions, all must be performed before the delegation is these actions, all must be performed before the delegation is
returned, including (when applicable) the close that corresponds to returned, including (when applicable) the close that corresponds to
the open that resulted in the delegation. These actions can be the open that resulted in the delegation. These actions can be
performed either in previous requests or in previous operations in performed either in previous requests or in previous operations in
the same COMPOUND request. the same COMPOUND request.
9.4.5. Clients that Fail to Honor Delegation Recalls 10.4.5. Clients that Fail to Honor Delegation Recalls
A client may fail to respond to a recall for various reasons, such as A client may fail to respond to a recall for various reasons, such as
a failure of the backchannel from server to the client. The client a failure of the backchannel from server to the client. The client
may be unaware of a failure in the backchannel. This lack of may be unaware of a failure in the backchannel. This lack of
awareness could result in the client finding out long after the awareness could result in the client finding out long after the
failure that its delegation has been revoked, and another client has failure that its delegation has been revoked, and another client has
modified the data for which the client had a delegation. This is modified the data for which the client had a delegation. This is
especially a problem for the client that held a write delegation. especially a problem for the client that held a write delegation.
The server also has a dilemma in that the client that fails to The server also has a dilemma in that the client that fails to
skipping to change at page 184, line 28 skipping to change at page 189, line 5
of time after the server attempted to recall the delegation. of time after the server attempted to recall the delegation.
This period of time MUST NOT be less than the value of the This period of time MUST NOT be less than the value of the
lease_time attribute. lease_time attribute.
o When the client holds a delegation, it cannot rely on operations o When the client holds a delegation, it cannot rely on operations
that take a stateid to renew delegation leases across backchannel that take a stateid to renew delegation leases across backchannel
failures. The client that wants to keep delegations in force failures. The client that wants to keep delegations in force
across backchannel failures must use SEQUENCE to do so and check across backchannel failures must use SEQUENCE to do so and check
the sr_status_flags for the SEQ4_STATUS_CB_PATH_DOWN status. the sr_status_flags for the SEQ4_STATUS_CB_PATH_DOWN status.
9.4.6. Delegation Revocation 10.4.6. Delegation Revocation
At the point a delegation is revoked, if there are associated opens At the point a delegation is revoked, if there are associated opens
on the client, the applications holding these opens need to be on the client, the applications holding these opens need to be
notified. This notification usually occurs by returning errors for notified. This notification usually occurs by returning errors for
READ/WRITE operations or when a close is attempted for the open file. READ/WRITE operations or when a close is attempted for the open file.
If no opens exist for the file at the point the delegation is If no opens exist for the file at the point the delegation is
revoked, then notification of the revocation is unnecessary. revoked, then notification of the revocation is unnecessary.
However, if there is modified data present at the client for the However, if there is modified data present at the client for the
file, the user of the application should be notified. Unfortunately, file, the user of the application should be notified. Unfortunately,
it may not be possible to notify the user since active applications it may not be possible to notify the user since active applications
may not be present at the client. See the section "Revocation may not be present at the client. See Section 10.5.1 for additional
Recovery for Write Open Delegation" for additional details. details.
9.5. Data Caching and Revocation 10.4.7. Delegations via WANT_DELEGATION
In addition to providing delegations as part of the response to OPEN
operations, servers may optionally provide delegations separate from
open, via the WANT_DELEGATION operation. This allows delegations to
be obtained in advance of an OPEN that might benefit from them, or to
deal with cases in which a delegation has been recalled and the
client wants to make an attempt to re-establish it if the absence of
use by other clients allows that.
When a delegation is obtained using WANT_DELEGATION, any open files
for the same filehandle held by that client are to be treated as
subordinate to the delegation, just as if they had been created using
an OPEN of type CLAIM_DELEGATE_CUR. same status
10.5. Data Caching and Revocation
When locks and delegations are revoked, the assumptions upon which When locks and delegations are revoked, the assumptions upon which
successful caching depend are no longer guaranteed. For any locks or successful caching depend are no longer guaranteed. For any locks or
share reservations that have been revoked, the corresponding owner share reservations that have been revoked, the corresponding owner
needs to be notified. This notification includes applications with a needs to be notified. This notification includes applications with a
file open that has a corresponding delegation which has been revoked. file open that has a corresponding delegation which has been revoked.
Cached data associated with the revocation must be removed from the Cached data associated with the revocation must be removed from the
client. In the case of modified data existing in the client's cache, client. In the case of modified data existing in the client's cache,
that data must be removed from the client without it being written to that data must be removed from the client without it being written to
the server. As mentioned, the assumptions made by the client are no the server. As mentioned, the assumptions made by the client are no
skipping to change at page 185, line 24 skipping to change at page 190, line 17
open file or on the close. Where the methods available to a client open file or on the close. Where the methods available to a client
make such notification impossible because errors for certain make such notification impossible because errors for certain
operations may not be returned, more drastic action such as signals operations may not be returned, more drastic action such as signals
or process termination may be appropriate. The justification for or process termination may be appropriate. The justification for
this is that an invariant for which an application depends on may be this is that an invariant for which an application depends on may be
violated. Depending on how errors are typically treated for the violated. Depending on how errors are typically treated for the
client operating environment, further levels of notification client operating environment, further levels of notification
including logging, console messages, and GUI pop-ups may be including logging, console messages, and GUI pop-ups may be
appropriate. appropriate.
9.5.1. Revocation Recovery for Write Open Delegation 10.5.1. Revocation Recovery for Write Open Delegation
Revocation recovery for a write open delegation poses the special Revocation recovery for a write open delegation poses the special
issue of modified data in the client cache while the file is not issue of modified data in the client cache while the file is not
open. In this situation, any client which does not flush modified open. In this situation, any client which does not flush modified
data to the server on each close must ensure that the user receives data to the server on each close must ensure that the user receives
appropriate notification of the failure as a result of the appropriate notification of the failure as a result of the
revocation. Since such situations may require human action to revocation. Since such situations may require human action to
correct problems, notification schemes in which the appropriate user correct problems, notification schemes in which the appropriate user
or administrator is notified may be necessary. Logging and console or administrator is notified may be necessary. Logging and console
messages are typical examples. messages are typical examples.
skipping to change at page 186, line 9 skipping to change at page 191, line 5
contents in these situations or mark the results specially to warn contents in these situations or mark the results specially to warn
users of possible problems. users of possible problems.
Saving of such modified data in delegation revocation situations may Saving of such modified data in delegation revocation situations may
be limited to files of a certain size or might be used only when be limited to files of a certain size or might be used only when
sufficient disk space is available within the target file system. sufficient disk space is available within the target file system.
Such saving may also be restricted to situations when the client has Such saving may also be restricted to situations when the client has
sufficient buffering resources to keep the cached copy available sufficient buffering resources to keep the cached copy available
until it is properly stored to the target file system. until it is properly stored to the target file system.
9.6. Attribute Caching 10.6. Attribute Caching
The attributes discussed in this section do not include named The attributes discussed in this section do not include named
attributes. Individual named attributes are analogous to files and attributes. Individual named attributes are analogous to files and
caching of the data for these needs to be handled just as data caching of the data for these needs to be handled just as data
caching is for ordinary files. Similarly, LOOKUP results from an caching is for ordinary files. Similarly, LOOKUP results from an
OPENATTR directory are to be cached on the same basis as any other OPENATTR directory are to be cached on the same basis as any other
pathnames and similarly for directory contents. pathnames and similarly for directory contents.
Clients may cache file attributes obtained from the server and use Clients may cache file attributes obtained from the server and use
them to avoid subsequent GETATTR requests. Such caching is write them to avoid subsequent GETATTR requests. Such caching is write
skipping to change at page 188, line 8 skipping to change at page 193, line 5
client will either eventually have to write the access time to the client will either eventually have to write the access time to the
server with bad performance effects, or it would never update the server with bad performance effects, or it would never update the
server's time_access, thereby resulting in a situation where an server's time_access, thereby resulting in a situation where an
application that caches access time between a close and open of the application that caches access time between a close and open of the
same file observes the access time oscillating between the past and same file observes the access time oscillating between the past and
present. The time_access attribute always means the time of last present. The time_access attribute always means the time of last
access to a file by a read that was satisfied by the server. This access to a file by a read that was satisfied by the server. This
way clients will tend to see only time_access changes that go forward way clients will tend to see only time_access changes that go forward
in time. in time.
9.7. Data and Metadata Caching and Memory Mapped Files 10.7. Data and Metadata Caching and Memory Mapped Files
Some operating environments include the capability for an application Some operating environments include the capability for an application
to map a file's content into the application's address space. Each to map a file's content into the application's address space. Each
time the application accesses a memory location that corresponds to a time the application accesses a memory location that corresponds to a
block that has not been loaded into the address space, a page fault block that has not been loaded into the address space, a page fault
occurs and the file is read (or if the block does not exist in the occurs and the file is read (or if the block does not exist in the
file, the block is allocated and then instantiated in the file, the block is allocated and then instantiated in the
application's address space). application's address space).
As long as each memory mapped access to the file requires a page As long as each memory mapped access to the file requires a page
skipping to change at page 190, line 16 skipping to change at page 195, line 13
are record locks for. are record locks for.
o Clients and servers MAY deny a record lock on a file they know is o Clients and servers MAY deny a record lock on a file they know is
memory mapped. memory mapped.
o A client MAY deny memory mapping a file that it knows requires o A client MAY deny memory mapping a file that it knows requires
mandatory locking for I/O. If mandatory locking is enabled after mandatory locking for I/O. If mandatory locking is enabled after
the file is opened and mapped, the client MAY deny the application the file is opened and mapped, the client MAY deny the application
further access to its mapped file. further access to its mapped file.
9.8. Name Caching 10.8. Name Caching
The results of LOOKUP and READDIR operations may be cached to avoid The results of LOOKUP and READDIR operations may be cached to avoid
the cost of subsequent LOOKUP operations. Just as in the case of the cost of subsequent LOOKUP operations. Just as in the case of
attribute caching, inconsistencies may arise among the various client attribute caching, inconsistencies may arise among the various client
caches. To mitigate the effects of these inconsistencies and given caches. To mitigate the effects of these inconsistencies and given
the context of typical file system APIs, an upper time boundary is the context of typical file system APIs, an upper time boundary is
maintained on how long a client name cache entry can be kept without maintained on how long a client name cache entry can be kept without
verifying that the entry has not been made invalid by a directory verifying that the entry has not been made invalid by a directory
change operation performed by another client. .LP When a client is change operation performed by another client. .LP When a client is
not making changes to a directory for which there exist name cache not making changes to a directory for which there exist name cache
skipping to change at page 191, line 16 skipping to change at page 196, line 12
directories when the contents of the corresponding directory is directories when the contents of the corresponding directory is
modified. For a client to use the change_info4 information modified. For a client to use the change_info4 information
appropriately and correctly, the server must report the pre and post appropriately and correctly, the server must report the pre and post
operation change attribute values atomically. When the server is operation change attribute values atomically. When the server is
unable to report the before and after values atomically with respect unable to report the before and after values atomically with respect
to the directory operation, the server must indicate that fact in the to the directory operation, the server must indicate that fact in the
change_info4 return value. When the information is not atomically change_info4 return value. When the information is not atomically
reported, the client should not assume that other clients have not reported, the client should not assume that other clients have not
changed the directory. changed the directory.
9.9. Directory Caching 10.9. Directory Caching
The results of READDIR operations may be used to avoid subsequent The results of READDIR operations may be used to avoid subsequent
READDIR operations. Just as in the cases of attribute and name READDIR operations. Just as in the cases of attribute and name
caching, inconsistencies may arise among the various client caches. caching, inconsistencies may arise among the various client caches.
To mitigate the effects of these inconsistencies, and given the To mitigate the effects of these inconsistencies, and given the
context of typical file system APIs, the following rules should be context of typical file system APIs, the following rules should be
followed: followed:
o Cached READDIR information for a directory which is not obtained o Cached READDIR information for a directory which is not obtained
in a single READDIR operation must always be a consistent snapshot in a single READDIR operation must always be a consistent snapshot
skipping to change at page 192, line 10 skipping to change at page 197, line 7
directories when the contents of the corresponding directory is directories when the contents of the corresponding directory is
modified. For a client to use the change_info4 information modified. For a client to use the change_info4 information
appropriately and correctly, the server must report the pre and post appropriately and correctly, the server must report the pre and post
operation change attribute values atomically. When the server is operation change attribute values atomically. When the server is
unable to report the before and after values atomically with respect unable to report the before and after values atomically with respect
to the directory operation, the server must indicate that fact in the to the directory operation, the server must indicate that fact in the
change_info4 return value. When the information is not atomically change_info4 return value. When the information is not atomically
reported, the client should not assume that other clients have not reported, the client should not assume that other clients have not
changed the directory. changed the directory.
10. Multi-Server Name Space 11. Multi-Server Namespace
NFSv4.1 supports attributes that allow a namespace to extend beyond NFSv4.1 supports attributes that allow a namespace to extend beyond
the boundaries of a single server. Use of such multi-server the boundaries of a single server. Use of such multi-server
namespaces is optional, and for many purposes, single-server namespaces is optional, and for many purposes, single-server
namespace are perfectly acceptable. Use of multi-server namespaces namespace are perfectly acceptable. Use of multi-server namespaces
can provide many advantages, however, by separating a file system's can provide many advantages, however, by separating a file system's
logical position in a name space from the (possibly changing) logical position in a name space from the (possibly changing)
logistical and administrative considerations that result in logistical and administrative considerations that result in
particular file systems being located on particular servers. particular file systems being located on particular servers.
10.1. Location attributes 11.1. Location attributes
NFSv4 contains recommended attributes that allow file systems on one NFSv4 contains recommended attributes that allow file systems on one
server to be associated with one or more instances of that file server to be associated with one or more instances of that file
system on other servers. These attributes specify such file systems system on other servers. These attributes specify such file systems
by specifying a server name (either a DNS name or an IP address) by specifying a server name (either a DNS name or an IP address)
together with the path of that file system within that server's together with the path of that file system within that server's
single-server name space. single-server name space.
The fs_locations_info recommended attribute allows specification of The fs_locations_info recommended attribute allows specification of
one more file systems instance locations where the data corresponding one more file systems instance locations where the data corresponding
skipping to change at page 192, line 46 skipping to change at page 197, line 43
as well as information to help the client efficiently effect as as well as information to help the client efficiently effect as
seamless a transition as possible among multiple file system seamless a transition as possible among multiple file system
instances, when and if that should be necessary. instances, when and if that should be necessary.
The fs_locations recommended attribute is inherited from NFSv4.0 and The fs_locations recommended attribute is inherited from NFSv4.0 and
only allows specification of the file system locations where the data only allows specification of the file system locations where the data
corresponding to a given file system may be found. Servers should corresponding to a given file system may be found. Servers should
make this attribute available whenever fs_locations_info is make this attribute available whenever fs_locations_info is
supported, but client use of fs_locations_info is to be preferred. supported, but client use of fs_locations_info is to be preferred.
10.2. File System Presence or Absence 11.2. File System Presence or Absence
A given location in an NFSv4 namespace (typically but not necessarily A given location in an NFSv4 namespace (typically but not necessarily
a multi-server namespace) can have a number of file system instance a multi-server namespace) can have a number of file system instance
locations associated with it (via the fs_locations or locations associated with it (via the fs_locations or
fs_locations_info attribute). There may also be an actual current fs_locations_info attribute). There may also be an actual current
file system at that location, accessible via normal namespace file system at that location, accessible via normal namespace
operations (e.g. LOOKUP). In this case, the file system is said to operations (e.g. LOOKUP). In this case, the file system is said to
be "present" at that position in the namespace and clients will be "present" at that position in the namespace and clients will
typically use it, reserving use of additional locations specified via typically use it, reserving use of additional locations specified via
the location-related attributes to situations in which the principal the location-related attributes to situations in which the principal
skipping to change at page 194, line 8 skipping to change at page 199, line 5
being within an absent file system happens at the start of every being within an absent file system happens at the start of every
operation, operations which change the current filehandle so that it operation, operations which change the current filehandle so that it
is within an absent file system will not result in an error. This is within an absent file system will not result in an error. This
allows such combinations as PUTFH-GETATTR and LOOKUP-GETATTR to be allows such combinations as PUTFH-GETATTR and LOOKUP-GETATTR to be
used to get attribute information, particularly location attribute used to get attribute information, particularly location attribute
information, as discussed below. information, as discussed below.
The recommended file system attribute fs_absent can used to The recommended file system attribute fs_absent can used to
interrogate the present/absent status of a given file system. interrogate the present/absent status of a given file system.
10.3. Getting Attributes for an Absent File System 11.3. Getting Attributes for an Absent File System
When a file system is absent, most attributes are not available, but When a file system is absent, most attributes are not available, but
it is necessary to allow the client access to the small set of it is necessary to allow the client access to the small set of
attributes that are available, and most particularly those that give attributes that are available, and most particularly those that give
information about the correct current locations for this file system, information about the correct current locations for this file system,
fs_locations and fs_locations_info. fs_locations and fs_locations_info.
10.3.1. GETATTR Within an Absent File System 11.3.1. GETATTR Within an Absent File System
As mentioned above, an exception is made for GETATTR in that As mentioned above, an exception is made for GETATTR in that
attributes may be obtained for a filehandle within an absent file attributes may be obtained for a filehandle within an absent file
system. This exception only applies if the attribute mask contains system. This exception only applies if the attribute mask contains
at least one attribute bit that indicates the client is interested in at least one attribute bit that indicates the client is interested in
a result regarding an absent file system: fs_locations, a result regarding an absent file system: fs_locations,
fs_locations_info, or fs_absent. If none of these attributes is fs_locations_info, or fs_absent. If none of these attributes is
requested, GETATTR will result in an NFS4ERR_MOVED error. requested, GETATTR will result in an NFS4ERR_MOVED error.
When a GETATTR is done on an absent file system, the set of supported When a GETATTR is done on an absent file system, the set of supported
skipping to change at page 195, line 19 skipping to change at page 200, line 16
attributes supported with the results. attributes supported with the results.
Handling of VERIFY/NVERIFY is similar to GETATTR in that if the Handling of VERIFY/NVERIFY is similar to GETATTR in that if the
attribute mask does not include fs_locations, fs_locations_info, or attribute mask does not include fs_locations, fs_locations_info, or
fs_absent, the error NFS4ERR_MOVED will result. It differs in that fs_absent, the error NFS4ERR_MOVED will result. It differs in that
any appearance in the attribute mask of an attribute not supported any appearance in the attribute mask of an attribute not supported
for an absent file system (and note that this will include some for an absent file system (and note that this will include some
normally mandatory attributes), will also cause an NFS4ERR_MOVED normally mandatory attributes), will also cause an NFS4ERR_MOVED
result. result.
10.3.2. READDIR and Absent File Systems 11.3.2. READDIR and Absent File Systems
A READDIR performed when the current filehandle is within an absent A READDIR performed when the current filehandle is within an absent
file system will result in an NFS4ERR_MOVED error, since, unlike the file system will result in an NFS4ERR_MOVED error, since, unlike the
case of GETATTR, no such exception is made for READDIR. case of GETATTR, no such exception is made for READDIR.
Attributes for an absent file system may be fetched via a READDIR for Attributes for an absent file system may be fetched via a READDIR for
a directory in a present file system, when that directory contains a directory in a present file system, when that directory contains
the root directories of one or more absent file systems. In this the root directories of one or more absent file systems. In this
case, the handling is as follows: case, the handling is as follows:
skipping to change at page 196, line 5 skipping to change at page 201, line 5
rdattr_error then the occurrence of the root of an absent file rdattr_error then the occurrence of the root of an absent file
system within the directory will result in the READDIR failing system within the directory will result in the READDIR failing
with an NFSERR_MOVED error. with an NFSERR_MOVED error.
o The unavailability of an attribute because of a file system's o The unavailability of an attribute because of a file system's
absence, even one that is ordinarily mandatory, does not result in absence, even one that is ordinarily mandatory, does not result in
any error indication. The set of attributes returned for the root any error indication. The set of attributes returned for the root
directory of the absent file system in that case is simply directory of the absent file system in that case is simply
restricted to those actually available. restricted to those actually available.
10.4. Uses of Location Information 11.4. Uses of Location Information
The location-bearing attributes (fs_locations and fs_locations_info), The location-bearing attributes (fs_locations and fs_locations_info),
provide, together with the possibility of absent file systems, a provide, together with the possibility of absent file systems, a
number of important facilities in providing reliable, manageable, and number of important facilities in providing reliable, manageable, and
scalable data access. scalable data access.
When a file system is present, these attribute can provide When a file system is present, these attribute can provide
alternative locations, to be used to access the same data, in the alternative locations, to be used to access the same data, in the
event that server failures, communications problems, or other event that server failures, communications problems, or other
difficulties, make continued access to the current file system difficulties, make continued access to the current file system
skipping to change at page 196, line 42 skipping to change at page 201, line 42
that there are cases in which this term can be used, like that there are cases in which this term can be used, like
"replication", when there is no actual data migration per se. "replication", when there is no actual data migration per se.
Where a file system was not previously present, specification of file Where a file system was not previously present, specification of file
system location provides a means by which file systems located on one system location provides a means by which file systems located on one
server can be associated with a name space defined by another server, server can be associated with a name space defined by another server,
thus allowing a general multi-server namespace facility. Designation thus allowing a general multi-server namespace facility. Designation
of such a location, in place of an absent file system, is called of such a location, in place of an absent file system, is called
"referral". "referral".
10.4.1. File System Replication 11.4.1. File System Replication
The fs_locations and fs_locations_info attributes provide alternative The fs_locations and fs_locations_info attributes provide alternative
locations, to be used to access data in place of or in a addition to locations, to be used to access data in place of or in a addition to
the current file system instance. On first access to a file system, the current file system instance. On first access to a file system,
the client should obtain the value of the set alternate locations by the client should obtain the value of the set alternate locations by
interrogating the fs_locations or fs_locations_info attribute, with interrogating the fs_locations or fs_locations_info attribute, with
the latter being preferred. the latter being preferred.
In the event that server failures, communications problems, or other In the event that server failures, communications problems, or other
difficulties, make continued access to the current file system difficulties, make continued access to the current file system
skipping to change at page 197, line 50 skipping to change at page 202, line 50
or the visibility of that change on any of the associated replicas. or the visibility of that change on any of the associated replicas.
Where a file system is not writable but represents a read-only copy Where a file system is not writable but represents a read-only copy
(possibly periodically updated) of a writable file system, similar (possibly periodically updated) of a writable file system, similar
requirements apply to the propagation of updates. It must be requirements apply to the propagation of updates. It must be
guaranteed that any change visible on the original file system guaranteed that any change visible on the original file system
instance must be immediately visible on any replica before the client instance must be immediately visible on any replica before the client
transitions access to that replica, to avoid any possibility, that a transitions access to that replica, to avoid any possibility, that a
client in effecting a transition to a replica, will see any reversion client in effecting a transition to a replica, will see any reversion
in file system state. The specific means by which this will be in file system state. The specific means by which this will be
prevented varies based on fs4_status_type reported as part of the prevented varies based on fs4_status_type reported as part of the
fs_status attribute. (See Section 10.11). fs_status attribute. (See Section 11.11).
10.4.2. File System Migration 11.4.2. File System Migration
When a file system is present and becomes absent, clients can be When a file system is present and becomes absent, clients can be
given the opportunity to have continued access to their data, at an given the opportunity to have continued access to their data, at an
alternate location, as specified by the fs_locations or alternate location, as specified by the fs_locations or
fs_locations_info attribute. Typically, a client will be accessing fs_locations_info attribute. Typically, a client will be accessing
the file system in question, get an NFS4ERR_MOVED error, and then use the file system in question, get an NFS4ERR_MOVED error, and then use
the fs_locations or fs_locations_info attribute to determine the new the fs_locations or fs_locations_info attribute to determine the new
location of the data. When fs_locations_info is used, additional location of the data. When fs_locations_info is used, additional
information will be available which will define the nature of the information will be available which will define the nature of the
client's handling of the transition to a new server. client's handling of the transition to a new server.
skipping to change at page 199, line 12 skipping to change at page 204, line 12
degree indicated by the fs_locations_info attribute). Where file degree indicated by the fs_locations_info attribute). Where file
systems are writable, a change made on the original file system must systems are writable, a change made on the original file system must
be visible on all migration targets. Where a file system is not be visible on all migration targets. Where a file system is not
writable but represents a read-only copy (possibly periodically writable but represents a read-only copy (possibly periodically
updated) of a writable file system, similar requirements apply to the updated) of a writable file system, similar requirements apply to the
propagation of updates. Any change visible in the original file propagation of updates. Any change visible in the original file
system must already be effected on all migration targets, to avoid system must already be effected on all migration targets, to avoid
any possibility, that a client in effecting a transition to the any possibility, that a client in effecting a transition to the
migration target will see any reversion in file system state. migration target will see any reversion in file system state.
10.4.3. Referrals 11.4.3. Referrals
Referrals provide a way of placing a file system in a location Referrals provide a way of placing a file system in a location
essentially without respect to its physical location on a given essentially without respect to its physical location on a given
server. This allows a single server of a set of servers to present a server. This allows a single server of a set of servers to present a
multi-server namespace that encompasses file systems located on multi-server namespace that encompasses file systems located on
multiple servers. Some likely uses of this include establishment of multiple servers. Some likely uses of this include establishment of
site-wide or organization-wide namespaces, or even knitting such site-wide or organization-wide namespaces, or even knitting such
together into a truly global namespace. together into a truly global namespace.
Referrals occur when a client determines, upon first referencing a Referrals occur when a client determines, upon first referencing a
skipping to change at page 200, line 6 skipping to change at page 205, line 6
Use of multi-server namespaces is enabled by NFSv4 but is not Use of multi-server namespaces is enabled by NFSv4 but is not
required. The use of multi-server namespaces and their scope will required. The use of multi-server namespaces and their scope will
depend on the applications used, and system administration depend on the applications used, and system administration
preferences. preferences.
Multi-server namespaces can be established by a single server Multi-server namespaces can be established by a single server
providing a large set of referrals to all of the included file providing a large set of referrals to all of the included file
systems. Alternatively, a single multi-server namespace may be systems. Alternatively, a single multi-server namespace may be
administratively segmented with separate referral file systems (on administratively segmented with separate referral file systems (on
separate servers) for each separately-administered section of the separate servers) for each separately-administered section of the
name space. Any segment or the top-level referral file system may namespace. Any segment or the top-level referral file system may use
use replicated referral file systems for higher availability. replicated referral file systems for higher availability.
Generally, multi-server namespaces are for the most part uniform, in Generally, multi-server namespaces are for the most part uniform, in
that the same data made available to one client at a given location that the same data made available to one client at a given location
in the namespace is made availably to all clients at that location. in the namespace is made availably to all clients at that location.
There are however facilities provided which allow different client to There are however facilities provided which allow different client to
be directed to different sets of data, so as to adapt to such client be directed to different sets of data, so as to adapt to such client
characteristics as cpu architecture. characteristics as cpu architecture.
10.5. Additional Client-side Considerations 11.5. Additional Client-side Considerations
When clients make use of servers that implement referrals, When clients make use of servers that implement referrals,
replication, and migration, care should be taken so that a user who replication, and migration, care should be taken so that a user who
mounts a given file system that includes a referral or a relocated mounts a given file system that includes a referral or a relocated
file system continue to see a coherent picture of that user-side file file system continue to see a coherent picture of that user-side file
system despite the fact that it contains a number of server-side file system despite the fact that it contains a number of server-side file
systems which may be on different servers. systems which may be on different servers.
One important issue is upward navigation from the root of a server- One important issue is upward navigation from the root of a server-
side file system to its parent (specified as ".." in UNIX). The side file system to its parent (specified as ".." in UNIX). The
skipping to change at page 201, line 10 skipping to change at page 206, line 10
change. It is expected that clients will cache information related change. It is expected that clients will cache information related
to traversing referrals so that future client side requests are to traversing referrals so that future client side requests are
resolved locally without server communication. This is usually resolved locally without server communication. This is usually
rooted in client-side name lookup caching. Clients should rooted in client-side name lookup caching. Clients should
periodically purge this data for referral points in order to detect periodically purge this data for referral points in order to detect
changes in location information. When the change attribute changes changes in location information. When the change attribute changes
for directories that hold referral entries or for the referral for directories that hold referral entries or for the referral
entries themselves, clients should consider any associated cached entries themselves, clients should consider any associated cached
referral information to be out of date. referral information to be out of date.
10.6. Effecting File System Transitions 11.6. Effecting File System Transitions
Transitions between file system instances, whether due to switching Transitions between file system instances, whether due to switching
between replicas upon server unavailability, or in response to a between replicas upon server unavailability, or in response to a
server-initiated migration events are best dealt with together. Even server-initiated migration events are best dealt with together. Even
though the prototypical use cases of replication and migration though the prototypical use cases of replication and migration
contain distinctive sets of features, when all possibilities for contain distinctive sets of features, when all possibilities for
these operations are considered, the underlying unity of these these operations are considered, the underlying unity of these
operations, from the client's point of view is clear, even though for operations, from the client's point of view is clear, even though for
the server pragmatic considerations will normally force different the server pragmatic considerations will normally force different
implementation strategies for planned and unplanned transitions. implementation strategies for planned and unplanned transitions.
skipping to change at page 202, line 9 skipping to change at page 207, line 9
types. Two file systems that belong to such a class share some types. Two file systems that belong to such a class share some
important aspect of file system behavior that clients may depend upon important aspect of file system behavior that clients may depend upon
when present, to easily effect a seamless transition between file when present, to easily effect a seamless transition between file
system instances. Conversely, where the file systems do not belong system instances. Conversely, where the file systems do not belong
to such a common class, the client has to deal with various sorts of to such a common class, the client has to deal with various sorts of
implementation discontinuities which may cause performance or other implementation discontinuities which may cause performance or other
issues in effecting a transition. issues in effecting a transition.
Where the fs_locations_info attribute is available, such file system Where the fs_locations_info attribute is available, such file system
classification data will be made directly available to the client. classification data will be made directly available to the client.
See Section 10.10 for details. When only fs_locations is available, See Section 11.10 for details. When only fs_locations is available,
default assumptions with regard to such classifications have to be default assumptions with regard to such classifications have to be
inferred. See Section 10.9 for details. inferred. See Section 11.9 for details.
In cases in which one server is expected to accept opaque values from In cases in which one server is expected to accept opaque values from
the client that originated from another server, it is a wise the client that originated from another server, it is a wise
implementation practice for the servers to encode the "opaque" values implementation practice for the servers to encode the "opaque" values
in big endian octet order. If this is done, servers acting as in big endian octet order. If this is done, servers acting as
replicas or immigrating file systems will be able to parse values replicas or immigrating file systems will be able to parse values
like stateids, directory cookies, filehandles, etc. even if their like stateids, directory cookies, filehandles, etc. even if their
native octet order is different from that of other servers native octet order is different from that of other servers
cooperating in the replication and migration of the file system. cooperating in the replication and migration of the file system.
10.6.1. File System Transitions and Simultaneous Access 11.6.1. File System Transitions and Simultaneous Access
When a single file system may be accessed at multiple locations, When a single file system may be accessed at multiple locations,
whether this is because of an indication of file system identity as whether this is because of an indication of file system identity as
reported by the fs_locations or fs_locations_info attributes or reported by the fs_locations or fs_locations_info attributes or
because two file systems instances have corresponding locations on because two file systems instances have corresponding locations on
server addresses which connect to the same server as indicated by a server addresses which connect to the same server as indicated by a
common so_major_id field in the eir_server_owner field returned by common so_major_id field in the eir_server_owner field returned by
EXCHANGE_ID, the client will, depending on specific circumstances as EXCHANGE_ID, the client will, depending on specific circumstances as
discussed below, either: discussed below, either:
skipping to change at page 203, line 5 skipping to change at page 208, line 5
depending on the attributes of the source and destination file depending on the attributes of the source and destination file
system instances, as specified in the fs_locations_info attribute. system instances, as specified in the fs_locations_info attribute.
Which of these choices is possible, and how a transition is effected Which of these choices is possible, and how a transition is effected
is governed by equivalence classes of file system instances as is governed by equivalence classes of file system instances as
reported by the fs_locations_info attribute, and, for file systems reported by the fs_locations_info attribute, and, for file systems
instances in the same location within multiple single-server instances in the same location within multiple single-server
namespace, by the so_major_id field in the eir_server_owner field namespace, by the so_major_id field in the eir_server_owner field
returned by EXCHANGE_ID. returned by EXCHANGE_ID.
10.6.2. Simultaneous Use and Transparent Transitions 11.6.2. Simultaneous Use and Transparent Transitions
When two file system instances have the same location within their When two file system instances have the same location within their
respective single-server namespaces and those two server IP addresses respective single-server namespaces and those two server IP addresses
return the so_major_id value in the eir_server_owner value returned return the so_major_id value in the eir_server_owner value returned
in response to EXCHANGE_ID, those file systems instances can be in response to EXCHANGE_ID, those file systems instances can be
treated as the same, and either used together simultaneously or treated as the same, and either used together simultaneously or
serially with no transition activity required on the part of the serially with no transition activity required on the part of the
client. client.
Whether simultaneous use of the two file system instances is valid is Whether simultaneous use of the two file system instances is valid is
skipping to change at page 203, line 34 skipping to change at page 208, line 34
indicate that these instances belong to different _handle_, _fileid_, indicate that these instances belong to different _handle_, _fileid_,
_verifier_, _change_ classes, whether the two instances are shown _verifier_, _change_ classes, whether the two instances are shown
belonging to the same _simultaneous-use_ class or not. belonging to the same _simultaneous-use_ class or not.
Where these conditions do not apply, a non-transparent file system Where these conditions do not apply, a non-transparent file system
instance transition is required with the details depending on the instance transition is required with the details depending on the
respective _handle_, _fileid_, _verifier_, _change_ classes of the respective _handle_, _fileid_, _verifier_, _change_ classes of the
two file system instances and whether the two servers in question two file system instances and whether the two servers in question
have the same eir_server_scope value as reported by EXCHANGE_ID. have the same eir_server_scope value as reported by EXCHANGE_ID.
10.6.2.1. Simultaneous Use of File System Instances 11.6.2.1. Simultaneous Use of File System Instances
When the conditions above hold, in either of the following two cases, When the conditions above hold, in either of the following two cases,
the client may use the two file system instances simultaneously. the client may use the two file system instances simultaneously.
o The fs_locations_info attribute does not contain separate per-IP o The fs_locations_info attribute does not contain separate per-IP
address entries for file systems instances at the distinct IP address entries for file systems instances at the distinct IP
addresses. This includes the case in which the fs_locations_info addresses. This includes the case in which the fs_locations_info
attribute is unavailable. attribute is unavailable.
o The fs_locations_info attribute indicates that two file system o The fs_locations_info attribute indicates that two file system
skipping to change at page 204, line 10 skipping to change at page 209, line 10
that happens because the two IP addresses connect to the same that happens because the two IP addresses connect to the same
physical server or because different servers connect to clustered physical server or because different servers connect to clustered
file systems and export their data in common. When simultaneous use file systems and export their data in common. When simultaneous use
is in effect, any change made to one file system instance must be is in effect, any change made to one file system instance must be
immediately reflected in the other file system instance(s). Locks immediately reflected in the other file system instance(s). Locks
are treated as part of a common lease, associated with a common are treated as part of a common lease, associated with a common
client ID. Depending on the details of the eir_server_owner returned client ID. Depending on the details of the eir_server_owner returned
by EXCHANGE_ID, the two server instances may be accessed by different by EXCHANGE_ID, the two server instances may be accessed by different
sessions or a single session in common. sessions or a single session in common.
10.6.2.2. Transparent File System Transitions 11.6.2.2. Transparent File System Transitions
When the conditions above hold and the fs_locations_info attribute When the conditions above hold and the fs_locations_info attribute
explicitly shows the file system instances for these distinct IP explicitly shows the file system instances for these distinct IP
addresses as belonging to different _simultaneous-use_ classes, the addresses as belonging to different _simultaneous-use_ classes, the
file system instances should not be used by the client file system instances should not be used by the client
simultaneously, but rather serially with one being used unless and simultaneously, but rather serially with one being used unless and
until communication difficulties, lack of responsiveness, or an until communication difficulties, lack of responsiveness, or an
explicit migration event causes another file system instance (or set explicit migration event causes another file system instance (or set
of file system instances sharing a common _simultaneous-use_ class to of file system instances sharing a common _simultaneous-use_ class to
be used. be used.
skipping to change at page 205, line 7 skipping to change at page 210, line 7
transition, except where their staleness is recognized and transition, except where their staleness is recognized and
reported by the new server. Except where such staleness requires reported by the new server. Except where such staleness requires
it, no lock reclamation is needed. it, no lock reclamation is needed.
o Write verifiers are presumed to retain their validity and can be o Write verifiers are presumed to retain their validity and can be
presented to COMMIT, with the expectation that if COMMIT on the presented to COMMIT, with the expectation that if COMMIT on the
new server accept them as valid, then that server has all of the new server accept them as valid, then that server has all of the
data unstably written to the original server and has committed it data unstably written to the original server and has committed it
to stable storage as requested. to stable storage as requested.
10.6.3. Filehandles and File System Transitions 11.6.3. Filehandles and File System Transitions
There are a number of ways in which filehandles can be handled across There are a number of ways in which filehandles can be handled across
a file system transition. These can be divided into two broad a file system transition. These can be divided into two broad
classes depending upon whether the two file systems across which the classes depending upon whether the two file systems across which the
transition happens share sufficient state to effect some sort of transition happens share sufficient state to effect some sort of
continuity of file system handling. continuity of file system handling.
When there is no such co-operation in filehandle assignment, the two When there is no such co-operation in filehandle assignment, the two
file systems are reported as being in different _handle_ classes. In file systems are reported as being in different _handle_ classes. In
this case, all filehandles are assumed to expire as part of the file this case, all filehandles are assumed to expire as part of the file
skipping to change at page 205, line 30 skipping to change at page 210, line 30
FH4_VOL_MIGRATION bit, which only affects behavior when FH4_VOL_MIGRATION bit, which only affects behavior when
fs_locations_info is not available. fs_locations_info is not available.
When there is co-operation in filehandle assignment, the two file When there is co-operation in filehandle assignment, the two file
systems are reported as being in the same _handle_ classes. In this systems are reported as being in the same _handle_ classes. In this
case, persistent filehandle remain valid after the file system case, persistent filehandle remain valid after the file system
transition, while volatile filehandles (excluding those while are transition, while volatile filehandles (excluding those while are
only volatile due to the FH4_VOL_MIGRATION bit) are subject to only volatile due to the FH4_VOL_MIGRATION bit) are subject to
expiration on the target server. expiration on the target server.